See PHI1403, where I'd like to do a local repository migration in the current cluster to spread load.
This most recently happened in PHI1040, and currently has some manual steps. The general flow is to setup a host:
- Select or provision a repository shard.
- Provisioning was once close-ish to automated. Is this close enough to automate?
- If this is becoming dedicated or "one big tenant", close it to new allocations.
- Original shard is A. New shard is B.
Then, ideally:
- Add service B to the service bindings for the instance on the central authority server.
- Synchronize services so the instance knows about both services A and B.
- Does this work today? It's not normal for instances to have more than one repository service.
- Does having two synchronized services result in an error when trying to create a new repository (this is probably fine to suffer briefly, but could it easily be avoided if it's a problem?).
- Do the current settings allow us to close "A" to new allocations here easily?
Then, for each repository:
- Put it in the best approximation of read-only mode that we can.
- This doesn't really exist yet. Is it easy to build / worth building in the general case?
- If this won't exist, what's the best approximation? (e.g., intentionally break the object policy to make it unwritable)
- Ideally, "Read Only" should probably mean "stop observation" for observed repositories.
- Copy the actual working copy from shard A to shard B.
- Anecdotally from the last time around, gzipping the tarball didn't really do much. Possibly, this might more broadly imply that we'd be better off not compressing repository backups.
- Is the 2GB HTTP stuff in T12907 realistic to fix? scp works fine if the answer is "no"
- Change the service PHID from A to B in the database.
- Can this be formalized with a real transaction?
- bin/repository thaw --promote --force the working copy on B to let clustering mechanisms know that we've forcefully copied the data over.
- Pull the repository out of read-only mode.
Finally:
- Delete or disable A if we didn't already effectively take it out of service earlier.
- Whatever state things end up in should be robust to the next Service Sync operation.
- Whatever state things end up in should know that this instance no longer starts daemons on A.
- Restart the daemons on all active shards.
In a perfect world, some maintenance side channel would also explain this process to users. This doesn't exist today.