Page MenuHomePhabricator

Improve repository shard migration pathway in the shared cluster
Closed, ResolvedPublic

Description

See PHI1403, where I'd like to do a local repository migration in the current cluster to spread load.

This most recently happened in PHI1040, and currently has some manual steps. The general flow is to setup a host:

  • Select or provision a repository shard.
    • Provisioning was once close-ish to automated. Is this close enough to automate?
  • If this is becoming dedicated or "one big tenant", close it to new allocations.
  • Original shard is A. New shard is B.

Then, ideally:

  • Add service B to the service bindings for the instance on the central authority server.
    • Does this work normally today?
    • Is the policy bug in T13156 fixed (expectation: yes)?
    • Can the policy behavior be simplified after the policy changes in Phortune connected to T13366?
    • Can we get rid of the instance-specific services completely now, after changes connected to T11413?
  • Synchronize services so the instance knows about both services A and B.
    • Does this work today? It's not normal for instances to have more than one repository service.
    • Does having two synchronized services result in an error when trying to create a new repository (this is probably fine to suffer briefly, but could it easily be avoided if it's a problem?).
    • Do the current settings allow us to close "A" to new allocations here easily?

Then, for each repository:

  • Put it in the best approximation of read-only mode that we can.
    • This doesn't really exist yet. Is it easy to build / worth building in the general case?
    • If this won't exist, what's the best approximation? (e.g., intentionally break the object policy to make it unwritable)
    • Ideally, "Read Only" should probably mean "stop observation" for observed repositories.
  • Copy the actual working copy from shard A to shard B.
    • Anecdotally from the last time around, gzipping the tarball didn't really do much. Possibly, this might more broadly imply that we'd be better off not compressing repository backups.
    • Is the 2GB HTTP stuff in T12907 realistic to fix? scp works fine if the answer is "no"
  • Change the service PHID from A to B in the database.
    • Can this be formalized with a real transaction?
  • bin/repository thaw --promote --force the working copy on B to let clustering mechanisms know that we've forcefully copied the data over.
  • Pull the repository out of read-only mode.

Finally:

  • Delete or disable A if we didn't already effectively take it out of service earlier.
    • Whatever state things end up in should be robust to the next Service Sync operation.
    • Whatever state things end up in should know that this instance no longer starts daemons on A.
  • Restart the daemons on all active shards.

In a perfect world, some maintenance side channel would also explain this process to users. This doesn't exist today.

Revisions and Commits

Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
rP Phabricator
D20754
D20748

Event Timeline

epriestley triaged this task as Normal priority.Aug 29 2019, 3:04 PM
epriestley created this task.

Is the 2GB HTTP stuff in T12907 realistic to fix?

I've reframed this as T13396. This is probably the most complicated question this task poses and has also been a thorn in my side forever, so I'm hoping this is a reasonable excuse to find a way forward here.

I'm hoping this is a reasonable excuse to find a way forward [on >2GB downloads] here.

I'm vetting this on a real production workload now (internally, see PHI1329) but it looks promising locally.

Put [the repository] in the best approximation of read-only mode that we can. Is it easy to build / worth building in the general case?

I'm going to look at writing a patch here under the assumption that the answer to this question is "yes". My expectation is:

  • Some kind of "read only" mode flag with an optional human-readable message.
  • This flag should appear in the web UI and stop observation pulls, HTTP pushes, and SSH pushes. It's still fine to serve reads.
  • I'm also going to do phase 1 of T13286 here since the testing overlaps.

Mostly-promising answers on much of the rest of this:

Does [adding repository services] work normally today?

Yep.

Caveat: adding a dead service (repo012) causes sync failure because daemons on the host can't restart.

Is the policy bug in T13156 fixed (expectation: yes)?

Yes.

Does [instances with multiple repository services] work today? It's not normal for instances to have more than one repository service.

Yes, this works fine. All the services sync properly.

Does having two synchronized services result in an error when trying to create a new repository (this is probably fine to suffer briefly, but could it easily be avoided if it's a problem?).

This works fine: we shuffle valid services and pick one randomly.

Do the current settings allow us to close "A" to new allocations here easily?

No:

  • There's no easy way to actually write the property, e.g. bin/almanac set-property <object> <key> <value> or whatever.
  • Synchronizing the service overwrites modifications to the properties, since the central service definition is authoritative.

Can [changing the service PHID] be formalized with a real transaction?

Already formalized and accessible via bin/repository clusterize. This workflow is missing a --force flag to get by some thaw prompts but we won't currently hit them anyway.

Whatever state things end up in should be robust to the next Service Sync operation.
Whatever state things end up in should know that this instance no longer starts daemons on A.

Daemon restarts come from admin, so this most likely makes sense as instance properties which service sync knows about if I don't come up with anything better.


Can we get rid of the instance-specific services completely now, after changes connected to T11413?

I strongly suspect the answer is "yes" but still need to do some legwork to confirm this.

Can the policy behavior be simplified after the policy changes in Phortune connected to T13366?

Not sure yet.

Can we get rid of the instance-specific services completely now, after changes connected to T11413?

To refresh context here, when a request hit turtle.phacility.com we once called almanac.service.search to find a related turtle.phacility.com service, then read properties from it to figure out what to do with the request (for example: is the instance active, or suspended?).

This mostly worked fine, but eventually ran into some scaling/performance issues since almanac.service.search doesn't work especially hard to do any caching and our production cache access pattern isn't terribly well-served by doing single-key lookup. Specifically, as the size of the web tier expands, this pattern tends to mean a large fraction of requests to turtle.phacility.com are served by a host which hasn't served a turtle.phacility.com request recently and needs to do a central service lookup.

D20082 removed this call, and we now use instances.state, which has tailored performance/cache behavior and versioning and is generally a more specialized tool. This has been stable since earlier this year.

Service synchronization still uses almanac.service.search, but has never synchronized the instance service, so it is not impacted here.

The service "wipe" workflow (during import of external data) still makes an almanac.service.search call, but only to determine instance status. This should be trivial to replace.

The instance profile controller loads the instance service, but purely for display.

The InstanceEditor creates and synchronizes the service, but can trivially stop doing that.

The "Instance" almanac service type can be destroyed.

"Instance->getAlmanacServiceName()" can be removed, as can the code to destroy the service on instance destruction.

"InstancesWorker" has a "loadService()" method but this has no callers.


Adjacent is the older instances.queryinstances API method. This is still used by service synchronization. This isn't trivial to throw out since it does some one-off OAuth stuff, but it looks like it has no meaningful dependency on the Almanac instance service.


Can the policy behavior be simplified after the policy changes in Phortune connected to T13366?

It looks like the answer is "yes". User authority is used to let merchant managers view instances, but all of these users are instances application managers anyway so this check is redundant. This may have attempted to carve out a narrower support role without instance management authority, but there are better ways to do that.

It is also used to cache InstancesManageCapability::CAPABILITY but this can easily just be cached in the request cache instead.

epriestley added a revision: Restricted Differential Revision.Aug 30 2019, 12:29 AM
epriestley added a revision: Restricted Differential Revision.Aug 30 2019, 1:50 AM

The "Instance" almanac service type can be destroyed.

This is still alive but we can't cleanly get rid of it until all the actual services are destroyed.

epriestley added a commit: Restricted Diffusion Commit.Aug 30 2019, 1:54 AM
epriestley added a commit: Restricted Diffusion Commit.Aug 30 2019, 1:58 AM

It is also used to cache InstancesManageCapability::CAPABILITY but this can easily just be cached in the request cache instead.

Path forward here is:

  • Swap this test for a request cache based test, analogous to D20716.
  • Replace the "InstancesInstance" test callsites.
  • Get rid of the merchant policy test, which has no clear reason (?) to exist.
  • Remove the "Authority" mechanism in Phabricator.
epriestley added a revision: Restricted Differential Revision.Aug 30 2019, 4:12 PM

Adjacent is the older instances.queryinstances API method. This is still used by service synchronization.

I moved this to T13399.

I think this leaves us with:

  • admin needs to know which services are closed to allocations and which services should not start daemons.
    • These are not the same thing in the general case: a service may be "full" so we don't want to put new repositories there, but may still have active repositories which need daemons present.
    • These behaviors are both per-instance, not per-service. A service may be in any open-to-allocation/daemon state for a given instance independent of the service state for other instances, except that "open to allocation, no daemons" is a questionable state.
  • Service synchronization needs to propagate the "closed" flag to instance repository services in Almanac. This may end up bringing T13399 back into scope despite my efforts to separate it.
  • Service restarts via the bin/services workflows and bin/host workflows need to respect the daemon flag.

Once these things work, I think every step can be carried out manually without any funny business, and we're just missing glue to tie the bits together.

The existing bin/instances move flow has some existing glue that should probably either be used, built upon, or removed.

This question is also still open, I think I'm going to see if I can scope it next and figure out if it's realistic to get online here:

Provisioning was once close-ish to automated. Is this close enough to automate?

Other addenda:

The "Instance" almanac service type can be destroyed [in the future].
I'm also going to do phase 1 of T13286 here since the testing overlaps.

Provisioning was once close-ish to automated. Is this close enough to automate?

This is in somewhat rough shape. I think it was a little closer in a scratch branch on my old laptop, but I missed it when pulling stuff off the machine.

I think this is probably not worth pulling into scope. This process is pretty separate and theoretically doomed in the long run anyway. There are a lot of small pieces (like moving phage fully to pools) which it would be hard to keep out of scope and which increase risk here, too.

I'm going to focus on defining and propagating the closed/daemon behaviors instead, then sort through the "move" glue.

epriestley added a revision: Restricted Differential Revision.Aug 30 2019, 8:21 PM

daemon behaviors

Daemons restart along one indirect pathway and one direct pathway.

The indirect pathway is InstancesSyncServicesWorker (which is now very confusingly named, since it does not correspond to "Sync Services" in the UI and does not synchronize service definitions; that is InstancesSyncWorker). This runs bin/host restart or bin/host stop based on overall instance state.

bin/host restart operates in two modes, either "all instances on the host" or "a specific set of instances".

Possibly, this should really be two different workflows, because we get an ambiguous answer to this question:

  • If you run bin/host restart --instance X, and instance X normally should not have any daemons running on the host, should we start daemons?

For bin/host stop, the answer to the related question is "yes": we stop daemons even if we don't expect them to be running. This seems correct.

bin/host restart can really mean two different things:

  • Synchronize this entire host to the right state (when run in "all" mode). This stops everything, then starts all the correct instance daemons.
  • Start or restart daemons for one instance. This just runs restart on that instance.

The "all" mode is also misleading because it does not actually restart host-level daemons like MySQL and Apache. These are restarted by bin/host upgrade.

In a perfect world, the "all" mode would probably be bin/host synchronize-instance-services, and bin/host upgrade would be bin/host synchronize-host-services. This is reinforced by the recent addition of bin/host upgrade --keep-versions, which means "do not upgrade". Yikes.

For now, I think the least-likely-to-do-anything-surprising behavior is:

  • bin/host restart does not start no-daemon services.
  • bin/host restart --instance X also does not start no-daemon services. The only reason for this to start services is diagnostic, and it's an edge case anyway.
  • SyncServices makes a separate start/stop decision based on no-daemon state.

That unfortunately means that both restart and SyncServices need to make daemon/no-daemon decisions, but the former is the much more complicated one and very clearly needs to make a decision no matter what, so this isn't really a great loss.

epriestley added a revision: Restricted Differential Revision.Aug 31 2019, 1:40 AM
epriestley added a revision: Restricted Differential Revision.Aug 31 2019, 1:46 AM
epriestley added a commit: Restricted Diffusion Commit.Aug 31 2019, 2:23 PM
epriestley added a commit: Restricted Diffusion Commit.Aug 31 2019, 2:25 PM
epriestley added a commit: Restricted Diffusion Commit.

bin/host restart does not start no-daemon services.

This turns out to be somewhat complicated to sequence.

bin/host is actually also driven by instances.queryinstances (see T13399). I'd prefer it be driven by instances.state. This is mostly easy to swap since we use very few properties of instances.queryinstances, but instances.state needs additional constraints (device name, internal instance name).

The InstanceRef and related classes also currently live in "services/", but should move to "core/", since it's about to start calling instances.state.

I think it's likely safer to put the capabilities into production before changing "core/", even though deploying admin theoretically does not require making API calls, and deployment is theoretically robust against moving the "Ref" classes. But there's a whole huge stack of stuff in the deployment queue right now (including all the Week 34 changes to Phortune) and I'd like to spread risk into smaller events. I'm going to release cut and deploy now, then look at moving forward on the daemon pieces.

epriestley added a revision: Restricted Differential Revision.Sep 1 2019, 12:19 AM
epriestley added a revision: Restricted Differential Revision.
epriestley added a revision: Restricted Differential Revision.Sep 1 2019, 12:35 AM
epriestley added a commit: Restricted Diffusion Commit.Sep 1 2019, 4:13 PM
epriestley added a commit: Restricted Diffusion Commit.
epriestley added a commit: Restricted Diffusion Commit.
epriestley added a commit: Restricted Diffusion Commit.Sep 1 2019, 4:16 PM
epriestley added a commit: Restricted Diffusion Commit.
epriestley added a revision: Restricted Differential Revision.Sep 1 2019, 4:41 PM
epriestley added a revision: Restricted Differential Revision.Sep 1 2019, 4:44 PM
epriestley added a commit: Restricted Diffusion Commit.Sep 1 2019, 4:45 PM
epriestley added a commit: Restricted Diffusion Commit.
epriestley added a commit: Restricted Diffusion Commit.
epriestley added a revision: Restricted Differential Revision.Sep 1 2019, 4:51 PM
epriestley added a revision: Restricted Differential Revision.Sep 1 2019, 5:09 PM
epriestley added a commit: Restricted Diffusion Commit.Sep 1 2019, 5:10 PM
epriestley added a revision: Restricted Differential Revision.Sep 1 2019, 5:12 PM
epriestley added a commit: Restricted Diffusion Commit.Sep 1 2019, 5:13 PM
epriestley added a commit: Restricted Diffusion Commit.
epriestley added a revision: Restricted Differential Revision.Sep 1 2019, 5:19 PM
epriestley added a commit: Restricted Diffusion Commit.Sep 1 2019, 5:19 PM
epriestley added a revision: Restricted Differential Revision.Sep 1 2019, 5:29 PM
epriestley added a revision: Restricted Differential Revision.Sep 1 2019, 5:31 PM
epriestley added a commit: Restricted Diffusion Commit.Sep 1 2019, 5:31 PM
epriestley added a commit: Restricted Diffusion Commit.
epriestley added a commit: Restricted Diffusion Commit.Sep 1 2019, 5:34 PM
epriestley added a commit: Restricted Diffusion Commit.
epriestley added a commit: Restricted Diffusion Commit.
epriestley added a commit: Restricted Diffusion Commit.

I believe I've moved "core/" from "instances.queryinstances" and sequenced all the followup changes properly, now, and that the only remaining piece is glue.

The glue needs to do this service stuff, although it's easy to do manually:

  • Add a service binding to the instance if it does not already exist.
  • Disable allocations on the old binding (based on a flag?). Maybe this is just manual for now.
  • Synchronize service definitions inline.

Then the not-so-easy-to-do-manually mechanical part. For each repository:

  • Put it in maintenance mode.
  • Archive and upload the working copy.
  • Download the working copy and place it on disk, probably moving any existing directory aside?
  • Clusterize to swap the host service.
  • Thaw the repository.
  • Pull it out of maintenance mode.

At the very end:

  • Restart the daemons for good measure? This shouldn't actually be necessary.

Most of these steps already have subprocesses:

$ ./bin/repository maintenance rXYZ --start <message>
$ # Archive-and-upload.
$ # Download-and-unarchive.
$ ./bin/repository clusterize rXYZ --service <servicePHID>
$ ./bin/repository thaw rXYZ --promote <device> --force
$ ./bin/repository maintenance rXYZ --stop

So the file transfer pieces are really the only missing mechanics.

The actual "upload" and "download" parts are no problem, but the "archive" and "unarchive" pieces aren't entirely straightforward, and overlap somewhat with backup workflows (for example, we archive a Git repository by cloning it, not by copying it).

Anecdotally from the last time around, gzipping the tarball didn't really do much. Possibly, this might more broadly imply that we'd be better off not compressing repository backups.

Anecdata on this:

Mirror Time.tar Time.tar Size.tgz Time.tgz SizeBreakpoint
Phabricator (Non-Bare / cp)3,500ms700ms188MB6,600ms154MB6MB/sec
Phabricator (Bare)6,700ms230ms139MB4,700ms136MB670KB/sec
Phabricator (Bare + Repack)""""""
PHI1403 Primary Repository42,000ms700ms400MB27,800ms387MB480KB/sec
" + Repack36,000 ms1,200ms373MB18,800ms377MB230KB/sec

The first row is kind of bogus since it's working copy + history, not just history.

Network speed should be substantially better than ~700KB/sec, the worst breakpoint in the other cases. In the save/load workflow we technically pay this cost twice (once to upload, once to download) and also pay the compression cost twice (once to compress, once to decompress -- which is normally much faster) so the real breakpoints are probably higher than this suggests, but this looks like we're probably better off without compression, assuming these repositories are approximately typical.

This is also mild evidence that git repack -a -d (see T13111) does more than purely ward off evil curses.

epriestley added a revision: Restricted Differential Revision.Sep 1 2019, 8:08 PM
epriestley added a revision: Restricted Differential Revision.Sep 1 2019, 8:12 PM
epriestley added a commit: Restricted Diffusion Commit.Sep 1 2019, 8:13 PM
epriestley added a commit: Restricted Diffusion Commit.
epriestley added a commit: Restricted Diffusion Commit.
epriestley added a revision: Restricted Differential Revision.Sep 2 2019, 12:57 PM
epriestley added a commit: Restricted Diffusion Commit.Sep 2 2019, 1:05 PM

The migration in PHI1403 seems to have gone through cleanly. This workflow can continue to improve, but it's in relatively good shape now.