Page MenuHomePhabricator

Move instances from repo012 to repo025
Closed, ResolvedPublic

Assigned To
Authored By
epriestley
Jun 9 2017, 4:14 PM
Referenced Files
F4995650: Screen Shot 2017-06-09 at 9.49.54 AM.png
Jun 9 2017, 4:51 PM
F4995620: Screen Shot 2017-06-09 at 9.25.33 AM.png
Jun 9 2017, 4:26 PM
Subscribers
Tokens
"Party Time" token, awarded by cspeckmim.

Description

See T12798. I think I'm ready to start breaking production. My overall plan is:

  • Bring up repo025 (I'm just going to do this normally -- in the main subnet, with a public IP -- see T12816).
  • Provision and deploy it, but don't open it for new instance allocation.
  • Close db012 and repo012 to instance allocation so new instances can not be allocated there.
  • Merge bin/instances move and the various improvements to bin/host restore, etc., to stable.
  • Update code on repo012, repo025, and admin to pick up these changes.
  • Use the new staff tools to forcibly allocate a new instance on db012 / repo012.
  • Add a test repository and push some code to it.
  • Use the new tools to move the instance from repo012 to repo025.
  • Push more code, make sure everything works and moved properly and the writes go to repo025, not repo012.
  • Move all suspended and disabled instances from repo012 to repo025.

Then, tomorrow during the normal deploy window:

  • Deploy normally.
  • Move all the remaining live instances.
  • Force allocate a new instance on db012 / repo12 and put some repository data on it so we can try to recover the shard on June 19th after AWS helps us run an operational drill by creating an "abrupt" shard failure on the host.

Revisions and Commits

Event Timeline

  • I launched repo025.phacility.net, like other hosts in the tier.
  • I added repo025.phacility.net to DNS in Route 53 (172.30.0.115).
  • I added a device record (repo025.phacility.net) to Almanac on admin.phacility.com.
  • I added a service record (repox025.phacility.net) to Almanac on admin.phacility.com, and bound it to the device.
  • I added a device record for the backup volume, rback025.phacility.net.

  • I used bin/provision launch --device repo025 to mount and attach EBS volumes.
  • I used bin/remote deploy repo025 to deploy the host. This is currently installing all the package junk and then will probably spend a while formatting the swap partition.

While that's running, I closed dbx012 and repox012 to new allocations. The shard status panel correctly reflects repox025 with no paired DB service and the 012 shard as closed:

Screen Shot 2017-06-09 at 9.25.33 AM.png (541ร—486 px, 97 KB)

  • Swap stuff finished up.
  • I merged master to stable for services/ to pick up bin/services sync --src ... --dst ....
  • I merged master to stable for instances/ to pick up bin/instances move ... and related changes.
  • core/ just runs master so nothing needed to merge.
  • I upgraded repo025, then repo012, then admin001. None of these seemed to hit any issues.

I'm going to use the new "force shard placement" tools to launch a couple of test instances next.

I launched yellow-underneathie.phacility.com with forced services dbx012 and repox025. My goal here is to check for any code I missed which makes the assumption that dbX = repoX. I believe we don't have any of this code, but if we do, this instance should point out whatever problems we have that still need to be cleaned up.

If this works, that points at only the allocation algorithm needing cleanup when we want to separate the db and repo tiers so we can resize them independently.

I read and wrote the repository and verified that the data really ended up on db012 and repo025.

Screen Shot 2017-06-09 at 9.49.54 AM.png (1ร—1 px, 150 KB)

Next, I'm going to launch a similar instance but force placement on to db012 + repo012, write a repository, then move it to repo025.

bin/host restore failed with this error:

$ PHABRICATOR_INSTANCE=admin /core/lib/instances/bin/instances move --instance red-underneathie --to repox025.phacility.net
[2017-06-09 16:58:36] EXCEPTION: (CommandException) Command failed with error #255!
COMMAND
'/core/bin/remote' --internal -u instances -i /core/conf/keystore/instances.key restore repo025.phacility.net --instance 'red-underneathie' --download 'PHID-FILE-qj3g6sya4bptjldhgjhl' --kind 'repository' 

STDOUT
(empty)

STDERR
Usage Exception: Unknown instances: red-underneathie!

This is because red-underneathie is not an instance on repo025, and --instance can only select instances on the current host.

I can come up with two ways we could fix this:

  • We could swap the services in the middle, instead of at the end, so the instance was on repo025 by the time restore ran. But I don't like this much because it makes recovering from errors harder.
  • We could add a flag like --do-not-restrict-the-instance-query-to-just-instances-on-this-host. That name is silly, but I think this flag is generally reasonable and there exist some other cases where it might be useful (for example, bin/host query doesn't really need to be executed from a host which the instance is bound to, and this flag would let bin/host instances list every instance if you wanted that for some reason).

I think the flag is fairly reasonable so I'm going to look at providing that.

(I missed this locally because all "local" instances always belong to all "local" devices so this situation isn't possible in development.)

epriestley added a commit: Restricted Diffusion Commit.Jun 9 2017, 5:11 PM
epriestley added a commit: Restricted Diffusion Commit.
epriestley added a commit: Restricted Diffusion Commit.
epriestley added a commit: Restricted Diffusion Commit.Jun 9 2017, 5:19 PM
epriestley added a commit: Restricted Diffusion Commit.
epriestley added a commit: Restricted Diffusion Commit.

I hit two more minor issues:

  • bin/host purge-cache --instance X did not work on admin because instance X is not bound to admin. I added --global.
  • --global did not work on admin when executed on the host directly because admin is a special tier which has a hard-coded binding to the admin instance. Instead, query the global instance pool with --global.

As a sort of aside, --global --instance X is really slow right now (about 10s), and will only get slower in the future: it loads every instance page by page over Conduit, then does the remaining filtering by name locally. But it gets the right result, so this is reasonable enough for the moment. I may try to tune this before moving a lot of instances.

With those changes, bin/instances move completed successfully.

I pulled and pushed the test repository and verified the new data went to repo025, not repo12. Almanac also synchronized properly. So it looks like this more or less works.

I'm going to go run some errands and probably grab lunch, then settle in and start moving disabled/suspended instances.

epriestley added a commit: Restricted Diffusion Commit.Jun 9 2017, 7:25 PM

I'm moving all the disabled/suspended instances now.

A verrrry minor thing is that because we always transfer a repository backup (even if it's empty), a migration always generates a repository directory on the far end. We could maaaybe have restore remove the directory if it's empty after the restore finishes to tidy this up just a touch.

The disabled/suspended instances finished up with no apparent issues, so I expect to move the live instances tomorrow.

Deploying, then picking this up again.

Instances are moving now.

๐ŸŒด ๐Ÿซ ๐Ÿซ ๐Ÿซ ๐ŸŒด ๐Ÿซ ๐Ÿซ ๐Ÿซ

The move completed without errors.

The total amount of data on repo025 is appreciably smaller than the total amount of data on repo012, but this appears to be because of the git clone --mirror we use to create repository copies during backup/migration: it only clones reachable objects, so it's effectively like running git gc in each repository first. This completely accounts for the size difference as far as I can tell: I spot-checked several repositories and they have the same number of reachable commits and refs, and the repo012 version of the repository is reduced to the repo025 size by running git gc.

(Currently, we never run git gc automatically, so it is not surprising that repositories would collect some garbage.)

I've forced blue-underneathie.phacility.com to allocate on the dead shard and am setting up some repository data for it now.

No one has complained that all their data is gone so I'm going to assume the best here.