Page MenuHomePhabricator

Decommission cluster host `repo012`
Closed, ResolvedPublic

Description

We skated by for a few years, but the AWS instance reaper has finally come for repo012. The email says "it may already be unreachable" but it appears to be working normally, so we likely have a full lease on life before we're forced to deal with this.

DaysHoursMinutesSeconds
----
Jun 19 2017, 3:00 PM


At the extremes I think there are two ways we can approach this, and we'll probably end up somewhere in the middle:

Very Near Plan: During the next deploy, stop the host. Then start it, deploy it (since the root volume will be wiped) and update DNS and Almanac. Continue on as though nothing happened.

  • Pros: Smallest operational investment. Will probably work fine. We'd basically want/have to do this if the hardware actually died and it wouldn't necessarily be bad to know it works.
  • Cons: Relatively long downtime for installs (an hour?), although only repositories (not web) will be impacted. High risk: all-or-nothing. We can't preview it or go back if it doesn't work. Doesn't build us toward anything else.

Very Long Plan: Between now and when the host explodes, migrate instances to a new shard one at a time so that it's empty and unused when it dies.

  • Pros: Very safe. Lowest realistic impact on installs. Builds toward migration tools (T11670, then private clusters beyond that).
  • Cons: Large operational investment. Doesn't necessarily help much if we actually lose hardware.

Other things we may want to do here:

Consider New Instance Types: (https://meta.phacility.com/T32: If we migrate, we could migrate to a new m4.large host or some other chassis.

I am inclined not to pursue this here. If we migrate by moving and it works, we'd have the tools to swap instances later. Doing a chassis swap here feels like it's increasing the risk for little benefit (small downtime reduction for instances on this shard only).

Move Away from dbX = repoX: (https://meta.phacility.com/T32) Currently, all instances on db004 are also on repo004. This is slightly nice from a human operations point of view but probably an assumption we should move away from. It also has a very small isolation effect -- if something goes weird on repo004 it will generally only bring db004 down with it -- but these kinds of issues are virtually unheard of (it's much more likely that an instance will do something crazy with load than an entire host).

I believe we encode this assumption in only one place (when picking a shard pair to allocate a new instance onto) but it might be nice to give the CLI tooling more support (e.g., connect to host by instance name instead of by host name).

The changes in T12217 (free instances must be active; daemons use less RAM) have given us significant headroom on the repo tier and we could easily shrink it.

I'm inclined not to pursue this here, exactly. As above, if we migrate by moving and it works, we'll have all the tools we need to do this later. This feels like we're adding risk for a tiny tiny benefit (half as much downtime for these instances if we do consolidate later, but most will be down for only minutes anyway).

Move Away From repoNNN Naming: (T12605#221039) If we launch a new shard, we could call it repo-asdflinwaln (e.g., AWS instance ID or some other random hash), or just repo-1 (and autoincrement across all devices we launch in the future).

repoNNN is sort of nice for humans, but discontinuities are inevitable (and we already have some) and the tools should not expect the range 120-129 to contain exactly 10 hosts.

I believe we encode this assumption only in Phage (the --hosts flag) and when allocating instances (db004 = repo004), and fixing it mostly an issue of giving Phage a flag like --pool repo or something instead.

I am inclined to spend at least some time fixing Phage here, although I'm not 100% sure where it should draw the pool from. I think we reduce risk if we don't need to launch the new host as repo012, and can launch it as, at least, repo025, even if we don't launch it as repo-1 or repo-abcdef.

A corollary of dbX = repoX and repoNNN is that if we launch repo025, move all the instances there, then decommission repo012, no new instances will allocate on the db012/repo025 pair because the UI won't see them as paired. This is basically fine (we have 23 other shards with plenty of room) and would be easy to fix in a second phase that moved toward severing dbX = repoX once the primary migration finishes.

Outbound NAT: (T11336) Currently, each host as an IPv4 public address. We want to swap these to outbound NAT at some point, partly as a general security/sanitation issue and partly to provide a stable address range for outbound requests.

To some degree doing this here concentrates risk, but the risk is pretty small (we can bring the host up first, make sure outbound works, then move instances to it), and it has a larger material impact on those instances than just a few minutes of downtime (they get a stable IP forever after the change, rather than eternal shifting sands).

I'm inclined to try to pursue this now, unless we get into it and discover that it's much more complicated than I think.

As a half-solution, we could give the new host one of the reserved EIPs at least, so that it was stable across the eventual transition to outbound NAT. But we can probably only really do this trick once, and it makes that transition riskier.

Moving Instances Across Shards: (T11670) I did this once before very manually, but since we're moving an entire shard worth of instances this process should be automated. This is a tool we should have in general, for: load shedding/rebalancing, rotating chassis, consolidating tiers, doing this same thing the next time, and so on.

I am inclined to pursue this now. I'd like to migrate all the instances off this host before it explodes. If we don't make it, we can do the "stop + start" plan instead. Possibly, we should leave a couple of test instances there and see if they survive stop + start so we have a better sense of what issues arise if AWS sends us a real "your stuff is all broken forever" mail.

Notify Instances About Maintenance: Currently, we just use @phacilitystatus on Twitter. Pretty much all instances should have only a tiny amount of downtime (minutes or less) that they probably won't even notice, but this might be a good opportunity to look at the per-instance status notifications we offer and consider refining them.

This is tricky because pretty much nothing that would be useful to us overlaps with anything that's useful to Phabricator as a whole, but whatever we write needs to at least sort of be integrated with Phabricator (e.g., "this instance is scheduled for a migration in <22 hours>, click here to ignore this"). The last-second stuff ("This instance is down right now!") does not need to be integrated, but the last-second stuff is pretty obvious anyway (since the instance is down). Also not clear if we should really be notifying all users (inciting mass panic?) or just notifying administrators and maybe giving them tools for notifying users, or what.

I don't think we really need to do anything here (I suspect no one will notice if we just do this during normal maintenance) but I'll at least review the open requests we have about this stuff and see if there are any small steps we could take.

Placement Instantiation: Somewhat tangential, but it would be helpful if operations staff could launch instances in an "advanced mode" which let them force allocation onto particular shards. This would let us:

  • Force instances onto the 012 shard pair to test migration.
  • Force instances onto the 012 shard pair so we could see if they survive the end of the world.
  • Force instances onto invalid (e.g., db022 + repo017) pairs to test that we don't have problems with that.

This should be a small change which makes several real things easier. I've forced these allocations in the past, but in a funky ad-hoc way that's enough of a pain that even using the tool once will probably pay for itself, I just didn't think of this.

Renaming Instances: (T11413) This is not directly related, but if the instance UI gets touched during this it would be good to start separating getDisplayName() from getInternalInstanceIdentity() or whatever to make a "Rename Instance" operation (and private clusters) easier in the long run. Work can begin here trivially without doing any heavy lifting.

Revisions and Commits

Restricted Differential Revision
Restricted Differential Revision

Event Timeline

I plan to take these steps specifically:

  • (With @amckinley, T11336) Get a better handle on what we need to do for NAT, and try to get that up before we launch the replacement host.
  • (T11413) Adjust the InstancesInstance API to prepare for renameable instances if we aren't already in good shape.
  • Write "Launch Instance (Advanced)" so operations staff can select shards when launching an instance.
  • Come up with a plan for where phage --pool repo is going to get the host list from.
    • Probably implement that if it's easy, maybe skip it if it's tough.
    • Maybe also implement ^C to show status if I'm in there, since that's my highest-priority feature request for Phage.
  • (T11670) Plan and implement bin/host migrate --instance X --from repo012 --to repo025.
    • Migrate everything off the host. We can do self-owned test instances and disabled/suspended instances anytime. We can do free/test instances without too much planning. Live instances should probably happen off-peak, though.
  • Force a new test instance onto the host after emptying it, then try to survive the explosion.

I'd like to try to hit roughly this timeline:

  • NAT / launch the new shard / phage / "Advanced Launch" early this week.
  • bin/host migrate mid-week.
  • Move all the unused instances late in the week.
  • Move the live instances during the normal deployment window on Saturday.
  • On the 19th, watch the host explode and then recover it.

That gives us an extra week if there are issues or things slip.

If we get all that done, we can look at doing more work on the dbX=repoX stuff or shrinking the tier or swapping the chassis or improving notifications or whatever else.

Another adjacent piece of work here is automating provisioning (e.g., through autoscale groups or bin/provision) but I think we should mostly leave that for the future too since we're theoretically bringing up only one new host (or maybe two, natXXX + repoXXX). Many of the other changes here give us tools toward that anyway, and it would probably be better to pursue that as part of a future change that involves launching a larger number of new hosts (e.g., chassis swap on the whole tier).

epriestley added a revision: Restricted Differential Revision.Jun 5 2017, 9:05 PM
epriestley added a revision: Restricted Differential Revision.Jun 5 2017, 11:15 PM

Come up with a plan for where phage --pool repo is going to get the host list from.

I think there are three real candidates here: secure, admin, or directly querying EC2.

Some things I don't like about secure:

  • If admin can ever deploy hosts on its own (e.g., a "Provision New Hosts" button for private clusters, or, more realistically, autoscaling build hosts with Drydock) it would need to write to secure. We pull from secure but currently never write to it, and it feels good that this dependency flows in only one direction. Currently, too, secure only needs to be up when someone is manually running deployment commands: it's fine for it to be down/broken if we aren't actively doing ops stuff. This seems really good.
  • admin already has a list of hosts, so we'd need to duplicate or make admin sync from secure. The former is real bad and the latter is real complicated.

Some things I don't like about admin:

  • We don't have credentials for it in phage by default.
  • We could connect to the bastion and use credentials there, but then we have to hard-code the bastion (but this probably isn't a huge issue). This is also a little slow, although if you're hitting a whole pool that's likely fine.
  • It creates a cluster > cluster dependency, where you can't get a list of hosts for, say, the repo pool if admin is down (and maybe admin is down because we need to deploy the repo pool). This is bad, but rare/hypothetical and could be mitigated by caching lists and falling back. We can always just go export the list from EC2 and --hosts x,y,z in the worst case which only takes a little extra time even if the sky is falling.

Some things I don't like about EC2:

  • Ties us to AWS vs other stuff. We're probably not going to mix-and-match but it's conceptually nice that we could.
  • We don't have credentials for it in phage by default. We can go to the bastion again.
  • We have much less flexibility to control how hosts are listed, annotated, etc. I'm not totally sure we want/need to do this but it seems likely that we will.

All these choices seem like they have drawbacks. At least for now, I'm inclined to use admin since it seems maybe the least-bad, and we must have an authoritative list of hosts there no matter what for the cluster itself. We could change how this works later if we want to swap this around, but the current approach of putting device records on admin doesn't seem to be causing any problems so I feel like we're okay to push it a little further.

As far as pulling this state from EC2, it's really easy to fetch the list of hosts that have a given tag: http://docs.aws.amazon.com/cli/latest/reference/ec2/describe-instances.html

Since I feel like tags are the basic semantics we're looking for (nodes belong to 0 or more pools aka tags), I don't think we'd be overly committing to AWS to assume that some kind of tag-like infrastructure is always going to be useful/available for filtering hosts (even we eventually build our own infrastructure once we're running 100k EC2 instances). Alternately if the semantics should be "nodes belong to exactly one pool", we can just have a tag called pool_id with some value like web, repo, db, etc.

Also it's nice that you can easily setup IAM users to get API tokens that can only invoke describe-instances inside a given AZ, making it less of an issue to worry about securing the credentials. Also we could just use IAM roles (which are awesome) and give the bastion host itself the ability to call that API without having to hardcode any creds on it.

No comment on admin vs secure.

epriestley added a commit: Restricted Diffusion Commit.Jun 6 2017, 12:48 AM
epriestley added a commit: Restricted Diffusion Commit.Jun 6 2017, 1:02 AM

phage remote status --pools db,repo and such work now, using Conduit credentials from the bastion to query admin. We can swap that stuff later if we want to switch to EC2.

I inched us forward on getting EC2 and Almanac to agree with one another, but need T12414 to make much more headway:

$ ./bin/provision sync
Looking up EC2 hosts...
WARNING: Host "i-08d530220c7ba9252" has a name ("test-nat") which does not match the pattern "*.phacility.net". This host will be ignored.
ID                  Name                     IPv4
i-429b998a          admin001.phacility.net   172.30.0.177
i-fcf6823c          aux001.phacility.net     172.30.0.220
i-f172d33a          bastion005.phacility.net 172.30.0.252
i-3a9b99f2          db001.phacility.net      172.30.0.71
...
i-f46acaab          web004.phacility.net     172.30.0.248
WARNING: Host "aux001.phacility.net" has no Almanac device record.
WARNING: Host "bastion005.phacility.net" has no Almanac device record.
...
WARNING: Host "web004.phacility.net" has no Almanac device record.

Er, that was rCORE085cd7e4 but I typed D instead of T.

epriestley added a commit: Restricted Diffusion Commit.Jun 6 2017, 11:31 AM
epriestley added a commit: Restricted Diffusion Commit.

(Slightly related: I killed the test-nat instance this morning since it doesn't help actually test the NAT).

Just to clarify, it's ok that I'm not following, nor understand any of this, correct?

There will be a final written exam on June 19th.

(None of this impacts anything outside the ops realm in any way, except that some of the admin/ops-flavored UIs are changing a little.)

epriestley lowered the priority of this task from High to Low.Jun 12 2017, 6:34 PM

I moved everything except one test instance off in T12817, and we appear to be ready for this host to die.

When it does, I'll revive it and see if we hit any surprises (simulating a more urgent situation where we lose the host suddenly). If we don't, I'll decommission it permanently.

Stuff that works now:

  • Moving across shards: Works now.
  • Placement instances: Works now.

Stuff that we made some progress on:

  • We now have a more specific path to break out of dbX = repoX, I'll file a followup with particulars. Concretely, db012 + repo025 is in production today.
  • Still using repoNNN naming for now, but --pools works and I deployed with it, so there's no longer too much of a technical barrier here. We'd probably need to fiddle with a few more things to actually make this work.
  • Outbound NAT: We now have a specific action plan in T12816 which seems likely to work.
  • Renaming: we inched forward on separating "internal" and "display" names although this is still more than a stone's throw away.

No changes:

  • No chassis changes at this time (although the other changes here support chassis changes in the future).
  • Downtime notifications: instances were only down briefly, only some services were affected, and this all happened at 3AM on a Saturday so nothing got built here.

AWS stopped the instance; I'm starting it again now.

I started the instance, then used bin/remote deploy to deploy it. Everything came back up cleanly with no additional steps. Things were back online after about 10 minutes (maybe a bit less).

During deploy, we initialize swap, which takes a big chunk of time. We could possibly move that to later in the sequence -- we don't strictly need swap to bring hosts back online. This could give us a ~5 minute recovery process instead of a ~10 minute recovery process. I'm not going to touch this for now since we've never had a real incident, but we could consider it if we start losing tons of hosts for some reason.

I'm going to stop the instance for now, then clean it up completely in a few weeks if nothing crops up.

epriestley claimed this task.
  • I filed T12854 for final cleanup.
  • I stopped repo012.phacility.net.
  • The repo tier now has a discontinuity. --timeout will "fix" it but this should be fixed properly, except that we're presumably throwing away Phage in favor of Chef/Puppet/Ansible in T12847 which will moot this.