We skated by for a few years, but the AWS instance reaper has finally come for repo012. The email says "it may already be unreachable" but it appears to be working normally, so we likely have a full lease on life before we're forced to deal with this.
Days | Hours | Minutes | Seconds |
---|---|---|---|
- | - | - | - |
Jun 19 2017, 3:00 PM |
At the extremes I think there are two ways we can approach this, and we'll probably end up somewhere in the middle:
Very Near Plan: During the next deploy, stop the host. Then start it, deploy it (since the root volume will be wiped) and update DNS and Almanac. Continue on as though nothing happened.
- Pros: Smallest operational investment. Will probably work fine. We'd basically want/have to do this if the hardware actually died and it wouldn't necessarily be bad to know it works.
- Cons: Relatively long downtime for installs (an hour?), although only repositories (not web) will be impacted. High risk: all-or-nothing. We can't preview it or go back if it doesn't work. Doesn't build us toward anything else.
Very Long Plan: Between now and when the host explodes, migrate instances to a new shard one at a time so that it's empty and unused when it dies.
- Pros: Very safe. Lowest realistic impact on installs. Builds toward migration tools (T11670, then private clusters beyond that).
- Cons: Large operational investment. Doesn't necessarily help much if we actually lose hardware.
Other things we may want to do here:
Consider New Instance Types: (https://meta.phacility.com/T32: If we migrate, we could migrate to a new m4.large host or some other chassis.
I am inclined not to pursue this here. If we migrate by moving and it works, we'd have the tools to swap instances later. Doing a chassis swap here feels like it's increasing the risk for little benefit (small downtime reduction for instances on this shard only).
Move Away from dbX = repoX: (https://meta.phacility.com/T32) Currently, all instances on db004 are also on repo004. This is slightly nice from a human operations point of view but probably an assumption we should move away from. It also has a very small isolation effect -- if something goes weird on repo004 it will generally only bring db004 down with it -- but these kinds of issues are virtually unheard of (it's much more likely that an instance will do something crazy with load than an entire host).
I believe we encode this assumption in only one place (when picking a shard pair to allocate a new instance onto) but it might be nice to give the CLI tooling more support (e.g., connect to host by instance name instead of by host name).
The changes in T12217 (free instances must be active; daemons use less RAM) have given us significant headroom on the repo tier and we could easily shrink it.
I'm inclined not to pursue this here, exactly. As above, if we migrate by moving and it works, we'll have all the tools we need to do this later. This feels like we're adding risk for a tiny tiny benefit (half as much downtime for these instances if we do consolidate later, but most will be down for only minutes anyway).
Move Away From repoNNN Naming: (T12605#221039) If we launch a new shard, we could call it repo-asdflinwaln (e.g., AWS instance ID or some other random hash), or just repo-1 (and autoincrement across all devices we launch in the future).
repoNNN is sort of nice for humans, but discontinuities are inevitable (and we already have some) and the tools should not expect the range 120-129 to contain exactly 10 hosts.
I believe we encode this assumption only in Phage (the --hosts flag) and when allocating instances (db004 = repo004), and fixing it mostly an issue of giving Phage a flag like --pool repo or something instead.
I am inclined to spend at least some time fixing Phage here, although I'm not 100% sure where it should draw the pool from. I think we reduce risk if we don't need to launch the new host as repo012, and can launch it as, at least, repo025, even if we don't launch it as repo-1 or repo-abcdef.
A corollary of dbX = repoX and repoNNN is that if we launch repo025, move all the instances there, then decommission repo012, no new instances will allocate on the db012/repo025 pair because the UI won't see them as paired. This is basically fine (we have 23 other shards with plenty of room) and would be easy to fix in a second phase that moved toward severing dbX = repoX once the primary migration finishes.
Outbound NAT: (T11336) Currently, each host as an IPv4 public address. We want to swap these to outbound NAT at some point, partly as a general security/sanitation issue and partly to provide a stable address range for outbound requests.
To some degree doing this here concentrates risk, but the risk is pretty small (we can bring the host up first, make sure outbound works, then move instances to it), and it has a larger material impact on those instances than just a few minutes of downtime (they get a stable IP forever after the change, rather than eternal shifting sands).
I'm inclined to try to pursue this now, unless we get into it and discover that it's much more complicated than I think.
As a half-solution, we could give the new host one of the reserved EIPs at least, so that it was stable across the eventual transition to outbound NAT. But we can probably only really do this trick once, and it makes that transition riskier.
Moving Instances Across Shards: (T11670) I did this once before very manually, but since we're moving an entire shard worth of instances this process should be automated. This is a tool we should have in general, for: load shedding/rebalancing, rotating chassis, consolidating tiers, doing this same thing the next time, and so on.
I am inclined to pursue this now. I'd like to migrate all the instances off this host before it explodes. If we don't make it, we can do the "stop + start" plan instead. Possibly, we should leave a couple of test instances there and see if they survive stop + start so we have a better sense of what issues arise if AWS sends us a real "your stuff is all broken forever" mail.
Notify Instances About Maintenance: Currently, we just use @phacilitystatus on Twitter. Pretty much all instances should have only a tiny amount of downtime (minutes or less) that they probably won't even notice, but this might be a good opportunity to look at the per-instance status notifications we offer and consider refining them.
This is tricky because pretty much nothing that would be useful to us overlaps with anything that's useful to Phabricator as a whole, but whatever we write needs to at least sort of be integrated with Phabricator (e.g., "this instance is scheduled for a migration in <22 hours>, click here to ignore this"). The last-second stuff ("This instance is down right now!") does not need to be integrated, but the last-second stuff is pretty obvious anyway (since the instance is down). Also not clear if we should really be notifying all users (inciting mass panic?) or just notifying administrators and maybe giving them tools for notifying users, or what.
I don't think we really need to do anything here (I suspect no one will notice if we just do this during normal maintenance) but I'll at least review the open requests we have about this stuff and see if there are any small steps we could take.
Placement Instantiation: Somewhat tangential, but it would be helpful if operations staff could launch instances in an "advanced mode" which let them force allocation onto particular shards. This would let us:
- Force instances onto the 012 shard pair to test migration.
- Force instances onto the 012 shard pair so we could see if they survive the end of the world.
- Force instances onto invalid (e.g., db022 + repo017) pairs to test that we don't have problems with that.
This should be a small change which makes several real things easier. I've forced these allocations in the past, but in a funky ad-hoc way that's enough of a pain that even using the tool once will probably pay for itself, I just didn't think of this.
Renaming Instances: (T11413) This is not directly related, but if the instance UI gets touched during this it would be good to start separating getDisplayName() from getInternalInstanceIdentity() or whatever to make a "Rename Instance" operation (and private clusters) easier in the long run. Work can begin here trivially without doing any heavy lifting.