Page MenuHomePhabricator

AWS is rebooting several production hosts (July 2018)
Closed, ResolvedPublic

Description

AWS is rebooting these machines:

db022July 26, 2018 at 1:00:00 PM UTC-7
db008July 26, 2018 at 5:00:00 PM UTC-7
repo010July 26, 2018 at 5:00:00 PM UTC-7
db004July 26, 2018 at 11:00:00 PM UTC-7
repo015July 27, 2018 at 1:00:00 AM UTC-7
repo004July 27, 2018 at 3:00:00 PM UTC-7
repo024July 29, 2018 at 5:00:00 PM UTC-7

We generally have three options here:

  1. Stop and start all of these instances before the maintenance windows. They'll lose local storage so we'll need to redeploy them, but this is generally straightforward and quick.
  2. Get the upcoming rebalance online (T13076) and hit the instances on these hosts first.
  3. Do nothing and just eat the downtime, which is usually much shorter than the 2 hours they forecast.

It would be nice to do (2) so that we only have to go through this once, but the timeline on that is pretty tight since we only have 9 days until the first host is affected. Beyond that, (1) during a normal deploy window is probably much better than (3).

I'll likely aim to do (1) this Saturday unless we look like we're on a really good track for the rebalance.


Lumping a couple of deploy/ops-ish issues in here:

  • See PHI769. A large instance's export process is hitting some hiccups. I'd like to:
    • Optionally pass --no-indexes through from bin/host dump.
    • Buffer the tempfile in /core/bak/tmp instead of /tmp.
    • Probably prune their ngrams.
  • See email. An instance got an invite into an awkward state by cancelling the invite after the user had accepted it but before they registered an account.

Revisions and Commits

Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
rPHU libphutil
D19521

Event Timeline

epriestley created this task.
epriestley renamed this task from AWS is rebooting every host (July 2018) to AWS is rebooting several production hosts (July 2018).Jul 20 2018, 4:52 PM
epriestley updated the task description. (Show Details)
epriestley added a revision: Restricted Differential Revision.Jul 20 2018, 4:54 PM
epriestley added a revision: Restricted Differential Revision.Jul 20 2018, 5:03 PM
epriestley added a revision: Restricted Differential Revision.Jul 20 2018, 5:31 PM
epriestley added a commit: Restricted Diffusion Commit.Jul 20 2018, 8:09 PM
epriestley added a commit: Restricted Diffusion Commit.
epriestley added a commit: Restricted Diffusion Commit.Jul 20 2018, 8:16 PM
epriestley added a commit: Restricted Diffusion Commit.

See email. An instance got an invite into an awkward state by cancelling the invite after the user had accepted it but before they registered an account.

The general issue here is that invites really sort of have three separate states:

  1. A user has been invited by email address.
  2. The invite has been bound to a central (Phacility) account.
  3. The user has created a corresponding instance account.

We currently treat state (2) as "you are an instance member" for login/access purposes.

But we also let you cancel these invites, which doesn't really make sense. Cancelling an invite doesn't actually prevent you from registering.

I'm just going to fix this by preventing the cancel action: if an invite is bound to a central account, you can't cancel it.

This creates a bit of a limbo state where you can only cancel invites in state (1), and only disable accounts in state (3). In state (2), you're powerless. But I think this is basically fine, at least for now.

epriestley added a revision: Restricted Differential Revision.Jul 20 2018, 10:00 PM
epriestley added a commit: Restricted Diffusion Commit.Jul 20 2018, 11:07 PM
epriestley added a commit: Restricted Diffusion Commit.Jul 20 2018, 11:34 PM

I'm planning to stop/start these instances during the maintenance window today since getting the rebalance into production in the next five days seems wildly optimistic.

Beginning the stop/start stuff now.

epriestley claimed this task.

It seems like that went through cleanly. I just did Stop + Start + bin/remote deploy on the affected hosts. I then launched a test instance with placement allocations on two of the affected services; it came up cleanly.

The external IPs for the repo hosts have changed, so it's possible some instances with hard-coded IP addresses in a remote address whitelist (most often, Jenkins) will need to adjust them. We can't really do too much about this until we upgrade the cluster topology, though, I think.