AWS is rebooting several production hosts (July 2018)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	epriestley
	Jul 17 2018, 10:20 PM

Description

AWS is rebooting these machines:

`db022`	July 26, 2018 at 1:00:00 PM UTC-7
`db008`	July 26, 2018 at 5:00:00 PM UTC-7
`repo010`	July 26, 2018 at 5:00:00 PM UTC-7
`db004`	July 26, 2018 at 11:00:00 PM UTC-7
`repo015`	July 27, 2018 at 1:00:00 AM UTC-7
`repo004`	July 27, 2018 at 3:00:00 PM UTC-7
`repo024`	July 29, 2018 at 5:00:00 PM UTC-7

We generally have three options here:

Stop and start all of these instances before the maintenance windows. They'll lose local storage so we'll need to redeploy them, but this is generally straightforward and quick.
Get the upcoming rebalance online (T13076) and hit the instances on these hosts first.
Do nothing and just eat the downtime, which is usually much shorter than the 2 hours they forecast.

It would be nice to do (2) so that we only have to go through this once, but the timeline on that is pretty tight since we only have 9 days until the first host is affected. Beyond that, (1) during a normal deploy window is probably much better than (3).

I'll likely aim to do (1) this Saturday unless we look like we're on a really good track for the rebalance.

Lumping a couple of deploy/ops-ish issues in here:

See PHI769. A large instance's export process is hitting some hiccups. I'd like to:
- Optionally pass --no-indexes through from bin/host dump.
- Buffer the tempfile in /core/bak/tmp instead of /tmp.
- Probably prune their ngrams.
See email. An instance got an invite into an awkward state by cancelling the invite after the user had accepted it but before they registered an account.

Revisions and Commits

	Restricted Differential Revision	Restricted Diffusion Commit
	Restricted Differential Revision	Restricted Diffusion Commit
	Restricted Differential Revision	Restricted Diffusion Commit
	Restricted Differential Revision	Restricted Diffusion Commit
	Restricted Differential Revision	Restricted Diffusion Commit
	Restricted Differential Revision	Restricted Diffusion Commit
rPHU libphutil
	D19521	rPHU1613e68f4740 Allow callers to choose which directory a "TempFile" is created in

Related Objects

Mentioned Here: T13076: Plans: Phacility cluster caching, renaming, and rebalance/compaction

Event Timeline

epriestley triaged this task as Normal priority.Jul 17 2018, 10:20 PM

epriestley created this task.

Herald added a subscriber: eadler. · View Herald TranscriptJul 17 2018, 10:20 PM

epriestley renamed this task from AWS is rebooting every host (July 2018) to AWS is rebooting several production hosts (July 2018).Jul 20 2018, 4:52 PM

epriestley updated the task description. (Show Details)

epriestley added a revision: Restricted Differential Revision.Jul 20 2018, 4:54 PM

epriestley added a revision: Restricted Differential Revision.Jul 20 2018, 5:03 PM

epriestley added a revision: D19521: Allow callers to choose which directory a "TempFile" is created in.Jul 20 2018, 5:28 PM

epriestley added a revision: Restricted Differential Revision.

epriestley added a revision: Restricted Differential Revision.Jul 20 2018, 5:31 PM

epriestley added a commit: rPHU1613e68f4740: Allow callers to choose which directory a "TempFile" is created in.Jul 20 2018, 8:02 PM

epriestley added a commit: Restricted Diffusion Commit.Jul 20 2018, 8:09 PM

epriestley added a commit: Restricted Diffusion Commit.

epriestley added a commit: Restricted Diffusion Commit.Jul 20 2018, 8:16 PM

epriestley added a commit: Restricted Diffusion Commit.

See email. An instance got an invite into an awkward state by cancelling the invite after the user had accepted it but before they registered an account.

The general issue here is that invites really sort of have three separate states:

A user has been invited by email address.
The invite has been bound to a central (Phacility) account.
The user has created a corresponding instance account.

We currently treat state (2) as "you are an instance member" for login/access purposes.

But we also let you cancel these invites, which doesn't really make sense. Cancelling an invite doesn't actually prevent you from registering.

I'm just going to fix this by preventing the cancel action: if an invite is bound to a central account, you can't cancel it.

This creates a bit of a limbo state where you can only cancel invites in state (1), and only disable accounts in state (3). In state (2), you're powerless. But I think this is basically fine, at least for now.

epriestley added a revision: Restricted Differential Revision.Jul 20 2018, 10:00 PM

epriestley added a commit: Restricted Diffusion Commit.Jul 20 2018, 11:07 PM

epriestley added a commit: Restricted Diffusion Commit.Jul 20 2018, 11:34 PM

I'm planning to stop/start these instances during the maintenance window today since getting the rebalance into production in the next five days seems wildly optimistic.

Beginning the stop/start stuff now.

It seems like that went through cleanly. I just did Stop + Start + bin/remote deploy on the affected hosts. I then launched a test instance with placement allocations on two of the affected services; it came up cleanly.

The external IPs for the repo hosts have changed, so it's possible some instances with hard-coded IP addresses in a remote address whitelist (most often, Jenkins) will need to adjust them. We can't really do too much about this until we upgrade the cluster topology, though, I think.

AWS is rebooting several production hosts (July 2018)Closed, ResolvedPublicActions

Description

Revisions and Commits

Related Objects

Event Timeline

AWS is rebooting several production hosts (July 2018)
Closed, ResolvedPublic
Actions