AWS is rebooting these machines:
db022 | July 26, 2018 at 1:00:00 PM UTC-7 |
db008 | July 26, 2018 at 5:00:00 PM UTC-7 |
repo010 | July 26, 2018 at 5:00:00 PM UTC-7 |
db004 | July 26, 2018 at 11:00:00 PM UTC-7 |
repo015 | July 27, 2018 at 1:00:00 AM UTC-7 |
repo004 | July 27, 2018 at 3:00:00 PM UTC-7 |
repo024 | July 29, 2018 at 5:00:00 PM UTC-7 |
We generally have three options here:
- Stop and start all of these instances before the maintenance windows. They'll lose local storage so we'll need to redeploy them, but this is generally straightforward and quick.
- Get the upcoming rebalance online (T13076) and hit the instances on these hosts first.
- Do nothing and just eat the downtime, which is usually much shorter than the 2 hours they forecast.
It would be nice to do (2) so that we only have to go through this once, but the timeline on that is pretty tight since we only have 9 days until the first host is affected. Beyond that, (1) during a normal deploy window is probably much better than (3).
I'll likely aim to do (1) this Saturday unless we look like we're on a really good track for the rebalance.
Lumping a couple of deploy/ops-ish issues in here:
- See PHI769. A large instance's export process is hitting some hiccups. I'd like to:
- Optionally pass --no-indexes through from bin/host dump.
- Buffer the tempfile in /core/bak/tmp instead of /tmp.
- Probably prune their ngrams.
- See email. An instance got an invite into an awkward state by cancelling the invite after the user had accepted it but before they registered an account.