⚓ T13183 AWS is rebooting instances in late August 2018

epriestley triaged this task as Low priority.Aug 13 2018, 2:30 PM

epriestley created this task.

Herald added a subscriber: eadler. · View Herald TranscriptAug 13 2018, 2:30 PM

epriestley updated the task description. (Show Details)Aug 13 2018, 3:09 PM

epriestley updated the task description. (Show Details)Aug 13 2018, 4:19 PM

epriestley updated the task description. (Show Details)Aug 13 2018, 6:14 PM

epriestley mentioned this in T13185: AWS Reboots Part 2, Electric Boogaloo.Aug 13 2018, 8:45 PM

amckinley claimed this task.Aug 13 2018, 9:25 PM

epriestley updated the task description. (Show Details)Aug 15 2018, 4:56 PM

epriestley updated the task description. (Show Details)Aug 15 2018, 6:43 PM

I'm going to stop/start at least some of these now.

epriestley updated the task description. (Show Details)Aug 18 2018, 9:00 PM

epriestley updated the task description. (Show Details)Aug 18 2018, 9:04 PM

epriestley updated the task description. (Show Details)Aug 18 2018, 9:10 PM

Think I got through the easy ones without any issues. I suspect admin and secure may be a little more involved so I'm going to leave the cat in the bag for the moment.

I'm going to do admin001 and secure001 today.

I think the only thing on secure or admin which isn't properly covered by deploy automation is the crontab on secure001:

0 6 * * * /core/bin/host backup
0 7 * * * /core/bin/host prune --force
0 8 * * * /core/conf/util/generate-documentation
0 9 * * * /core/conf/util/generate-symbols

Kicking secure001 now.

(It not being covered is covered by T12879.)

Kicking secure001 now.

This seems to have worked. Two issues:

Crontabs aren't managed by deploy (T12879) today, so I had to restore the crontab manually.
secure001 can't fully bootstrap itself automatically since it tries to git pull repositories from secure.phabricator.com, which doesn't work while the master database is dead. I just commented out the git operations and made it through. This could be smoothed out, and bin/upgrade --do-not-pull seems like a reasonable tool to have in the toolbox. If we actually lost secure001 we'd need to manually failover secure003 to become the master database or deploy a copy of the code to secure001 via some other mechanism (like scp instead of git pull). I'd theoretically like to move away from git-based deploy some day anyway and secure001 isn't an especially critical service, so I'm not specifically going to take any reaction steps for now.

Doing admin001 now.

That one seemed straightforward.

RIP secure001

$ uptime
 13:18:01 up 1155 days, 16:08, 12 users,  load average: 0.20, 0.15, 0.14

AWS is rebooting instances in late August 2018
Closed, ResolvedPublic
Actions

Description

Related Objects

Event Timeline

AWS is rebooting instances in late August 2018Closed, ResolvedPublicActions

Description

Related Objects

Event Timeline

AWS is rebooting instances in late August 2018
Closed, ResolvedPublic
Actions