Page MenuHomePhabricator

AWS is rebooting instances in late August 2018
Closed, ResolvedPublic

Description

Affected hosts are:

  • admin001
  • db001
  • db002
  • db005
  • db015
  • repo018
  • repo025
  • web001
  • web003
  • web007
  • secure001

I'll plan to stop and start these during the regular upgrade/maintenance window this week.

Since admin001 got included this might be a touch tricky. The db and repo didn't have any issues but admin is a bit special.

Most recent round includes secure001, which is also maybe a bit special. Offhand, T12171 probably needs to be manually repaired after the host comes back up.

Event Timeline

epriestley created this task.

I'm going to stop/start at least some of these now.

Think I got through the easy ones without any issues. I suspect admin and secure may be a little more involved so I'm going to leave the cat in the bag for the moment.

I'm going to do admin001 and secure001 today.

I think the only thing on secure or admin which isn't properly covered by deploy automation is the crontab on secure001:

0 6 * * * /core/bin/host backup
0 7 * * * /core/bin/host prune --force
0 8 * * * /core/conf/util/generate-documentation
0 9 * * * /core/conf/util/generate-symbols

Kicking secure001 now.

(It not being covered is covered by T12879.)

Kicking secure001 now.

This seems to have worked. Two issues:

  • Crontabs aren't managed by deploy (T12879) today, so I had to restore the crontab manually.
  • secure001 can't fully bootstrap itself automatically since it tries to git pull repositories from secure.phabricator.com, which doesn't work while the master database is dead. I just commented out the git operations and made it through. This could be smoothed out, and bin/upgrade --do-not-pull seems like a reasonable tool to have in the toolbox. If we actually lost secure001 we'd need to manually failover secure003 to become the master database or deploy a copy of the code to secure001 via some other mechanism (like scp instead of git pull). I'd theoretically like to move away from git-based deploy some day anyway and secure001 isn't an especially critical service, so I'm not specifically going to take any reaction steps for now.

That one seemed straightforward.

RIP secure001

$ uptime
 13:18:01 up 1155 days, 16:08, 12 users,  load average: 0.20, 0.15, 0.14