Page MenuHomePhabricator

Unannounced/unlogged AWS reboots are apparently routine procedure, not cosmic rays / ghosts
Closed, ResolvedPublic

Description

See PHI1315, where an instance restarted without any apparent notification. This happened once before previously, although I'm not immediately able to dig up the issue.

Some user on Reddit suggests this is routine, and/or fabricates an imaginary story about it being routine: https://www.reddit.com/r/aws/comments/ax944r/is_it_possible_that_amazon_itself_rebooted_my_ec2/ehsgb7y/

...although other users in that thread suggest this isn't expected:

Whenever aws schedules an ec2 for migration/reboot due to underlying hardware/software issues you will receive usually an email with the notice and also you will have a notification in the console.

I checked these places for any kind of "host is restarting because X" information:

  • AWS email.
  • AWS notifications bell in the menu.
  • CloudTrail logs.
  • /var/log/syslog
  • /var/log/dmesg.*
  • last reboot
  • /var/log/boot.log

No luck finding any actual cause for the restart.


Prior to these recent cases, all restarts have been scheduled/announced. However, now that this has happened more than once, it seems like something we should handle better.

The major issues are:

  • After restart, volumes don't automatically remount.
  • After restart, services don't automatically start.

These can be remedied by running bin/host upgrade on restart. However, this also upgrades application libraries (e.g., Phabricator), which we don't want. The shortest path to a reasonable fix is probably:

  • Add a flag like bin/host upgrade --keep-deployed-version to disable the git pull steps.
  • Run this on host startup.
  • Kick some hosts to make sure it works.

Ideally, this would also include a flag like --and-complain to generate a notification/alert.

Revisions and Commits

Restricted Differential Revision

Event Timeline

epriestley triaged this task as Normal priority.Jun 20 2019, 2:58 PM
epriestley created this task.

If you are just writing an upstart job that needs to start the service after the basic facilities are up, either of these will work:

start on (local-filesystems and net-device-up IFACE!=lo)

I am led to believe that some sort of grand crusade was waged over "Upstart" vs "SystemV" so our children could have a brighter future where the correct way to do the most common task with the startup system is start on (local-filesystems and net-device-up IFACE!=lo)?

But wait! It says there are two ways! The other way is:

start on runlevel [2345]

As a reminder, here's what the runlevels mean, from the same document:

2 : Graphical multi-user plus networking (DEFAULT)
3 : Same as "2", but not used.
4 : Same as "2", but not used.
5 : Same as "2", but not used.

Do computers really work?

I'm going to install an upstart task on secure004 and kick it a few times, some stuff might be sketchy until I figure that out.

This incantation seems to work as an upstart job:

/etc/init/phacility.conf
# See T13320. AWS hosts occasionally restart abruptly. When they do, we want
# to remount volumes and restart services.

start on runlevel [2345]

task

exec sudo -u ubuntu -- /core/bin/host upgrade --hold-versions

The start on runlevel [2345] is a magic incantation which the manual says probably does what we want ("start a normal thing on startup").

The task line tells upstart that this is a one-shot script, not a daemon.

The exec runs bin/host upgrade with the new --hold-versions flag.

This can be tested with:

$ sudo service phacility start

If you kill some of the services on the box (like aphlict) and run that, they come back up.

Then I did a:

$ sudo shutdown -r now

...and the box popped back up a minute later with all the services running.

It also saves logs in /var/log/upstart/phacility.log. They show the job performing volume attachment, etc., correctly on restart.

epriestley added a revision: Restricted Differential Revision.Jun 25 2019, 5:50 PM
epriestley added a commit: Restricted Diffusion Commit.Jun 26 2019, 1:27 AM
epriestley claimed this task.

This is deployed everywhere now, and kicking secure004 over seemed to work correctly, at least.