See PHI1315, where an instance restarted without any apparent notification. This happened once before previously, although I'm not immediately able to dig up the issue.
Some user on Reddit suggests this is routine, and/or fabricates an imaginary story about it being routine: https://www.reddit.com/r/aws/comments/ax944r/is_it_possible_that_amazon_itself_rebooted_my_ec2/ehsgb7y/
...although other users in that thread suggest this isn't expected:
Whenever aws schedules an ec2 for migration/reboot due to underlying hardware/software issues you will receive usually an email with the notice and also you will have a notification in the console.
I checked these places for any kind of "host is restarting because X" information:
- AWS email.
- AWS notifications bell in the menu.
- CloudTrail logs.
- /var/log/syslog
- /var/log/dmesg.*
- last reboot
- /var/log/boot.log
No luck finding any actual cause for the restart.
Prior to these recent cases, all restarts have been scheduled/announced. However, now that this has happened more than once, it seems like something we should handle better.
The major issues are:
- After restart, volumes don't automatically remount.
- After restart, services don't automatically start.
These can be remedied by running bin/host upgrade on restart. However, this also upgrades application libraries (e.g., Phabricator), which we don't want. The shortest path to a reasonable fix is probably:
- Add a flag like bin/host upgrade --keep-deployed-version to disable the git pull steps.
- Run this on host startup.
- Kick some hosts to make sure it works.
Ideally, this would also include a flag like --and-complain to generate a notification/alert.