See PHI2063. See PHI2062. See PHI2089.
I failed to manually reboot an instance in response to a scheduled AWS event. I normally do this during weekly deploy windows, but didn't deploy in the event window and forgot that I'd received an event notification. This is a problem on its own, but isn't fundamentally a technical problem.
In theory, AWS could also apply this kind of reboot without a schedule notification, so even if there was no operator error this could still have caused issues.
Since I didn't manually do the reboot, AWS rebooted the instance automatically. Technically, there were two instances with events at similar times. It seems like the initial reboot took much longer than a manual reboot does, which caused PHI2062. Both instances came back up without swap, which later caused PHI2063.
The code that sets up swap just tests for the existence of /mnt/swap and assumes swap is properly configured if it exists:
$swapfile = '/mnt/swap'; if (Filesystem::pathExists($swapfile)) { return; }
Normally, this test appears to produce the correct result. These instances came up into a state where /mnt/swap existed but swap was not configured.
This test should be more surgical and examine swap state -- possibly by parsing swapon -a or /proc/swaps.
It's less clear why the initial reboot took so long ("about an hour", from PHI2062) and I'm not sure this can be reproduced from the AWS console.