Page MenuHomePhabricator

After an AWS event, Phacility hosts may come up with swap only partially configured
Open, LowPublic

Description

See PHI2063. See PHI2062. See PHI2089.

I failed to manually reboot an instance in response to a scheduled AWS event. I normally do this during weekly deploy windows, but didn't deploy in the event window and forgot that I'd received an event notification. This is a problem on its own, but isn't fundamentally a technical problem.

In theory, AWS could also apply this kind of reboot without a schedule notification, so even if there was no operator error this could still have caused issues.

Since I didn't manually do the reboot, AWS rebooted the instance automatically. Technically, there were two instances with events at similar times. It seems like the initial reboot took much longer than a manual reboot does, which caused PHI2062. Both instances came back up without swap, which later caused PHI2063.

The code that sets up swap just tests for the existence of /mnt/swap and assumes swap is properly configured if it exists:

$swapfile = '/mnt/swap';

if (Filesystem::pathExists($swapfile)) {
  return;
}

Normally, this test appears to produce the correct result. These instances came up into a state where /mnt/swap existed but swap was not configured.

This test should be more surgical and examine swap state -- possibly by parsing swapon -a or /proc/swaps.

It's less clear why the initial reboot took so long ("about an hour", from PHI2062) and I'm not sure this can be reproduced from the AWS console.