Page MenuHomePhabricator

After an AWS event, Phacility hosts may come up with swap only partially configured
Open, LowPublic

Description

See PHI2063. See PHI2062.

I failed to manually reboot an instance in response to a scheduled AWS event. I normally do this during weekly deploy windows, but didn't deploy in the event window and forgot that I'd received an event notification. This is a problem on its own, but isn't fundamentally a technical problem.

In theory, AWS could also apply this kind of reboot without a schedule notification, so even if there was no operator error this could still have caused issues.

Since I didn't manually do the reboot, AWS rebooted the instance automatically. Technically, there were two instances with events at similar times. It seems like the initial reboot took much longer than a manual reboot does, which caused PHI2062. Both instances came back up without swap, which later caused PHI2063.

The code that sets up swap just tests for the existence of /mnt/swap and assumes swap is properly configured if it exists:

$swapfile = '/mnt/swap';

if (Filesystem::pathExists($swapfile)) {
  return;
}

Normally, this test appears to produce the correct result. These instances came up into a state where /mnt/swap existed but swap was not configured.

This test should be more surgical and examine swap state -- possibly by parsing swapon -a or /proc/swaps.

It's less clear why the initial reboot took so long ("about an hour", from PHI2062) and I'm not sure this can be reproduced from the AWS console.

Event Timeline

epriestley created this task.