Page MenuHomePhabricator

Phacility repository shards may restart incompletely
Open, LowPublic

Description

Via email:

  • On February 13, 2021, repo023 was one of ten hosts scheduled for an AWS reboot.
  • After the reboot, I spot-checked each repo instance and confirmed that daemons were running, and spot-checked each db instance and confirmed MySQL was accepting connections.
  • On February 15, 2021, an instance on the shard reported a "Daemons Not Running" error.
  • Although repo023 was running daemons, it wasn't running all the daemons.
  • I restarted services on repo023, then restarted all other affected repo host services for good measure.

Here's the full cohort:

2021-02-25 8:00:00 AM  i-3a7aa165          db007.phacility.net   system-reboot [Completed] scheduled reboot 
2021-02-25 12:00:00 PM i-0db1f9795347be785 db017.phacility.net   system-reboot [Completed] scheduled reboot 
2021-02-28 8:00:00 PM  i-02547e8e824bcab08 db019.phacility.net   system-reboot [Completed] scheduled reboot 
2021-02-26 8:00:00 AM  i-bbbca87b          repo005.phacility.net system-reboot [Completed] scheduled reboot 
2021-02-24 8:00:00 AM  i-0c7ba053          repo008.phacility.net system-reboot [Completed] scheduled reboot 
2021-02-24 10:00:00 AM i-0e43e8acb921cf250 repo014.phacility.net system-reboot [Completed] scheduled reboot 
2021-02-25 10:00:00 AM i-0a287448a2ef41563 repo019.phacility.net system-reboot [Completed] scheduled reboot 
2021-02-28 6:00:00 PM  i-0b09689d7feccca4d repo023.phacility.net system-reboot [Completed] scheduled reboot 
2021-02-26 10:00:00 AM i-0f0d656cd882c6e4b repo031.phacility.net system-reboot [Completed] scheduled reboot 
2021-02-28 4:00:00 PM  i-08e3bdee526cea358 web008.phacility.net  system-reboot [Completed] scheduled reboot

Since I believe daemons were visible during the spot-check, I suspect the upgrade-after-reboot process aborted midway through: it launched some of the daemons, but not all of them.

A possible cause of a midway abort is a database failure: the restart script may attempted to restart an instance on db007, db017, or db019 and failed because it could not connect to the database. If so, the restart process should probably be made more robust against temporary failures on a subset of instances. Although it may still be correct to fail the process overall if any instance fails to restart, we could reasonably continue restarting instances and retry failed instances, with the expectation that this condition is relatively routine (it is probably the most likely reason an instance would fail to restart).

More broadly, when upgrade-after-reboot fails, there is currently no upstream channel by which the error can be reported. A general channel to escalate this class of error (whether it really originated with database reachability or not) would be desirable.

Event Timeline

epriestley created this task.