Page MenuHomePhabricator

When we "discover" new fatal setup issues, stop serving traffic
ClosedPublic

Authored by epriestley on Nov 21 2016, 2:51 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Jan 12, 11:02 PM
Unknown Object (File)
Sun, Jan 12, 7:23 AM
Unknown Object (File)
Fri, Dec 27, 5:45 PM
Unknown Object (File)
Tue, Dec 24, 3:34 PM
Unknown Object (File)
Fri, Dec 20, 6:38 PM
Unknown Object (File)
Dec 19 2024, 3:05 AM
Unknown Object (File)
Dec 19 2024, 3:05 AM
Unknown Object (File)
Dec 18 2024, 1:47 PM
Subscribers
None

Details

Summary

Ref T10759. We may "discover" the presence of a fatal setup error later, after starting Phabricator.

This can happen in a few ways, but most are unlikely. The one I'm immediately concerned about is:

  • Phabricator starts up during a disaster with some databases unreachable.
  • We start with warnings (unreachable databases are generally not fatal, since it's OK for some subset of hosts to be down in replicated/partitioned setups).
  • The unreachable databases later recover and become accessible again.
  • When we run checks against them, we discover that they are misconfigured.

Currently, "fatal" setup issues are not truly fatal if we're "in flight" -- we've survived setup checks at least once in the past. This is bad in the scenario above.

Especially with partitioning, it could lead to mangled data in a disaster scenario where operations staff makes a small configuration mistake while trying to get things running again.

Instead, if we "discover" a fatal error while already "in flight", reset the whole setup process as though the webserver had just restarted. Don't serve requests again until we can make it through setup without hitting fatals.

Test Plan
  • Started Phabricator with multiple masters, one of which was down and broken.
  • Got a warning about the bad master.
  • Revived the master.
  • Before: Phabricator detects the fatal, but keeps serving requests.
  • After: Phabricator detects the fatal, resets the webserver, and stops serving requests until the fatal is resolved.

Diff Detail

Repository
rP Phabricator
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

epriestley retitled this revision from to When we "discover" new fatal setup issues, stop serving traffic.
epriestley updated this object.
epriestley edited the test plan for this revision. (Show Details)
epriestley added a reviewer: chad.
chad edited edge metadata.
This revision is now accepted and ready to land.Nov 21 2016, 4:49 PM