Page MenuHomePhabricator

Run "DatabaseSetup" checks against all configured hosts
ClosedPublic

Authored by epriestley on Nov 21 2016, 1:39 PM.
Tags
None
Referenced Files
F13096681: D16902.id40700.diff
Thu, Apr 25, 6:19 PM
F13096680: D16902.id40691.diff
Thu, Apr 25, 6:19 PM
F13096679: D16902.id40690.diff
Thu, Apr 25, 6:19 PM
F13096678: D16902.id.diff
Thu, Apr 25, 6:19 PM
Unknown Object (File)
Thu, Apr 25, 2:36 AM
Unknown Object (File)
Sat, Apr 20, 7:48 PM
Unknown Object (File)
Sat, Apr 6, 8:19 AM
Unknown Object (File)
Fri, Apr 5, 4:30 PM
Subscribers
None

Details

Summary

Ref T10759. Currently, these checks run only against configured masters. Instead, check every host.

These checks also sort of cheat through restart during a recovery, when some hosts will be unreachable: they test for "disaster" by seeing if no masters are reachable, and just skip all the checks in that case.

This is bad for at least two reasons:

  • After recent changes, it is possible that some masters are dead but it's still OK to start. For example, "slowvote" may have no master, but everything else is reachable. We can safely run without slowvote.
  • It's possible to start during a disaster and miss important setup checks completely, since we skip them, get a clean bill of health, and never re-test them.

Instead:

  • Test each host individually.
  • Fundamental problems (lack of InnoDB, bad schema) are fatal on any host.
  • If we can't connect, raise it as a warning to make sure we check it later. If you start during a disaster, we still want to make sure that schemata are up to date if you later recover a host.

In particular, I'm going to add these checks soon:

  • Fatal if a "master" is replicating.
  • Fatal if a "replica" is not replicating.
  • Fatal if a database partition config differs from web partition config.
  • When we let a database off with a warning because it's down, and later upgrade it to a fatal because we discover it is broken after it comes up again, fatal everything. Currently, we keep running if we "discover" the presence of new fatals after surviving setup checks for the first time.
Test Plan
  • Configured with multiple masters, intentionally broke one (simulating a disaster where one master is lost), saw Phabricator still startup.
  • Tested individual setup checks by intentionally breaking them.

Diff Detail

Repository
rP Phabricator
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

epriestley retitled this revision from to Run "DatabaseSetup" checks against all configured hosts.
epriestley updated this object.
epriestley edited the test plan for this revision. (Show Details)
epriestley added a reviewer: chad.
  • Also, run bin/storage destroy against ALL configured masters by default.
  • While this is probably the best behavior anyway, it directly makes unit test cleanup work correctly.
chad edited edge metadata.
This revision is now accepted and ready to land.Nov 21 2016, 3:24 PM
This revision was automatically updated to reflect the committed changes.