Ref T10759. Currently, these checks run only against configured masters. Instead, check every host.
These checks also sort of cheat through restart during a recovery, when some hosts will be unreachable: they test for "disaster" by seeing if no masters are reachable, and just skip all the checks in that case.
This is bad for at least two reasons:
- After recent changes, it is possible that some masters are dead but it's still OK to start. For example, "slowvote" may have no master, but everything else is reachable. We can safely run without slowvote.
- It's possible to start during a disaster and miss important setup checks completely, since we skip them, get a clean bill of health, and never re-test them.
Instead:
- Test each host individually.
- Fundamental problems (lack of InnoDB, bad schema) are fatal on any host.
- If we can't connect, raise it as a warning to make sure we check it later. If you start during a disaster, we still want to make sure that schemata are up to date if you later recover a host.
In particular, I'm going to add these checks soon:
- Fatal if a "master" is replicating.
- Fatal if a "replica" is not replicating.
- Fatal if a database partition config differs from web partition config.
- When we let a database off with a warning because it's down, and later upgrade it to a fatal because we discover it is broken after it comes up again, fatal everything. Currently, we keep running if we "discover" the presence of new fatals after surviving setup checks for the first time.