Build general healthcheck infrastructure for monitoring services
Open, LowPublic
Actions

Assigned To

None

Authored By

	epriestley
	May 10 2019, 5:45 PM

Description

See PHI1211, where a JIRA service failure cascaded into a Phabricator service failure. Supporting health checks for Doorkeeper services could reduce the impact of this kind of external service failure by letting us drop traffic instead of waiting for it to timeout.

See PHI1206, which discusses read routing in a repository cluster with some down nodes. Performing health checks on cluster nodes could reduce how often we need to mis-route to downed nodes, retry, and reach retry exhaustion.

In both cases, we could benefit from some shared infrastructure for managing health check state. Although it's possible this is useful as a general capability, I suspect most of the value in building this as an infrastructure component is in making health checks observable and debuggable. It is probably reasonable to assume that health checked services will be at-least-mostly managed by Phabricator and that companies won't be putting production-server001 into this for at least a while.

An adjacent useful capability would be to give a node some sort of "health score" so we can represent nodes in a degraded state, like repository nodes performing a repack (T13111). These nodes can still receive traffic, but we'd prefer not to send them traffic if we don't need to.

We could also health check mailers and notification nodes, but since requests to these services happen in the daemons it's generally not important to avoid mis-routed requests.

This may also tie into Drydock nodes (T8153) but I'm not going to try to plan that far out for now.

This probably looks roughly like:

Service Health Object
- Node Health Object
  - Node Health Log Event Thing

...where each Service (like "JIRA") has zero or more "Nodes", each node has some health events, and we aggregate the health events into a health score/status for the node.

In the case of services like JIRA, the "Node" object has no other object as an analog.

In the case of repositories, the "Node" object is analogous to a service binding. It's probably still worth making a separate "real" object here, but maybe they serve as very lightweight proxies in this case.

Related Objects
Search...

Status	Assigned	Task
Open	None	T13286 When nodes in a cluster repository fail, reads are still routed with the same weight and failed reads do not recover
Open	None	T13285 Service failures in JIRA can cascade into service failures in Phabricator
Open	None	T13287 Build general healthcheck infrastructure for monitoring services