Page MenuHomePhabricator

Provide tools to drop severed nodes from load balancer pools by failing status checks
Open, NormalPublic

Description

When a web node is unable to reach any database replica, it should report an unavailable status from /status/.

This would allow a configuration across multiple datacenters (where some replicas are mutually unreachable) to automatically stop sending traffic to web nodes in the bad datacenter after losing services there.

Doing this with SSH might be a little trickier, but you should be able to use the same health check to decide whether to connect to a box over SSH, and I think the configuration in the Phacility cluster (where SSH application servers and Web application servers share the same nodes) is generally a sensible one, so we may not really need more than this.

Event Timeline

This is probably fairly straightforward, but I'd like to see some evidence that installs would actually benefit from it before pursuing it.

In the case of this host, all database replicas are reachable from all web nodes, so there's no way web nodes can become severed from the service.