Use stdio, not signals, to heartbeat from the daemons
ClosedPublic
Actions

Authored by epriestley on Feb 22 2015, 2:06 PM.

Details

Reviewers

btrahan

Maniphest Tasks

T7352: Improve daemon scalability in the cluster

Commits

rPHU55861bcbd6a5: Use stdio, not signals, to heartbeat from the daemons

Summary

Ref T7352. We currently run one overseer per daemon. I want to run one overseer for a group of daemons, to reduce the minimum memory footprint of an instance.

One barrier is how hang detection works: we detect daemon hangs by requiring them to send a periodic heartbeat. If a daemon doesn't heartbeat for a while, we assume it has hung and restart it.

Currently, this heartbeat is sent by having the daemons send SIGUSR1 to the overseer. When the overseer receives the signal, it extends the deadline for the next heartbeat.

However, the overseer can't tell where the signal came from. Right now it can only come from one place, but in a world where overseers run multiple daemons it could have come from any of the children.

Instead of using signals, this turns the daemon's stdout (which we already consume) into a structured message pipeline, and sends the heartbeat over stdout.

In a future diff, the overseer will be able to attriubute heartbeats to the correct child process.

Test Plan

Ran daemon in the raw, saw sensible output.
Made daemon use plain echo, saw output get wrapped.
Artificially set heartbeat deadline to 10 seconds, saw heartbeating daemons continue running and hung daemons restart.