Page MenuHomePhabricator

When instances change up/down status, start or stop their daemons
Closed, ResolvedPublic

Description

See PHI1363 , PHI1367, and PHI1377. We've seen three recent cases where shards suffered load-related issues.

In PHI1363 the host was totally unresponsive so it was difficult to do much to establish a root cause, and in PHI1367 there was a confounding factor related to repository compaction (see T13111), but PHI1377 had a sufficiently responsive host to offer a more compelling explanation:

When instances are taken out of service, we don't currently proactively stop their daemons. Instead, daemons keep running and are stopped during the next deploy.

This means that over time, hosts tend toward running infinite daemons in the absence of a deploy to wipe the slate clean. The deploy schedule has been unusually off-kilter recently so hosts have had an unusually long time to accumulate stray daemons. I believe this is just leading to a mundane memory-pressure-into-swapping-to-death situation.

Some evidence is that the host in PHI1377 had "like, quite a lot" of daemons in ps auxww and was using ~15GB of the ~16GB swap partition, and recovered quickly after a restart.

To resolve this:

  • When instances stop or start, we should also stop or start their daemons.
    • This should probably be a RestartWorker, and RestartWorker or bin/host restart should be adjusted to actually mean "synchronize state", i.e. bring the daemons down or up appropriately. This protects against cases where "stop" workers end up in queue and execute after a delay and put things in the wrong state.
  • Also, we should deploy normally. We're ripe to do this anyway.

Revisions and Commits

Restricted Differential Revision
Restricted Differential Revision

Event Timeline

epriestley triaged this task as Normal priority.Aug 11 2019, 3:39 PM
epriestley created this task.
epriestley added a revision: Restricted Differential Revision.Aug 11 2019, 4:35 PM
epriestley added a revision: Restricted Differential Revision.Aug 11 2019, 4:38 PM
epriestley added a commit: Restricted Diffusion Commit.Aug 11 2019, 4:41 PM
epriestley added a commit: Restricted Diffusion Commit.
epriestley added a commit: Restricted Diffusion Commit.Aug 12 2019, 2:39 PM

There was one issue with this: bin/host stop --instance X (which is rarely used, and has no prior automated callers) used --force, which stops all daemons on the host. The recent changes to the --force flag in D20601 also had an indirect impact here. I fixed this and redeployed the repo tier.

After the change, stopping and starting a test instance appears to correctly synchronize daemon state without affecting other instances.