When instances change up/down status, start or stop their daemons
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	epriestley
	Aug 11 2019, 3:39 PM

Description

See PHI1363 , PHI1367, and PHI1377. We've seen three recent cases where shards suffered load-related issues.

In PHI1363 the host was totally unresponsive so it was difficult to do much to establish a root cause, and in PHI1367 there was a confounding factor related to repository compaction (see T13111), but PHI1377 had a sufficiently responsive host to offer a more compelling explanation:

When instances are taken out of service, we don't currently proactively stop their daemons. Instead, daemons keep running and are stopped during the next deploy.

This means that over time, hosts tend toward running infinite daemons in the absence of a deploy to wipe the slate clean. The deploy schedule has been unusually off-kilter recently so hosts have had an unusually long time to accumulate stray daemons. I believe this is just leading to a mundane memory-pressure-into-swapping-to-death situation.

Some evidence is that the host in PHI1377 had "like, quite a lot" of daemons in ps auxww and was using ~15GB of the ~16GB swap partition, and recovered quickly after a restart.

To resolve this:

When instances stop or start, we should also stop or start their daemons.
- This should probably be a RestartWorker, and RestartWorker or bin/host restart should be adjusted to actually mean "synchronize state", i.e. bring the daemons down or up appropriately. This protects against cases where "stop" workers end up in queue and execute after a delay and put things in the wrong state.
Also, we should deploy normally. We're ripe to do this anyway.

Revisions and Commits

		Restricted Diffusion Commit
	Restricted Differential Revision	Restricted Diffusion Commit
	Restricted Differential Revision	Restricted Diffusion Commit

Related Objects

Mentioned Here: D20601: Drive "phd stop" entirely from the process list, not PID files on disk
T13111: Periodically run `git prune` on Git working copies

Event Timeline

epriestley triaged this task as Normal priority.Aug 11 2019, 3:39 PM

epriestley created this task.

Herald added a subscriber: amckinley. · View Herald TranscriptAug 11 2019, 3:39 PM

epriestley added a revision: Restricted Differential Revision.Aug 11 2019, 4:35 PM

epriestley added a revision: Restricted Differential Revision.Aug 11 2019, 4:38 PM

epriestley updated the task description. (Show Details)Aug 11 2019, 4:40 PM

epriestley added a commit: Restricted Diffusion Commit.Aug 11 2019, 4:41 PM

epriestley added a commit: Restricted Diffusion Commit.

epriestley added a commit: Restricted Diffusion Commit.Aug 12 2019, 2:39 PM

There was one issue with this: bin/host stop --instance X (which is rarely used, and has no prior automated callers) used --force, which stops all daemons on the host. The recent changes to the --force flag in D20601 also had an indirect impact here. I fixed this and redeployed the repo tier.

After the change, stopping and starting a test instance appears to correctly synchronize daemon state without affecting other instances.

When instances change up/down status, start or stop their daemonsClosed, ResolvedPublicActions

Description

Revisions and Commits

Related Objects

Event Timeline

When instances change up/down status, start or stop their daemons
Closed, ResolvedPublic
Actions