See PHI1363 , PHI1367, and PHI1377. We've seen three recent cases where shards suffered load-related issues.
In PHI1363 the host was totally unresponsive so it was difficult to do much to establish a root cause, and in PHI1367 there was a confounding factor related to repository compaction (see T13111), but PHI1377 had a sufficiently responsive host to offer a more compelling explanation:
When instances are taken out of service, we don't currently proactively stop their daemons. Instead, daemons keep running and are stopped during the next deploy.
This means that over time, hosts tend toward running infinite daemons in the absence of a deploy to wipe the slate clean. The deploy schedule has been unusually off-kilter recently so hosts have had an unusually long time to accumulate stray daemons. I believe this is just leading to a mundane memory-pressure-into-swapping-to-death situation.
Some evidence is that the host in PHI1377 had "like, quite a lot" of daemons in ps auxww and was using ~15GB of the ~16GB swap partition, and recovered quickly after a restart.
To resolve this:
- When instances stop or start, we should also stop or start their daemons.
- This should probably be a RestartWorker, and RestartWorker or bin/host restart should be adjusted to actually mean "synchronize state", i.e. bring the daemons down or up appropriately. This protects against cases where "stop" workers end up in queue and execute after a delay and put things in the wrong state.
- Also, we should deploy normally. We're ripe to do this anyway.