Page MenuHomePhabricator

Cluster daemons can spiral to death over contentious cache fills
Open, NormalPublic

Description

Daemons using the disk-based instance ref cache can turn into a thundering herd relatively easily after failing to acquire a lock on the disk-based cache.

In updateRefCache(), we do this, but the read can fail and return an age of 0, which triggers a full cache fill:

$this->updateCacheAgeEpoch();
$age = $this->getCacheAge();

The read pathway currently does not distinguish between "not in cache" and "read failed for lock reasons". Normally this is good, since programs mostly shouldn't distinguish between the two cases. However, we get into some trouble here.

For now, I'm just going to extend the daemon lock wait to 15s, which seems to mostly "fix" the problem.


  • The unserialize() in loadCache() can safely come out of the locked section.
  • Conduit over the wire probably isn't compressed, and the raw cache fills are ~1MB.
  • We could let Conduit methods cache the raw output so they don't have to re-serialize it. This might be cheap but probably isn't free.