Page MenuHomePhabricator

Improve daemon scalability in the cluster
Closed, ResolvedPublic

Description

See T7346 for discussion.

Event Timeline

epriestley raised the priority of this task from to Normal.
epriestley updated the task description. (Show Details)
epriestley added a project: Phacility.
epriestley added a subscriber: epriestley.
epriestley added a commit: Restricted Diffusion Commit.Feb 21 2015, 10:12 PM

Remaining stuff here:

  • Get daemonID into the storage table so I can clean up phd status.
  • Change taskmasters to use autoscale pools when started with phd start.
  • Merge the GarbageCollector and Trigger daemons.
  • Double check that phd stop will still stop old daemons before landing all this stuff (it should, and users "shouldn't" hit this, but...)

I consider auto-anything to be generally scary because things can easily autopilot out of control. I've seen more than a couple of postmortems where a minor failure cascaded into a total failure because of a misbehaving automated recovery, automated scaling, etc.

These autoscaling pools have a dangerous, cascading failure mode: when a pool scales up, it consumes more resources and tends to exert pressure on other pools to scale up, since they'll have fewer resources and take longer to complete work. In an extreme case, several pools can scale up together and push a box into swap, and that could impose a huge performance penalty on other pools and make them all scale up, too. Then all the pools max out and the box thrashes itself to death and can probably never complete work fast enough to recover.

We're particularly susceptible to this failure immediately following a problem with the task queue. If the daemons restart into an existing task backlog, that will create a scaling pressure across all the pools (more work to do than usual), which will reinforce itself and push the box toward a thrashing death spiral where all pools scale up at once and starve each other out. This is an especially bad time to be susceptible to failures, since it could mean that one failure with the queue cascades into a second, larger and more complex failure if we restart daemons to try to fix it. This problem also gets harder to resolve over time, because the backlog will exert more upward pressure on pool sizes and scale pools up more quickly after a restart.

I think the cleanest way to prevent this is to put a hard free memory limit on pool scale-up: say, pools never autoscale up if the machine has less than 20% of its RAM free. This prevents the box from swapping itself to death. Resource allocation may not be totally fair (the first pools to grow get to stay large, and later pools don't get to expand) but system will self-heal over time as long as the work completion rate exceeds the rate at which new work is being generated.

This stuff seems to be working (daemons restarted cleanly; autoscale worked; queue flushed; no errors) so I'm going to roll it to the cluster.

Seems to be working in the cluster, too.