Page MenuHomePhabricator

Allow daemon pools to autoscale down to 0 processes
Closed, WontfixPublic

Description

After T7352, daemons are organized into "pools" which can autoscale up and down, so each instance can, for example, run 1 taskmaster by default and scale up to 4 when there's work to be done. This allowed us to get about 100 instances onto each repo host.

Now that we have free instances, scalability is again bottlenecked by daemon memory pressure. The most-available fix we can apply is to allow pools to scale down below 1 process, to 0, so they don't need to be using any resources while asleep. In theory, this gives us about 4x headroom by stopping the Taskmaster, PullLocal, and Trigger processes and just leaving the Overseer running (although, realistically, we can't use all of that since we need some free memory for actual work, but 2x-3x is likely safe).

The infrastructure changes out of T7352 work, but they aren't especially clean. In particular, the Overseer has some awkward responsibilities, not everything is really a pool, autoscaling does some magic, there's a lot of "dictionary of keys" stuff instead of "actual object" stuff, and so on.

I plan to clean this up first (let the Overseer have a list of Pools, not a list of DaemonHandles, then make the Pools deal with the daemon/autoscale stuff), then give daemons tools to entomb themselves.

Event Timeline

I believe I have the first part of this (restructuring the code into a more sensible Overseer > Pool > Daemon sort of thing) working, but it could use more testing. I'm going to see if we have anything else in Daemons that I can fix while I'm here to help me kick the tires a bit.

Hibernating daemons currently show as "Waiting" in the Daemon console, but I'm not going to worry about that for now.

We no longer offer free instances so I don't currently plan to pursue this.

I think it's also possible that we may want to remove all this autoscale/hibernation code, since it no longer serves any purpose but is very complicated. But it seems stable, T13052 excepted, so it's not on the chopping block immediately.