Fix unbounded expansion of allocating resource pool
ClosedPublic
Actions

Authored by epriestley on Oct 5 2015, 4:07 PM.

Details

Reviewers

hach-que
chad

Maniphest Tasks

T9252: Unprototype Drydock (v1)

Commits

Restricted Diffusion Commit
rPee937e99fb9a: Fix unbounded expansion of allocating resource pool

Summary

Ref T9252. I think there's a more complex version of this problem discussed elsewhere, but here's what we hit today:

5 commits land at the same time and trigger 5 builds.
All of them go to acquire a working copy.
Working copies have a limit of 1 right now, so 1 of them gets the lease on it.
The other 4 all trigger allocation of new working copies. So now we have: 1 active, leased working copy and 4 pending, leased working copies.
The 4 pending working copies will never activate without manual intervention, so these 4 builds are stuck forever.

To fix this, prevent WorkingCopies from giving out leases until they activate. So now the leases won't acquire until we know the working copy is good, which solves the first problem.

However, this creates a secondary problem:

As above, all 5 go to acquire a working copy.
One gets it.
The other 4 trigger allocations, but no longer acquire leases. This is an improvement.
Every time the leases update, they trigger another allocation, but never acquire. They trigger, say, a few thousand allocations.
Eventually the first build finishes up and the second lease acquires the working copy. After some time, all of the builds finish.
However, they generated an unboundedly large number of pending working copy resources during this time.

This is technically "okay-ish", in that it did work correctly, it just generated a gigantic mess as a side effect.

To solve this, at least for now, provide a mechanism to impose allocation rate limits and put a cap on the number of allocating resources of a given type. As hard-coded, this the greater of "1" or "25% of the active resources in the pool".

So if there are 40 working copies active, we'll start allocating up to 10 more and then cut new allocations off until those allocations get sorted out. This prevents us from getting runaway queues of limitless size.

This also imposes a total active working copy resource limit of 1, which incidentally also fixes the problem, although I expect to raise this soon.

These mechanisms will need refinement, but the basic idea is:

Resources which aren't sure if they can actually activate should wait until they do activate before allowing leases to acquire them. I'm fairly confident this rule is a reasonable one.
Then we limit how many bookkeeping side effects Drydock can generate once it starts encountering limits.

Broadly, some amount of mess is inevitable because Drydock is allowed to try things that might not work. In an extreme case we could prevent this mess by setting all these limits at "1" forever, which would degrade Drydock to effectively be a synchronous, blocking queue.

The idea here is to put some amount of slack in the system (more than zero, but less than infinity) so we get the performance benefits of having a parallel, asyncronous system without a finite, manageable amount of mess.

Numbers larger than 0 but less than infinity are pretty tricky, but I think rules like "X% of active resources" seem fairly reasonable, at least for resources like working copies.

Test Plan

Ran something like this:

for i in `seq 1 5`; do sh -c '(./bin/harbormaster build --plan 10 rX... &) &'; done;

Saw 5 plans launch, acquire leases, proceed in an orderly fashion, and eventually finish successfully.

Diff Detail

Repository

rP Phabricator

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

epriestley updated this revision to Diff 34374.Oct 5 2015, 4:07 PM

epriestley retitled this revision from to Fix unbounded expansion of allocating resource pool.

epriestley updated this object.

epriestley edited the test plan for this revision. (Show Details)

epriestley added reviewers: chad, hach-que.

epriestley added a task: T9252: Unprototype Drydock (v1).

Welcome to the world of Drydock race conditions :P

The other 4 all trigger allocation of new working copies. So now we have: 1 active, leased working copy and 4 pending, leased working copies.

The way I solved this in my patches was by allowing leases to pre-allocate against resources they allocate. This seems cleaner than just allocating lots and lots of resources?

I mention "pre-leasing" in one of the comments. What I'm thinking about there is that we sometimes get this situation:

Lease A asks for a resource.
In response, we create Resource B. Resource B starts allocating.
Meanwhile, resource C (which already existed) frees up.
Lease A is acquired on resource C.
(Resource B is still allocating.)

It would be nice for "Resource B" to be able to see that nothing cares about it anymore now and release itself more aggressively. We could do this by letting the lease "pre-lease" (or "express an intent to acquire" or something) the resource. This wouldn't be a real lease, but just a bookkeeping hint that would let resources stop work or release early. We shouldn't technically need this, but it might smooth out some behaviors in the queue.

In particular, once the total resource limit is raised, we'll potentially run into a mess like this:

99% of working copies are type "A" (say, the main repositories), and 1% are type "B" (say, some weird unusual repository).
A new lease comes in asking for type "B".
There aren't any of them free right now, and some other limit is also maxed out so none can be built for a while.
The lease can fill up the entire allocation queue asking a "B".

Although things will eventually normalize, at least in theory, it could take an unreasonably long time for this queue to flush. Letting leases express that they're waiting for a specific resource would help with this in two ways:

All the allocating "B" could just throw themselves away immediately once the lease was satisfied.
The lease could be prevented from queueing more than one allocation.

This would also help improve observability of the allocation process.

But there's no way to hit this right now in the upstream and I'm not totally sure we want/need it, so I'm leaving it out for now.

So if there are 40 working copies active, we'll start allocating up to 10 more and then cut new allocations off until those allocations get sorted out.

In particular, it isn't an acceptable strategy to just allocate a bunch of resources that won't be needed, since resources may have costs associated (EC2).

In D14236#157925, @epriestley wrote:

I mention "pre-leasing" in one of the comments. What I'm thinking about there is that we sometimes get this situation:

Lease A asks for a resource.

In response, we create Resource B. Resource B starts allocating.

Meanwhile, resource C (which already existed) frees up.

Lease A is acquired on resource C.

(Resource B is still allocating.)

It would be nice for "Resource B" to be able to see that nothing cares about it anymore now and release itself more aggressively. We could do this by letting the lease "pre-lease" (or "express an intent to acquire" or something) the resource. This wouldn't be a real lease, but just a bookkeeping hint that would let resources stop work or release early. We shouldn't technically need this, but it might smooth out some behaviors in the queue.

In the case of pre-allocation of leases, Lease A would already be allocated onto resource B. The availability of resource C has no effect (but in the case of minimum / maximum resource settings, resource C may be released if the last lease is freed).

The way I solved this in my patches was by allowing leases to pre-allocate against resources they allocate. This seems cleaner than just allocating lots and lots of resources?

I think we want these limits regardless, because they also smooth out allocation behavior when resources are expensive to allocate relative to the cost to do work on them.

For example, it takes us about 5 minutes to bring up a working copy (mostly spent building xhpast caches) and <15 seconds to run tests in it. These numbers could be different, but I think it's broadly reasonable to imagine that there are various cases where resources are expensive relative to the cost of using those resources.

For resources like this, if we get 1,000 requests suddenly, we probably don't want to allocate 1,000 resources, even if we can guarantee that those allocations are 1:1. With the 5 minute / 15 second numbers, the cost to use a resource is 5% of the cost to allocate it, so we can bring up ~50 resources instead of ~1,000 and only take twice as long to finish, approximately.

In some cases you might not care and just want to finish as fast as possible, but if the resources have some material cost you might want to smooth out the curve and accept 2x build duration in exchange for 20x resource cost reduction. Eventually, this will theoretically be configurable by adjusting the rate at which the pool is allowed to scale.

This also serves as a safety / runaway-train sort of feature, where it's harder to have something build infinite resources by accident if you screw something up, and the default state of the queue is to lock up instead of grow unboundedly.

In particular, it isn't an acceptable strategy to just allocate a bunch of resources that won't be needed, since resources may have costs associated (EC2).

You can adjust (I mean, in theory -- it's hard-coded in this diff) the queue growth rate to your tolerance for overallocation. If you have a very low tolerance, put a limit of 1 on it. If you have zero tolerance, put a total resource limit of 1 on it. If there's a use case for extremely low tolerance resources, we could let the pool grow as a function of pending requests in addition to growing as a function of active resources, so you'd only scale the pool up if there was a large backlog.

It isn't possible to completely avoid over-allocation unless you run everything in a fully blocking/synchronous mode by dropping all the limits to 1, and the more aggressive you get about performance the more you have to accept the possibility of over-allocation.

In the case of pre-allocation of leases, Lease A would already be allocated onto resource B.

This is what happens right now. If resource B never allocates, the queue deadlocks. At HEAD, if you run two builds, the second one never runs and you end up with an active, unutilized resource; a pending, blocked resource; and waiting lease.

In this scenario, we'd rather put Lease A on resource C and throw resource B away before it allocates. I think this is always better on every dimension: lease A acquires faster and we build fewer total resources.

chad accepted this revision.Oct 5 2015, 7:05 PM

chad edited edge metadata.

This revision is now accepted and ready to land.Oct 5 2015, 7:05 PM

Closed by commit rPee937e99fb9a: Fix unbounded expansion of allocating resource pool (authored by epriestley, committed by epriestley). · Explain WhyOct 5 2015, 10:59 PM

This revision was automatically updated to reflect the committed changes.

epriestley mentioned this in T9252: Unprototype Drydock (v1).Oct 6 2015, 12:33 AM

epriestley mentioned this in Blog Post: Development Notes (2015 Week 41).Oct 10 2015, 1:33 PM

epriestley mentioned this in D14272: Fix an issue where newly created Drydock resources could be improperly acquired.Oct 14 2015, 12:02 PM

epriestley mentioned this in D14274: Fix bad counting in SQL when enforcing Drydock allocator soft limits.Oct 14 2015, 12:26 PM

This seems to be working OK for the most part so far, although there were two bugs I missed in the original implementation:

(D14272) New resources and existing resources had slightly different checks applied to them, which could cause some inconsistent states.
(D14274) The query in the allocator soft limit method wasn't quite right and misbehaved for larger/more complex states.