Provide better tools for preventing deadlocks in build plans which acquire multiple resources
Open, Needs TriagePublic

Description

The drydock documentation is clear enough on the fact that it's designed to operate in environments where acquiring new resources is is cheap and easy, so I suppose this may be a philosophically incorrect request. However, there are some severe problems that occur in our world where we do have a limit to the number of machines we have access to.

General constraints:

  • Our build is very heavy, and leaks all over the system as well, it's not possible to run more than 1 build on any given machine
  • We use our own hardware, and can only rack servers so fast, demand regularly outpaces supply

Primary use:

  • Every revision update triggers a harbormaster build via herald

Curveball:

  • We have two "lease working copy" steps inside the build plan

We have encountered the scenario where enough revisions were created/updated in a short amount of time that most of them succeeded in obtaining one lease, then our resource limit was hit, and the builds essentially deadlocked eachother because they could never finish without getting a second lease, which all of their siblings were holding, waiting for leases as well.

I'm not quite sure how to get around this, other than just running two separate build plans, but for a number of reasons that will be difficult for us to accomplish short term.

Ideas?

See some prior discussion in T10081, which I'm going to merge here.

In T10081, the scenario was that plans needed one resource of type A and one resource of type B, but acquired them in random order. Serializing the steps so that A was always acquired first could potentially fix the problem. Is that's what's happening here, or are you acquiring two homogenous resources (i.e., both of type A)?

I think the best pathway forward here in the general case is probably to introduce a "lock" resource which acquires instantly and just provides limits, so you'd provide half as many lock slots as you have machines if each build needs two machines. Then your plan would look like:

  • Acquire lock resource.
  • (Depends on Lock) Acquire resource of type A.
  • (Depends on Lock) Acquire resource of type B (or another A).

You'd have to update the number of lock slots when you changed the size of the pool, so maybe slightly better is to let the lock slots automatically shadow the pool size and acquire multiple slots instead:

  • Acquire 2 shadow-locks on pool A.
  • (Depends on shadow locks) Acquire resource of type A.
  • (Depends on shadow locks) Acquire another resource of type A.

On the other hand, I can see non-shadow lock resources being useful to impose other types of limits on pools (e.g., to let a pool be used for both higher-priority builds and lower-priority builds, but imposing an artificial lock limit on the low-priority builds that is intentionally much smaller than the pool size). So maybe we do both, or start with the explicit ones and try to upgrade them to automatic if managing the slot count is a pain.

Does something in this vein generally sound reasonable? Do you anticipate any problems declaring maximum limits for what sort of resources you'll need up front at the top of a build plan?

We could also try to detect and resolve deadlocks, and this might be worth doing, but I think the long-term the primary solution here is locks at the start, not recovery after already doing a lot of work. No matter how good our detection/restart code is, I think we're not in a good place if we already spent 15 minutes building resources and then realize we have to give them all up because the build tier deadlocked.

The only real problem I see with this is that in the future when build plans become more complex you may not really know how many resources you will need upfront, so you'll have to lock 30 machines or something even when you usually only use 12. But I think that's actually okay, and we could give you some tools later to partially release shadow-locks as you get more information, and presumably we have some time before the resource allocation strategies for build plans become too large and complex to reason about.

I believe a possible very-short-term tactic here is to split your build pool into two pools (artificially, resources of type "A" and resources of type "B", even if they are indistinguishable) and then serialize their acquisition. This will slow things down but should prevent deadlocks. But this is also clumsy and onerous and will quickly become unreasonable.

I think providing a formal lock resource isn't too difficult, but this is adjacent to other work which I'd like to schedule at the same time (e.g., T8153). How much pain is this causing?

Actually, in this specific case (assuming two homongenous resources), would adding an explicit Release Lease step fix this problem? Presumably the builds that got their first resources are just hanging after using it. This doesn't fix the "A, B" problem but should fix the "A, A" problem.

Upshot:

  • Do you anticipate any problems with trying to solve this through a "before doing anything, lock 2 of X slots" primitive, where the lock is an artificial constraint you size separately from the pool?
  • How much pain is this causing?
  • Would an explicit way to "Release Lease" hypothetically fix this, given your understanding of the problem?
epriestley renamed this task from Improve harbormaster / drydock behavior in resource constrained environments to Provide better tools for preventing deadlocks in build plans which acquire multiple resources.
epriestley added subscribers: joshuaspence, hach-que.

You don't really need an explicit release lease to solve this problem. Nor do you need an explicit locking system for resources. You can determine both of these from the harbormaster step dependency tree.

Releasing is easy: when all of the steps that have the resource as an input finish, release the working copy artifact then instead of at the end of the build.

Allocation is harder: you need to compute the potential hard overlap in build steps for resource artifacts. An example is build step C uses resource D and depends on build step B, and build step B uses resource A. There are build steps after both B and C that continue to use the resources. In this case you need to couple the lease acquisitions together - Either you acquire a lease on both B and C, or you acquire a lease on neither (and continue retrying until you can).

You can determine both of these from the harbormaster step dependency tree.

Today, yes, but not if the tree includes custom build steps which allocate resources or if we add "if <condition>, run <another build plan>" or "foreach <item> in <artifact>, run <another build plan with 'item' as an input>". I think these capabilities are probably more interesting than being able to automatically count how many resources a build needs is.

We could do this when a plan is predictable, although an automatic lock acquisition scheme could fail in cases where an explicit one would succeed: for example, when a plan is edited, we may change discovered lock order from "A, B" to "B, A", and builds on the new version of the plan could deadlock with currently-executing builds. If both versions of the plan acquire locks explicitly builds will be able to survive edits.

Releasing is easy: when all of the steps that have the resource as an input finish, release the working copy artifact then instead of at the end of the build.

I think this is right, though. I could imagine bizarre cases where this isn't the right rule (for example, steps that use released resources to, say, log metrics), but can't come up with any reasonable ones, and we could add a dummy "Hold This Resource" or an explicit "Release This Resource" step later to let you shape behavior if we did hit them.

This does mean that future foreach/if/run another build should take inputs explicitly (not just inherit the parent environment automatically), but I think that's likely desirable anyway.

We could still hold a resource too long when running another plan with a contained if, but this strikes me as a very-distant-future edge case.

It's causing a lot of pain mostly because we don't have good visibility into it happening, and we don't have any good way to react to it once we get close to fully saturated.

I think that providing better tools for understanding utilization or warning when things are going awry would be nice, but even if we knew it was going to happen, how would we react? I suppose we could go manually kill a bunch of builds, but that seems almost worse to me than just letting things slowly proceed, rather than interrupt whoever issued the build.

An actual deadlock is very rare, and has only happened once, usually this just manifests itself by having hundreds of worker tasks queue up and then zerg rush for a lease every 15 seconds. They keep piling up as long as demand stays beyond supply.

Anyways, any one of the methods suggested here for freeing the lease sounds like it would solve the problem. If that's easy enough, awesome.

There is still a "fairness" problem here, and maybe this is an assumption that people shouldn't make, but it seemed a natural one to me: everyone assumes that if you issue a build, and then issue another build, the first build will get priority when acquiring leases or locks. I think this is not entirely ridiculous, especially when you consider many very long builds competing for leases. It's a source of frustration that you can submit a build which should take ten minutes to run, ten people submit builds after you, and their builds just happen to snap up the leases before yours does, so it takes your build 1.5 hours to finish.

Do you think it's possible to get some kind of queueing or prioritization scheme in place, whereby we could dole out resources that builds need based on some kind of ranking system, or even just always preferring older builds?

The expectation is that allocation should generally be fair ("first come, first serve") when no allocations fail.

Currently, all bets are off once an allocation fails. If we don't give up our place in line and someone asks for an impossible resource, we might completely halt the queue forever. We could distinguish between failures more granularly and potentially improve this, but the risk of trying to be more fair is that a bad request stops forward motion completely as we try to satisfy it forever.

Are you seeing behavior which differs from this (i.e., unfair allocation prior to resource exhaustion)?

Being completely fair also means we must execute without parallelism, because if we start position 2 without waiting for position 1 to finish it might win, and this could keep happening forever.

Some of this is just that the design optimizes for a surplus of resources. It sucks if you have to wait 1.5 hours for your build whether you're first in line or not, and the cost of machines is theoretically small compared to the cost of developer time: it's obviously better for everyone if the pool has a resource surplus and we don't have to solve these problems.

I'm not opposed to improving these behaviors, they've just received less focus than the surplus case because all outcomes suck when there's a resource deficit.

Generally, I think resource autorelease (using @hach-que's rule) is probably the shortest pathway forward here. You'll still have a fairness problem, but you should get better utilization from the pool and be deadlock-proof on "A", "A" type acquisitions. This also tends to trigger less adjacent work than new resource types would.

Being completely fair also means we must execute without parallelism

To clarify my comment, we would only need to acquire resources without parallelism. This is probably an average of a few seconds lost per acquisition and not a big hit relative to numerous-minutes-long builds, but not ideal in the resource surplus / short build time case.

You can determine both of these from the harbormaster step dependency tree.

Today, yes, but not if the tree includes custom build steps which allocate resources or if we add "if <condition>, run <another build plan>" or "foreach <item> in <artifact>, run <another build plan with 'item' as an input>". I think these capabilities are probably more interesting than being able to automatically count how many resources a build needs is.

We could do this when a plan is predictable, although an automatic lock acquisition scheme could fail in cases where an explicit one would succeed: for example, when a plan is edited, we may change discovered lock order from "A, B" to "B, A", and builds on the new version of the plan could deadlock with currently-executing builds. If both versions of the plan acquire locks explicitly builds will be able to survive edits.

I think for most of these cases we can also either predict, or preemptively hold a lease (without allocating it).

In the case of foreach, I would expect resources allocated within the foreach to be scoped to that foreach - I don't think it makes much sense to allow multiple resources with the same name to be allocated here. For deadlock detection inside a foreach, we can apply the same technique, but scoped to just the scope of the foreach (without worrying about resources allocated inside a foreach being used beyond that scope because we disallow it).

In the case of run another build plan, we're waiting until that build plan completes anyway, so we can still determine whether or not there's a hard resource dependency because we treat the resource as being used for the entirety of the other build plan running (within our local deadlock detection).

In the case of if we can do one of two things. We could either scope resources (allocations inside an if are only available inside an if), which would make it consistent with foreach. However I can see a scenario where you might want to allocate a resource differently based on a condition. In this case we could introduce some sort of "preemptive lease" - a lease which counts as allocated for the purpose of resource and lease limits, but which doesn't actually consume physical resources on whatever resource it's associated with. For a preemptive working copy lease, this would involve going through the allocation of limits, but not actually performing git clone or any of the other working copy commands until the preemptive lease is upgraded to a full lease. Then, we acquire a preemptive lease for the potential resources (only if there's potentially a hard resource dependency) alongside whatever other resources need to be obtained, and the build can either upgrade it to a full lease if the actual build step is hit, or release it if the build step won't be hit (i.e. it resides in the block that doesn't get executed). Hopefully that makes sense?

On the other hand, I can see non-shadow lock resources being useful to impose other types of limits on pools (e.g., to let a pool be used for both higher-priority builds and lower-priority builds, but imposing an artificial lock limit on the low-priority builds that is intentionally much smaller than the pool size). So maybe we do both, or start with the explicit ones and try to upgrade them to automatic if managing the slot count is a pain.

Separate to the deadlock detection mechanism, it would be greatly useful to us to be able to associate priorities with working copies on constrained resources. For example, at the moment we don't have resource usage limits set because when we do a rollback of production, we don't want to be blocked by commit / diff builds that are running on the same resource. If we could set a priority so that pending commit / diff builds don't get a lease until a pending rollback has acquire and released leases, that would assist us in better managing Drydock's usage of resources.

eadler added a project: Restricted Project.Jun 16 2016, 6:26 PM
eadler moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Jun 16 2016, 6:28 PM

After D16145, Harbormaster should immediately release resources which no remaining (running or queued) build step takes as an input.

This should make deadlocks impossible with homogenous resources.

I suspect it may improve pool utilization substantially too, particularly as you reach pool saturation, although it's hard to predict how much of an effect it will have.

eadler moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Jul 4 2016, 9:05 PM
Alexmoon2 moved this task from Backlog to Far Future on the Drydock board.Jan 17 2017, 8:56 AM
epriestley moved this task from Far Future to Backlog on the Drydock board.Jan 17 2017, 3:29 PM