Page MenuHomePhabricator

Deadlock when a build plan leases multiple working copies
Closed, DuplicatePublic

Description

It's possible to cause a deadlock in builds when build plans lease multiple working copies, and where the Drydock blueprints for those working copies have limits set on them. I think this describes a different issue in Drydock to the ones I've previously reported, but if not, just close it.

Replication Steps:

  1. Configure a first working copy blueprint with limit 5.
  2. Configure a second working copy blueprint with limit 5.
  3. Configure a build plan that does the following:
    • Lease from first working copy
    • Lease from second working copy
    • Perform some operation on the first working copy (or alternatively just sleep?)
    • Perform some operation on the second working copy
  4. Run lots of builds using this build plan in parallel (I encountered this issue when running the same build plan on HEAD of lots of repositories as I pushed to all of them at the same time, but since there's no "Wait for Previous Commit" here, it should be possible to run into just by triggering builds across a bunch of commits in the same repository as well)

Expected Behaviour:

When no further build steps use a working copy artifact, it should be released at that point, rather than when the build passes / fails.

Actual Behaviour:

Some set of builds will succeed leasing from the first working copy, while still waiting on obtaining the second working copy. Even when the operations on the first working copy are complete (all the run command steps that use them), those leases are held open until those builds complete, but those builds are currently blocked on obtaining a second working copy. Meanwhile, there is another set of builds that have successfully leased the second working copy, but are waiting on leasing the first working copy. Even when they're done with the second working copy, they won't release the lease on that until those builds pass / fail, which won't happen until the lease is obtained on the first working copy.

Event Timeline

I think it might be possible to cause a deadlock with limited leases even if the Expected Behaviour is achieved, approximately by devising a build plan like this:

  1. Lease from first working copy
  2. Lease from second working copy
  3. Run operation on first working copy (but add a "Depends On" entry to also depend on the "Lease from second working copy" build step to complete)

However, this build plan doesn't make sense, because if you want the build plan to be sequential across working copies, you should wait until all the operations on the first working copy are complete before leasing from the second working copy. This is unlike the scenario described in the bug report, which describes a build plan where the operations on working copies are independent of one another.

I managed to temporarily workaround this issue by upping the limits on the blueprints, but obviously that is not ideal if the machines can't sustain that number of working copies in parallel.

epriestley triaged this task as Wishlist priority.Jan 3 2016, 8:03 PM
epriestley edited projects, added Feature Request; removed Bug Report.
epriestley added a subscriber: epriestley.

It's not a bug that you can write plans which deadlock.