Page MenuHomePhabricator

Merge conflicts causes resources not being able to be released
Closed, ResolvedPublic

Assigned To
Authored By
nickz
Dec 15 2015, 7:12 PM
Referenced Files
F1045154: Screen Shot 2015-12-26 at 12.39.27 PM.png
Dec 26 2015, 8:40 PM
F1045150: dismiss.png
Dec 26 2015, 8:40 PM
F1045152: Screen Shot 2015-12-26 at 12.38.19 PM.png
Dec 26 2015, 8:40 PM
F1042363: pasted_file
Dec 24 2015, 12:58 AM
F1042360: pasted_file
Dec 24 2015, 12:58 AM
F1029825: Screen Shot 2015-12-18 at 6.21.27 AM.png
Dec 18 2015, 2:23 PM
F1027075: pasted_file
Dec 15 2015, 7:12 PM
F1027069: pasted_file
Dec 15 2015, 7:12 PM

Description

This is how I can reproduce the problem:
I make the "Land Revision" fail on purpose, then I go to the corresponding lease and see a Slot Lock is still there:

pasted_file (200×688 px, 24 KB)

Then I resolve the conflicts and click "Land Revision" again, I see it just keeps running. I go to its lease and see it "Waiting for Activation" due to no resource:

pasted_file (693×1 px, 179 KB)

Then I go to the previous locked lease and manually release it:

pasted_file (271×1 px, 50 KB)

Then the waiting lease can get the resource and proceed.

Event Timeline

nickz raised the priority of this task from to Needs Triage.
nickz updated the task description. (Show Details)
nickz changed the edit policy from "All Users" to "Custom Policy".
nickz added a project: Bug Report.
nickz added a subscriber: nickz.

Also why can't a new lease get another resource when previous lease holds a resource? I set the limit of the working copy to be 5, so there should be enough resources available to be used.

It looks like it counts the total working copies. I thought the limit is for every single repo: when I set the the limit to be 5, I expect that every single repo can have 5 copies at most. This is not scalable: when more and more repos get created, we have to adjust the limit.

Broadly, two issues here:

First, the lease should be getting cleaned up automatically. That looks like a bug; I'll see if I can reproduce it.

Second, the behavior of "limit". "limit" is the total resource limit, not the per-repository limit. I think this is the correct way to define "limit", because you usually want "limit" to be some function of how much hardware you're giving Drydock (e.g., you have two machines and each machine can handle about 32 working copies, so you set the limit to 64). Defined like this, you can choose a "limit" such that you don't run out of resources.

If "limit" was per-repository instead, everything might be OK today but you could have twice as many repositories next year and the pool start failing (e.g., disks full on the hosts).

When too much work is coming in, I want the behavior of Drydock to be to slow down, not fail. When you see that it's starting to slow down or queue up, you can add more hardware and increase the limits at your convenience, and everything will keep working fine until you do (just more slowly than usual during peak hours or whatever). This is a lot more manageable than suddenly hitting the limit and the whole thing not working anymore. Put another way, users should be sending you an email like "Landing stuff is kind of slow in the afternoon when everyone else is landing stuff, can we make that faster?", not "The land button is broken and no one can land anything and engineering is completely at a standstill until this is fixed".

However, because resources are not automatically released/reclaimed yet, we can end up in a deadlocked situation. For example:

  • You set the limit to 5.
  • You run 5 jobs on different repositories, one at a time.
  • A 6th repository can not run jobs now.

You can sort of work around this by setting the limit to a large number and manually releasing resources until the pool distribution of working copies is about the same as your actual use of working copies. This is essentially what we do in the upstream right now, but we have only a tiny number of repositories and this is generally silly.

(I believe T9875 also describes this issue, although the description there is far more roundabout.)

Here's my plan for dealing with this:

  • Allow resources to be proactively reclaimed by the lease process. If a lease misses because no suitable resources exist and none can currently be built, but suitable resources could be built if not for pool limit issues, allow it to try to reclaim resources by finding existing resources with no leases against them and tearing them down.
  • Later, allow resources to be automatically recycled by GC. This would just reclaim resources which are unused for a while, as a general tidiness/housekeeping measure. This is not very important for working copies, but for some future types of resources (like VMs) there may be a dollars-per-hour cost associated with keeping unused hosts up, so it may be reasonable to want a behavior like "if nothing needs a host for 10 minutes, shut it down".

I will just implement reclaiming for now. Once that's stable and we're closer to having resources where recycling matters more, we can implement it as well.

T9875 is different to this. This discusses leases being deadlocked, whereas T9875 is there not being enough storage space on the host to hold copies of all working copies + build artifacts at once.

The recommended solution for T9875 is to set a limit lower than "available disk space / size of largest repository", which reduces it to this deadlock issue.

For example, suppose we have perfect resource reclamation and reuse algorithms which always instantly destroy a resource the moment it is no longer in use.

If you set a limit which allows use of more resources than you actually have (suppose your largest repository + artifacts is 5GB, and you have a 30GB disk, but you set a limit of 10 working copies) then a sudden burst of requests to build artifacts in that repository may exceed the available disk space by requiring 50GB of active, in-use resources at once. These resources are all active and in-use, so some builds will fail despite perfect reclamation algorithms.

Exceeding available resources is a limit problem. Lower the limit to reduce it to a queue throughput problem. The queue currently has a serious throughput problem (this deadlock), but it is expected that you will encounter resource exhaustion problems if you configure limits in excess of your infrastructure's ability to support concurrent jobs.

I believe there's another deadlock issue here as well though, say 10 builds start with the following plan:

  1. Lease Working Copy
  2. Wait for Previous Commits
  3. Do Some Build Task

In this case, 5 builds could run "Lease Working Copy" and then sit waiting for previous commits. Meanwhile a previous commit to all of those 5 builds tries to run Lease Working Copy, but it hangs forever waiting for a resource to become available. However no resources can be reclaimed because leases are open on all of them.

While the solution here is obviously "make Lease Working Copy depends on Wait for Previous Commits", it's not obvious this will deadlock and there's no real way to diagnose this issue other than viewing all of the in-progress builds and realising that it's deadlocked. When you turn off the Drydock resource limit, this scenario will implicitly fail because the builds will run into "Out of disk space" errors and fail, which is better than "deadlock forever until someone notices".

If you drive without a seatbelt, it will be easier to escape from your vehicle when it becomes submerged in the ocean.

I've reproduced the first issue with the lease breaking but not being released. Here's what I did:

I forced a merge error by applying this patch, so we run git error instead of git merge:

diff --git a/src/applications/drydock/blueprint/DrydockWorkingCopyBlueprintImplementation.php b/src/applications/drydock/blueprint/DrydockWorkingCopyBlueprintImplementation.php
index f39353a..076b1aa 100644
--- a/src/applications/drydock/blueprint/DrydockWorkingCopyBlueprintImplementation.php
+++ b/src/applications/drydock/blueprint/DrydockWorkingCopyBlueprintImplementation.php
@@ -435,7 +435,7 @@ final class DrydockWorkingCopyBlueprintImplementation
     // "--squash", but git sometimes runs code to check that a username and
     // email are configured anyway.
     $real_command = csprintf(
-      'git -c user.name=%s -c user.email=%s merge --no-stat --squash -- %R',
+      'git -c user.name=%s -c user.email=%s error --no-stat --squash -- %R',
       'drydock',
       'drydock@phabricator',
       $src_ref);

I used "Land Revision" to land a revision, then ran the tasks one by one from the CLI with bin/worker execute --trace --id <id>, reading IDs from /daemon/ in the web UI.

After a couple of allocations, the actual merge failed with an error, as expected:

[2015-12-18 06:16:00] EXCEPTION: (PhabricatorWorkerPermanentFailureException) Permanent failure while activating lease ("PHID-DRYL-dngjnuwvdi2xqachtlko"): Command failed with error #1!
COMMAND
ssh '-o' 'LogLevel=quiet' '-o' 'StrictHostKeyChecking=no' '-o' 'UserKnownHostsFile=/dev/null' '-o' 'BatchMode=yes' -l 'xxxxx' -p '22' -i 'xxxxx' '52.8.83.222' -- '(cd '\''/var/drydock/workingcopy-62/repo/git-test/'\'' && git -c user.name='\''drydock'\'' -c user.email='\''drydock@phabricator'\'' error --no-stat --squash -- refs/tags/phabricator/diff/275)'

STDOUT
(empty)

STDERR
git: 'error' is not a git command. See 'git --help'.

Did you mean this?
	rerere
 at [<phabricator>/src/applications/drydock/worker/DrydockLeaseUpdateWorker.php:773]

This resulted in a broken working-copy lease, just like in your screenshot:

Screen Shot 2015-12-18 at 6.21.27 AM.png (439×1 px, 117 KB)

This behavior is expected so far, except that there should also be an update task queued to destroy this lease (which should execute immediately in most cases), and there was not. I'll look into why that task wasn't queued next.

If you drive without a seatbelt, it will be easier to escape from your vehicle when it becomes submerged in the ocean.

This is an analogy that doesn't make sense.

Temporary failures are actionable and loud - you get notified by Harbormaster build failures when things aren't working. These failures are actionable by administrators. If things don't correct themselves, administrators can manually release leases and resources (even though it's currently tedious to do so). But most often things do correct themselves in the temporary failure model, because the resources used by the failed builds would be cleaned up in the perfect-GC mode.

Deadlocks and stalled builds don't notify you at all; everything just sits in a Building status and there's no indication that anything as stalled or is not making progress. In my opinion, a deadlock is a worse scenario than a temporary failure.

So you'd suggest that we encourage Drydock users to configure limits in excess of available resources ("not wear a seatbelt") to improve error behavior when they configure a deadlock ("fall in the ocean")? If so, wouldn't it be better to simply remove the limits entirely ("uninstall seatbelts from cars") so that they can't accidentally configure a limit with bad behavior?

No, I'd suggest that Drydock fail (and clean up resources used by failed builds immediately) instead of deadlocking.

How would you propose Drydock detect that it should fail if the situation is a legitimate deadlock that can not be alleviated by cleaning up resources?

That is, suppose repositories take 5GB, and there is 30GB of disk available, and a limit of 6 resources. A user can still push 10 commits all at once, and 6 of the later ones happen to run first and grab all the resources. No matter how good our resource reclamation is, it can not alleviate the deadlock. What behavior would you like Drydock to have in this situation?

I would expect 4 of the builds to fail as they can't allocate more resources.

Alternatively if there was no limit set, I would expect some of the builds to fail when they hit out-of-disk-space and for the working copy resources allocated by those builds to be removed when they fail, so that future builds will pass.

I would expect 4 of the builds to fail as they can't allocate more resources.

Do you always want leases to fail if they can't immediately allocate resources (basically, maximum queue length is 0)? If not, how does a build decide between failing and queueing when it can't immediately allocate more resources?

In case the case of a limit being set, then yes, I expect it to fail immediately (and I expect them to fail until the resources drop below the limit).

However, in my case I almost certainly won't set a limit; instead I'd just let it fail on out-of-disk-space and have it automatically correct itself as working copies get released / deleted from failed builds.

Okay. You can implement the behavior you want like this:

diff --git a/src/applications/drydock/worker/DrydockLeaseUpdateWorker.php b/src/applications/drydock/worker/DrydockLeaseUpdateWorker.php
index 80b2a6f..a0507c1 100644
--- a/src/applications/drydock/worker/DrydockLeaseUpdateWorker.php
+++ b/src/applications/drydock/worker/DrydockLeaseUpdateWorker.php
@@ -204,6 +204,9 @@ final class DrydockLeaseUpdateWorker extends DrydockWorker {
       }
 
       if (!$resources) {
+        throw new PhabricatorWorkerPermanentFailureException(
+          pht('Declining to queue this lease.'));
+
         throw new PhutilAggregateException(
           pht(
             'All blueprints failed to allocate a suitable new resource when '.

I disagree that the behavior you want is desirable, and do not plan to implement it in the upstream.

epriestley triaged this task as Normal priority.

@nickz I believe both issues are now fixed at HEAD. Let me know if you run into anything else.

This works for me now! Thanks!

Another issue I noticed:
When "Land Revision" failed due to merge conflicts, I resolved the conflicts and then directly run "arc land" instead of "Land Revision", the revision was closed as expected, but the "Land Revision" failure still stayed there:

pasted_file (66×786 px, 17 KB)

If I click "Land Revision", it says the revision is still under review.

pasted_file (158×564 px, 16 KB)

Ideally, once the revision is closed by "arc land", the previous "Land Revision" failure should go away and the "Land Revision" link should not be clickable. What do you think?

I don't want the error messages to just disappear since I worry that might be confusing, but I've added an option to dismiss errors once you're done with them in D14886:

dismiss.png (227×835 px, 22 KB)

I've tailored the error messages for closed and abandoned revisions. They no longer show the misleading "still under review" text. Here's the new error for "closed":

Screen Shot 2015-12-26 at 12.38.19 PM.png (173×577 px, 20 KB)

I also reduced the scariness of the dialog itself, here's the new one with more toned-down messaging (no more huge yellow "THE WORLD WILL EXPLODE" warnings);

Screen Shot 2015-12-26 at 12.39.27 PM.png (276×620 px, 24 KB)

This looks much better! Thanks!