Page MenuHomePhabricator

Improve detection and recovery when resources are mangled outside of Drydock's control
Open, Needs TriagePublic

Description

So I mentioned this on IRC, and I need some feedback on how @epriestley thinks this should work.

Basically at the moment Drydock assumes that resources do not close or go away unless it requests that they do so. In the real world this is not the case; spot instances on AWS can go away at any time, and more rarely, AWS can shut down or terminate machines for maintenance or hardware issues.

This scenario is also relevant for resources which depend on other resources (such as the working copy blueprint). In this case the host resource might be closed for any reason, and the working copy resource would remain open.

When this happens (and the underlying resource goes away), Drydock continues to try and lease against it, even though all future leases fail. We need some way of Drydock recognising that the underlying resource has disappeared and closing the resource as a result.

I thought of basically checking the resource status inside lease acquisition and allowing lease acquisition to throw a ResourceGoneException or something of that nature. When this occurs, Drydock would catch the exception in the allocator worker, close the resource (and breaking all of the leases open on it) and re-run the allocation operation for that lease (basically re-inserting a new task in the queue so it is reprocessed).

To me this solution seems roughly okay, will be able to handle any type of resource and should be somewhat easy to make race-free. In addition, breaking the leases allows other systems such as Harbormaster to detect whether a resource has gone away mid-build, and depending on the settings of "Lease Host", automatically restart a build if that scenario occurs.

@epriestley, thoughts?

Event Timeline

hach-que raised the priority of this task from to Needs Triage.
hach-que updated the task description. (Show Details)
hach-que added a project: Drydock.
hach-que added subscribers: hach-que, epriestley.

First thought is to do this:

  • When something fails, mark resources it used as suspicious.
  • Don't lease resources flagged as suspicious.
  • Queue health check for those resources.
  • If they pass health checks, clear the flag (i.e., presume the error was with the task).
  • If they fail, recycle them.

Basically, check-after-failure instead of check-before-use. This should let us do fewer checks overall, which is probably good if checks aren't trivially cheap (which they probably aren't in many cases).

That sounds similar to the solution I proposed, but I'm not entirely sure that will work due to the health check running as a queued task (and preemptively preventing leases against suspicious resources I guess). I'm particularly concerned about this scenario:

  • A blueprint allocates some resource of which there can be only 1 (it holds some external exclusive lock).
  • The resource gets into an undesired state which prevents leases from being obtained against it.
  • A lease comes through, attempts to allocate against the resource, fails, and the resource gets marked as suspicious.
  • A health check is queued for the resource, but is not yet processed.
  • Another lease comes through.

The question now is: what happens now?

The blueprint can not allocate another resource; one already exists. The lease can not be made against a suspicious resource, because we know it will fail. So the only option is for that lease, and any future leases to immediately fail until the health check cleans up the bad resource.

This issue comes about because of the health check being queued into another task and being run as a background / parallel process, rather than the health check being performed immediately after failure during allocation. By performing it inside the allocation task, we can correctly retry any lease operations that are pending against that resource, which should result in Drydock being far more fault tolerant than if we just make leasing fail.

That situation is unavoidable. Here's a way we can hit it with pre-checks:

  • Resource fails a health check; we tear it down and begin allocating a new one. This takes a few minutes.
  • A new lease comes through while the resource is allocating.

There's no meaningful difference between "waiting for an allocation" and "waiting for a health check". Both may depend on queued tasks executing. The new lease has the same options in either case: fail permanently, fail temporarily, block and wait. "Fail temporarily" is probably correct in most situations. "Block and wait" is probably never correct.

"Block and wait" is what currently happens when you have more leases come through than maximum resources available (actually for blueprints that implement min-max-expiry it will optimize to only allocate as many resources as is really needed).

What this means is that if you have a blueprint with an ideal number of 5 leases per resource, and 5 leases are put through, Drydock will allocate one resource on the first lease that comes through, and the other 4 will be leased against the pending resource (they will block and wait until the resource is either open, or throw and exception if it gets closed).

So "block and wait" is already the current and correct behavior for that scenario. Otherwise you would allocate 5 AWS EC2 instances when you really only need 1 (and overpaying for resources).

"Block and wait" can deadlock the queue and we're going to have to stop doing that if we're doing it anywhere.

I don't believe there's a deadlock here. What happens is:

  • The first lease task is picked up in the queue, and starts allocating the resource (with it in pending state).
  • The next lease tasks that are processed by the Drydock worker are leased against the still pending resource. These workers wait until the resources moves out of the pending status.

Either one of two things can happen here; either the first lease worker succeeds and moves the Drydock resource into Open, or the first lease worker fails and moves it into some other status. In either case, all of the other lease tasks start continuing at this point (throwing an exception if the status is not open).

Because processing never leaves the Drydock allocation worker, it's not possible for some part of the resource allocation to go back into the queue and deadlock; at all times the taskmasters remain processing the Drydock task, and only after either the resource allocation succeeds or fails do they start processing other tasks (i.e. they never yield).

I think I'm not explicitly going to include anything to deal with this in v1 since the v1 resource types "probably won't" break for no reason, but I think we likely do need to have more formal tools for it soon. I roughly imagine:

  • v1: No real support. If you hit this case today, you can send the resource a "release" command and then schedule an update for it. This will destroy it and your process should recover and succeed overall, it's just not formal/clean/default/nice.
  • v2: Do this formally, and probably do it by default? That is, assume errors taint everything they touched unless we're told otherwise? Maybe this is overkill and explicit tainting is fine. Recovery workflow is still destroy/re-create.
  • vFuture: New state to triage/recover the resource instead of just throwing it away? Not sure if this is necessary. Might make sense if we have expensive, fragile resources later that also have some kind of cheap health check / recovery process available.

A concrete case where we can hit this in v1 is this:

  • Write a "unit test" which removes the .git/ directory from the working copy where it is run.

When run, this "unit test" will permanently mangle the WorkingCopy resource where the test runs, and all future leases against the resource will fail (since they won't be able to git checkout, etc.). Currently, the resource will require manual intervention to fix and will break every new lease anything tries to acquire. So this is a bad state, but with the current upstream resource types it's fairly hard to get into.

epriestley renamed this task from Discussion: How should Drydock handle resources that were closed outside of it's control? to Improve detection and recovery when resource are mangled outside of Drydock's control.Oct 1 2015, 1:37 PM
epriestley moved this task from Backlog to v2 on the Drydock board.
epriestley renamed this task from Improve detection and recovery when resource are mangled outside of Drydock's control to Improve detection and recovery when resources are mangled outside of Drydock's control.Apr 10 2016, 10:07 PM

A specific subcase here is when the binding to an Almanac host has been disabled. We should possibly test this during Interface construction, treat it as a failure, then recover from it.

I expect changes connected to T12145 to stop new leases from being handed out on these hosts, but existing long-lived leases may persist.