Page MenuHomePhabricator

Give Drydock leases an explicit ownership PHID
Closed, ResolvedPublic

Description

We're currently running into the following scenario with Drydock:

  • A build starts and grabs a lease, which allocates an EC2 instance.
  • At some point, someone restarts the daemons while the builds are still running.
  • The lease never gets released because the daemon that was using it is terminated.

There is currently no way to track a lease back to the thing that obtained it, and no way to determine whether or not that thing still requires the lease.

I'm proposing that we add an ownershipPHID to leases, where the target object implements LeaseOwnerInterface or something of that nature. This provides a method stillRequiresLease which is used to evaluate whether or not a lease is still required, based on the status of the object.

In the above scenario, the build would be the owner; and the build would be in a failed status, so it knows that it can release the lease (and we perhaps just run this logic inside the GC daemon?)

Event Timeline

hach-que raised the priority of this task from to Needs Triage.
hach-que updated the task description. (Show Details)
hach-que added a project: Drydock.
hach-que added subscribers: hach-que, epriestley.

I think there are two components to this:

First, figuring out what a lease is connected to. Leases have an ownerPHID since D1454, but it may not be getting populated very consistently. It should be consistently populated. I think this covers the general "why does this thing exist?" need.

In the second case, I plan to make Drydock work like the worker queue does: leases have a time limit, and expire automatically if not extended.

I'd broadly expect a Harbormaster build which needs a lease to work like this when it starts up:

  1. Check if it already has a valid lease stored somewhere.
  2. If not, build one, then cache it.

This gets us the right behavior on all sorts of restarts / failures / etc. There's no easy way for workers to be internally stateful right now but that's easy to add if it doesn't make sense to do it as part of build targets.

That is, if a lease owner truly vanishes, the cleanup mechanism is that the lease duration expires and it gets GC'd eventually. Then we never have to reach back up into applications to see if they still need leases, which is a question they probably can't easily answer most of the time anyway (specifically, I believe the answer is always "yes" if they haven't explicitly released the lease and are implemented correctly/consistently).

I'll leave this open until T9123 is closer to resolving and there's some implemented behavior here, but I don't think this needs anything new per se.

epriestley claimed this task.

I think everything is setting ownerPHID now.

Expiry behaviors are more specifically covered in T6569 so I'm just going to close this and let that task cover them.