Page MenuHomePhabricator

[harbormaster/core] Fix issue where exception during artifact save would result in orphaned lease
AbandonedPublic

Authored by hach-que on Jul 29 2015, 1:02 AM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Apr 25, 2:31 AM
Unknown Object (File)
Mon, Apr 22, 10:38 AM
Unknown Object (File)
Mon, Apr 22, 10:38 AM
Unknown Object (File)
Mon, Apr 22, 10:38 AM
Unknown Object (File)
Mon, Apr 22, 10:38 AM
Unknown Object (File)
Mon, Apr 22, 10:37 AM
Unknown Object (File)
Mon, Apr 22, 8:33 AM
Unknown Object (File)
Thu, Apr 4, 8:27 PM
Subscribers

Details

Reviewers
epriestley
Group Reviewers
Blessed Reviewers
Summary

If an exception occurs when saving the artifact (e.g. due to a duplicate key exception), then we need to release the lease otherwise it will be orphaned and hold resources open.

This fixes it so that the lease is released if needed.

Test Plan

Not tested.

Diff Detail

Event Timeline

hach-que retitled this revision from to [harbormaster/core] Fix issue where exception during artifact save would result in orphaned lease.
hach-que updated this object.
hach-que edited the test plan for this revision. (Show Details)
hach-que added a reviewer: epriestley.
hach-que edited edge metadata.

Move waitUntilActive outside try block to avoid masking the exception

epriestley edited edge metadata.

Not tested.


This pattern is inherently race-prone and unsafe. Another process can load the object between reload() and save(), then either our changes or the changes made by that process are lost.

setBrokenReason() isn't in HEAD and this use is upstream-controlled but untranslatable.

Resources and leases should tend to self-heal, and errors other than duplicate names should be sufficiently rare that we probably don't need to special case this, and duplicate names should theoretically only occur through error/misconfiguration.

(I'm not sure duplicate names should even be an error.)

We could provide a race-safe way to quick-release a resource, or build artifacts in multiple phases, but this particular case seems not worth resolving until other failure modes are more clearly understood. If this is the only case of this, I'd lean toward not handling this specially.

This revision now requires changes to proceed.Aug 24 2015, 6:43 PM

The exception I was seeing here was duplicate artifact keys. When an artifact happened to have a hash collision, the creation of the artifact would fail and the lease would be left open (because even though the build fails afterwards, the artifact isn't registered).

This would result in Drydock resources being kept open by leases that were no longer being used.

This would result in Drydock resources being kept open by leases that were no longer being used.

I think this is fine -- Drydock needs to be able to recover from this anyway, since the process holding the lease may also be struck by lightning. If this is rare and mostly the result of human error, it seems fine to leave leases around until they naturally expire.

As written, this patch can't fix this problem on its own -- even ignoring race issues -- because the reload() could throw and then we're right back to where we were before. So we either need to make the current behavior recoverable or make this patch way more complex anyway.

Leases don't naturally expire though and they'll be held open forever otherwise. Whenever artifact hash collisions occur (we see them pretty regularly at work), people would then need to remember to release the leases.

We have a problem right now: unused open leases don't expire. This is a general problem with Drydock leases as a whole, and affects all use cases, because pretty much anything can allocate a lease and then get hit by lightning or otherwise be unable to continue.

This change (a) does not prevent the problem in general, and does not prevent it even in this use case; (b) introduces a race on the lease which may introduce new -- and potentially more severe -- problems; and (c) would very likely be completely unnecessary if we fixed the problem.

Instead of doing this, I think we should fix the lease problem.

I'm pretty sure leases have an expiry field in the new Drydock now, even if they aren't used.