So I mentioned this on IRC, and I need some feedback on how @epriestley thinks this should work.
Basically at the moment Drydock assumes that resources do not close or go away unless it requests that they do so. In the real world this is not the case; spot instances on AWS can go away at any time, and more rarely, AWS can shut down or terminate machines for maintenance or hardware issues.
This scenario is also relevant for resources which depend on other resources (such as the working copy blueprint). In this case the host resource might be closed for any reason, and the working copy resource would remain open.
When this happens (and the underlying resource goes away), Drydock continues to try and lease against it, even though all future leases fail. We need some way of Drydock recognising that the underlying resource has disappeared and closing the resource as a result.
I thought of basically checking the resource status inside lease acquisition and allowing lease acquisition to throw a ResourceGoneException or something of that nature. When this occurs, Drydock would catch the exception in the allocator worker, close the resource (and breaking all of the leases open on it) and re-run the allocation operation for that lease (basically re-inserting a new task in the queue so it is reprocessed).
To me this solution seems roughly okay, will be able to handle any type of resource and should be somewhat easy to make race-free. In addition, breaking the leases allows other systems such as Harbormaster to detect whether a resource has gone away mid-build, and depending on the settings of "Lease Host", automatically restart a build if that scenario occurs.
@epriestley, thoughts?