This implements resource constraints and the correct locking mechanisms (with yielding when lease acquisition would otherwise be blocked) to ensure that acquiring leases and allocation of resources in Drydock is race-free.
Details
- Reviewers
hach-que - Group Reviewers
Blessed Reviewers - Maniphest Tasks
- T2015: Implement Drydock
Tested this by adding sleep(10) to the canAllocateLease method of the preallocated host (this is to simulate the acquisition lock being held for large amounts of time). Then I ran
for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15; do (bin/drydock lease --type host --attributes platform=linux &); done
and saw appropriate output:
june-linux-laptop:~/Projects/Phabricator/phabricator> for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15; do (bin/drydock lease --type host --attributes platform=linux &); done june-linux-laptop:~/Projects/Phabricator/phabricator> Task yielded while acquiring 402... < ... cut output ... > Task yielded while acquiring 409... Task yielded while acquiring 410... Task yielded while acquiring 407... Task yielded while acquiring 403... Task yielded while acquiring 407... Acquired Lease 411 Task yielded while acquiring 409... Task yielded while acquiring 410... Task yielded while acquiring 403... Task yielded while acquiring 407... Acquired Lease 401 Task yielded while acquiring 403... Task yielded while acquiring 407... Task yielded while acquiring 407... Task yielded while acquiring 403... Acquired Lease 409 Acquired Lease 410 Acquired Lease 407 Acquired Lease 403
Ran a Harbormaster build that did Lease Host while all of these leases were being allocated and saw the lease acquired in Harbormaster properly (with the build step yielding and continuing correctly when being picked back up again).
Diff Detail
- Repository
- rP Phabricator
- Branch
- arcpatch-D10517
- Lint
Lint Passed Severity Location Code Message Advice src/applications/drydock/blueprint/DrydockPreallocatedHostBlueprintImplementation.php:127 XHP16 TODO Comment Advice src/applications/drydock/blueprint/DrydockWorkingCopyBlueprintImplementation.php:139 XHP16 TODO Comment Advice src/applications/drydock/worker/DrydockAllocatorWorker.php:256 XHP16 TODO Comment Advice src/applications/drydock/worker/DrydockAllocatorWorker.php:307 XHP16 TODO Comment - Unit
Tests Passed - Build Status
Buildable 7762 Build 8556: [Placeholder Plan] Wait for 30 Seconds Build 8555: arc lint + arc unit
Event Timeline
Daemons need to yield if they can't continue, not sleep().
src/applications/drydock/blueprint/DrydockMinMaxTestBlueprintImplementation.php | ||
---|---|---|
95–100 ↗ | (On Diff #32275) | In particular, this can deadlock the queue. |
@epriestley I just realised by merging down D10394 and deleting the test implementation, the sleep() calls are removed. The only thing left is the lock acquisition.
If I change the lock acquisition to catch the exception and yield (because the lock is obtained at the start, there is no state to be kept here so it's not dependant on D13832), would that make this suitable for acceptance?
Okay I managed to make this yieldable before D13832 (however this is still required for future blueprints that need to wait until a resource is in open status before leasing against it).
This means all long-lived locks and sleeps have been removed from this code, and that this code is Ready for Review™.
To step back here, I still don't actually understand what problem you're trying to solve.
- What specific problem is this aimed at solving?
- How do I reproduce that problem at HEAD?
Broadly this solves all race conditions currently present in the Drydock allocator. These race conditions have been found through months and months of testing the Drydock code at a reasonably high scale.
If you wanted to reproduce the problems at HEAD, I'd suggest grabbing this patch and the EC2 changes, configuring a blueprint for AWS, and requesting 50 host leases. I'd also suggest running 50 taskmasters to ensure maximum parallelization of task execution. You will observe that Drydock at HEAD fails to acquire leases, overallocates resources and doesn't release or clean up resources or leases when appropriate. Disk usage after builds run also increases continually because there is no close resource or release lease hook until this patch.
You might be able to replicate it also with the Almanac blueprint, but it will be much harder because the time frame for race conditions is lower (but not non-existent).
This patch is basically the foundation for all other Drydock patches (it includes the other changes you requested to be merged in as well).
I can not maintain code I don't understand, and will not bring code upstream without understanding it.
Suppose I accept this change, and a month from now someone else sends me a patch and makes the same claims -- they ran your code in production for a long time, it had a lot of problems, they don't understand them or have a way to reproduce them, but this new patch definitely fixes them. I accept that patch (the arguments for accepting it are equal in strength to the arguments for accepting this) and it breaks production for you. And maybe for us, too. This is obviously silly and unsustainable.
I also just can't review the code if I don't know what it's doing, and it's not clear what this is doing or why.
Given enough time I can figure out what's racing here or convince myself that nothing races and that the error lies elsewhere, but this is making an enormous demand on my time to resolve a problem that no one but you has encountered, and which may be in third-party code you wrote or in your environment.
What's your expectation here? Do you expect me to just accept the code without understanding it because it works in your environment, and that should be good enough? Or do you expect me to figure out the underlying problem, even though it was too hard for you to figure out and you can actually observe it?
The descriptions of the race conditions that this code fixes have been detailed in previous diffs that I've sent through, in most cases with a pretty good description of the race condition that was encountered, or descriptions of the race conditions that were encountered and resolved with future arc diff (i.e. the race conditions that are present are detailed in the comments):
- D10204: Implement Amazon EC2 blueprint for Drydock
- D10304: [drydock/core] Implement resource constraints and correct locking mechanisms with yielding for Drydock
- D10394: [drydock/core] Abstract min-max blueprint and implement expiry logic
- D10409: [drydock/core] Mark leases as broken when failing to acquiring lease
- D10470: Prevent allocation race by allocating lease before leaving lock
- D10479: Various improvements and bug fixes to Drydock / Harbormaster
- D10511: Prevent allocator race condition where all resources are in "ALLOCATING" status
- D10513: Provide more detailed logging around why resources are / are not allocated
- D10517: Fix more race conditions within Drydock
- D10544: Increase Drydock worker time to 24 hours
- D10871: Set the status of DrydockResource to BROKEN instead of performing a hard-delete
I don't know what parts of this code need explaining. To me, it seems pretty straight-forward, but that's because I wrote it. Things like not calling releaseLease if you're about to close the resource. Or like having to initialize a pending resource before you exit the lock so that other allocators will correctly match the new resource's attributes before it's fully initialized?
If something isn't clear as to why it's here, leave an inline and I can explain the entire backstory behind why a particular method exists and the context in which it's used.
I don't understand what problem this is even trying to solve. I'm not talking about not understanding the code and am not even looking at it yet.
If that problem is explained across 20 other revisions, please bring the explanations to this revision. A third party should be able to read this revision in isolation and understand what it is doing and why. They should not have to dig across 20 other tasks.
If this is solving a large number of problems, break it apart and solve them one at a time.
The problem is:
Drydock has race conditions in it's allocator; those race conditions cause it to fail to lease when it otherwise shouldn't, over and under allocate resources when it shouldn't, and fail to appropriately clean up resources when they are no longer used.
This diff fixes those race conditions by modifying the allocator logic and adding additional abstract methods that appropriately set up and teardown the state of resources and leases when needed, so that the behaviour of the Drydock allocator allocated resources and acquires leases in the most optimal way possible.
Drydock has race conditions in it's allocator
Where? What code races? How can I reproduce these issues against HEAD?
This diff fixes those race conditions
How can I verify this?
most optimal way possible
What convinces you that this is the most optimal way possible?
I'm just not sure where the disconnect is here.
Suppose we land this, and then I immediately revert it and send you a revision undoing these changes with the exact same summary and test plan, claiming that there are lots of races that can only be observed in my specialized environment with my particular workload, and there's no way for you to reproduce them, but that my diff definitely fixes them in the most optimal way possible. How would you respond?
I absolutely believe there are race conditions in this code and that you've observed their effects, but applying some locks without understanding or reproducing the issues is not a viable way to resolve races. We need to isolate and reproduce the races, understand what's racing, develop a minimal lock, apply that lock, and then verify it resolves the issue. Races are not magic, they just need careful analysis and deliberate care to understand and reproduce.
It sounds like your belief that this fixes the races is mostly rooted in having observed the code not race in production for a while. Are you confident that this code is race-free in all environments and workloads? How?
If we deploy this code and the next install reports that they're observing races, how would you respond, given that Drydock is now "race-free" and this is the "most optimal way possible" to fix the allocator? Would you assert that they're mistaken and haven't actually observed races? Or would you think that this change might have missed something? How would you fix the problem if you couldn't observe it and they were unable to give you any information about how to reproduce it?
Inspection reveals that D13751 has a race. Did you believe it had a race? Has that raced in production for you?
Suppose I suggest a change to this diff that affects the behavior in some minor way. How would you be certain that the change is still correct? Deploy it in production and run it for a while and see if any races occur?
I'll give some known scenarios that I remember at the end of this comment, but you can't reproduce these issues at HEAD because HEAD has no Drydock blueprint that allocates resources when leases are requested (the only usable one is the preallocated blueprint, which never hits 90% of the allocator code).
This diff fixes those race conditions
How can I verify this?
You can't. Or more accurately, you would need to set up a Phabricator instance with 150 taskmasters, request 50 leases in parallel at once (through Harbormaster or otherwise, as long as you're releasing them), and continue to do this for days if not weeks on end to isolate all of the race conditions. Some of them are the results of very subtle timing issues that have a very small opportunity to reproduce.
most optimal way possible
What convinces you that this is the most optimal way possible?
We've tested this code in a production environment for the last 6+ months (probably longer). Given the constraints set on blueprints, we've always observed the correct behaviour; Drydock leases to existing resources up to their ideal lease per resource limit, then allocates another resource, then continues to lease up to limits until the maximum resource count is reached, and then it starts overleasing resources. It does this correctly even when the scenario is being driven by 50 leases all being requested in parallel across >50 taskmasters.
I'm just not sure where the disconnect is here.
I think there's the expectation that all of the race conditions can be replicated in your average development environment. Some of the race conditions can be; namely those that are obviously wrong, but some of them require a large time investment to reproduce.
Suppose we land this, and then I immediately revert it and send you a revision undoing these changes with the exact same summary and test plan, claiming that there are lots of races that can only be observed in my specialized environment with my particular workload, and there's no way for you to reproduce them, but that my diff definitely fixes them in the most optimal way possible. How would you respond?
I would test the changes on the production PageUp Phabricator instance, find out that your revert breaks everything, and report it as a bug.
I absolutely believe there are race conditions in this code and that you've observed their effects, but applying some locks without understanding or reproducing the issues is not a viable way to resolve races. We need to isolate and reproduce the races, understand what's racing, develop a minimal lock, apply that lock, and then verify it resolves the issue. Races are not magic, they just need careful analysis and deliberate care to understand and reproduce.
I know they're not magic, but they are extremely hard to reproduce, especially when 99% of installs (including this upstream install) will never hit them.
It sounds like your belief that this fixes the races is mostly rooted in having observed the code not race in production for a while. Are you confident that this code is race-free in all environments and workloads? How?
The PageUp Phabricator production instance is probably the only instance in existence that has using the Drydock infrastructure to this extent, with the frequency and scale that is required to replicate these issues. We have a very diverse set of blueprints and configurations (custom attributes, constraints, blueprints with different allocation times, blueprints with different resource types, blueprints that perform nested leasing, etc.) I don't foresee any other install reaching the level of use of Drydock that we have for quite some time (we have basically migrated our entire build and deployment process onto Drydock / Harbormaster).
If we deploy this code and the next install reports that they're observing races, how would you respond, given that Drydock is now "race-free" and this is the "most optimal way possible" to fix the allocator? Would you assert that they're mistaken and haven't actually observed races? Or would you think that this change might have missed something? How would you fix the problem if you couldn't observe it and they were unable to give you any information about how to reproduce it?
That would be extremely interesting to me, and I'd test the change on the production PageUp Phabricator instance to verify that it doesn't cause any regressions. I understand that race conditions are not something that you can "for certain" confirm that they're no longer present, because eyeballing the code is often not enough to detect them. But I do have a high level of confidence that Drydock is race free because of the scale we operate it at.
Inspection reveals that D13751 has a race. Did you believe it had a race? Has that raced in production for you?
That was a reasonably new diff, and has had significantly less testing than this code. I don't believe we've observed it race, but that's because the fix was an attempt to make sure leases are released and resources are closed if the artifact fails to load. Now that you've pointed it out however, I don't believe that the code actually solves that problem because the Drydock allocator will continue to acquire the lease in some scenarios (the fix only works if the Drydock allocator starts running after the lease has been broken, because the Drydock allocator will not operate on any leases that are not in a pending status).
Suppose I suggest a change to this diff that affects the behavior in some minor way. How would you be certain that the change is still correct? Deploy it in production and run it for a while and see if any races occur?
Yes, that's basically my strategy for testing Drydock allocator changes. Over the last 6+ months my strategy for race conditions here has basically been: Observe them in production, analyze the code to determine where the race condition might be occurring, make the fix locally in development, do some basic sanity checking to make sure I haven't drastically broken things (i.e. just obtain a single lease to make sure that still works), deploy it to production, wait and observe the results next time a deployment build is kicked off.
Race Condition:
The original Drydock code has no locking at all. Consider the scenario where two lease requests come in and are being processed in parallel by two taskmasters. There are no existing resources in this scenario, and a single Almanac blueprint is configured. The Almanac service is configured with one binding.
As each lease is processed, they'll both read the current list of resources. Because there's no order here, both allocator tasks will read the list of resources and observe that none exists. They'll then both start to allocate a resource.
When they both allocate a resource, the result will be two host resources pointing to the same Almanac binding, because there is only one binding available. Future leases will be distributed roughly evenly between both of these resources, and none of them will ever close.
Race Condition:
Consider the scenario where we lock around the resource loading and creation code, such that only one Drydock allocator can read and then create a resource as a response. This appears to solve the initial problem, but does not in practice. The reason is that resource is created in a PENDING status, and the next Drydock allocator is looking for resources in an OPEN status.
Thus resources are created when they shouldn't be.
Race Condition:
Consider the scenario where we then allow the Drydock allocator to obtain leases on pending resources. In this scenario, let's also assume that a resource has a limit of 5 leases, and that above this a new resource should be created.
When multiple leases come in, one lease will cause the creation of the resource, and will start allocating the resource outside of the lock. The other leases will come in, see the pending resource and lease immediately against it (we are also assuming that the blueprint lease acquisition waits for the resource to move into an open status). By the time the first lease finishes allocating the resource, it then attempts to lease against it, only to find that all of the available leases have already been consumed, and as such the lease fails.
This is the reason for the section documented "Pre-emptively allocate the lease on the resource". This scenario is extremely difficult to reproduce in anything other than production.
Race Condition:
Consider the scenario where we create a resource in response to a lease, and we start allocating it straight away. Other leases come in and see the pending resource, but they don't match because the blueprint hasn't yet had a chance to initialize the attributes of the resource (because resource allocation occurs outside of the lock). This causes the other leases to start allocating new resources when they shouldn't.
This is the reason for initializePendingResource; it allows the attributes to be set on the pending resource before the resource creation leaves the lock, allowing other lease to correctly match against it.
General Issue:
The original Drydock code does not support clean up of lease or resources. When a build is finished with a lease, the files that the build has store in it's lease directory will not be cleaned up. The resource will eventually fail with out of disk space.
THIS IS NOT AN EXHAUSTIVE LIST This is just the list of issues I remembered off the top of my head.
When they both allocate a resource, the result will be two host resources pointing to the same Almanac binding, because there is only one binding available.
This is a reasonable race condition which we should prevent. It is simple to understand and should be simple and straightforward to reproduce.
We can prevent this by doing global locking at blueprint scope within the blueprint, or optimistic locking by adding a uniqueness mechanism to resources. These approaches would let us hold narrower locks for a shorter period of time (or require no exclusive locking with optimistic locks), and require locks to be acquired much less often than a single global allocator lock.
For example, in the case where all applicable resources are allocated and leased, this diff requires workers to acquire a global lock before they can assess that and bail out. But it's fine to fail to acquire a resource because you're seeing a very-slightly-out-of-date view of the world and something just freed up: you can just retry again normally and get the resource if there's a free slot. This result is normal and indistinguishable from a request which legitimately encountered a fully-allocated state a moment earlier.
When a resource is fully allocated, requiring workers to acquire a lock before they can determine that they won't be able to proceed could make recovery from high-load conditions harder. It's desirable for workers to be able to determine that a resource is full and they should back off without needing to acquire a highly contested lock. With this implementation, they can't.
In this general vein, a single global lock creates unnecessary contention across resource types. Just saying "lock everything globally" is certainly safe, in the sense that it prevents race conditions, but provides a large amount of surface area for other minor problems to cascade into serious lock contention issues. Many operations which acquire this lock do not actually need to acquire or hold it, and will not interact with other lock holders during execution.
In particular, I believe it is valuable for the allocator to broadly have a cheap backoff/sanity check phase before the main allocator phase, where fully allocated requests and other unsatisfiable requests can cheaply bail out or back off without acquiring or holding locks. In the case of a full resource in high demand, we would expect to see many workers enter this phase and be able to make a cheap, lock-free determination that they should back off and wait for availability, rather than hammering the allocator or contending for locks.
Consider the scenario where we lock around the resource loading and creation code, such that only one Drydock allocator can read and then create a resource as a response
Yes, this locking strategy wouldn't be an effective one. Even if it was, it would still be a relatively contentious lock.
This is the reason for the section documented "Pre-emptively allocate the lease on the resource". This scenario is extremely difficult to reproduce in anything other than production.
I'd say this isn't really a race, and locking alone won't solve it. This is equivalent to a scenario where resource allocation simply fails: in both cases, we're left with no valid lease. The code needs to be able to recover. This should be easy to reproduce by making an allocator which fails consistently.
Provided that we can recover from this situation, it's OK if other request steal all the leases -- we should be able to recover from that, too, following the same pathway.
Given correct recovery, grabbing the lease sooner is probably OK, but I wouldn't expect it to matter much in practice.
(There is a possibility that two allocators mutually steal each others' resources and deadlock, but even this should be recoverable and I suspect it isn't possible even in your environment.)
This is also a separate (and separable) concern from the first race, and should be in a separate change.
Consider the scenario where we create a resource in response to a lease, and we start allocating it straight away.
In HEAD, the blueprint is responsible for creating the resource and can populate it appropriately before it is saved. I think this problem is an artifact of bringing resource creation out of the blueprint, which is a change I don't understand. Leaving the blueprint responsible for creating the artifact seems like it solves this problem.
The original Drydock code does not support clean up of lease or resources
This is obviously an issue, but totally separate and fully separable from locking concerns.
The global lock as it stands right now has a pretty small surface area; it just checks the state of the system, creates new records with some reasonably lightweight operations, and then leaves the actual resource allocation and lease acquisition to occur outside of the lock.
I'm not sure how we can reduce the scope of that lock without sacrificing correctness, even if we lock around reads to find the right resource we want to pick, by the time we go to do writes it might have been consumed by something else.
As you say, we can fix this by continually retrying the Drydock operation until we get something we like, but that seems like it might run into scenarios where the result is never satisfied, and reduces the ability to guarantee completion over all (right now either the lease can be satisfied through at most the allocation of a resource, or not at all).
Consider the scenario where we lock around the resource loading and creation code, such that only one Drydock allocator can read and then create a resource as a response
Yes, this locking strategy wouldn't be an effective one. Even if it was, it would still be a relatively contentious lock.
This is the reason for the section documented "Pre-emptively allocate the lease on the resource". This scenario is extremely difficult to reproduce in anything other than production.
I'd say this isn't really a race, and locking alone won't solve it. This is equivalent to a scenario where resource allocation simply fails: in both cases, we're left with no valid lease. The code needs to be able to recover. This should be easy to reproduce by making an allocator which fails consistently.
Provided that we can recover from this situation, it's OK if other request steal all the leases -- we should be able to recover from that, too, following the same pathway.
I think we'd need to fully separate out the concept of "allocate a resource" and "acquire a lease" in this case. We'd need to be able to allocate a resource and then potentially acquire a lease on a different resource, which does seem a bit weird to me. It also potentially means that one allocator might end up stuck allocating all of the resources for all the other workers (and thus rather than a consistent execution time, it might end up allocating several resources before it gets around to acquiring a lease). In this model, I think we should probably make the allocation of resources a separate task so that it scales across multiple taskmasters. This allows a lease acquisition to schedule the allocation of a resource, and then yield periodically until one comes available (or schedule another resource allocation if we're back in a deprived state of resources when it checks again).
(Again though to avoid the "allocate several resources before acquiring a lease" we could pass a lease ID to the resource allocation task to ensure that the lease is acquired on the resource before it's made available to other allocators)
Given correct recovery, grabbing the lease sooner is probably OK, but I wouldn't expect it to matter much in practice.
(There is a possibility that two allocators mutually steal each others' resources and deadlock, but even this should be recoverable and I suspect it isn't possible even in your environment.)
This is also a separate (and separable) concern from the first race, and should be in a separate change.
Consider the scenario where we create a resource in response to a lease, and we start allocating it straight away.
In HEAD, the blueprint is responsible for creating the resource and can populate it appropriately before it is saved. I think this problem is an artifact of bringing resource creation out of the blueprint, which is a change I don't understand. Leaving the blueprint responsible for creating the artifact seems like it solves this problem.
If the blueprint allocates the resource inside allocateResource, it means other workers can't lease against it or see it until it's fully ready (or explicitly saved by the blueprint). It also means that it won't count towards any resource limits on the blueprint because the resource creation occurs outside of a lock.
By placing the resource creation inside the lock, and pre-allocating against it, we ensure that other workers see a consistent state in terms of the resources allocated against a blueprint (even if they're not ready yet), and can start acquiring a lease on a resource that isn't ready yet (so they can use it when it becomes immediately available).
The original Drydock code does not support clean up of lease or resources
This is obviously an issue, but totally separate and fully separable from locking concerns.
Yeah I can probably pull that out of this diff. The reason it was kept in here is that this diff basically affects the API of the the abstract blueprint implementation, and I wanted all of the changes to be consistent with the locking and consistency mechanisms that were being introduced.
A "sort-of-optimistic lock" solution to the locking problem might go like this:
- Add a new DrydockResourceSlot table.
- Rows have the form <blueprintPHID, slotKey>, with a unique key across those fields in order.
Blueprints which impose count-based limits would lock incrementing numeric slots by performing inserts:
INSERT INTO slot (blueprintPHID, slotKey) VALUES ("PHID-BLUE-abcd", "slot.1");
Blueprints which impose name-based limits (like the Almanac blueprint) would lock named slots identifying the locked resources:
INSERT INTO slot (blueprintPHID, slotKey) VALUES ("PHID-BLUE-abcd", "almanac.service.PHID-SERV-1234");
A blueprint could claim multiple slots if it needs to impose limits across multiple dimensions.
This approach eliminates explicit locking, lock contention across blueprints, and lock acquisition on requests which are doomed to failure. I believe it is powerful enough to satisfy the locking requirements of any scenario I can come up with. We would need to handle cleanup better first, to make sure slots are eventually freed.
This sounds like a pretty good idea (side note: the quoting in Remarkup isn't quoting your code blocks properly).
When do these records get created? At the start of the worker allocation by calling acquireSlot or something on the blueprint implementation? And then we attempt to acquire a PhabricatorGlobalLock on the returned lock value?
Right now the code loads all blueprints and evaluates them all at once within the same lock; I believe this would need to change so that the code is more "per-blueprint" orientated; iterating through each blueprint, acquiring the slot on it, performing resources and lease checks, and then releasing the slot before exiting. The first blueprint to satisfy the request stops iteration.
Also how do we handle if the INSERT fails due to unique key collision (presumably because another allocator has a lock now)? Would we treat the insert failure as failure to acquire the lock and just yield instead?
(I believe that would also negate the need for PhabricatorGlobalLock as well)
Also how do we handle if the INSERT fails due to unique key collision (presumably because another allocator has a lock now)? Would we treat the insert failure as failure to acquire the lock and just yield instead?
Yeah (and, yeah, I expect we don't need a GlobalLock if this approach pans out).
I generally imagine the allocator as having two phases. The first "allocator" phase is Drydock doing general checks: do we have any blueprint which can possibly satisfy this request? Is it totally full? Does the request make sense? Which blueprints look plausible as candidates to hand the request to?
This first phase is very coarse, not locked, and it's OK for it to get the wrong result. Both false positives (we conclude we can allocate, but it won't actually work) and false negatives (we believe we can't allocate, even though we could have) are fine, as long as the false negatives are just timing issues and not real bugs. But if a request a split second earlier would have received a true negative, it's OK to give a slightly later request a false negative too and let it retry.
We're just looking to come out of it with a roughly correct result that is actually correct at least some of the time.
The second "blueprint" phase is where we actually make sure the resource is unique. It's handed some state of the world, but that state is just advisory. It might say "Almanac service 9 is free" or "slots 15 and 22 are free", but the blueprint needs to apply its own locks to control the actual acquisition of these resources. The "slot" approach might be a good way to do this that's pretty easy to use/understand/inspect and has good contention characteristics. The blueprint could also conceivably acquire a GlobalLock local to the blueprint, reload the world state, and then proceed from there. That would also be correct, but probably a little slower and a little harder to implement and understand in cases I can forsee than using slots would be. We could also use a hybrid approach if slots turn out to be poorly suited to something down the road.
It's fine if the actual allocation effort fails when the blueprint claimed it would succeed: this will happen sometimes no matter what (e.g., network fails just before we save the resource), so it's fine if it sometimes happens because something stole a resource between the time we read the world state and made it to allocation. We'll just retry and eventually move forward.
But basically we end up with the Allocator responsible for big picture strategic planning, and the Blueprints responsible for the specifics of resource acquisition.
I'm sure there are a bunch of specific problems with the code as currently written (e.g., there's a bare load via the LiskDAO, so obviously this code is not particularly modern or mature) and it may need adjustments to work with this overall strategy, but I think pushing locking down into Blueprints is desirable.
I think the specifics of how locks work there are less important than scoping the locks narrowly, but the slot-based approach seems attractive to me since it's fairly flexible and easy to reason about and a lot more observable than GlobalLocks are.
The one case I can think of where it might not work well is if a blueprint has a very large resource limit on some kind of resource, like a 1TB disk that you'd like to partition into one million 1MB blocks, and it's common for leases to request several GB of disk. With slots, you'd need to do thousands of inserts to lock each megabyte you planned to use. In practice, maybe we can always break this kind of resource into coarser chunks, so you get 1GB blocks on your 1TB disk, and you eat up 3 slots for 3GB instead of 3,000 slots. This generally seems surmountable, though, and I think we can cross this bridge when we come to it.
On the last point of retrying, how do you know when all attempts have failed? i.e. how does the system not end up retrying forever if there's no solvable solution to the lease?
When the allocator says "this thing is available" and then it's not, and we retry and the allocator again says "this thing is available", how do we terminate?
But basically we end up with the Allocator responsible for big picture strategic planning, and the Blueprints responsible for the specifics of resource acquisition.
I'm sure there are a bunch of specific problems with the code as currently written (e.g., there's a bare load via the LiskDAO, so obviously this code is not particularly modern or mature) and it may need adjustments to work with this overall strategy, but I think pushing locking down into Blueprints is desirable.
I think the specifics of how locks work there are less important than scoping the locks narrowly, but the slot-based approach seems attractive to me since it's fairly flexible and easy to reason about and a lot more observable than GlobalLocks are.
The one case I can think of where it might not work well is if a blueprint has a very large resource limit on some kind of resource, like a 1TB disk that you'd like to partition into one million 1MB blocks, and it's common for leases to request several GB of disk. With slots, you'd need to do thousands of inserts to lock each megabyte you planned to use. In practice, maybe we can always break this kind of resource into coarser chunks, so you get 1GB blocks on your 1TB disk, and you eat up 3 slots for 3GB instead of 3,000 slots. This generally seems surmountable, though, and I think we can cross this bridge when we come to it.
I think there might be other scaling issues at that end of the scale (like having millions of leases in the database and attempting to do any sort of reasonable queries on them).
In the general case, if the blueprint examines a world state and claims it will be able to allocate given that world state, and is never actually able to allocate in that state, I think it's bugged and it's correct to retry until someone notices the bug and fixes the blueprint.
For example, if you have an EC2 blueprint but gave it the wrong credentials or something, it's reasonable to queue up builds until the issue is corrected (after the issue is corrected, the operator can choose whether they want to dump the queue or let it complete).
More specifically, we can tailor behavior for particular failures, just like how Workers distinguish between Yield, PermanetnFailure, and other types of failures. We can distinguish between failures caused by no valid blueprint existing (permanent failure? or heavy backoff) vs no free slots (backoff) vs a slot failing to lock (likely race; retry fairly aggressively) vs generic allocator failures because of things like network issues or bad configuration or bugs.
But if you give Drydock an EC2 blueprint and tell it to allocate some resources using the blueprint and then EC2 is hit by an earthquake and offline for 72 hours, it's basically correct/reasonable for jobs to queue up and retry endlessly in the meantime. If EC2 comes back up, the requested work will be completed. Otherwise, an operator can make a decision to abort the work if it's no longer relevant.
Just a heads up, the proposed locking infrastructure sounds good, but I'm pressed for time on other projects at the moment, so I don't know when I'll get a chance to start building it (if upstream doesn't get to it first).
Sounds good. I'm not jumping on this right away either, but I'll yell if it looks like I'm headed toward it.
I think this is substantially covered in HEAD now. A few pieces (some of the constraint stuff and expiration) aren't fully in yet (e.g., see T6569) but I believe we're race-free now as long as blueprints do whatever locking and constraint management they need to.