Resolves T6074. This attempts to prevent an allocation race from occurring by saving the lease against the resource before we exit the lock.
Details
- Reviewers
epriestley - Group Reviewers
Blessed Reviewers - Maniphest Tasks
- T6074: Determine the best way to prevent the "Lost an allocation race" exception in Drydock
Tested that allocations still work inside Harbormaster locally, but can't be entirely sure that the issue is resolved until we leave this fix in production for a little while.
Diff Detail
- Repository
- rP Phabricator
- Branch
- prevent-allocation-race
- Lint
Lint Passed - Unit
No Test Coverage - Build Status
Buildable 2505 Build 2509: [Placeholder Plan] Wait for 30 Seconds
Event Timeline
This seems bad: we hold the lock for an arbitrarily long period of time, and may cycle and deadlock, if allocating requires leasing another resource of the same type. This is legal and permissible (the second resource of the same type can have different parameters, and this is fine; cycling on resource types does not mean that the allocation plan cycles as a whole).
I'd expect to just retry or yield if we lose a race.
I don't even really understand the root problem here. STATUS_ALLOCATING does not exist at HEAD and I think was added and then removed elsewhere? Does this actually repro against HEAD, or is it plausible that other changes intorduced this problem?
More broadly:
- There isn't even a lock at head so I don't really know what this code is doing.
- It's OK for us to lease as soon as other processes are allowed to lease. If they can lease earlier, it's OK for us to lease earlier too. In this version of the code, leasing before allocateResource() might be OK, for example. At HEAD, there's nowhere earlier to lease.
- It's potentially OK to not let anything lease until later, but I think this is undesirable (in bursty workloads, we can do a better job of satisfying resource demand with the current approach: preventing allocations will spin up more slowly or overallocate resources).
- It's not OK to hold a lock while waiting for a resource to come up.
- This can take arbitrarily long.
- This can cycle.
- If we're allowed to lease immediately after releasing the lock, it's probably OK to lease immediately before releasing the lock, but I'd be very hesitant to do this.
- I think letting the code race is basically desirable, and that we can just yield if we lose a race.
When I originally made this fix, I was as confused as you are, but it turns out allocateLease doesn't actually do any work. The only function it calls that's controllable by the derived implementation is shouldAllocateLease, and this function already has an explicit warning that checks in this function should be quick because the function may be called from a lock.
The lease acquisition work is always done inside acquireLease, which always resides outside of a lock.
Anyway, I'm going to merge this down into D10304 since it's related to ensuring Drydock has race-free allocation behaviour.