Page MenuHomePhabricator

Prevent allocation race by allocating lease before leaving lock
AbandonedPublic

Authored by hach-que on Sep 11 2014, 1:57 AM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, Dec 24, 12:15 PM
Unknown Object (File)
Thu, Dec 19, 6:34 AM
Unknown Object (File)
Sun, Dec 8, 3:06 PM
Unknown Object (File)
Nov 27 2024, 3:02 AM
Unknown Object (File)
Nov 23 2024, 10:46 AM
Unknown Object (File)
Nov 18 2024, 11:37 PM
Unknown Object (File)
Nov 14 2024, 11:04 PM
Unknown Object (File)
Nov 13 2024, 8:18 PM

Details

Summary

Resolves T6074. This attempts to prevent an allocation race from occurring by saving the lease against the resource before we exit the lock.

Test Plan

Tested that allocations still work inside Harbormaster locally, but can't be entirely sure that the issue is resolved until we leave this fix in production for a little while.

Diff Detail

Repository
rP Phabricator
Branch
prevent-allocation-race
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 2505
Build 2509: [Placeholder Plan] Wait for 30 Seconds

Event Timeline

hach-que retitled this revision from to Prevent allocation race by allocating lease before leaving lock.
hach-que updated this object.
hach-que edited the test plan for this revision. (Show Details)
hach-que added a reviewer: epriestley.
epriestley edited edge metadata.

This seems bad: we hold the lock for an arbitrarily long period of time, and may cycle and deadlock, if allocating requires leasing another resource of the same type. This is legal and permissible (the second resource of the same type can have different parameters, and this is fine; cycling on resource types does not mean that the allocation plan cycles as a whole).

I'd expect to just retry or yield if we lose a race.

I don't even really understand the root problem here. STATUS_ALLOCATING does not exist at HEAD and I think was added and then removed elsewhere? Does this actually repro against HEAD, or is it plausible that other changes intorduced this problem?

This revision now requires changes to proceed.Aug 8 2015, 6:06 PM

More broadly:

  • There isn't even a lock at head so I don't really know what this code is doing.
  • It's OK for us to lease as soon as other processes are allowed to lease. If they can lease earlier, it's OK for us to lease earlier too. In this version of the code, leasing before allocateResource() might be OK, for example. At HEAD, there's nowhere earlier to lease.
  • It's potentially OK to not let anything lease until later, but I think this is undesirable (in bursty workloads, we can do a better job of satisfying resource demand with the current approach: preventing allocations will spin up more slowly or overallocate resources).
  • It's not OK to hold a lock while waiting for a resource to come up.
    • This can take arbitrarily long.
    • This can cycle.
  • If we're allowed to lease immediately after releasing the lock, it's probably OK to lease immediately before releasing the lock, but I'd be very hesitant to do this.
  • I think letting the code race is basically desirable, and that we can just yield if we lose a race.

When I originally made this fix, I was as confused as you are, but it turns out allocateLease doesn't actually do any work. The only function it calls that's controllable by the derived implementation is shouldAllocateLease, and this function already has an explicit warning that checks in this function should be quick because the function may be called from a lock.

The lease acquisition work is always done inside acquireLease, which always resides outside of a lock.

Anyway, I'm going to merge this down into D10304 since it's related to ensuring Drydock has race-free allocation behaviour.