Plans: Drydock for normal software use cases where builds take more than 45 seconds
Open, NormalPublic
Actions

Assigned To

None

Authored By

	epriestley
	Feb 12 2018, 11:02 PM

Description

See PHI115. See PHI272. Maybe see T12145.

Drydock should ship with a spot instance EC2 allocator which works properly and makes it clear how to write this kind of allocator.
Drydock should ship with a one-lease-per-host allocator or a mechanism for configuring this.
Drydock should ship with at least one reclaim-on-release resource.
PHI272 has some other reasonable use cases which upstream builtins and/or documentation should make clear.

See PHI272. See PHI299. Drydock blueprints use CustomField to support type-level fields, but should use EditField instead.

See PHI182. DrydockRepositoryOperationStatusView has hard-coded errors. These should be modular.

See T11693. That has a bunch of generally sane things although it's maybe somewhat speculative.

See T11694. Some of D16594 can likely come upstream.

See PHI270. This discusses adding "temporarily unavailable" or some other similar maintenance state to blueprints.

See PHI129. See T8153. This discusses better (somewhat semantic) logging on resources.

Recovery Issues:

See T10559. Unclear what this is, but I'll make some effort to reproduce it.
See T11495. This appears to be that we don't recover correctly if you can't clone from a staging area?
See PHI312. Drydock can attempt to lease a resource which is currently being reclaimed, and does not recover gracefully.

Errata:

Resources in BROKEN can receive commands. Leases in BROKEN can not. Probably just an oversight?
(T13073#235779) Can we remove setActivateWhenAcquired() on Lease and make this do the right thing automatically?
Some logs render with HTML in them when running bin/drydock lease.

Compatibility breaks:

Lease->releaseOnDestruction() is now Lease->setReleaseOnDestruction(bool). Passing true preserves identical behavior to the old call.

Documentation:

Because we can throw leases back in the pool if they acquire on dead resources, acquireLease() must not cause side effects other than acquiring slot locks (or we need a mechanism for reversing these side effects).

See PHI570, which identifies a possible issue with recycling resources.

See PHI885, which identifies an issue with git fetch hanging during repository operations. We should add timeout behavior and possibly try to figure out if git fetch is "actually doing something" versus hung.

See PHI882, which requests Kubernetes support.

See PHI917 and T8153#242083. Currently, disabling a binding for an Almanac host stops new leases, but doesn't cause us to tear down the host resource. It should.

Revisions and Commits

rP Phabricator
	Draft		D19102 Implement "Hoax" blueprints in Drydock, like "Host" blueprints but imaginary
		D19101	rP463dda98edd3 Render Drydock logs in text from "bin/drydock lease"; in HTML in web views
		D19080	rPa1baedbd9a08 Lock resources briefly while acquiring leases on them to prevent acquiring near…
		D19078	rP619943bea01c Reduce collision rate for concurrency-limiting slot locks
		D19077	rP2994753d2333 Don't let `bin/drydock lease --attributes` overwrite blueprints
		D19076	rP30a0b103e6f2 When a lease acquires a resource but the resource fails to activate, throw the…
		D19075	rP06bbf237fe56 Give Drydock Resources more modern status treatment
		D19074	rP27c3793d4087 Give Drydock Leases more modern status treatment
		D19073	rP07028cfc30cf When `bin/drydock lease` is interrupted, release leases
		D19072	rPb833e324bd4a While waiting for a "bin/drydock" lease to activate, entertain the user with…
		D19071	rP3ec80a36dbee In Drydock log views, respect newlines
		D19070	rP4dd32dca3e8f When a Drydock Blueprint promises us a resource but can't deliver, continue…

Related Objects

Mentioned In: T13216: 2018 Week 45-47 Bonus Content
T13210: Plans: 2018 Week 41-44 Bonus Content
T13189: Plans: 2018 Week 35 Bonus Content
T13187: Plans: 2018 Week 34 Bonus Content
T13164: Plans: 2018 Week 31 - 33 Bonus Content
T13151: Plans: 2018 Week 23 - Week 30 Bonus Content
2018 Week 7 (Mid February)
T11495: Drydock doesn't delete working copies
T10559: Lease Working Copy Hangs on lease failure, but should fail
D19070: When a Drydock Blueprint promises us a resource but can't deliver, continue believing in it
T13054: Plans: Harbormaster Interactions with Differential / Chaos Reduction
Mentioned Here: T12316: Phacility Deploy Log 2017 Week 8
D16594: Expose Drydock leases via Conduit
T8153: Improve detection and recovery when resources are mangled outside of Drydock's control
T10559: Lease Working Copy Hangs on lease failure, but should fail
T11495: Drydock doesn't delete working copies
T11693: Make drydock command interfaces accessible via SSH workflows
T11694: Allow clients to generally reason about Drydock leases over the API
T12145: Resource allocator does not create new host resources when one is already active

Event Timeline

epriestley triaged this task as Normal priority.Feb 12 2018, 11:02 PM

epriestley created this task.

Herald added a subscriber: eadler. · View Herald TranscriptFeb 12 2018, 11:02 PM

epriestley mentioned this in T13054: Plans: Harbormaster Interactions with Differential / Chaos Reduction.Feb 12 2018, 11:03 PM

epriestley updated the task description. (Show Details)Feb 12 2018, 11:17 PM

epriestley renamed this task from Plans: Drydock to Plans: Drydock for normal software use cases where builds take more than 45 seconds.Feb 12 2018, 11:26 PM

epriestley updated the task description. (Show Details)

joshuaspence added a subscriber: joshuaspence.Feb 12 2018, 11:42 PM

In trying to plot a course through this, I'd like to try to sequence this so that existing issues get fixed first, then breaking changes happen afterward (maybe in the next release). Some of these changes (particularly, the CustomField-to-EditField change) will be significantly disruptive to existing custom code. (That change is also pretty straightforward and new stuff depends on it, so I'd probably start there if there was no compatibility concern.)

If possible, I don't want to add new blueprints (EC2) or blueprint behaviors (reclaim-on-release, fancier configuration, etc) until doing the CustomField/EditField swap, since they'll make the field swap harder and adding them will help test the swap.

The most immediate issue is bad behavior when a "reclaim" races a lease acquisition (PHI312). I think the fix here is probably one of:

reclaim must acquire a lease first; or
leases that are outraced should recover gracefully.

I'm going to look at recovery behavior first. In general, resources may vanish at any time and we need to be robust against this: no amount of clever locking on our side can stop lightning from striking a server, AWS from reclaiming an instance, etc. The goal of locking is to keep our view of the world internally consistent, not to prevent recovery from ever being required.

PHI312, T8153, T11495, and T10559 suggest that recovery behavior may currently be quite poor in several cases.

One difficulty in working with Drydock is that reproducing scenarios tends to be difficult and to require a large amount of configuration, and many resources Drydock interacts with are necessarily fairly slow to build. When testing an EC2 allocator it's fine that it actually spins up instances, but when testing recovery behavior there's basically no quick way to set it up so it's repeatable from the CLI. All of the tasks above require substantial setup and there is no way for me to share a configuration with anyone else.

A possible approach for this that I'm considering is to write a "Synthetic Host" blueprint, where each "host" is just a row in MySQL that responds in a pre-programmed way when commands are run. This probably isn't very powerful for testing very complex behaviors where multiple resources interact with one another, but I think it may cover 80%-90% of things that anyone actually runs into today. I'm not sure if this entirely makes sense in the upstream or would be better as a separate "Drydock SDK" extension.

A related issue is that adding test coverage for this stuff is hard. I think this is might be more trouble than it's worth to get working: Drydock relies on a functioning task queue to process work, so we'd have to stub out the entire daemon queue to write unit tests. If that only makes it slightly easier to fix a handful of bugs in the next several years, it probably isn't worth it. I'd like to try to attack this via "Synthetic Host" plus integration-test-esque scripts instead -- these behaviors are unlikely to stop working suddenly, and the majority of the value in making test cases repeatable is in making debugging easier.

Upshot: I'm going to start by seeing how a mocked host feels as a tool for building reproduction cases for PHI312/T8153/T11495/T10559, and then go from there.

When blueprints claim to be able to allocate resources but fail during allocation, we currently destroy the lease. I think this is likely too severe: consider an EC2 blueprint where the allocation call fails for some temporary reason (EC2 just fails, or we hit the account limit for instances, or whatever else).

A better behavior would probably be to log the failure to the blueprint, then yield the lease so it can retry. If a blueprint says "yes, I'll be able to allocate" and then can't actually allocate, I think we should assume the blueprint is implemented correctly and just encountered unforeseeable difficulty, not that the blueprint is so poorly implemented that all leases which touch it are eternally tainted and must be destroyed.

When bin/drydock lease is used to acquire a lease and it fails during resource allocation, almost no information is provided to the user:

EXCEPTION: (Exception) Lease has already been destroyed!

Useful information is provided in the daemon log:

EXCEPTION: (PhutilProxyException) Task "3161734" encountered a permanent failure and was cancelled.
{>} (PhabricatorWorkerPermanentFailureException) Permanent failure while activating lease ("PHID-DRYL-6uadrvcml7thenrfzovo"): All blueprints failed to allocate a suitable new resource when trying to allocate lease "PHID-DRYL-6uadrvcml7thenrfzovo".
PhutilMissingSymbolException: Failed to load class or interface 'DrydockHoaxPHIDType': the class or interface 'DrydockHoaxPHIDType' is not defined in the library map for any loaded phutil library.

This may go hand-in-hand with the allocation failure behavior, but if fixing that doesn't fix the bin/drydock behavior we should fix that separately. I suspect bin/drydock should poll the relevant log tables while waiting for lease acquisition.

When logs span multiple lines, they are rendered as unreadable garbage messes:

When you ^C a bin/drydock lease ... command while the lease is waiting, the lease is not released. We should attempt to release it. This may be slightly tricky to handle since most bin/* commands don't process interrupts.

epriestley mentioned this in D19070: When a Drydock Blueprint promises us a resource but can't deliver, continue believing in it.Feb 13 2018, 11:31 AM

epriestley added a revision: D19071: In Drydock log views, respect newlines.Feb 13 2018, 11:36 AM

epriestley added a revision: D19070: When a Drydock Blueprint promises us a resource but can't deliver, continue believing in it.Feb 13 2018, 11:37 AM

epriestley added a revision: D19072: While waiting for a "bin/drydock" lease to activate, entertain the user with log output.Feb 13 2018, 11:50 AM

Leases and resources don't use standard header UI tags to indicate status:

Screen Shot 2018-02-13 at 4.12.57 AM.png (258×378 px, 23 KB)

They should use header tags for consistency and ease of identifying object states.

epriestley added a revision: D19073: When `bin/drydock lease` is interrupted, release leases.Feb 13 2018, 12:24 PM

epriestley added a revision: D19074: Give Drydock Leases more modern status treatment.Feb 13 2018, 1:14 PM

epriestley updated the task description. (Show Details)Feb 13 2018, 1:24 PM

ftdysa added a subscriber: ftdysa.Feb 13 2018, 1:29 PM

epriestley added a revision: D19075: Give Drydock Resources more modern status treatment.Feb 13 2018, 1:40 PM

Alright, next issue:

My Hoax blueprint does not skip activation, but also does not implement activateResource(). The resource is destroyed as a result, which seems correct. Conceivably, we could distinguish between "temporary activation failure" and "permanent activation failure" at recover from this more cheaply in some cases, but assuming that activation failure is permanent and throwing resources away completely seems pretty reasonable for now. So I think this behavior is okay.

However, this causes also causes the lease to fail permanently and be destroyed:

Trying to acquire an active lease on a pending resource. You can not immediately activate leases on resources which need time to start up.

I think this is also correct, since I left a setActivateWhenAcquired(true) call inside acquireLease() when I copy/pasted.

It might be nice to make the meaning and behavior of setActivateWhenAcquired() and setActivateWhenAllocated() more clear. They both mean "this resource/lease doesn't need to do any setup, so you can skip straight over the setup step". This error means "you tried to skip the setup step, but you can't".

Actually, it's not clear to me that we need this flag on Lease at all: I think Lease can always figure out whether it can skip setup or not by checking if the resource is in setup. This would simplify things.

I remain slightly unsure about this so I'm going to leave it for now, but tentatively I'd like to:

Make this mechanism do the right thing automatically for Lease.
Possibly rename this mechanism to something more clear for Resource, perhaps like setNoSetupRequired(true).

epriestley updated the task description. (Show Details)Feb 13 2018, 1:58 PM

I just removed the setActivateWhenAcquired(true) for now without changing behavior. Next issue:

My blueprint still can't activate since I haven't implemented activateResource() yet, nor opted out of activation.

This kills the resource (fine), and then we try to lease it (fine). This permanently destroys the lease, which seems extreme. I'd prefer the lease return to the pool; it is reasonable for activation to fail for unforseen reasons.

I am going to change the behavior so that hitting a resource in the wrong state during lease activation throws the lease back into the lease pool instead of breaking it. This is consistent with a scenario like this:

You lease a host.
The blueprint brings up a new EC2 host.
Setup for the blueprint discovers that the EC2 host came up without an IPv4 IP address (which I've seen multiple times, see T12316#213286).
The blueprint throws the host away.
The lease is now stuck acquiring a broken resource.

This isn't the lease's fault, and it can just go back in the pool. It's reasonable to believe that it will succeed in the future if a success path exists in allocation.

Note that there's another similar scenario:

You lease a host.
It comes up in EC2.
The lease acquires and activates.
Lightning strikes, destroying the host utterly.

In this case, the lease isn't going to get to go back into the pool. You'll get a build failure or whatever and operations staff will need to install lighting rods to protect the hardware; this broadly seems reasonable.

Slightly more broadly: if the lease starts actually activating, we can't be sure that we can undo its state in the general case so we need to throw it away. But this failure (where the resource it has acquired doesn't activate) carries no such risk, and we can safely just throw it back in the pool. I think the only state we need to wipe out its its slot locks, which are easy to clean up.

epriestley updated the task description. (Show Details)Feb 13 2018, 2:35 PM

epriestley added a revision: D19076: When a lease acquires a resource but the resource fails to activate, throw the lease back in the pool.Feb 13 2018, 2:36 PM

My HoaxBlueprint does not implement activateLease(). This causes leases to be broken and destroyed.

Per above, I think we can't be sure that we can safely do anything else: activateLease() implementations (like DrydockWorkingCopyBlueprintImplementation) may do a significant amount of work and set properties on leases. For now, I plan to retain this behavior and just keep an eye on it. If activateLease() methods in the wild routinely encounter temporary errors, we could set different expectations for this method, or let it signal a retry-able failure, or whatever else.

The upstream failures that can arise here are basically git commands breaking. It's very hard for us to know if that's temporary (network issue, you haven't added the credentials on the remote) or permanent (wrong remote, you added the wrong credentials locally).

So, for now, guidance will be:

Succeed and call $lease->activateOnResource(...) to activate the lease.
Or fail and throw to break and destroy the lease.
Yielding probably works, but ehhh.
There is no way to fail gently and return the lease to the pool. This may change if use cases arise.

epriestley updated the task description. (Show Details)Feb 13 2018, 3:55 PM

epriestley mentioned this in T10559: Lease Working Copy Hangs on lease failure, but should fail.Feb 13 2018, 4:03 PM

epriestley mentioned this in T11495: Drydock doesn't delete working copies.Feb 13 2018, 4:33 PM

After the patches above, I can't reproduce any bad behavior for T11495 or T10559.

I can still make resource reclamation and allocation race to some degree (PHI312), although half of this is no longer problematic: if the resource releases first, the lease goes back in the pool now.

But if the lease acquires first, it can be immediately released by the reclamation, which isn't much of an improvement.

To fix this, I plan to make lease acquisition hold the resource lock while actually linking the lease to the resource. If the lock fails, the lease will retry later, just as though a slot lock had failed.

However, I identified a separate issue with slot locks that I'm going to fix first.

epriestley added a revision: D19077: Don't let `bin/drydock lease --attributes` overwrite blueprints.Feb 13 2018, 5:35 PM

epriestley added a revision: D19078: Reduce collision rate for concurrency-limiting slot locks.Feb 13 2018, 5:53 PM

When a blueprint uses concurrency-limiting slot locks, it currently just returns a random lock if it can't find a slot it suspects is free. This makes "all slots look fill, waiting for something to free up" ambiguous with actual collisions. It would probably be better to distinguish between these cases and make it more clear which one we're in.

Also, if you implement slot locks but don't call shouldLimitAllocatingPoolSize(), you can never hit the reclaim logic, which is a little un-obvious. Some of these mechanisms should likely be more built-in since it's easy to miss a step without any obvious indication that your blueprint won't work like you expect.

epriestley added a revision: D19080: Lock resources briefly while acquiring leases on them to prevent acquiring near-death resources.Feb 13 2018, 7:11 PM

epriestley updated the task description. (Show Details)Feb 13 2018, 9:10 PM

epriestley added a commit: rP4dd32dca3e8f: When a Drydock Blueprint promises us a resource but can't deliver, continue….Feb 13 2018, 9:12 PM

epriestley added a commit: rP3ec80a36dbee: In Drydock log views, respect newlines.

epriestley added a commit: rPb833e324bd4a: While waiting for a "bin/drydock" lease to activate, entertain the user with….

epriestley added a commit: rP07028cfc30cf: When `bin/drydock lease` is interrupted, release leases.Feb 13 2018, 9:14 PM

epriestley added a commit: rP27c3793d4087: Give Drydock Leases more modern status treatment.

epriestley added a commit: rP06bbf237fe56: Give Drydock Resources more modern status treatment.Feb 13 2018, 9:16 PM

epriestley added a commit: rP30a0b103e6f2: When a lease acquires a resource but the resource fails to activate, throw the….

epriestley added a commit: rP2994753d2333: Don't let `bin/drydock lease --attributes` overwrite blueprints.Feb 13 2018, 9:20 PM

epriestley added a commit: rP619943bea01c: Reduce collision rate for concurrency-limiting slot locks.

epriestley added a commit: rPa1baedbd9a08: Lock resources briefly while acquiring leases on them to prevent acquiring near….

epriestley added a revision: D19101: Render Drydock logs in text from "bin/drydock lease"; in HTML in web views.Feb 15 2018, 11:39 PM

epriestley added a revision: D19102: Implement "Hoax" blueprints in Drydock, like "Host" blueprints but imaginary.Feb 16 2018, 12:22 AM

epriestley added a commit: rP463dda98edd3: Render Drydock logs in text from "bin/drydock lease"; in HTML in web views.Feb 16 2018, 1:25 AM

epriestley mentioned this in 2018 Week 7 (Mid February).Feb 16 2018, 10:04 PM

epriestley added a project: Plans.Feb 25 2018, 3:28 PM

This is effectively paused until I'm more convinced that the stabilizations changes really stabilized things -- I'm hoping to stabilize first, then work on improvements from there.

Herald added a subscriber: faulconbridge. · View Herald TranscriptMar 5 2018, 3:06 PM

epriestley mentioned this in T13151: Plans: 2018 Week 23 - Week 30 Bonus Content.Jun 5 2018, 9:14 PM

J5lx added a subscriber: J5lx.Jun 17 2018, 12:53 PM

epriestley mentioned this in T13164: Plans: 2018 Week 31 - 33 Bonus Content.Jul 27 2018, 6:05 PM

epriestley mentioned this in T13187: Plans: 2018 Week 34 Bonus Content.Aug 20 2018, 10:01 AM

epriestley mentioned this in T13189: Plans: 2018 Week 35 Bonus Content.Aug 25 2018, 12:25 AM

epriestley updated the task description. (Show Details)Sep 19 2018, 12:37 PM

epriestley updated the task description. (Show Details)Sep 19 2018, 12:43 PM

epriestley updated the task description. (Show Details)Sep 19 2018, 12:53 PM

epriestley mentioned this in T13210: Plans: 2018 Week 41-44 Bonus Content.Oct 23 2018, 8:56 PM

epriestley mentioned this in T13216: 2018 Week 45-47 Bonus Content.Nov 6 2018, 12:50 PM

epriestley updated the task description. (Show Details)Nov 10 2018, 1:31 PM