Page MenuHomePhabricator

Plans: Drydock for normal software use cases where builds take more than 45 seconds
Open, NormalPublic

Description

See PHI115. See PHI272. Maybe see T12145.

  • Drydock should ship with a spot instance EC2 allocator which works properly and makes it clear how to write this kind of allocator.
  • Drydock should ship with a one-lease-per-host allocator or a mechanism for configuring this.
  • Drydock should ship with at least one reclaim-on-release resource.
  • PHI272 has some other reasonable use cases which upstream builtins and/or documentation should make clear.

See PHI272. See PHI299. Drydock blueprints use CustomField to support type-level fields, but should use EditField instead.

See PHI182. DrydockRepositoryOperationStatusView has hard-coded errors. These should be modular.

See T11693. That has a bunch of generally sane things although it's maybe somewhat speculative.

See T11694. Some of D16594 can likely come upstream.

See PHI270. This discusses adding "temporarily unavailable" or some other similar maintenance state to blueprints.

See PHI129. See T8153. This discusses better (somewhat semantic) logging on resources.


Recovery Issues:

  • See T10559. Unclear what this is, but I'll make some effort to reproduce it.
  • See T11495. This appears to be that we don't recover correctly if you can't clone from a staging area?
  • See PHI312. Drydock can attempt to lease a resource which is currently being reclaimed, and does not recover gracefully.

Errata:

  • Resources in BROKEN can receive commands. Leases in BROKEN can not. Probably just an oversight?
  • (T13073#235779) Can we remove setActivateWhenAcquired() on Lease and make this do the right thing automatically?
  • Some logs render with HTML in them when running bin/drydock lease.

Compatibility breaks:

  • Lease->releaseOnDestruction() is now Lease->setReleaseOnDestruction(bool). Passing true preserves identical behavior to the old call.

Documentation:

  • Because we can throw leases back in the pool if they acquire on dead resources, acquireLease() must not cause side effects other than acquiring slot locks (or we need a mechanism for reversing these side effects).

See PHI570, which identifies a possible issue with recycling resources.

See PHI885, which identifies an issue with git fetch hanging during repository operations. We should add timeout behavior and possibly try to figure out if git fetch is "actually doing something" versus hung.

See PHI882, which requests Kubernetes support.

See PHI917 and T8153#242083. Currently, disabling a binding for an Almanac host stops new leases, but doesn't cause us to tear down the host resource. It should.

Event Timeline

epriestley created this task.
epriestley renamed this task from Plans: Drydock to Plans: Drydock for normal software use cases where builds take more than 45 seconds.Feb 12 2018, 11:26 PM
epriestley updated the task description. (Show Details)

In trying to plot a course through this, I'd like to try to sequence this so that existing issues get fixed first, then breaking changes happen afterward (maybe in the next release). Some of these changes (particularly, the CustomField-to-EditField change) will be significantly disruptive to existing custom code. (That change is also pretty straightforward and new stuff depends on it, so I'd probably start there if there was no compatibility concern.)

If possible, I don't want to add new blueprints (EC2) or blueprint behaviors (reclaim-on-release, fancier configuration, etc) until doing the CustomField/EditField swap, since they'll make the field swap harder and adding them will help test the swap.

The most immediate issue is bad behavior when a "reclaim" races a lease acquisition (PHI312). I think the fix here is probably one of:

  • reclaim must acquire a lease first; or
  • leases that are outraced should recover gracefully.

I'm going to look at recovery behavior first. In general, resources may vanish at any time and we need to be robust against this: no amount of clever locking on our side can stop lightning from striking a server, AWS from reclaiming an instance, etc. The goal of locking is to keep our view of the world internally consistent, not to prevent recovery from ever being required.

PHI312, T8153, T11495, and T10559 suggest that recovery behavior may currently be quite poor in several cases.

One difficulty in working with Drydock is that reproducing scenarios tends to be difficult and to require a large amount of configuration, and many resources Drydock interacts with are necessarily fairly slow to build. When testing an EC2 allocator it's fine that it actually spins up instances, but when testing recovery behavior there's basically no quick way to set it up so it's repeatable from the CLI. All of the tasks above require substantial setup and there is no way for me to share a configuration with anyone else.

A possible approach for this that I'm considering is to write a "Synthetic Host" blueprint, where each "host" is just a row in MySQL that responds in a pre-programmed way when commands are run. This probably isn't very powerful for testing very complex behaviors where multiple resources interact with one another, but I think it may cover 80%-90% of things that anyone actually runs into today. I'm not sure if this entirely makes sense in the upstream or would be better as a separate "Drydock SDK" extension.

A related issue is that adding test coverage for this stuff is hard. I think this is might be more trouble than it's worth to get working: Drydock relies on a functioning task queue to process work, so we'd have to stub out the entire daemon queue to write unit tests. If that only makes it slightly easier to fix a handful of bugs in the next several years, it probably isn't worth it. I'd like to try to attack this via "Synthetic Host" plus integration-test-esque scripts instead -- these behaviors are unlikely to stop working suddenly, and the majority of the value in making test cases repeatable is in making debugging easier.

Upshot: I'm going to start by seeing how a mocked host feels as a tool for building reproduction cases for PHI312/T8153/T11495/T10559, and then go from there.

When blueprints claim to be able to allocate resources but fail during allocation, we currently destroy the lease. I think this is likely too severe: consider an EC2 blueprint where the allocation call fails for some temporary reason (EC2 just fails, or we hit the account limit for instances, or whatever else).

A better behavior would probably be to log the failure to the blueprint, then yield the lease so it can retry. If a blueprint says "yes, I'll be able to allocate" and then can't actually allocate, I think we should assume the blueprint is implemented correctly and just encountered unforeseeable difficulty, not that the blueprint is so poorly implemented that all leases which touch it are eternally tainted and must be destroyed.


When bin/drydock lease is used to acquire a lease and it fails during resource allocation, almost no information is provided to the user:

EXCEPTION: (Exception) Lease has already been destroyed!

Useful information is provided in the daemon log:

EXCEPTION: (PhutilProxyException) Task "3161734" encountered a permanent failure and was cancelled.
{>} (PhabricatorWorkerPermanentFailureException) Permanent failure while activating lease ("PHID-DRYL-6uadrvcml7thenrfzovo"): All blueprints failed to allocate a suitable new resource when trying to allocate lease "PHID-DRYL-6uadrvcml7thenrfzovo".
PhutilMissingSymbolException: Failed to load class or interface 'DrydockHoaxPHIDType': the class or interface 'DrydockHoaxPHIDType' is not defined in the library map for any loaded phutil library.

This may go hand-in-hand with the allocation failure behavior, but if fixing that doesn't fix the bin/drydock behavior we should fix that separately. I suspect bin/drydock should poll the relevant log tables while waiting for lease acquisition.

When logs span multiple lines, they are rendered as unreadable garbage messes:

Screen Shot 2018-02-13 at 3.12.52 AM.png (174×1 px, 91 KB)


When you ^C a bin/drydock lease ... command while the lease is waiting, the lease is not released. We should attempt to release it. This may be slightly tricky to handle since most bin/* commands don't process interrupts.

Leases and resources don't use standard header UI tags to indicate status:

Screen Shot 2018-02-13 at 4.12.57 AM.png (258×378 px, 23 KB)

They should use header tags for consistency and ease of identifying object states.

Alright, next issue:

My Hoax blueprint does not skip activation, but also does not implement activateResource(). The resource is destroyed as a result, which seems correct. Conceivably, we could distinguish between "temporary activation failure" and "permanent activation failure" at recover from this more cheaply in some cases, but assuming that activation failure is permanent and throwing resources away completely seems pretty reasonable for now. So I think this behavior is okay.

However, this causes also causes the lease to fail permanently and be destroyed:

Trying to acquire an active lease on a pending resource. You can not immediately activate leases on resources which need time to start up.

I think this is also correct, since I left a setActivateWhenAcquired(true) call inside acquireLease() when I copy/pasted.

It might be nice to make the meaning and behavior of setActivateWhenAcquired() and setActivateWhenAllocated() more clear. They both mean "this resource/lease doesn't need to do any setup, so you can skip straight over the setup step". This error means "you tried to skip the setup step, but you can't".

Actually, it's not clear to me that we need this flag on Lease at all: I think Lease can always figure out whether it can skip setup or not by checking if the resource is in setup. This would simplify things.

I remain slightly unsure about this so I'm going to leave it for now, but tentatively I'd like to:

  • Make this mechanism do the right thing automatically for Lease.
  • Possibly rename this mechanism to something more clear for Resource, perhaps like setNoSetupRequired(true).

I just removed the setActivateWhenAcquired(true) for now without changing behavior. Next issue:

My blueprint still can't activate since I haven't implemented activateResource() yet, nor opted out of activation.

This kills the resource (fine), and then we try to lease it (fine). This permanently destroys the lease, which seems extreme. I'd prefer the lease return to the pool; it is reasonable for activation to fail for unforseen reasons.

I am going to change the behavior so that hitting a resource in the wrong state during lease activation throws the lease back into the lease pool instead of breaking it. This is consistent with a scenario like this:

  • You lease a host.
  • The blueprint brings up a new EC2 host.
  • Setup for the blueprint discovers that the EC2 host came up without an IPv4 IP address (which I've seen multiple times, see T12316#213286).
  • The blueprint throws the host away.
  • The lease is now stuck acquiring a broken resource.

This isn't the lease's fault, and it can just go back in the pool. It's reasonable to believe that it will succeed in the future if a success path exists in allocation.

Note that there's another similar scenario:

  • You lease a host.
  • It comes up in EC2.
  • The lease acquires and activates.
  • Lightning strikes, destroying the host utterly.

In this case, the lease isn't going to get to go back into the pool. You'll get a build failure or whatever and operations staff will need to install lighting rods to protect the hardware; this broadly seems reasonable.

Slightly more broadly: if the lease starts actually activating, we can't be sure that we can undo its state in the general case so we need to throw it away. But this failure (where the resource it has acquired doesn't activate) carries no such risk, and we can safely just throw it back in the pool. I think the only state we need to wipe out its its slot locks, which are easy to clean up.

My HoaxBlueprint does not implement activateLease(). This causes leases to be broken and destroyed.

Per above, I think we can't be sure that we can safely do anything else: activateLease() implementations (like DrydockWorkingCopyBlueprintImplementation) may do a significant amount of work and set properties on leases. For now, I plan to retain this behavior and just keep an eye on it. If activateLease() methods in the wild routinely encounter temporary errors, we could set different expectations for this method, or let it signal a retry-able failure, or whatever else.

The upstream failures that can arise here are basically git commands breaking. It's very hard for us to know if that's temporary (network issue, you haven't added the credentials on the remote) or permanent (wrong remote, you added the wrong credentials locally).

So, for now, guidance will be:

  • Succeed and call $lease->activateOnResource(...) to activate the lease.
  • Or fail and throw to break and destroy the lease.
  • Yielding probably works, but ehhh.
  • There is no way to fail gently and return the lease to the pool. This may change if use cases arise.

After the patches above, I can't reproduce any bad behavior for T11495 or T10559.

I can still make resource reclamation and allocation race to some degree (PHI312), although half of this is no longer problematic: if the resource releases first, the lease goes back in the pool now.

But if the lease acquires first, it can be immediately released by the reclamation, which isn't much of an improvement.

To fix this, I plan to make lease acquisition hold the resource lock while actually linking the lease to the resource. If the lock fails, the lease will retry later, just as though a slot lock had failed.

However, I identified a separate issue with slot locks that I'm going to fix first.

When a blueprint uses concurrency-limiting slot locks, it currently just returns a random lock if it can't find a slot it suspects is free. This makes "all slots look fill, waiting for something to free up" ambiguous with actual collisions. It would probably be better to distinguish between these cases and make it more clear which one we're in.

Also, if you implement slot locks but don't call shouldLimitAllocatingPoolSize(), you can never hit the reclaim logic, which is a little un-obvious. Some of these mechanisms should likely be more built-in since it's easy to miss a step without any obvious indication that your blueprint won't work like you expect.

This is effectively paused until I'm more convinced that the stabilizations changes really stabilized things -- I'm hoping to stabilize first, then work on improvements from there.