Unprototype Drydock (v1)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	epriestley
	Aug 24 2015, 4:43 PM

Description

Drydock is a resource allocation system for hardware and software. It is mostly an infrastructure component which supports other applications, not an application that normal users are expected to interact with much.

The primary use case for Drydock is creating, managing, and destroying repository working copies for build systems. In particular, these are the short-term use cases:

(T9123) Harbormaster should be able to ask Drydock to give it a working copy containing an arbitrary commit, then run build processes in that working copy.
(T182) Differential should be able to ask Drydock to give it a working copy so it can commit a revision.

In the long term, Drydock will be able to build resources incrementally: you tell it how to allocate hosts and other hardware resources, and it manages pools of hardware and software to satisfy these requests.

For v1, the focus is on enabling T9123 + T182 by allocating working copies, not incremental resource construction or hardware resource management. Roughly, this means:

Hardware is in static, pre-allocated pools in Almanac.
Push as much dynamic/incremental allocation to later versions as possible.

Revisions and Commits

rP Phabricator
	D14349	rPa763f9510e76 Add some Drydock documentation plus "Test Configuration" for repository…
	D14334	rPc059149eb98e Remove Drydock host resource limits and give working copies simple limits
	D14274	rPac7edf54afe4 Fix bad counting in SQL when enforcing Drydock allocator soft limits
	D14272	rP083a321dad1b Fix an issue where newly created Drydock resources could be improperly acquired
	D14237	rP4d5278af1148 Put Drydock build steps into their own group in Harbormaster
	D14236	rPee937e99fb9a Fix unbounded expansion of allocating resource pool
	D14235	rPde2bbfef7d14 Allow PhabricatorWorker->queueTask() to take full $options
	D14234	rP4cf1270ecdd8 In Harbormaster, make sure artifacts are destroyed even if a build is aborted
	D14224	rPbb4667cb8490 Fix WorkingCopy step to read correct commit variables
	D14215	rPc95fcb8970ca Add a little Drydock documentation
	D14214	rP449617692489 Add staging area support to Harbormaster/Drydock + various fixes
	D14213	rPd4a0b1c8709b Remove names from Drydock resources
	D14212	rPb219bcfb3d70 Improve error and exception handling for Drydock leases
	D14211	rPe589d152310a Improve error and exception handling for Drydock resources
	D14210	rP6b775e609053 Add more Drydock log types and some additional logging
	D14202	rP4ac82be5ed22 Merge the DrydockLease workers into a single worker
	D14201	rP91e5ca0ee28c Merge the DrydockResource workers into a single worker
	D14198	rP8bf59050247d Add Drydock log types and more logging
	D14197	rP06f927250290 Garbage collect Drydock logs after 30 days
	D14196	rP2ef5b5321d1f Move Drydock logs to PHIDs and increased structure
	D14194	rP9d997df9643b Reset Drydock git working copies better
	D14180	rP33be8f719ff3 Allow WorkingCopy resources to have multiple working copies
	D14178	rP9b29d46e60f3 Make Drydock lease infrastructure more nimble
	D14177	rPcd2dd2a08f81 Give visual feedback when a Drydock resource or lease is releasing
	D14161	rPd735c7adf2d5 Allow Harbormaster to run commands on Drydock working copies
	D14160	rP284fe0fe51ce Allow Harbormaster to lease working copies from Drydock
	D14158	rP64ed97103993 Show recent active leases on Drydock resource detail
	D14157	rP3b2f4c258f1b Show recent active resources on Drydock blueprint detail, with link to all
	D14156	rPb441e8b81e31 Allow Drydock blueprints to be disabled
	D14155	rP1491269b72e4 Modernize Drydock SearchEngine implementations
	D14154	rPb71ce90b9cc1 Straighten out Drydock policies for Resources
	D14153	rPe117ace8c7fb Convert Drydock lease and resource constants to strings
	D14151	rPc6aade439283 Give Drydock leases a resourcePHID instead of a resourceID
	D14150	rP309aadc595a1 Rename Drydock Lease STATUS_EXPIRED to STATUS_DESTROYED
	D14147	rPfcb6d1e2faa5 Strip some obsolete code out of Drydock
	D14144	rP1f311d64c608 Give Drydock resources and leases a real "destroy" lifecycle phase
	D14143	rP789df89c84b5 Add a command queue to Drydock to manage lease/resource release

Related Objects
Search...

Status	Assigned	Task
Duplicate	None	T7869 Support CircleCI webhooks for Test results (so that one can run unit tests asynchronously)
Resolved	epriestley	T9456 Evaluate upstream support for third-party build systems
Resolved	epriestley	T9123 Build Phabricator in Harbormaster (v2)
Open	None	T6008 Editing files and contributing changes via web
Open	epriestley	T182 Commit into repository directly from differential
Resolved	epriestley	T9252 Unprototype Drydock (v1)
Resolved	epriestley	T9253 Implement a Drydock blueprint for leasing hosts from a static, pre-built Almanac pool
Resolved	epriestley	T9431 Deploy an "sbuild" tier outside of the cluster
Resolved	None	T9519 Design the Drydock Blueprint selection mechanism

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

epriestley added a commit: rPb71ce90b9cc1: Straighten out Drydock policies for Resources.Sep 24 2015, 4:56 PM

epriestley added a commit: rP1491269b72e4: Modernize Drydock SearchEngine implementations.

epriestley added a commit: rPb441e8b81e31: Allow Drydock blueprints to be disabled.Sep 24 2015, 5:18 PM

epriestley closed subtask T9253: Implement a Drydock blueprint for leasing hosts from a static, pre-built Almanac pool as Resolved.Sep 24 2015, 7:22 PM

epriestley added a revision: D14157: Show recent active resources on Drydock blueprint detail, with link to all.Sep 24 2015, 8:34 PM

epriestley added a revision: D14158: Show recent active leases on Drydock resource detail.Sep 24 2015, 8:51 PM

epriestley added a commit: rP3b2f4c258f1b: Show recent active resources on Drydock blueprint detail, with link to all.Sep 24 2015, 8:52 PM

epriestley added a commit: rP64ed97103993: Show recent active leases on Drydock resource detail.Sep 24 2015, 10:29 PM

epriestley added a revision: D14160: Allow Harbormaster to lease working copies from Drydock.Sep 24 2015, 11:33 PM

epriestley added a revision: D14161: Allow Harbormaster to run commands on Drydock working copies.Sep 25 2015, 12:16 AM

epriestley added a commit: rP284fe0fe51ce: Allow Harbormaster to lease working copies from Drydock.Sep 25 2015, 12:29 AM

epriestley added a commit: rPd735c7adf2d5: Allow Harbormaster to run commands on Drydock working copies.Sep 25 2015, 5:43 PM

epriestley mentioned this in 2015 Week 39 (Very Late September).Sep 26 2015, 1:35 PM

These things are now complete:

Resources and leases have a real destruction phase.
Resources now have sensible policies.
Leases now have formal expiration behaviors.
Resource and lease statuses are now consistent (and are now strings).
Leases now have a resourcePHID.
Blueprints can be disabled, preventing them from allocating new resources or acquiring new leases.
Various UI improvements.

These things remain:

Logging is still mostly untouched and way under the level it should be at.
Yields / temporary failures / permanent failures are still very coarse.
Lease policies are still a bit odd.
Resources still do not have expiration behaviors.
Security landscape isn't documented yet.
All the direct writes for phase changes are still non-transactional.
Recovery/retry behavior is pretty good if failures happen right away, but not as good if something allocates and then breaks later.

Broadly, T7399 has progressed far enough to let code run on the sbuild tier without fear that we're potentially leaking live cluster key material. I configured blueprints and a build plan on this host and we can now successfully execute "builds":

https://secure.phabricator.com/harbormaster/build/9161/

These "builds" have a lot of weird stuff going on still: for example, we queue a working copy lease, it acquires about 500ms later, then we spend 14500ms waiting around for no reason and 500ms doing the "build". However, this is easy to fix by tweaking yield heuristics or letting the allocator awaken the Harbormaster worker after allocation (which might be trivial).

The major blockers for T9123 (building Phabricator in Phabricator) on the Drydock side are:

Blueprint/resource selection, per above. We don't want normal builds running on the saux (higher-trust) tier, but there's currently no way to prevent it. This doesn't block T9123 but does block T182.
Running arc is harder for us than for other projects: we can't just put arc on the host in $PATH when building libphutil or arcanist, since it needs to run the version of arc being tested. We also want to use libphutil and arcanist at HEAD of master, even when running tests on phabricator -- not just whatever was last deployed to the box. I don't have a concrete plan for this yet. I think it probably takes the form of letting WorkingCopy blueprints be collections of working copies instead of single working copies.

epriestley mentioned this in T9123: Build Phabricator in Harbormaster (v2).Sep 26 2015, 4:00 PM

epriestley added a revision: D14177: Give visual feedback when a Drydock resource or lease is releasing.Sep 28 2015, 1:34 PM

epriestley added a revision: D14178: Make Drydock lease infrastructure more nimble.Sep 28 2015, 2:28 PM

epriestley added a revision: D14180: Allow WorkingCopy resources to have multiple working copies.Sep 28 2015, 3:41 PM

epriestley added a commit: rPcd2dd2a08f81: Give visual feedback when a Drydock resource or lease is releasing.Sep 28 2015, 4:35 PM

epriestley added a commit: rP9b29d46e60f3: Make Drydock lease infrastructure more nimble.

epriestley added a commit: rP33be8f719ff3: Allow WorkingCopy resources to have multiple working copies.

Resources still do not have expiration behaviors.

This is fixed, with some caveats about range of capabilities in D14176.

we queue a working copy lease, it acquires about 500ms later, then we spend 14500ms waiting around for no reason and 500ms doing the "build".

This is fixed, and we now do acquire + activate + "build" + release in ~1-2 seconds for libphutil/ on this host.

I think it probably takes the form of letting WorkingCopy blueprints be collections of working copies instead of single working copies.

This part is implemented now, although I haven't figured out how users are going to configure it.

Blueprint/resource selection, per above.

I have some ideas on this but nothing concrete yet.

Logging is still mostly untouched and way under the level it should be at.
Yields / temporary failures / permanent failures are still very coarse.
Recovery/retry behavior is pretty good if failures happen right away, but not as good if something allocates and then breaks later.

This stuff is still highly sketchy and probably up next.

epriestley mentioned this in T2015: Implement Drydock.Sep 29 2015, 3:01 PM

J5lx added a subscriber: J5lx.Sep 29 2015, 9:07 PM

• baylisscg added a subscriber: • baylisscg.Sep 30 2015, 4:41 AM

epriestley added a revision: D14194: Reset Drydock git working copies better.Sep 30 2015, 1:26 PM

epriestley added a revision: D14196: Move Drydock logs to PHIDs and increased structure.Sep 30 2015, 2:43 PM

epriestley added a commit: rP9d997df9643b: Reset Drydock git working copies better.Sep 30 2015, 2:45 PM

epriestley added a revision: D14197: Garbage collect Drydock logs after 30 days.Sep 30 2015, 2:52 PM

epriestley added a revision: D14198: Add Drydock log types and more logging.Sep 30 2015, 4:28 PM

epriestley added a revision: D14201: Merge the DrydockResource workers into a single worker.Sep 30 2015, 7:44 PM

epriestley added a revision: D14202: Merge the DrydockLease workers into a single worker.Sep 30 2015, 9:42 PM

epriestley added a revision: D14210: Add more Drydock log types and some additional logging.Oct 1 2015, 12:03 PM

epriestley added a revision: D14211: Improve error and exception handling for Drydock resources.Oct 1 2015, 12:46 PM

epriestley added a revision: D14212: Improve error and exception handling for Drydock leases.Oct 1 2015, 12:57 PM

epriestley added a revision: D14213: Remove names from Drydock resources.Oct 1 2015, 1:20 PM

epriestley added a commit: rP2ef5b5321d1f: Move Drydock logs to PHIDs and increased structure.Oct 1 2015, 3:06 PM

epriestley added a commit: rP06f927250290: Garbage collect Drydock logs after 30 days.Oct 1 2015, 3:09 PM

epriestley added a commit: rP8bf59050247d: Add Drydock log types and more logging.

epriestley added a commit: rP91e5ca0ee28c: Merge the DrydockResource workers into a single worker.

epriestley added a commit: rP4ac82be5ed22: Merge the DrydockLease workers into a single worker.

epriestley added a commit: rP6b775e609053: Add more Drydock log types and some additional logging.Oct 1 2015, 3:11 PM

epriestley added a commit: rPe589d152310a: Improve error and exception handling for Drydock resources.

epriestley added a commit: rPb219bcfb3d70: Improve error and exception handling for Drydock leases.

epriestley added a commit: rPd4a0b1c8709b: Remove names from Drydock resources.Oct 1 2015, 3:13 PM

epriestley removed a subtask: T7399: Fully separate live credentials from development repositories.Oct 1 2015, 4:09 PM

epriestley added a revision: D14214: Add staging area support to Harbormaster/Drydock + various fixes.Oct 1 2015, 6:21 PM

epriestley added a revision: D14215: Add a little Drydock documentation.Oct 1 2015, 7:28 PM

epriestley added a commit: rP449617692489: Add staging area support to Harbormaster/Drydock + various fixes.Oct 1 2015, 11:55 PM

epriestley added a commit: rPc95fcb8970ca: Add a little Drydock documentation.

epriestley added a revision: D14224: Fix WorkingCopy step to read correct commit variables.Oct 2 2015, 1:36 PM

epriestley added a commit: rPbb4667cb8490: Fix WorkingCopy step to read correct commit variables.Oct 2 2015, 1:37 PM

@epriestley I just tried to start reimplementing some of my patches on top of HEAD and I've run into a problem:

I need to yield within allocateResource, because the blueprint needs to wait for the IP address to be assigned to the host, but we can't call sleep. However, the allocateResource method doesn't have a resource because it creates one, and presumably if I yield, there's no guarantee that the allocator will continue in the same place?

Allocate, but don't setActivateWhenAllocated(). You'll get a callback to activateResource() later. Check for an IP. If you have one, set it on the resource and call activateResource() to finish activation. If you don't have one yet, throw a yield and you'll get another call later. Repeat until you get an IP. Does that sound approximately reasonable?

e.g.

public function activateResource(
  DrydockBlueprint $blueprint,
  DrydockResource $resource) {

  $ec2_key = $resource->getAttribute('key-in-ec2');

  $ip = hey_ec2_is_there_an_ip_yet($ec2_key);
  if (!$ip) {
    throw yield;
  }

  $resource
    ->setAttribute('ip', $ip)
    ->activateResource();
}

Couple of issues I've seen so far:

If a WorkingCopy build step is restarted while getting a working copy, it doesn't clean up the lease. This is because we don't emit an artifact until the very end. We either need to emit the artifact sooner or have a separate cleanup step for other target resources. I'm inclined to just emit the artifact sooner. The build won't move forward until the build step completes, anyway, so it's OK that there's no formal "incomplete artifact" state.
If we try to run two concurrent builds, the WorkingCopy blueprint is currently fine with bringing up an unlimited number of resources, but hosts are currently limited to one lease. This can give us resources which will never activate, since they're waiting for a host indefinitely. These limits don't make sense as-is anyway, but this interaction is sort of subtle and may need some finesse to resolve.

epriestley mentioned this in 2015 Week 40 (Very Early October).Oct 3 2015, 12:07 PM

gabe added a subscriber: gabe.Oct 4 2015, 12:17 AM

epriestley added a revision: D14234: In Harbormaster, make sure artifacts are destroyed even if a build is aborted.Oct 4 2015, 5:56 PM

epriestley added a commit: rP4cf1270ecdd8: In Harbormaster, make sure artifacts are destroyed even if a build is aborted.Oct 5 2015, 12:58 PM

benstiglitz added a subscriber: benstiglitz.Oct 5 2015, 3:01 PM

epriestley added a revision: D14235: Allow PhabricatorWorker->queueTask() to take full $options.Oct 5 2015, 3:29 PM

epriestley added a revision: D14236: Fix unbounded expansion of allocating resource pool.Oct 5 2015, 4:07 PM

epriestley added a commit: rPde2bbfef7d14: Allow PhabricatorWorker->queueTask() to take full $options.Oct 5 2015, 4:46 PM

epriestley added a revision: D14237: Put Drydock build steps into their own group in Harbormaster.Oct 5 2015, 6:17 PM

thoughtpolice awarded a token.Oct 5 2015, 10:32 PM

epriestley added a commit: rPee937e99fb9a: Fix unbounded expansion of allocating resource pool.Oct 5 2015, 10:59 PM

epriestley added a commit: rP4d5278af1148: Put Drydock build steps into their own group in Harbormaster.

We are now building all of the repositories, and all revisions submitted by members of Community.

This stuff is now fixed:

Harbormaster and Drydock now guarantee destruction of leases despite aborts/releases.
We no longer degrade if there is a burst of requests, but see substantial discussion in D14236 about refining this in the future.
Logging is better, although still needs some work.
I'm probably not going to make state-change writes non-transactional in v1 since logging does a pretty reasonable job of covering that now.
Error handling and distinguishing between temporary and permanent failures is greatly improved. It will still take some time to stabilize, but recent issues have been about cleaning up edge cases, not fundamental mishandling of error states.
We're better about dealing with some kinds of resource breaks after activation. These breaks are hard to encounter in the upstream today (all reasonable breaks require operator intervention to resolve anyway) so I don't expect to make this too much more robust in the short term.
There's a tiny bit of documentation.

This stuff still needs work:

Per above, log observability is better but still isn't great.
Documentation is still mostly nonexistent.
Blueprint/resource selection stuff still doesn't meaningfully exist.
A bunch of limits (mostly, see D14236) are hard-coded and set to nonsense values (usually "1").
Lease policies are still a bit odd (although maybe they're just always going to be a bit odd?)

epriestley mentioned this in T9519: Design the Drydock Blueprint selection mechanism.Oct 6 2015, 11:24 AM

• axel.hooge added a subscriber: • axel.hooge.Oct 6 2015, 7:53 PM

wienczny added a subscriber: wienczny.Oct 7 2015, 12:35 AM

jasonfsmitty added a subscriber: jasonfsmitty.Oct 7 2015, 11:09 AM

epriestley mentioned this in T731: Allow revisions to have alternate acceptance conditions.Oct 8 2015, 7:23 PM

vhbit added a subscriber: vhbit.Oct 9 2015, 8:04 AM

epriestley mentioned this in 2015 Week 41 (Early October).Oct 10 2015, 11:47 AM

epriestley moved this task from Preflight to Paused on the Prioritized board.Oct 10 2015, 1:30 PM

epriestley mentioned this in Blog Post: Development Notes (2015 Week 41).Oct 10 2015, 1:33 PM

epriestley added a revision: D14272: Fix an issue where newly created Drydock resources could be improperly acquired.Oct 14 2015, 12:02 PM

epriestley added a revision: D14274: Fix bad counting in SQL when enforcing Drydock allocator soft limits.Oct 14 2015, 12:26 PM

epriestley added a commit: rP083a321dad1b: Fix an issue where newly created Drydock resources could be improperly acquired.Oct 14 2015, 1:16 PM

epriestley added a commit: rPac7edf54afe4: Fix bad counting in SQL when enforcing Drydock allocator soft limits.

Progress here:

There's now a little bit more documentation.
Blueprint selection feels reasonable for v1 generally (see T9519 for discussion).
I think after D14272 + D14274 the allocator behaves correctly in production (on this one install, in a very limited role, etc). It's a little early to say that I actually fixed all the bugs, but the behavior appeared nearly correct before and the effects of the bugs those changes fixed were pretty straightforward.

Overall, except for the stuff fixed above, things have been working well for a while. Drydock now handles multiple task types (revision builds, commit builds, lands) across multiple pools (saux, sbuild) and seems to be functioning as designed. By all appearances, we could dump as much hardware into these pools as we wanted and scale until MySQL eventually falls over as a coordination server.

Stuff I'm still looking at:

Logs will probably get a little more work, although they've been not-terrible for the last few issues I've hit.
I'll continue fleshing out the documentation, it's like halfway to where it probably should be for an unprototype.
Not really concerned about lease policies for now, probably a v2+ thing if we deal with it.
I need to move the hard-coded limits of 1 to config. That's easy, but I want to think about what it will look like in v2/v3 and try to make sure we're moving in that direction rather than somewhere we'll need to migrate away from later.
Resource cleanup/release has a big manual component for now but that actually feels fine in practice today. This isn't scalable in the long term but I'm not overly concerned about solving it completely for v1.

jra3 added a subscriber: jra3.Oct 14 2015, 8:50 PM

epriestley mentioned this in 2015 Week 42 (Mid October).Oct 17 2015, 11:06 AM

epriestley added a revision: D14334: Remove Drydock host resource limits and give working copies simple limits.Oct 25 2015, 6:12 PM

epriestley closed subtask T9519: Design the Drydock Blueprint selection mechanism as Resolved.Oct 26 2015, 7:38 PM

epriestley added a commit: rPc059149eb98e: Remove Drydock host resource limits and give working copies simple limits.

epriestley added a revision: D14349: Add some Drydock documentation plus "Test Configuration" for repository automation.Oct 27 2015, 5:08 PM

epriestley added a commit: rPa763f9510e76: Add some Drydock documentation plus "Test Configuration" for repository….Oct 27 2015, 6:04 PM

I'm going to close this out as I think we're generally in good shape here and we're now hosting builds and doing server-side lands in the upstream, and don't anticipate much more Drydock-specific work in this iteration.

I'm not actually unprototyping Drydock yet (and may not for a while) since we also have to unprototype Almanac for it to be useful and they both interact with Phacility. I want to let it stabilize for a while first before we try to do that integration.

eadler added a subscriber: eadler.Dec 16 2015, 8:16 PM

urzds added a subscriber: urzds.Jul 12 2017, 11:12 AM

	F960257: likefromwall.jpeg
	Nov 11 2015, 8:03 AM

Unprototype Drydock (v1)Closed, ResolvedPublicActions

Description

Revisions and Commits

Related ObjectsSearch...

Event Timeline

Unprototype Drydock (v1)
Closed, ResolvedPublic
Actions

Related Objects
Search...