Page MenuHomePhabricator

Deploy Drydock in the Phacility cluster
Closed, ResolvedPublic

Description

We've run Drydock in production for this host for 2-3 months now without running into issues. While parts of it (particularly some UI components) are still rough, we have substantial evidence that it works as designed.

We have customer interest in deploying it in the Phacility cluster, in a "bring your own hosts" configuration like the one we use in the upstream. Specifically, customers would do this:

  • Launch their own build hosts somewhere (e.g., in EC2, or a proprietary datacenter, or a closet in their office).
  • Configure an Almanac service pool pointing at the hosts they've brought up.
  • Drydock now runs builds on the third-party hardware.

This is currently not possible because Drydock is still a prototype. There are no real technical blockers within Drydock preventing this from happening, but there are some other blockers elsewhere. Here's a rough pathway toward support:

Stuff Already in Pipeline: We have some work already in the pipeline that I want to complete before pursuing this. Notably:

  • Subprojects/milestones in projects (see T10010, etc).
  • Completing work in Diffusion to make callsigns optional (see T4245).

Almanac: To specify bring-your-own-hosts, instance administrators need to interact with Almanac. Almanac is also currently a prototype. Unprototyping Almanac creates some problems in the cluster because also we use Almanac to configure instances, and it's important that instances not be able to interact with Almanac in a way that allows them to destroy themselves. In particular:

I don't think any of this is particularly difficult, but we definitely have to do the lock stuff before we can move forward.

Drydock: Drydock is probably OK to unprototype more or less in its current state, with appropriate caveats. One issue is that we should get rid of all the defunct Habormaster build plans first, but this is minor.

Harbormaster: I think the biggest stumbling block for replacing something like CircleCI with Harbormaster + Drydock may ultimately just be UI roughness in Harbormaster. This stuff isn't trivial, but can be improved by leaps and bounds in a relatively small amount of effort.

Related Objects

Event Timeline

epriestley edited projects, added Almanac (v2); removed Almanac.

Quick summary here:

  • Subprojects/milestones are largely complete (but see T10349 for some followups).
  • Callsigns are now optional, although some related work remains (see T4245).

So this is now moving forward, but will still be interleaved with work elsewhere to some degree. I'm tentatively aiming to at least complete T10411 and deploy new APIs this Saturday, but may need to move through that somewhat slowly across multiple deployment periods because the old APIs are currently live in the cluster but I want to remove them before unprototyping.

New APIs are going out tomorrow morning (Saturday, February 27). I expect to unprototype Almanac shortly after that, provided things go relatively smoothly, so it will become available the following week (Saturday, March 5).

I believe Almanac is generally in a reasonable release state today as an infrastructure application -- simple, usable, does what it needs to do, no technical debt or significant UI/UX issues. I would roughly expect the next future iteration to focus on adding monitoring to devices and services (T7338, primary motivation is making the Phacility cluster easier to monitor for the upstream) and some future iteration beyond that to finish up the write APIs and bulk up custom ServiceType extensions. I don't expect to pursue these for some time unless we do a big push on making clusters deployable by other installs (likely in connection with T4209 or T4292), or possibly "custom clusters" as an upstream SAAS offering.

Between now and March 5, I'm going to focus on usability and setup issues in Harbormaster and Drydock. I'll file a subtask and collect these, but T10447 is a good recent example of a rough edge in this vein ("Land Revision" would show a generic error on rejection by commit hook, instead of a more useful error with the hook message). I want to try to fix as much of this stuff as we can before unprototyping. We're still going to have a bit of a usability/polish gap between our current offering and established builds-as-a-service providers, at least for a while, but I think we can make that much smaller than it is today by spending a week polishing up reachable rough edges and be in a "rougher UI, but more powerful backend" situation instead of a "how do I even use this?" situation by the release.

epriestley edited projects, added Drydock (v3); removed Drydock.

This is detailed in T10463, but the API changes seem to have deployed with only a few minor rough patches, so we're currently on track to deploy everything else next Saturday (March 5).

Quick update here ahead of deployment tomorrow:

We've made reasonable progress, but this has been a bit of a slow week and things aren't really where I want them to be yet. I currently expect to unprototype Almanac and Drydock, but they'll come with some caveats and usability issues (mostly minor things like T10508 and T9493) that won't get buttoned up in this release. Basically, stuff will work, but if you take a stab at configuring it and don't make much headway it may be easiest to just wait a little longer for some more rough edges to get smoothed over.

I'll provide more detailed guidance in the changelog and release notes.

Drydock and Almanac are now available in the cluster. Here's a rough guide to configuring them -- I'll turn this into something formal once things work a little more smoothly.

Overview

This is a rough guide for "Bring Your Own Hardware" builds in the Phacility cluster. The same general principles also apply outside of the cluster.

  • Create a build bot account.
  • Configure a build host.
  • Add the host to Almanac.
  • Set up Drydock to use the host.
  • Write a build plan in Harbormaster.

Create a Bot

Like everything else, builds will authenticate as some real user account when interacting with Phabricator -- builds don't have a magic skeleton key which gives them universal access. Among other things, this means that an attacker who compromises a build server doesn't get the keys to the castle. To start with, we'll create a bot user which builds will operate as.

Go into the People application and create a new "Bot User". On this install, we use @builder, but you can name this whatever you want (eventually, you may want several accounts in order to separate permissions).

On the bot's account page, go to ManageEdit SettingsSSH Public Keys and add or generate a public key. Let's call this keypair "builder.key / builder.pub".

Make sure permissions are set so the bot can access any repositories you want it to be able to build (for example, you may need to add it to projects). It's going to clone them using the keypair you just configured. The clone operation will be logged in as the bot, so the bot needs to have access to the repositories.

Configure a Host

See Also: Drydock Blueprints: Hosts

Get a host on the internet somehow, and install whatever software your builds need in order to run. Drydock doesn't have fancy autoconfiguration features that can build an environment for you (at least, not yet).

Create a system user account that you want Drydock to log in as (you might also name this builder, although you don't have to). Put builder.key from the previous step on this account as its ~/.ssh/id_rsa or whatever (for now, there's no option to force operations to use a particular key other than whatever the system default is).

At this point, you should be able to git clone repositories from Phabricator while logged in as the bot system account (for example, git clone ssh://whatever.phacility.com/... should work). If it doesn't, the most likely issue is that the bot doesn't actually have permission to access the repository, but hopefully the error message will be illuminating.

If clones work, you're in good shape so far.

You should also be able to run make or arc unit or whatever you expect Drydock to run now, in a cloned working copy.

Add an authorized key to the builder account that Phabricator can use to connect to the machine. We'll call this login.key / login.pub. You can use the same keypair as above if you want, although ideally they should be different. Add login.pub to ~/.ssh/authorized_keys -- you should now be able to SSH to the host as the builder account using login.key.

Add login.key as a credential in the Passphrase application so we can use it later.

Before we move on, create the /var/drydock/ directory and make sure the builder account can write to it.

Add the Host to Almanac

See Also: Almanac User Guide

Now we're going to tell Phabricator that the machine exists, and where to find it.

Go to Almanac and create a new Network called "The Internet" or similar.

In Almanac, add a new Device called build001.mycompany.com or whatever else you want. Add an Interface and type in the IP address and port of the host. You don't need to (and should not) add any SSH keys. When you're done, the UI should look something like this, except the network should be "The Internet":

Screen Shot 2016-03-05 at 6.56.51 AM.png (1×1 px, 154 KB)

Now, add a new Service with type Drydock: Resource Pool, called buildpool.mycompany.com or similar. Add a Binding between the service and the device you just created. The service should look something like this:

Screen Shot 2016-03-05 at 6.58.34 AM.png (1×1 px, 123 KB)

That's it for Almanac.

Create Drydock Blueprints

See Also: Drydock User Guide, Drydock Blueprints

Now, we're going to tell Phabricator that it can use the host to perform builds.

Go to the Drydock application, and create a new Blueprint with type "Almanac Hosts". Set these values:

  • Name: choose a name like "Build Pool".
  • Almanac Services: Select the service you created above (buildpool.mycompany.com).
  • Credential: Select the credential you created above (login.key).

It should look something like this:

Screen Shot 2016-03-05 at 7.02.49 AM.png (809×1 px, 94 KB)

Now, create a second Blueprint with type "Working Copy". Set these values:

  • Name: choose a name like "Working Copies".
  • Use Blueprints: Select the blueprint you created above ("Build Pool").
  • Limit: You may want to set some reasonable concurrency value here by guessing how many builds your hardware can run simultaneously. Choosing a larger value allows Drydock to allocate more idle resources (working copies that are ready to start a build) and active resources (working copies that are actively performing a build).

It should look something like this:

Screen Shot 2016-03-05 at 7.08.40 AM.png (912×1 px, 101 KB)

After you create it, you'll see a warning in the UI that it needs an authorization to use the previous blueprint:

Screen Shot 2016-03-05 at 7.09.22 AM.png (54×426 px, 13 KB)

To resolve this, click the link to go to the first blueprint ("Build Pool"), then find the authorization in the "Active Authorizations" table.

Screen Shot 2016-03-05 at 7.10.26 AM.png (600×1 px, 98 KB)

Click the name, then "Approve Authorization":

Screen Shot 2016-03-05 at 7.10.42 AM.png (600×1 px, 89 KB)

The second blueprint ("Working Copies") should now have a checkmark:

Screen Shot 2016-03-05 at 7.11.16 AM.png (67×335 px, 11 KB)

Write a Build Plan

Now, go to HarbormasterManage Build PlansCreate Build Plan. Pick a name, then "Add Build Step". We're going to create two build steps: one will build a working copy, and the second will run some build command inside it.

First, create a "Lease Working Copy" step. Set these values:

  • Name: Pick a name like "Check out a copy of the repository"
  • Artifact Name: Something like repository.
  • Use Blueprints: Pick the "Working Copies" blueprint from the previous step. Make sure you pick the right blueprint! This control allows you to select an invalid blueprint right now; that will be fixed by T10508.

The step should look something like this:

Screen Shot 2016-03-05 at 7.15.41 AM.png (876×1 px, 111 KB)

Now, create a "Drydock: Run Command" step. This will run your actual build. For now, we'll just run git show to prove that things are working. Set these values:

  • Name: Pick a name like "Run the build"
  • Command: For now, just use git show unless you're feeling ambitious.
  • Drydock Lease: Use the Artifact Name from the previous step (like repository).

It should look something like this:

Screen Shot 2016-03-05 at 7.16.47 AM.png (876×1 px, 111 KB)

You may have noticed earlier that the first step needs a Drydock authorization. (This will be shown on the plan detail page after T10522). Go back to Drydock and go to the "Working Copies" blueprint, then authorize the build plan.

When you're done, the build plan should look something like this:

Screen Shot 2016-03-05 at 7.21.37 AM.png (891×1 px, 140 KB)

Run a Test Build

Click Run Plan Manually and enter the identifier for a commit like abcdef1234 (not a revision like D123!). If the stars align, a build will run. This first build may take a while because it needs to check out the repository, but Drydock will be able to recycle the same working copy for future builds, so it should run in a few seconds if you run it again.

If things work, you should be able to reload the page in a few moments (depending on how large the repository is) and see something like this:

Screen Shot 2016-03-05 at 7.34.19 AM.png (799×1 px, 131 KB)

Clicking into the build will show you the actual log:

Screen Shot 2016-03-05 at 7.34.30 AM.png (1×1 px, 187 KB)

Next Steps

If that works, you're done with the configuration part. Next steps might be:

  • Run a real build command instead of git show. If you want to run arc unit, see T5821 for the state of the world.
  • Add a rule in Herald to automatically run builds when new commits are pushed.
  • If you want to build revisions in addition to commits, configure "Staging Areas" (see "Change Handoff" in Harbormaster User Guide), then write a Herald rule.
  • If you want to activate "Land Revision", configure staging areas first, then follow this guide: Drydock User Guide: Repository Automation.
  • Let us know what isn't working well and what needs improvement. You can check tasks in Harbormaster and Drydock for known issues.

It didn't work for me.

exception 'PhabricatorWorkerPermanentFailureException' with message 'Permanent failure while activating lease ("PHID-DRYL-oluks7j5hat2te7seh3p"): All blueprints failed to allocate a suitable new resource when trying to allocate lease "PHID-DRYL-oluks7j5hat2te7seh3p".
    - Exception: Lock 'ph:phabric-6rILUDHKjHKu:drydock.resource:ckvQ.Nn667WP' has already been locked by this process.' in /var/www/html/phabricator/phabricator/src/applications/drydock/worker/DrydockLeaseUpdateWorker.php:821
Stack trace:
#0 /var/www/html/phabricator/phabricator/src/applications/drydock/worker/DrydockLeaseUpdateWorker.php(50): DrydockLeaseUpdateWorker->breakLease(Object(DrydockLease), Object(PhutilAggregateException))
#1 /var/www/html/phabricator/phabricator/src/applications/drydock/worker/DrydockLeaseUpdateWorker.php(26): DrydockLeaseUpdateWorker->handleUpdate(Object(DrydockLease))  
#2 /var/www/html/phabricator/phabricator/src/infrastructure/daemon/workers/PhabricatorWorker.php(122): DrydockLeaseUpdateWorker->doWork()
#3 /var/www/html/phabricator/phabricator/src/infrastructure/daemon/workers/PhabricatorWorker.php(161): PhabricatorWorker->executeTask()
#4 /var/www/html/phabricator/phabricator/src/applications/drydock/storage/DrydockLease.php(391): PhabricatorWorker::scheduleTask('DrydockLeaseUpd...', Array, Array)
#5 /var/www/html/phabricator/phabricator/src/applications/drydock/storage/DrydockLease.php(168): DrydockLease->scheduleUpdate()
#6 /var/www/html/phabricator/phabricator/src/applications/harbormaster/step/HarbormasterLeaseWorkingCopyBuildStepImplementation.php(77): DrydockLease->queueForActivation()
#7 /var/www/html/phabricator/phabricator/src/applications/harbormaster/worker/HarbormasterTargetWorker.php(64): HarbormasterLeaseWorkingCopyBuildStepImplementation->execute(Object(HarbormasterBuild), Object(HarbormasterBuildTarget))
#8 /var/www/html/phabricator/phabricator/src/infrastructure/daemon/workers/PhabricatorWorker.php(122): HarbormasterTargetWorker->doWork()
#9 /var/www/html/phabricator/phabricator/src/infrastructure/daemon/workers/PhabricatorWorker.php(161): PhabricatorWorker->executeTask()
#10 /var/www/html/phabricator/phabricator/src/applications/harbormaster/engine/HarbormasterBuildEngine.php(88): PhabricatorWorker::scheduleTask('HarbormasterTar...', Array, Array)
#11 /var/www/html/phabricator/phabricator/src/applications/harbormaster/worker/HarbormasterBuildWorker.php(25): HarbormasterBuildEngine->continueBuild()
#12 /var/www/html/phabricator/phabricator/src/infrastructure/daemon/workers/PhabricatorWorker.php(122): HarbormasterBuildWorker->doWork()
#13 /var/www/html/phabricator/phabricator/src/infrastructure/daemon/workers/PhabricatorWorker.php(161): PhabricatorWorker->executeTask()
#14 /var/www/html/phabricator/phabricator/src/applications/harbormaster/storage/HarbormasterBuildable.php(192): PhabricatorWorker::scheduleTask('HarbormasterBui...', Array, Array)
#15 /var/www/html/phabricator/phabricator/src/applications/harbormaster/management/HarbormasterManagementBuildWorkflow.php(106): HarbormasterBuildable->applyPlan(Object(HarbormasterBuildPlan), Array, 'PHID-APPS-Phabr...')
#16 /var/www/html/phabricator/libphutil/src/parser/argument/PhutilArgumentParser.php(410): HarbormasterManagementBuildWorkflow->execute(Object(PhutilArgumentParser))
#17 /var/www/html/phabricator/libphutil/src/parser/argument/PhutilArgumentParser.php(303): PhutilArgumentParser->parseWorkflowsFull(Array)
#18 /var/www/html/phabricator/phabricator/scripts/setup/manage_harbormaster.php(21): PhutilArgumentParser->parseWorkflows(Array)
#19 {main}

HEAD is b6bf0f6a3b4971db173e782577a70c6cd5ddfbb0

File a bug report please.

At least some installs have been using Drydock in production for a while. This could be made easier -- and this task still has the best guide for it -- but I think nothing actionable here remains.