Page MenuHomePhabricator

Harbormaster builds and Drydock leases are stuck
Closed, ResolvedPublic

Description

Running Phabricator at commit 3bccb0d. For some reason today our Harbormaster builds got stuck, and cannot be resumed or aborted:

  • The "All Buildables" list in Harbormaster has about 20 builds that are all in the "Building" state, stuck on the "Lease Host" build step. Aborting a build puts the "Lease Host" step in the "Pausing" state, but it never gets out of that.
  • The "Active Leases" list in Drydock has the same amount of leases in a "Pending" state. Releasing a lease puts it in the "Releasing" state, and adds a "release" command to the "Commands" list, but it never gets out of this state to be actually released.

A consequence of these stuck builds is that all our 16 daemons are stuck on HarbormasterTargetWorker tasks, with a growing queue of pending tasks that will never be processed, including commit parsers and search indexers. This basically renders Phabricator useless to us, until we can clear up those stuck builds or remove the HarbormasterTargetWorker tasks from the daemons so they can process tasks again.

Can you advise me on:

  1. Where to find more info/logging about why the pending Drydock leases don't get through, and why they can't be released?
  2. How to force-remove the stuck builds (and the pending Drydock leases) so our daemons can continue working?

Thanks!

Event Timeline

That commit is from October 8th, have you tried updating to HEAD?

I have, a week or so ago. However, around October 12 "Blueprint Authorizations" were added to Drydock and those changes broke our entire build setup. There was no documentation on how to work with these Authorizations at the time (or I couldn't find it), so I rolled back to the last commit before those changes.

I was planning to upgrade to HEAD to see if that fixes these pending leases, but I'd rather first understand what's going wrong and how to prevent it from happening again. That's why I came here for help, first :-).

Thanks, I'll read those. However, I'd still like to know why Drydock leases suddenly stopped working, and why I can't cancel them to free up the daemons, and there's no way to cancel or remove the build tasks from the daemons, too. They just keep retrying forever.

You'll want to update to HEAD if you need support. Please please please.

Drydock is a prototype and all of the Harbormaster build steps which interact with it are prototypes. Much of this code has been substantially rewritten in the last two weeks. See:

https://secure.phabricator.com/book/phabricator/article/prototypes/

https://secure.phabricator.com/book/phabricator/article/support/

If you'd like us to walk through debugging a complex workflow in an out-of-date version of Phabricator, remotely, without reproduction steps, see Consulting.

Upgrading Phabricator to the latest HEAD, and running ./bin/storage upgrade results in this error:

Storage is up to date. Use 'storage status' for details.
Verifying database schemata...
Found no adjustments for schemata.

Target                                            Error
phabricator_drydock.drydock_lease.authorizingPHID Missing
phabricator_drydock.drydock_authorization         Missing
phabricator_drydock.drydock_repositoryoperation   Missing

 SCHEMATA ERRORS

The schemata have errors (detailed above) which the adjustment workflow can
not fix.

If you are not developing Phabricator itself, report this issue to the
upstream.

Looks like there are some missing columns. Should I manually add those?

Alright, I've managed to upgrade Phabricator and fix the storage schemata issues. I've also reconfigured Almanac, Drydock, and Harbormaster to reflect the latest changes regarding blueprint authorizations. My Harbormaster build step is authorized on the Drydock blueprint, so that looks good.

However, when manually running this new Harbormaster build plan, it immediately errors with this message:

exception 'InvalidArgumentException' with message 'Argument 1 passed to DrydockLease::setAllowedBlueprintPHIDs() must be of the type array, null given, called in /opt/phabricator/src/applications/harbormaster/step/HarbormasterLeaseWorkingCopyBuildStepImplementation.php on line 51 and defined' in /opt/libphutil/src/error/PhutilErrorHandler.php:200
2	Stack trace:
3	#0 /opt/phabricator/src/applications/drydock/storage/DrydockLease.php(399): PhutilErrorHandler::handleError(4096, 'Argument 1 pass...', '/opt/phabricato...', 399, Array)
4	#1 /opt/phabricator/src/applications/harbormaster/step/HarbormasterLeaseWorkingCopyBuildStepImplementation.php(51): DrydockLease->setAllowedBlueprintPHIDs(NULL)
5	#2 /opt/phabricator/src/applications/harbormaster/worker/HarbormasterTargetWorker.php(64): HarbormasterLeaseWorkingCopyBuildStepImplementation->execute(Object(HarbormasterBuild), Object(HarbormasterBuildTarget))
6	#3 /opt/phabricator/src/infrastructure/daemon/workers/PhabricatorWorker.php(122): HarbormasterTargetWorker->doWork()
7	#4 /opt/phabricator/src/infrastructure/daemon/workers/storage/PhabricatorWorkerActiveTask.php(171): PhabricatorWorker->executeTask()
8	#5 /opt/phabricator/src/infrastructure/daemon/workers/PhabricatorTaskmasterDaemon.php(22): PhabricatorWorkerActiveTask->executeTask()
9	#6 /opt/libphutil/src/daemon/PhutilDaemon.php(183): PhabricatorTaskmasterDaemon->run()
10	#7 /opt/libphutil/scripts/daemon/exec/exec_daemon.php(125): PhutilDaemon->execute()
11	#8 {main}

Is this due to misconfiguration on my part, or perhaps something I broke in the storage upgrade?

Basically you need to extra set the repo to be cloned. The UI is kinda misleading and the exception is not handled correctly when allow_phid is null

Thanks, I set "Also Clone" to a random repository and that solves the exception I pasted. I got my builds working again. For future generations: this is how I configured Almanac/Drydock/Harbormaster now:

  1. Add an Almanac Network/Device/Service per these docs
  2. Create a "builder" blueprint in Drydock, of the type "Almanac Hosts", using the "builder" Almanac Service from the previous step.
  3. Create a "repo" blueprint in Drydock, of the type "Working Copy", with the "builder" blueprint from the previous step selected under "Use Blueprints". In our case it has a limit of 1000 (each separate repository that is built will create a new resource, so this limit needs to be at least equal to the number of repositories that you will do builds for)
  4. Create a "Run Tests" build plan in Harbormaster, with the following build steps:
    1. "Drydock: Lease Working Copy", with "Artifact Name" set to "repo", and "Use Blueprints" set to the "repo" blueprint from step 3. Set "Also Clone" to a random repository, as we don't actually need that (but it prevents the error pasted above, as mentioned by @tycho.tatitscheff). Set "Artifact Name" to "repo", as we'll need that in the following build step.
    2. "Drydock: Run Command", with "Command" set to the actual build command. In our case this is a binary that will clone and run our Ruby/Go builds isolated in Docker containers. We pass in a few build variables to tell the binary what to build, and which build PHID to report status to. Set "Drydock Lease" in this build step to "repo", to reference the Working Copy from the previous build step.
  5. Run this new "Run Tests" build plan manually to test it out, and see it successfully build a commit from a repository.
  6. Create a Herald rule that listens for new commits globally, and executes the "Run build plans" action when it's triggered. Reference the build plan from step 4 here.
  7. Push a commit to any repository and watch Herald trigger a new build with Harbormaster, using the Drydock resources created above.

@epriestley I used to be able to lease a Drydock host with the "Lease Host" build step that is now listed under "Prototypes" in the "Add Build Step" page of Harbormaster. However, adding this build step now doesn't create a Authorization Request in Drydock. Is that by design, i.e. is the "Working Copy/Run Command" sequence I listed above the correct way to use Drydock hosts in Harbormaster now?
Because of how our build system works, I actually only need a host to run a command on. I don't need the Drydock Working Copy, but including that in the build steps currently seems to be the only way to get a build plan with the correct Authorizations in Drydock.

@marceldegraaf : i did the first part (almanac + drydock almanac based host blueprint) of how to configure drydock here : Q179: How do you configure an Almanac based Drydock resource but not found time to get the second part (drydock working copy blueprint + harbormaster + herald). Say me if you wanna details the last part in a ponder question or shall I do it later when i have time.

Reopened since the main concern was more the installation of T9669 and theire is still pending question.

Thanks for reopening @tycho.tatitscheff. Before I document anything officially I'd like to wait for @epriestley's answer on T9671#142468, to make sure I'm not Doing It Wrong™ :-).

"Lease Host" is not currently expected to work at all. It hasn't been updated for any of the changes, including authorizations. "Lease Working Copy" (under "Drydock") should.

Using "Lease Working Copy" to get a host and then ignoring the working copy will do a little more work than you need, but is probably viable in general.

"Lease Host" will probably be updated so it works (and moved under "Drydock") but I don't have a specific timeline on this since it isn't necessary for any of the immediate goals we're pursuing (all of which operate on working copies, not raw hosts). "Lease Host" also previously had some specialized behavior (IIRC, it created a unique directory for you as a work area). It has less of this behavior now, and I suspect most of this behavior won't return to Host blueprints -- for example, we might introduce a new type of resource like a "work area" instead, which might lease a host and then create a working directory on it.

The way artifact types work right now also makes it difficult for the "Run Command" step to operate on either a host artifact or a working copy artifact, even though both implement a CommandInterface and will work if they can get past the UI. This likely needs to be expanded.

"Run Command" is not currently expected to work at all and is likely to be removed (in favor of "Drydock: Run Command", which is substantially equivalent).

I will probably also rename these slightly for consistency and put "Drydock" in all the step names.

marceldegraaf claimed this task.

Thanks for your reply, the detailed description really helps. The build system works now, and even though the "Lease Working Copy" step doesn't really make sense for us, I'm really glad everything works :-).

D14368 should fix the issue where you're required to set a dummy "Also Clone" repository.