Page MenuHomePhabricator

Harbormaster builds fail to allocate resource
Closed, InvalidPublic

Description

Current versions:

phabricator
    3aed39b8b072c8bfa89b3f45183dfa126600ff1d (Wed, May 18) 
arcanist
    2234c8cacc21ce61c9c10e8e5918b6a63cc38fc8 (Mon, May 16) 
phutil
    b25e0477b280ca3e8345bb97cd55e95bcb5023ec (Wed, May 11)

Following the instructions at https://secure.phabricator.com/T10246#163438 to run a simple git show manually for a recent commit, build attempts to activate a lease which fails, continuously, until aborted.

The relevant error seems to be "fatal: could not read Username for 'http://phabricator.<HOSTNAME_MASKED>': No such device or address"

I can log in to the build server using the private key, in fact the builder creates a workingcopy, but fails to clone into it.

builder@builder:~$ ls -la /var/drydock/workingcopy-92/repo/
total 8
drwxrwxr-x 2 builder builder 4096 May 18 16:02 .
drwxrwxr-x 3 builder builder 4096 May 18 16:02 ..
builder@builder:~$

Build Log:

exception 'PhabricatorWorkerPermanentFailureException' with message 'Lease "PHID-DRYL-2pvhwujlbsollyzaeipd" never activated.' in /usr/local/phabricator/phabricator/src/applications/harbormaster/step/HarbormasterLeaseWorkingCopyBuildStepImplementation.php:91
Stack trace:
#0 /usr/local/phabricator/phabricator/src/applications/harbormaster/worker/HarbormasterTargetWorker.php(64): HarbormasterLeaseWorkingCopyBuildStepImplementation->execute(Object(HarbormasterBuild), Object(HarbormasterBuildTarget))
#1 /usr/local/phabricator/phabricator/src/infrastructure/daemon/workers/PhabricatorWorker.php(122): HarbormasterTargetWorker->doWork()
#2 /usr/local/phabricator/phabricator/src/infrastructure/daemon/workers/storage/PhabricatorWorkerActiveTask.php(171): PhabricatorWorker->executeTask()
#3 /usr/local/phabricator/phabricator/src/infrastructure/daemon/workers/PhabricatorTaskmasterDaemon.php(22): PhabricatorWorkerActiveTask->executeTask()
#4 /usr/local/phabricator/libphutil/src/daemon/PhutilDaemon.php(183): PhabricatorTaskmasterDaemon->run()
#5 /usr/local/phabricator/libphutil/scripts/daemon/exec/exec_daemon.php(125): PhutilDaemon->execute()
#6 {main}

Blueprint log:

Resource activation failed: [CommandException] Command failed with error #128! COMMAND ssh '-o' 'LogLevel=quiet' '-o' 'StrictHostKeyChecking=no' '-o' 'UserKnownHostsFile=/dev/null' '-o' 'BatchMode=yes' -l 'xxxxx' -p '22' -i 'xxxxx' '<BUILDER_IP_MASKED>' -- 'git clone -- '\''http://phabricator.<HOSTNAME_MASKED>/diffusion/48/testrepo.git'\'' '\''/var/drydock/workingcopy-92/repo/testrepo/'\''' STDOUT (empty) STDERR Cloning into '/var/drydock/workingcopy-92/repo/testrepo'... fatal: could not read Username for 'http://phabricator.<HOSTNAME_MASKED>': No such device or address

Event Timeline

This sounds like an issue with either git clone or the SSH client installed on the build machine itself, and not Phabricator. I don't believe upstream offers support for configuring build agents (they don't offer support for envionment issues on machines that host Phabricator, let alone environment issues on machines that Phabricator uses for builds).

You can potentially get more information by performing the command that Phabricator uses to connect to the machine, but manually and providing -vvv to get more information about why the command is failing.

I'll leave it to @chad or @epriestley to actually close this task.

I have a somewhat similar problem when I try to start a working copy job:

Resource activation failed: [CommandException] Command failed with error #255! COMMAND ssh '-o' 'LogLevel=quiet' '-o' 'StrictHostKeyChecking=no' '-o' 'UserKnownHostsFile=/dev/null' '-o' 'BatchMode=yes' -l 'xxxxx' -p '22' -i 'xxxxx' '10.10.30.120' -- 'git clone -- '\''ssh://xxxxx@phabricator-eng.{hostname-masked}:2222/diffusion/BAR/b.git'\'' '\''/var/drydock/workingcopy-532/repo/b/'\''' STDOUT (empty) STDERR (empty)

I double checked the documentation that I did everything according to what is written. Manually cloning works fine, ssh on the machine works fine.
not sure how i can debug it to find out if my configuration is off, or if phabricator is not fully implemented on that regards.
I tried manually running the command, but 'xxxxx' is unknown (who would have guessed? :P) and I'm not sure where this information comes from or if it's just the Phabricator masking some information.

Any help or pointers are welcome.

the -l xxxxx is your configured SSH user, the -i xxxxx is the path to your configured SSH private key (it's written out to a temporary file).

So, the information is just masked but really present in the real command call? The ssh man-page states what the options before the xxxxxs mean, I'm just not sure if the call is right when the information is masked. E.g. the clone call...does phabricator call with git@phabricator... or with the username provided for the ssh connection. anyway. Is there a log where the real command is written? Why is it even masked here?

If we masked with ***** instead of xxxxx, do you think that would have been more intuitive?

I would say, definitively. 'xxxxx' looks more like a placeholder for me, but that might be a personal preference.

Nevertheless, I'm still not sure how to debug this error message / my configuration.
Is there a way to test/run the Drydock Host Blueprint and get more information than the command failed?

A a second note: It would be nice to give drydock a timeout for acquiring a resource (e.g. try connecting to a host/service for 15 min and then fail) and not run for a few days. Or at least add a way in a drydock blueprint to reliably cancel the acquisition attempts for resources.
Not sure if I missed something, but releasing leases seems to be unreliable for canceling the execution of a blueprint.

This isn't an actionable bug report because it doesn't include reproduction instructions. Reports MUST include reproduction instructions; see Contributing Bug Reports and Providing Reproduction Steps.

We don't provide general setup and configuration support because issues in this vein often require a tremendous amount of time and energy to resolve and boil down to environmental issues that help exactly one user. See Support Resources for what we do provide, and for other ways to get help.

You can likely file the timeout issue as a legitimate, actionable report since it should be reasonable to provide reproduction steps. It probably falls under T8153, but having it filed as a specific test case will help make sure it gets resolved as part of that task (that task is prioritized and should happen soon).

There should be no distinction between Drydock running git clone and you running git clone (while logged in as the same user), but it sounds like one works and the other doesn't. There is no secret hidden magic, we're just SSH'ing to the host and running git clone. git clone works fine in the upstream environments and other user environments, so this strongly suggests there is some sort of environmental problem in your configuration. You need to identify it and reduce it to repeatable reproduction steps before we can move forward.

@epriestley I was trying to recreate my problem I had a month ago and write a task (bug report) for it and wasn't able to reproduce the issue of Drydock trying to allocate resources indefinitely. Have you already added a timeout to Drydock leases?