Page MenuHomePhabricator

Drydock doesn't delete working copies
Closed, InvalidPublic

Description

We have started using DryDock about a month ago. We have run at some problems lately. Some builds fail with exception :

exception 'PhabricatorWorkerPermanentFailureException' with message 'Lease "PHID-DRYL-rgwlvlruch5tzjdzkbgy" never activated.' in /var/www/html/phabricator/phabricator/src/applications/harbormaster/step/HarbormasterLeaseWorkingCopyBuildStepImplementation.php:91
Stack trace:
#0 /var/www/html/phabricator/phabricator/src/applications/harbormaster/worker/HarbormasterTargetWorker.php(64): HarbormasterLeaseWorkingCopyBuildStepImplementation->execute(Object(HarbormasterBuild),   Object(HarbormasterBuildTarget))
#1 /var/www/html/phabricator/phabricator/src/infrastructure/daemon/workers/PhabricatorWorker.php(122): HarbormasterTargetWorker->doWork()
#2 /var/www/html/phabricator/phabricator/src/infrastructure/daemon/workers/storage/PhabricatorWorkerActiveTask.php(171): PhabricatorWorker->executeTask()
#3 /var/www/html/phabricator/phabricator/src/infrastructure/daemon/workers/PhabricatorTaskmasterDaemon.php(22): PhabricatorWorkerActiveTask->executeTask()
#4 /var/www/html/phabricator/libphutil/src/daemon/PhutilDaemon.php(184): PhabricatorTaskmasterDaemon->run()
#5 /var/www/html/phabricator/libphutil/scripts/daemon/exec/exec_daemon.php(127): PhutilDaemon->execute()
#6 {main}

On the build machine I have found that there are 155008 working copies. We are not launching too many builds (20 builds/day at max). It's not even easy to delete all the folder at once. rm -fr * return The total size of the argument and environment lists 2.7MB exceeds the operating system limit of 2MB. but this it not really a problem.

Our Phabricator version is: 0bb5dd88c87d9031656cb572298789dd6ffa430e

Event Timeline

epriestley added a subscriber: epriestley.

I can't reproduce this.

sbuild001.phacility.net (which runs builds for this host) has 10 working copies (with a configured limit of 64).

saux001.phacility.net (which runs repository operations for this host) has 5 working copies (also with a configured limit of 64).

Both of these hosts have been in production for many months.

To continue, we need reproduction instructions we can follow to replicate the problem. See Providing Reproduction Steps for help.

I also don't know how to reproduce the problem. I followed the guide found in T10246 to set up DryDock. I reported the bug hoping you could help me diagnosis the problem. How do you limit the number of working copies ?

Sorry, we don't offer help diagnosing problems. See Support Resources for a list of what we do and do not offer help with.

Feel free to file a new bug report with reproduction steps if you're able to come up with valid reproduction steps per Contributing Bug Reports / Providing Reproduction Steps.

Reproductions steps :

  1. Create a repository that observe another repository
  2. Configure Phabricator to serve that repository through ssh
  3. Configure DryDock like it shown in the guide found in T10246#163309
  4. Don't give ssh access to the observed repository to the builder user. The builder user has access to the Phabricator repository and can clone the repository using the Phabricator one. You can just came up with something that will fail the working copy creation.

Now each time we try to create a working copy, we end up with an error. A try is made each 15s and the previous working copy is not deleted.

joshuaspence renamed this task from DryDock doesn't delete working copies to Drydock doesn't delete working copies.Nov 29 2016, 12:30 AM
joshuaspence added a subscriber: joshuaspence.

I'm unable to reproduce this by following the instructions provided, at least after changes in T13073. Here's what I did:

  • I gave the remote user access to the main repository, but not to the staging repository.
  • I built several working copies in a row, attempting to fetch refs from the staging repository.

I saw Drydock build one working copy resource (by fetching the main repository) and then re-use that resource for subsequent attempts. This is the expected behavior and consistent with the behavior we observe on this host.

If you're still able to reproduce this after upgrading to the changes in T13073, feel free to file a new report with more specific reproduction instructions on the Discourse community forum.

The company I work for doesn't use Phabricator anymore. If I recall correctly the issue happens when an error occurs when creating the working copy. So DryDock create a copy 'copy-123' for example and then if an error occur (during the creation of the working copy for example can't fetch the repository), the build is stopped and the folder 'copy-123' is not deleted. DryDock will retry after 15s and create another copy 'copy-124'. So you can end up with millions of folders after a couple of hours.

If I understand correctly, in your reproduction steps there was no error at the creation of the working copy. You made many changes so perhaps the issues was solved.