Page MenuHomePhabricator

Unable to release global Phabricator lock used by Drydock
Closed, ResolvedPublic

Description

I have no idea how my Phabricator instance got into this state, but I seem unable to correct it. I'm pretty sure this is because I went and release a lease or resource while daemons were doing some work?

Basically the global lock "ph:phabric-6rILUDHKjHKu:drydock.resource:d_d.uVkX3iPn" is being held. Drydock refuses to actually allocate any resources until it can get the global lock. In the daemon logs, I see this:

Daemon 302977 STDE [Mon, 20 Jun 2016 18:24:52 +1000] [2016-06-20 18:24:52] EXCEPTION: (PhutilProxyException) Error while executing Task ID 1520716. {>} (PhutilLockException) ph:phabric-6rILUDHKjHKu:drydock.resource:d_d.uVkX3iPn at [<phabricator>/src/infrastructure/util/PhabricatorGlobalLock.php:124]
Daemon 302977 STDE [Mon, 20 Jun 2016 18:24:52 +1000] arcanist(head=master, ref.master=c13e5a629535), phabricator(head=master, ref.master=f9a58fafba0d, custom=3), phutil(head=master, ref.master=fb1e159d3640)
Daemon 302977 STDE [Mon, 20 Jun 2016 18:24:52 +1000]   #0 <#2> PhabricatorGlobalLock::doLock(double) called at [<phutil>/src/filesystem/PhutilLock.php:168]
Daemon 302977 STDE [Mon, 20 Jun 2016 18:24:52 +1000]   #1 <#2> PhutilLock::lock(integer) called at [<phabricator>/src/applications/drydock/worker/DrydockResourceUpdateWorker.php:20]
Daemon 302977 STDE [Mon, 20 Jun 2016 18:24:52 +1000]   #2 <#2> DrydockResourceUpdateWorker::doWork() called at [<phabricator>/src/infrastructure/daemon/workers/PhabricatorWorker.php:122]
Daemon 302977 STDE [Mon, 20 Jun 2016 18:24:52 +1000]   #3 <#2> PhabricatorWorker::executeTask() called at [<phabricator>/src/infrastructure/daemon/workers/storage/PhabricatorWorkerActiveTask.php:171]
Daemon 302977 STDE [Mon, 20 Jun 2016 18:24:52 +1000]   #4 <#2> PhabricatorWorkerActiveTask::executeTask() called at [<phabricator>/src/infrastructure/daemon/workers/PhabricatorTaskmasterDaemon.php:22]
Daemon 302977 STDE [Mon, 20 Jun 2016 18:24:52 +1000]   #5 PhabricatorTaskmasterDaemon::run() called at [<phutil>/src/daemon/PhutilDaemon.php:184]
Daemon 302977 STDE [Mon, 20 Jun 2016 18:24:52 +1000]   #6 PhutilDaemon::execute() called at [<phutil>/scripts/daemon/exec/exec_daemon.php:127]
Daemon 302977 FAIL [Mon, 20 Jun 2016 18:24:52 +1000] Process exited with error 255.
Daemon 302977 WAIT [Mon, 20 Jun 2016 18:24:52 +1000] Waiting to restart process.
Daemon 302978 INIT [Mon, 20 Jun 2016 18:24:54 +1000] Starting process.
Daemon 302977 INIT [Mon, 20 Jun 2016 18:24:57 +1000] Starting process.
Daemon 302978 DONE [Mon, 20 Jun 2016 18:25:15 +1000] Process exited normally.
Daemon 302977 DONE [Mon, 20 Jun 2016 18:26:04 +1000] Process exited normally.
Daemon 302979 INIT [Mon, 20 Jun 2016 18:26:05 +1000] Starting process.
Daemon 302979 STDE [Mon, 20 Jun 2016 18:26:11 +1000] [2016-06-20 18:26:11] EXCEPTION: (PhutilProxyException) Error while executing Task ID 1520716. {>} (PhutilLockException) ph:phabric-6rILUDHKjHKu:drydock.resource:d_d.uVkX3iPn at [<phabricator>/src/infrastructure/util/PhabricatorGlobalLock.php:124]
Daemon 302979 STDE [Mon, 20 Jun 2016 18:26:12 +1000] arcanist(head=master, ref.master=c13e5a629535), phabricator(head=master, ref.master=f9a58fafba0d, custom=3), phutil(head=master, ref.master=fb1e159d3640)
Daemon 302979 STDE [Mon, 20 Jun 2016 18:26:12 +1000]   #0 <#2> PhabricatorGlobalLock::doLock(double) called at [<phutil>/src/filesystem/PhutilLock.php:168]
Daemon 302979 STDE [Mon, 20 Jun 2016 18:26:12 +1000]   #1 <#2> PhutilLock::lock(integer) called at [<phabricator>/src/applications/drydock/worker/DrydockResourceUpdateWorker.php:20]
Daemon 302979 STDE [Mon, 20 Jun 2016 18:26:12 +1000]   #2 <#2> DrydockResourceUpdateWorker::doWork() called at [<phabricator>/src/infrastructure/daemon/workers/PhabricatorWorker.php:122]
Daemon 302979 STDE [Mon, 20 Jun 2016 18:26:12 +1000]   #3 <#2> PhabricatorWorker::executeTask() called at [<phabricator>/src/infrastructure/daemon/workers/storage/PhabricatorWorkerActiveTask.php:171]
Daemon 302979 STDE [Mon, 20 Jun 2016 18:26:12 +1000]   #4 <#2> PhabricatorWorkerActiveTask::executeTask() called at [<phabricator>/src/infrastructure/daemon/workers/PhabricatorTaskmasterDaemon.php:22]
Daemon 302979 STDE [Mon, 20 Jun 2016 18:26:12 +1000]   #5 PhabricatorTaskmasterDaemon::run() called at [<phutil>/src/daemon/PhutilDaemon.php:184]
Daemon 302979 STDE [Mon, 20 Jun 2016 18:26:12 +1000]   #6 PhutilDaemon::execute() called at [<phutil>/scripts/daemon/exec/exec_daemon.php:127]
Daemon 302979 FAIL [Mon, 20 Jun 2016 18:26:12 +1000] Process exited with error 255.
Daemon 302979 WAIT [Mon, 20 Jun 2016 18:26:12 +1000] Waiting to restart process.
Daemon 302980 INIT [Mon, 20 Jun 2016 18:26:14 +1000] Starting process.
Daemon 302979 INIT [Mon, 20 Jun 2016 18:26:17 +1000] Starting process.

I tried releasing it with bin/storage shell, with:

MariaDB [(none)]> SELECT RELEASE_LOCK('ph:phabric-6rILUDHKjHKu:drydock.resource:d_d.uVkX3iPn');
+-----------------------------------------------------------------------+
| RELEASE_LOCK('ph:phabric-6rILUDHKjHKu:drydock.resource:d_d.uVkX3iPn') |
+-----------------------------------------------------------------------+
|                                                                     0 |
+-----------------------------------------------------------------------+
1 row in set (0.00 sec)

However that says it's releasing no locks?

I have no idea how to fix this issue, and none of my attempts to manually release the lock through MySQL seem to be working.

Event Timeline

So I manually set the resource to a destroyed status, cleared out drydock_command and saw Drydock attempt to start allocating a new resource. However, now I'm getting:

Daemon 302981 STDE [Mon, 20 Jun 2016 18:33:00 +1000] [2016-06-20 18:33:00] EXCEPTION: (PhutilProxyException) Error while executing Task ID 1520724. {>} (PhutilLockException) ph:phabric-6rILUDHKjHKu:drydock.resource:uAG3lO2imamy at [<phabricator>/src/infrastructure/util/PhabricatorGlobalLock.php:124]
Daemon 302981 STDE [Mon, 20 Jun 2016 18:33:00 +1000] arcanist(head=master, ref.master=c13e5a629535), phabricator(head=master, ref.master=f9a58fafba0d, custom=3), phutil(head=master, ref.master=fb1e159d3640)
Daemon 302981 STDE [Mon, 20 Jun 2016 18:33:00 +1000]   #0 <#2> PhabricatorGlobalLock::doLock(double) called at [<phutil>/src/filesystem/PhutilLock.php:168]
Daemon 302981 STDE [Mon, 20 Jun 2016 18:33:00 +1000]   #1 <#2> PhutilLock::lock(integer) called at [<phabricator>/src/applications/drydock/worker/DrydockResourceUpdateWorker.php:20]
Daemon 302981 STDE [Mon, 20 Jun 2016 18:33:00 +1000]   #2 <#2> DrydockResourceUpdateWorker::doWork() called at [<phabricator>/src/infrastructure/daemon/workers/PhabricatorWorker.php:122]
Daemon 302981 STDE [Mon, 20 Jun 2016 18:33:00 +1000]   #3 <#2> PhabricatorWorker::executeTask() called at [<phabricator>/src/infrastructure/daemon/workers/storage/PhabricatorWorkerActiveTask.php:171]
Daemon 302981 STDE [Mon, 20 Jun 2016 18:33:00 +1000]   #4 <#2> PhabricatorWorkerActiveTask::executeTask() called at [<phabricator>/src/infrastructure/daemon/workers/PhabricatorTaskmasterDaemon.php:22]
Daemon 302981 STDE [Mon, 20 Jun 2016 18:33:00 +1000]   #5 PhabricatorTaskmasterDaemon::run() called at [<phutil>/src/daemon/PhutilDaemon.php:184]
Daemon 302981 STDE [Mon, 20 Jun 2016 18:33:00 +1000]   #6 PhutilDaemon::execute() called at [<phutil>/scripts/daemon/exec/exec_daemon.php:127]
Daemon 302981 FAIL [Mon, 20 Jun 2016 18:33:00 +1000] Process exited with error 255.
Daemon 302981 WAIT [Mon, 20 Jun 2016 18:33:00 +1000] Waiting to restart process.
Daemon 302983 INIT [Mon, 20 Jun 2016 18:33:02 +1000] Starting process.
Daemon 302981 INIT [Mon, 20 Jun 2016 18:33:05 +1000] Starting process.
Daemon 302981 DONE [Mon, 20 Jun 2016 18:33:22 +1000] Process exited normally.
Daemon 302984 INIT [Mon, 20 Jun 2016 18:34:27 +1000] Starting process.

which is presumably the lock for that resource? How can this possibly fail if there's no work been done on it before?

hach-que claimed this task.

So I managed to resolve this eventually. It looks like the underlying cause of all the problems was that the SSH daemon on the build server itself had got into a weird state where it would accept connections, but not initiate an actual SSH session.

After I cleared out all resources, leases, slot locks and commands from Drydock, restarted the daemons and restarted the SSH daemon on the build server, things started working again.

I'm going to mark this as resolved, although I still think there might be some value in a CLI command to clear any global locks (or just better logging in Drydock).