Page MenuHomePhabricator

Lock is not locked by this process
Closed, WontfixPublic

Description

I'm seeing a lot of these errors in our phd logs:

Daemon 4800 STDE [Sat, 07 Mar 2015 10:41:32 +1100] [2015-03-07 10:41:31] EXCEPTION: (PhutilProxyException) Error while updating the "rPP" repository. {>} (CommandException) Command failed with error #255!
Daemon 4800 STDE [Sat, 07 Mar 2015 10:41:32 +1100] COMMAND
Daemon 4800 STDE [Sat, 07 Mar 2015 10:41:32 +1100] '/usr/src/phabricator/bin/repository' update '--no-discovery' -- 'PP'
Daemon 4800 STDE [Sat, 07 Mar 2015 10:41:32 +1100] 
Daemon 4800 STDE [Sat, 07 Mar 2015 10:41:32 +1100] STDOUT
Daemon 4800 STDE [Sat, 07 Mar 2015 10:41:32 +1100] (empty)
Daemon 4800 STDE [Sat, 07 Mar 2015 10:41:32 +1100] 
Daemon 4800 STDE [Sat, 07 Mar 2015 10:41:32 +1100] STDERR
Daemon 4800 STDE [Sat, 07 Mar 2015 10:41:32 +1100] [2015-03-07 10:41:31] EXCEPTION: (Exception) Lock 'global:phabricator:PhabricatorRepositoryManagementUpdateWorkflow:PP is not locked by this process! at [<phutil>/src/filesystem/PhutilLock.php:195]
Daemon 4800 STDE [Sat, 07 Mar 2015 10:41:32 +1100] arcanist(head=master, ref.master=c36b4ceb1849), phabricator(head=master, ref.master=85f4bdc2ac74), phlab(head=master, ref.master=4689d11dbe69), phutil(head=master, ref.master=eaa6ebecef8e)
Daemon 4800 STDE [Sat, 07 Mar 2015 10:41:32 +1100]   #0 PhutilLock::unlock() called at [<phabricator>/src/applications/repository/management/PhabricatorRepositoryManagementUpdateWorkflow.php:102]
Daemon 4800 STDE [Sat, 07 Mar 2015 10:41:32 +1100]   #1 PhabricatorRepositoryManagementUpdateWorkflow::execute(PhutilArgumentParser) called at [<phutil>/src/parser/argument/PhutilArgumentParser.php:396]
Daemon 4800 STDE [Sat, 07 Mar 2015 10:41:32 +1100]   #2 PhutilArgumentParser::parseWorkflowsFull(array) called at [<phutil>/src/parser/argument/PhutilArgumentParser.php:292]
Daemon 4800 STDE [Sat, 07 Mar 2015 10:41:32 +1100]   #3 PhutilArgumentParser::parseWorkflows(array) called at [<phabricator>/scripts/repository/manage_repositories.php:22]
Daemon 4800 STDE [Sat, 07 Mar 2015 10:41:32 +1100]  at [<phutil>/src/future/exec/ExecFuture.php:416]
Daemon 4800 STDE [Sat, 07 Mar 2015 10:41:32 +1100] arcanist(head=master, ref.master=c36b4ceb1849), phabricator(head=master, ref.master=85f4bdc2ac74), phlab(head=master, ref.master=4689d11dbe69), phutil(head=master, ref.master=eaa6ebecef8e)
Daemon 4800 STDE [Sat, 07 Mar 2015 10:41:32 +1100]   #0 phlog(PhutilProxyException) called at [<phabricator>/src/applications/repository/daemon/PhabricatorRepositoryPullLocalDaemon.php:364]
Daemon 4800 STDE [Sat, 07 Mar 2015 10:41:32 +1100]   #1 PhabricatorRepositoryPullLocalDaemon::resolveUpdateFuture(PhabricatorRepository, ExecFuture, integer) called at [<phabricator>/src/applications/repository/daemon/PhabricatorRepositoryPullLocalDaemon.php:199]
Daemon 4800 STDE [Sat, 07 Mar 2015 10:41:32 +1100]   #2 PhabricatorRepositoryPullLocalDaemon::run() called at [<phutil>/src/daemon/PhutilDaemon.php:179]
Daemon 4800 STDE [Sat, 07 Mar 2015 10:41:32 +1100]   #3 PhutilDaemon::execute() called at [<phutil>/scripts/daemon/exec/exec_daemon.php:119]

Event Timeline

joshuaspence raised the priority of this task from to Needs Triage.
joshuaspence updated the task description. (Show Details)
joshuaspence added a project: Daemons.
joshuaspence added a subscriber: joshuaspence.

I'm seeing this on a build from March 12. A repository stopped updating on Friday and the daemon logs have been printing out this message ever few minutes and it is still occuring on Monday. No change after phd restart. diffusion/edit says 'Initializing Working Copy Daemons are initializing the working copy.'

I'm now seeing this on a second repository. After some digging it appears to be a magical mysql GET_LOCK and not something simple on the file system I can just clear.

D12253 "fixes" this in the sense that it likely reduces it to T6958.

The root cause is very likely to be a stray copy of the daemon running somewhere.

The root cause is very likely to be a stray copy of the daemon running somewhere.

FWIW I tried ./bin/phd stop --> ps aux | grep php and don't see anything.

Does this show anything?

mysql> SELECT IS_USED_LOCK('global:phabricator:PhabricatorRepositoryManagementUpdateWorkflow:PP');

Specifically, if that returns an integer, it's a connection ID.

You can use SHOW FULL PROCESSLIST to get very slightly more information about the connection ID. Possibly, that will show you it's on oldbackup23.mycompany.com and you'll be like "oh, I forgot I left a daemon running on that other machine" and go shut it down. If it shows it on the local machine and you're connecting over TCP, you can use netstat -t or maybe lsof to figure out which PID is connected to that port (there might be an easier way to do this), then use ps to figure out what process owns that PID.

mysql> SELECT IS_USED_LOCK('global:phabricator:PhabricatorRepositoryManagementUpdateWorkflow:TEMPORALANOMALY');
+--------------------------------------------------------------------------------------------------+
| IS_USED_LOCK('global:phabricator:PhabricatorRepositoryManagementUpdateWorkflow:TEMPORALANOMALY') |
+--------------------------------------------------------------------------------------------------+
|                                                                                             NULL |
+--------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

mysql> SELECT IS_USED_LOCK('global:phabricator:PhabricatorRepositoryManagementUpdateWorkflow:HYDRA');
+----------------------------------------------------------------------------------------+
| IS_USED_LOCK('global:phabricator:PhabricatorRepositoryManagementUpdateWorkflow:HYDRA') |
+----------------------------------------------------------------------------------------+
|                                                                                   NULL |
+----------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

Hrrm. Maybe wait for D12253 and see if that provides any clarity. It's possible it will give us a much better and more specific message, rather than just degrading into T6958.

Updated log message with D12253 applied and daemons restarted

[01-Apr-2015 20:05:21] [2015-04-01 20:05:21] EXCEPTION: (PhutilProxyException) Error while updating the "rHYDRA" repository. {>} (CommandException) Command failed with error #255!
COMMAND
'/home/web/sites/phabricator/phabricator/bin/repository' update  -- 'HYDRA'

STDOUT
(empty)

STDERR
[2015-04-01 20:05:21] EXCEPTION: (PhutilLockException) global:phabricator:PhabricatorRepositoryManagementUpdateWorkflow:HYDRA at [<phabricator>/src/infrastructure/util/PhabricatorGlobalLock.php:94]
arcanist(head=f51f55a63cbceec95319d5236292970c7918866e, ref.master=9e357d3fd0ca), phabricator(head=debug-T7484, ref.master=d10e25098c67, ref.debug-T7484=f3536b1e4007), phutil(head=6a1f18b4d2e3240483ecdc134a624650c17ca422, ref.master=702046d23645)
  #0 PhabricatorGlobalLock::doLock(double) called at [<phutil>/src/filesystem/PhutilLock.php:167]
  #1 PhutilLock::lock() called at [<phabricator>/src/applications/repository/management/PhabricatorRepositoryManagementUpdateWorkflow.php:60]
  #2 PhabricatorRepositoryManagementUpdateWorkflow::execute(PhutilArgumentParser) called at [<phutil>/src/parser/argument/PhutilArgumentParser.php:396]
  #3 PhutilArgumentParser::parseWorkflowsFull(array) called at [<phutil>/src/parser/argument/PhutilArgumentParser.php:292]
  #4 PhutilArgumentParser::parseWorkflow... (81 more bytes) ... at [<phutil>/src/future/exec/ExecFuture.php:416]
[01-Apr-2015 20:05:21] arcanist(head=f51f55a63cbceec95319d5236292970c7918866e, ref.master=9e357d3fd0ca), phabricator(head=debug-T7484, ref.master=d10e25098c67, ref.debug-T7484=f3536b1e4007), phutil(head=6a1f18b4d2e3240483ecdc134a624650c17ca422, ref.master=702046d23645)
[01-Apr-2015 20:05:21]   #0 ExecFuture::resolvex() called at [<phabricator>/src/applications/repository/daemon/PhabricatorRepositoryPullLocalDaemon.php:357]
[01-Apr-2015 20:05:21]   #1 PhabricatorRepositoryPullLocalDaemon::resolveUpdateFuture(PhabricatorRepository, ExecFuture, integer) called at [<phabricator>/src/applications/repository/daemon/PhabricatorRepositoryPullLocalDaemon.php:199]
[01-Apr-2015 20:05:21]   #2 PhabricatorRepositoryPullLocalDaemon::run() called at [<phutil>/src/daemon/PhutilDaemon.php:179]
[01-Apr-2015 20:05:21]   #3 PhutilDaemon::execute() called at [<phutil>/scripts/daemon/exec/exec_daemon.php:119]

I'm not sure I find the new log message that much more different or illuminating.

Okay, that is the lock failing to acquire. Does this work?

mysql> SELECT GET_LOCK('global:phabricator:PhabricatorRepositoryManagementUpdateWorkflow:HYDRA', 0);

The only expected failure mode for lock acquisition is something else holding the lock, so I'm not sure what's up.

Possibly:

MySQL 5.7.5 and later enforces a maximum length on lock names of 64 characters. Previously, no limit was enforced.

$ php -r "echo strlen('global:phabricator:PhabricatorRepositoryManagementUpdateWorkflow:HYDRA');"
70

So maybe the unexpected failure mode is "lock name too long".

It's also possible that this:

Previously, no limit was enforced.

...means "previously, locks over 64 characters long all collided with one another".

mysql> SELECT GET_LOCK('global:phabricator:PhabricatorRepositoryManagementUpdateWorkflow:HYDRA', 0);
+---------------------------------------------------------------------------------------+
| GET_LOCK('global:phabricator:PhabricatorRepositoryManagementUpdateWorkflow:HYDRA', 0) |
+---------------------------------------------------------------------------------------+
|                                                                                     1 |
+---------------------------------------------------------------------------------------+
# mysql --version
mysql  Ver 14.14 Distrib 5.6.17, for Linux (x86_64) using  EditLine wrapper

Can you see if this fixes it (bringing lock name length under 64 characters):

diff --git a/src/applications/repository/management/PhabricatorRepositoryManagementUpdateWorkflow.php b/src/applications/repository/management/PhabricatorRepositoryManagementUpdateWo
index ba57873..387b0a8 100644
--- a/src/applications/repository/management/PhabricatorRepositoryManagementUpdateWorkflow.php
+++ b/src/applications/repository/management/PhabricatorRepositoryManagementUpdateWorkflow.php
@@ -54,7 +54,7 @@ final class PhabricatorRepositoryManagementUpdateWorkflow
     $callsign = $repository->getCallsign();
 
     try {
-      $lock_name = get_class($this).':'.$callsign;
+      $lock_name = 'repository.update:'.$callsign;
       $lock = PhabricatorGlobalLock::newLock($lock_name);
 
       $lock->lock();

With that patch the daemon log is clean and the repositories show recent updates.

If it was just the 64 character lock limit I'm confused why it was always the same repositories getting unlucky.

Does this patch also fix it?

diff --git a/src/applications/repository/management/PhabricatorRepositoryManagementUpdateWorkflow.php b/src/applications/repository/management/PhabricatorRepositoryManagementUpdateWo
index ba57873..387b0a8 100644
--- a/src/applications/repository/management/PhabricatorRepositoryManagementUpdateWorkflow.php
+++ b/src/applications/repository/management/PhabricatorRepositoryManagementUpdateWorkflow.php
@@ -54,7 +54,7 @@ final class PhabricatorRepositoryManagementUpdateWorkflow
     $callsign = $repository->getCallsign();
 
     try {
-      $lock_name = get_class($this).':'.$callsign;
+      $lock_name = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX:'.$callsign;
       $lock = PhabricatorGlobalLock::newLock($lock_name);
 
       $lock->lock();

(For any enormously long lock name).

If it's still broken under that change, that points at the 64 character limit pretty convincingly.

If that also fixes it, that points at something possibly holding the lock (maybe?) -- or is generally spooky.

The order which we update repositories is deterministic enough that I'd expect things to usually end up locking in the same order, and they'll keep retrying regularly (every 15 seconds) after a failure, so it's not particularly surprising to me that the failures wouldn't be randomly distributed if the issue is a 64-character lock uniqueness limit, although I would expect us to see way more failures all the time since 64 characters ends before the callsign part and we haven't changed the lock name. Maybe the behavior is specific to your MySQL sub-version.

I can't reproduce on 5.6.21.

You could also definitively confirm that it's a length limit by doing this in two different windows:

mysql> SELECT GET_LOCK('global:phabricator:PhabricatorRepositoryManagementUpdateWorkflow:HYDRX', 0);
mysql> SELECT GET_LOCK('global:phabricator:PhabricatorRepositoryManagementUpdateWorkflow:HYDRY', 0);

(That is, lock "HYDRX" in one, and "HYDRY" in the other -- using non-callsigns in both cases to make sure we don't hit any "real" locks by accident.)

If those collide (second lock returns "0"), that's very compelling evidence that the lock name length is the issue.

  • Shorten patch: Works
  • Spooky Patch: Works
  • Roll back to no patch: Confirmed still broken.

casper2.jpg (444×539 px, 58 KB)

Window 1:

mysql> SELECT GET_LOCK('global:phabricator:PhabricatorRepositoryManagementUpdateWorkflow:HYDRX', 0);
+---------------------------------------------------------------------------------------+
| GET_LOCK('global:phabricator:PhabricatorRepositoryManagementUpdateWorkflow:HYDRX', 0) |
+---------------------------------------------------------------------------------------+
|                                                                                     1 |
+---------------------------------------------------------------------------------------+

Window 2:

mysql> SELECT GET_LOCK('global:phabricator:PhabricatorRepositoryManagementUpdateWorkflow:HYDRY', 0);
+---------------------------------------------------------------------------------------+
| GET_LOCK('global:phabricator:PhabricatorRepositoryManagementUpdateWorkflow:HYDRY', 0) |
+---------------------------------------------------------------------------------------+
|                                                                                     1 |
+---------------------------------------------------------------------------------------+

I guess you could try this? Maybe?

diff --git a/src/applications/repository/management/PhabricatorRepositoryManagementUpdateWorkflow.php b/src/applications/repository/management/PhabricatorRepositoryManagementUpdateWo
index ba57873..39df560 100644
--- a/src/applications/repository/management/PhabricatorRepositoryManagementUpdateWorkflow.php
+++ b/src/applications/repository/management/PhabricatorRepositoryManagementUpdateWorkflow.php
@@ -54,7 +54,7 @@ final class PhabricatorRepositoryManagementUpdateWorkflow
     $callsign = $repository->getCallsign();
 
     try {
-      $lock_name = get_class($this).':'.$callsign;
+      $lock_name = get_class($this).'.'.$callsign;
       $lock = PhabricatorGlobalLock::newLock($lock_name);
 
       $lock->lock();

But I'm just fishing at this point. No clue what's going on.

I'll pursue these changes, since we reasonably believe they are correct/necessary under MySQL 5.7.5:

  • I'll throw from newLock() if the lock name is more than 64 characters long.
  • I'll shorten the lock name.

That should accidentally fix it without actually establishing what the problem is, but hopefully no one else will ever experience whatever problem you're encountering.

iiam

No errors with that patch either...

This comment was removed by cburroughs.

:-(

We are seeing this again with both D12263 and D12253 applied.

[08-Apr-2015 14:14:00] [2015-04-08 14:14:00] EXCEPTION: (PhutilProxyException) Error while updating the "rSEQUOIA" repository. {>} (CommandException) Command failed with error #255!
COMMAND
'/home/web/sites/phabricator/phabricator/bin/repository' update  -- 'SEQUOIA'

STDOUT
(empty)

STDERR
[2015-04-08 14:14:00] EXCEPTION: (PhutilLockException) phabric-6rILUDHKjHKu-g:repository.update:75 at [<phabricator>/src/infrastructure/util/PhabricatorGlobalLock.php:107]
arcanist(head=f51f55a63cbceec95319d5236292970c7918866e, ref.master=9e357d3fd0ca), phabricator(head=debug-T7484, ref.master=d10e25098c67, ref.debug-T7484=09c757119b24), phutil(head=6a1f18b4d2e3240483ecdc134a624650c17ca422, ref.master=702046d23645)
  #0 PhabricatorGlobalLock::doLock(double) called at [<phutil>/src/filesystem/PhutilLock.php:167]
  #1 PhutilLock::lock() called at [<phabricator>/src/applications/repository/management/PhabricatorRepositoryManagementUpdateWorkflow.php:59]
  #2 PhabricatorRepositoryManagementUpdateWorkflow::execute(PhutilArgumentParser) called at [<phutil>/src/parser/argument/PhutilArgumentParser.php:396]
  #3 PhutilArgumentParser::parseWorkflowsFull(array) called at [<phutil>/src/parser/argument/PhutilArgumentParser.php:292]
  #4 PhutilArgumentParser::parseWorkflows(array) called at [<phabr... (55 more bytes) ... at [<phutil>/src/future/exec/ExecFuture.php:416]
[08-Apr-2015 14:14:00] arcanist(head=f51f55a63cbceec95319d5236292970c7918866e, ref.master=9e357d3fd0ca), phabricator(head=debug-T7484, ref.master=d10e25098c67, ref.debug-T7484=09c757119b24), phutil(head=6a1f18b4d2e3240483ecdc134a624650c17ca422, ref.master=702046d23645)
[08-Apr-2015 14:14:00]   #0 ExecFuture::resolvex() called at [<phabricator>/src/applications/repository/daemon/PhabricatorRepositoryPullLocalDaemon.php:357]
[08-Apr-2015 14:14:00]   #1 PhabricatorRepositoryPullLocalDaemon::resolveUpdateFuture(PhabricatorRepository, ExecFuture, integer) called at [<phabricator>/src/applications/repository/daemon/PhabricatorRepositoryPullLocalDaemon.php:199]
[08-Apr-2015 14:14:00]   #2 PhabricatorRepositoryPullLocalDaemon::run() called at [<phutil>/src/daemon/PhutilDaemon.php:179]
[08-Apr-2015 14:14:00]   #3 PhutilDaemon::execute() called at [<phutil>/scripts/daemon/exec/exec_daemon.php:119]

Spooky problem has not occurred in the few minutes since I rebased off of e0473aa70265a222d440c7e2e2680d2d1959ab20. Will continue to watch.

Occurring on yet another repository after the latest rebase:

17-Apr-2015 13:25:29] [2015-04-17 13:25:29] EXCEPTION: (PhutilProxyException) Error while updating the "rBOOST" repository. {>} (CommandException) Command failed with error #255!
COMMAND
'/home/web/sites/phabricator/phabricator/bin/repository' update  -- 'BOOST'

STDOUT
(empty)

STDERR
[2015-04-17 13:25:29] EXCEPTION: (PhutilLockException) phabric-6rILUDHKjHKu-g:repository.update:54 at [<phabricator>/src/infrastructure/util/PhabricatorGlobalLock.php:107]
arcanist(head=f2a3fdf5e393de1fd9767f46434004848f97a34c, ref.master=9e357d3fd0ca), phabricator(head=c550a09e4ba38c953d3e4fcaa50d268563732ceb, ref.master=d10e25098c67), phutil(head=0a4415cb590b628026b83198102a4402b7b27a08, ref.master=702046d23645)
  #0 PhabricatorGlobalLock::doLock(double) called at [<phutil>/src/filesystem/PhutilLock.php:167]
  #1 PhutilLock::lock() called at [<phabricator>/src/applications/repository/management/PhabricatorRepositoryManagementUpdateWorkflow.php:59]
  #2 PhabricatorRepositoryManagementUpdateWorkflow::execute(PhutilArgumentParser) called at [<phutil>/src/parser/argument/PhutilArgumentParser.php:396]
  #3 PhutilArgumentParser::parseWorkflowsFull(array) called at [<phutil>/src/parser/argument/PhutilArgumentParser.php:292]
  #4 PhutilArgumentParser::parseWorkflows(array) called at [<phabri... (54 more bytes) ... at [<phutil>/src/future/exec/ExecFuture.php:416]

Not sure this is fixed yet.

cburroughs claimed this task.

I have not seen this since moving our install a month ago. I don't have the logs from the old server to verify if that was before or after D12574. Maybe the old server was haunted? ¯\_(ツ)_/¯

jhurwitz added a subscriber: jhurwitz.

We think we just ran into this today? But the problem mysteriously resolved itself after 2 hours, so we can't be much help in reproing.

I can write a lock log for this but it's going to add a huge amount of complexity and basically amount to implementing locks ourselves from scratch. However, I don't see any other way forward.

We have not had any poltergeist sightings recently.

1228708185_4.jpg (240×320 px, 18 KB)

eadler added a project: Restricted Project.Aug 5 2016, 5:23 PM
epriestley claimed this task.

This got a little more work when clustering was written but it's essentially a log of two-year-old ghost sightings at this point and not actionable.

iiam