Page MenuHomePhabricator

PhabricatorTaskmasterDaemon 100% CPU load
Closed, WontfixPublic

Description

I am having isues with the background daemons, causing over 100% in CPU load.

When I start the daemons (./bin/phd start) at first nothing seems wrong. But on occasion the load spikes to 100% for longer periods of time. Usually for more than 15 minutes.
Then after a while it disappears to reappear some seemingly random time later.

The command that is used to spawn the process that has this load is;

php /var/vhosts/phabricator/libphutil/scripts/daemon/exec/exec_daemon.php PhabricatorTaskmasterDaemon --load-phutil-library=/var/vhosts/phabricator/arcanist/src --load-phutil-library=/var/vhosts/phabricator/phabricator/src --log=/var/vhosts/phabricator/logs/daemons.log --

When I strace the daemon while the load is high, I see an endless loop of these;

19444 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
19444 rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
19444 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
19444 rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
19444 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
19444 rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
19444 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
19444 rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
19444 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
19444 rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
19444 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
19444 rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
19444 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
19444 rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
19444 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
19444 rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0

I have 149.508 of these lines captured in less than 10 seconds, which explain the high CPU.

Now as far as I know the number 8 refers a file descriptor for this proces. But when I view /proc/19444/fd/ there is no reference 8;

0 lr-x------ 1 root root 64 Mar 17 14:08 0 -> pipe:[857844016]
0 l-wx------ 1 root root 64 Mar 17 14:08 1 -> pipe:[857844017]
0 l-wx------ 1 root root 64 Mar 17 14:08 2 -> pipe:[857844018]
0 lrwx------ 1 root root 64 Mar 17 14:08 3 -> socket:[776376581]
0 lrwx------ 1 root root 64 Mar 17 14:08 4 -> socket:[776376691]
0 lrwx------ 1 root root 64 Mar 17 14:08 5 -> socket:[857844034]
0 lrwx------ 1 root root 64 Mar 17 14:08 6 -> socket:[857844036]

I have tried to increase some sleep() calls in the source code to slow down the number of calls, but had no success. And have in no way been able to locate where exactly this is happening. It does not seem to occur until I add a SVN repository to a fresh install. Even when Phabricator reports to have fully imported, and no commits are made, the problem appears.

I am running the lastest versions of arcanist, libphutil and phabricator from the Git repository on a DELL PowerEdge R310 running CentOS 5.10 x64.
All storage is on local disks.

The repositories are "Hosted elsewhere" and loaded from HTTP, though actually hosted on the same machine.

Any hints or tips on how to resolve or debug this issue?

Event Timeline

fpruis raised the priority of this task from to Needs Triage.
fpruis updated the task description. (Show Details)
fpruis added a project: Daemons.
fpruis added a subscriber: fpruis.
chad renamed this task from PhabricatorTaskmasterDaemon 100% CPU load to PhabricatorTaskmasterDaemon 100% CPU load.Mar 17 2014, 4:40 PM

Try stopping the daemons, then run:

bin/phd debug taskmaster

...in the foreground. That will print out a lot of high-level trace information. If that process spikes CPU, let us know what it was logging around that time?

Checking the daemon console (/daemon/ on the web) to see which task is currently leased when CPU spikes may also be instructive.

Thank you for your reply epriestley.

I executed your debug command and right from the start it's bingo;

debug taskmaster output;

2014-03-18 8:54:34 AM [STDE] <<< [11] <query> 13,144,677 us
>>> [10] <event> daemon.didLogMessage <listeners = 2>
>>> [11] <query> INSERT INTO `daemon_logevent` (`logID`, `logType`, `message`, `epoch`) VALUES ('118', 'STDE', '<<< [11] <query> 13,144,677 us', '1395129274')
<<< [11] <query> 688 us
<<< [10] <event> 1,412 us

And there it stopped for about 1 minute with CPU at 100% and continued with;

>>> [12] <event> daemon.didHeartbeat <listeners = 2>
>>> [13] <query> UPDATE `daemon_log` SET `daemon` = 'PhabricatorTaskmasterDaemon', `host` = 'web01.esites', `pid` = '20602', `argv` = '[\"PhabricatorTaskmasterDaemon\",\"--load-phutil-library=\\/var\\/vhosts\\/phabricator\\/arcanist\\/src\",\"--load-phutil-library=\\/var\\/vhosts\\/phabricator\\/phabricator\\/src\",\"--trace\",\"--verbose\",\"--phd=\\/var\\/vhosts\\/phabricator\\/pid\"]', `status` = 'run', `id` = '118', `dateCreated` = '1395129260', `dateModified` = '1395129381' WHERE `id` = 118
<<< [13] <query> 55,302 us
<<< [12] <event> 56,360 us

Again this paused for about 1 or 2 minutes spiking the CPU to 100%. And after this one it continued with various outputs of queries (looks like commits being read) and lower CPU.

During the pasues the /daemon/ page told me the following daemon is active;

Daemon 118 PhabricatorTaskmasterDaemon
Tue, Mar 18, 8:54 AM
This daemon is running.

I sometimes notice other pauses in the debug output for example after this;

<VERB> PhabricatorTaskmasterDaemon Working on task 946102 (PhabricatorRepositoryCommitOwnersWorker)...
>>> [29] <connect> phabricator_repository
<<< [29] <connect> 706 us
>>> [30] <query> SELECT * FROM `repository_commit` WHERE `id` = 239107
<<< [30] <query> 531 us
>>> [31] <query> SELECT * FROM `repository` r  WHERE (r.id IN (10)) ORDER BY r.id DESC
<<< [31] <query> 150 us
>>> [32] <query> SELECT * FROM `repository_commit` WHERE repositoryID = 10 AND commitIdentifier = '22171'
<<< [32] <query> 813 us
>>> [33] <query> SELECT c.*, p.path pathName, t.path targetPathName,
          i.commitIdentifier targetCommitIdentifier
        FROM `repository_pathchange` c
          LEFT JOIN `repository_path` p ON c.pathID = p.id
          LEFT JOIN `repository_path` t ON c.targetPathID = t.id
          LEFT JOIN `repository_commit` i ON c.targetCommitID = i.id
        WHERE c.commitID = 239107 AND isDirect = 1
>>> [133] <event> daemon.didLogMessage <listeners = 2>
>>> [134] <query> INSERT INTO `daemon_logevent` (`logID`, `logType`, `message`, `epoch`) VALUES ('119', 'STDE', '>>> [0] <connect> phabricator_config\n<<< [0] <connect> 3,336 us\n>>> [1] <query> SELECT * FROM `config_entry` WHERE namespace = \'default\' AND isDeleted = 0 \n<<< [1] <query> 237 us\n>>> [2] <connect> phabricator_worker\n<<< [2] <connect> 566 us\n>>> [3] <query> SELECT id, leaseOwner FROM `worker_activetask` WHERE (leaseOwner IS NULL) ORDER BY id ASC LIMIT 1\n<<< [3] <query> 376 us\n>>> [4] <query> UPDATE `worke
<<< [134] <query> 84,172 us
<<< [133] <event> 87,821 us

This also took about a minute, but did not hog the CPU. That was just idle. So no problems there

But then right afterward this same block as before reappeared and again hogged the CPU to 100%, this time for multiple minutes;

2014-03-18 9:09:18 AM [STDE] <<< [33] <query> 24,068,933 us
  >>> [135] <event> daemon.didLogMessage <listeners = 2>
  >>> [136] <query> INSERT INTO `daemon_logevent` (`logID`, `logType`, `message`, `epoch`) VALUES ('119', 'STDE', '<<< [33] <query> 24,068,933 us', '1395130158')
  <<< [136] <query> 1,448 us
  <<< [135] <event> 2,157 us

And immedeatly after the same, with again a long pause and high cpu

The only things that stand out to me there are these:

<<< [11] <query> 13,144,677 us
<<< [33] <query> 24,068,933 us

...which suggests the queries took 13 seconds and 24 seconds to run, respectively. Those are very slow; I would expect them to take a few milliseconds. These are also a bit slow:

<<< [134] <query> 84,172 us
<<< [13] <query> 55,302 us

It looks like this query was the 24-second query:

SELECT c.*, p.path pathName, t.path targetPathName,
          i.commitIdentifier targetCommitIdentifier
        FROM `repository_pathchange` c
          LEFT JOIN `repository_path` p ON c.pathID = p.id
          LEFT JOIN `repository_path` t ON c.targetPathID = t.id
          LEFT JOIN `repository_commit` i ON c.targetCommitID = i.id
        WHERE c.commitID = 239107 AND isDirect = 1

If your run that query manually (first, use phabricator_repository) is it also slow?

Generally, the daemons just seem to be running normally, so I don't have any real idea what's going on or why they're using all that CPU.

The query itself doesn't seem the problem

Query took 0.0017 sec

Database is 28G in size with 10G in repository_pathchange spread over ~113,019,282 rows.

It also wouldn't have explained the rt_sigprocmask calls I saw in my strace.
Also it is the php process that is causing the load, mysql is idle most of the time.

To me it seems like some loop trying to read a stream that should be in blocking mode, but isn't.
If all else fails I could try installing xdebug or xhprof to better pinpoint what is going on?

You should be able to install xhprof and then run:

bin/phd debug taskmaster --trace --xprofile out.xprofile

That should generate an out.xprofile file when the process is ^C'd.

I've generated an xhprof outfile from a launch up untill execution pauses. I ^C'd it when it was about to continue
The time seems to be spent with calling proc_close(), called from PhutilExecPassthru::execute().

See http://www.e-sites.nl/phabricator.xhprof (40kb) for the entire trace.

I am running PHP 5.3.28

Oh, sorry, my fault: the profile is misleading -- the phd process is actually being profiled, not the daemon, which we execute as a subprocess. I'll see if I can adjust the --xprofile flag to either propagate to the subprocess, or add a flag to run the daemon in the parent process.

I think you can get a proper profile like this (I'm on 5.5.8 and xhprof segfaults immediately when built against 5.5.8, which makes this difficult for me to test properly locally):

phabricator/ $ ./scripts/daemon/phd-daemon 'PhabricatorTaskmasterDaemon' --trace --verbose --phd='/var/tmp/phd/pid' --xprofile out.xprofile

You were right, that works.

This shows stream_select() being the reason, called from Future::waitForSockets()
Entire trace file here; http://www.e-sites.nl/phabricator.2.xhprof

I also noticed two new things about the situation;

  1. top shows the 100% CPU load under system cpu time, while I would expect it under user cpu time.
  2. the resident memory size grows from 900M to 4.1G during the 100% load

I am trying to debug ./libphutil/src/future/Future.php where the stream_select() call is;

$ok = stream_select(
  $read_list,
  $write_list,
  $exceptfds,
  $timeout,
  $timeout_usec);

During the high load $ok is (int) 0.
$timeout is a float (ex. 0.99919239192) while stream_select() expects an integer. I suspect it was the intention to pass $timeout_sec, but changing this does not resolve the issue.
Even when I pass NULL (and discard timeout_usec) as a timeout the high CPU load persists. Although strace now shows endless calls to brk() instead of rt_sigprocmask().

I am starting to worry something is wrong at a deeper level on my system. But hopefully you are able to continue your much appreciated help.

Ah, I think you're right that we intended to pass $timeout_sec instead of $timeout. I'll fix that (D8567), but I wouldn't expect it to affect behavior (it sounds like it didn't, and I can't trigger any behavioral changes in testing).

This is definitely starting to look like something might be wrong at a deeper level. If you pass 1, do you experience the same behavior? You can also use a simple script like this to test the behavior of stream_select() in a more isolated environment, if that's helpful:

<?php

$spec = array(
  0 => array('pipe', 'r'),
  1 => array('pipe', 'w'),
  2 => array('pipe', 'w'),
);

$pipes = array();
$proc = proc_open('sleep 30', $spec, $pipes);
list($stdin, $stdout, $stderr) = $pipes;

$read = array($stdout, $stderr);
$write = array();
$except = array();

$select_start = microtime(true);
stream_select($read, $write, $except, 0, 80000);
$select_end = microtime(true);

printf(
  "select() took %d us to return.\n",
  ($select_end - $select_start) * 1000000);

...with something like that, if you change the timeout_sec to 60, do you experience CPU spin while we're waiting for stream_select()?

Also, this is a bit of a shot in the dark, but if you send SIGHUP to the phd-daemon process, it should dump a stack trace into phabricator_backtrace_<pid> in a system-dependent temporary directory which you can find with this command:

php -r 'echo sys_get_temp_dir();'

A vaguely plausible scenario where that could theoretically tell us something useful is if we're receiving signals while inside stream_select(), dropping into a signal handler, and then doing something expensive which doesn't show up on strace. I'm not really sure this could happen, or that sending the process another signal could detect it (I don't offhand have a nuanced understanding of how signal handlers will reenter while we're inside a signal handler), and signaling could disrupt things and change behavior, but it's fairly easy to try and might give us a bit more information.

Setting the timeout to 1 or 60 in libphutil/src/future/Future.php:146 does not make any difference. Allthough the brk() and rt_sigprocmask() calls in strace now both seem to occur.

Your code as-is;

select() took 80057 us to return.

With $timeout set to 60

select() took 30001078 us to return.

During this time no abnormal CPU usage and strace only shows a select(). No brk() nor rt_sigprocmask().
Seems good to me.

Could it have anything to do with mod_pcntl?

Hmm, how about this version? The theory is that either the ticks directive or signal handling might be the issue.

<?php

declare(ticks = 1);

$signals = array(
  SIGINT => 'SIGINT',
  SIGHUP => 'SIGHUP',
  SIGTERM => 'SIGTERM',
  SIGUSR1 => 'SIGUSR1',
  SIGUSR2 => 'SIGUSR2',
  SIGCHLD => 'SIGCHLD',
);

foreach ($signals as $signal => $signame) {
  pcntl_signal($signal, 'handle_signal');
}

$spec = array(
  0 => array('pipe', 'r'),
  1 => array('pipe', 'w'),
  2 => array('pipe', 'w'),
);

$pipes = array();
$proc = proc_open('sleep 30', $spec, $pipes);
list($stdin, $stdout, $stderr) = $pipes;

$read = array($stdout, $stderr);
$write = array();
$except = array();

$select_start = microtime(true);
stream_select($read, $write, $except, 30, 0);
$select_end = microtime(true);

printf(
  "select() took %d us to return.\n",
  ($select_end - $select_start) * 1000000);

function handle_signal($signo) {
  global $signals;

  $name = $signo;
  foreach ($signals as $signal => $signame) {
    if ($signal == $signo) {
      $name = $signame;
      break;
    }
  }

  echo "Got signal {$name} ({$signo})!\n";
}

That one doesn't reproduce the issue either
CPU stays idle and strace shows a single select() while waiting and finishes with;

select() took 29999794 us to return.

Anything with blocking/non-blocking maybe?

Here's an non-blocking version, similar to what we do in the real code:

<?php

declare(ticks = 1);

$signals = array(
  SIGINT => 'SIGINT',
  SIGHUP => 'SIGHUP',
  SIGTERM => 'SIGTERM',
  SIGUSR1 => 'SIGUSR1',
  SIGUSR2 => 'SIGUSR2',
  SIGCHLD => 'SIGCHLD',
);

foreach ($signals as $signal => $signame) {
  pcntl_signal($signal, 'handle_signal');
}

$spec = array(
  0 => array('pipe', 'r'),
  1 => array('pipe', 'w'),
  2 => array('pipe', 'w'),
);

$pipes = array();
$proc = proc_open('sleep 30', $spec, $pipes);
list($stdin, $stdout, $stderr) = $pipes;

stream_set_blocking($stdin, false);
stream_set_blocking($stdout, false);
stream_set_blocking($stderr, false);

$read = array($stdout, $stderr);
$write = array();
$except = array();

$select_start = microtime(true);
stream_select($read, $write, $except, 30, 0);
$select_end = microtime(true);

printf(
  "select() took %d us to return.\n",
  ($select_end - $select_start) * 1000000);

function handle_signal($signo) {
  global $signals;

  $name = $signo;
  foreach ($signals as $signal => $signame) {
    if ($signal == $signo) {
      $name = $signame;
      break;
    }
  }

  echo "Got signal {$name} ({$signo})!\n";
}

Just as a sanity check, I double checked that we aren't seeing anything like this ourselves, but the taskmasters on this box have been running for 12+ hours and each have less than a minute of time used, idle CPUs, low RAM consumption, etc.

                                                    START   TIME
ec2-user 31086  0.0  0.2 217632 18396 ?        S    Mar18   0:48 php /core/lib/phabricator/scripts/daemon/phd-daemon PhabricatorTaskmasterDaemon --daemonize --log=/var/tmp/phd/log/daemons.log --phd=/var/tmp/phd/pid
ec2-user 31101  0.0  0.2 217632 18324 ?        S    Mar18   0:47 php /core/lib/phabricator/scripts/daemon/phd-daemon PhabricatorTaskmasterDaemon --daemonize --log=/var/tmp/phd/log/daemons.log --phd=/var/tmp/phd/pid
ec2-user 31117  0.0  0.2 217632 18328 ?        S    Mar18   0:47 php /core/lib/phabricator/scripts/daemon/phd-daemon PhabricatorTaskmasterDaemon --daemonize --log=/var/tmp/phd/log/daemons.log --phd=/var/tmp/phd/pid
ec2-user 31143  0.0  0.2 217632 18376 ?        S    Mar18   0:47 php /core/lib/phabricator/scripts/daemon/phd-daemon PhabricatorTaskmasterDaemon --daemonize --log=/var/tmp/phd/log/daemons.log --phd=/var/tmp/phd/pid
                                                      ^      ^
                                                      |      |
                      12+ hours of runtime  ----------+      |
                <1 minute of time used      -----------------+

Yes, we are getting close.

I have taken the code above and modified it to be more like as how I have seen it is run by the daemon; with a timeout of (int) 0 and timeout_usec of for example 999905 (as seen during debugging) and inside a loop, as Future::resolve() does.
When run I see what seems normal;

select() took 999950 us to return.
select() took 1000955 us to return.
select() took 1000918 us to return.
...

This continues for 30 seconds under normal CPU load, and then suddenly..

select() took 4 us to return.
select() took 3 us to return.
select() took 3 us to return.
...

And CPU spikes to 100%

As soon as the "sleep 30" finishes the script goes rogue.
This confims to my earlier observation the script was trying to read a file descriptor that did not exist, though in this testcase I obviously have nothing build-in to break the loop, the symptoms are the same.

Hmm, I'm not convinced that's related. If we loop with select() on bad file descriptors it's not surprising that the program spins, but in resolve() we'll check to see if the process exited after the select() returns, which should prevent us from looping with bad descriptors.

You can run commands using ExecFuture in a more isolated way with a script like this (the scripts/ directory is libphutil/scripts/):

<?php

require_once 'scripts/__init_script__.php';

execx('sleep 5');

Does that experience the same issues?

Nope, that one works correctly
I see one select() that pauses untill the sleep is finished, and no CPU usage.

When I echo the microtime inside Future::waitForSockets() and I run your script from above I see it output once every second.

1395240787.1607
1395240788.1608
1395240789.1607
1395240790.1607
1395240791.1607
...

When I run the daemon I get an endless stream;

1395240743.7532
1395240743.7534
1395240743.7638
1395240743.764
1395240743.7645
1395240743.765
1395240743.7651
1395240743.7655
...

Maybe a difference is "sleep" exits nicely while the task the deamon executes does not, causing resolve() not to detect this properly?
//edit As discovered later; the task exits with an exception because of an expired lease

ExecFuture::isReady() is the method used to detect process exit, so you could try putting print/echo in there. If you added echo "isready\n";, I'd expect to see something like:

1395240743.7532
isready
1395240743.7534
isready

...etc. You could then conceivably add more "echo" to that method and figure out what's happening in terms of detecting subprocess exit. You can also figure out which command is being run by echoing $this->command.

(It's also possible to end up in waitForSockets() via FutureIterator::next(), but this is rarer.)

When I add the echo inside ExecFuture::isReady() the output becomes this;

isready
1395651103.4699
isready
isready
isready
1395651103.471
isready
1395651104.4709
isready
isready
isready
1395651104.472
isready
1395651105.4719
isready
isready
isready
1395651105.473
...

The number of microtime-lines have been reduced to 2 per second, but CPU load is still at 100%.

Echo-ing $this->command inside isReady() gives me the same respons every time;

exec ./exec_daemon.php 'PhabricatorTaskmasterDaemon' '--trace' '--load-phutil-library=/var/vhosts/phabricator/arcanist/src' '--load-phutil-library=/var/vhosts/phabricator/phabricator/src' '--verbose' --

When I manually spawn this proces the last thing it outputs is

SELECT c.*, p.path pathName, t.path targetPathName,
         i.commitIdentifier targetCommitIdentifier
       FROM `repository_pathchange` c
         LEFT JOIN `repository_path` p ON c.pathID = p.id
         LEFT JOIN `repository_path` t ON c.targetPathID = t.id
         LEFT JOIN `repository_commit` i ON c.targetCommitID = i.id
       WHERE c.commitID = 239107 AND isDirect = 1

This is shown a few lines higher;

<VERB> PhabricatorTaskmasterDaemon Working on task 946102 (PhabricatorRepositoryCommitOwnersWorker)...

Then output hangs indefinitely with CPU at 100%. Any tips on how to profile this script?

I have ran the task manually.
In DiffusionPathChangeQuery::executeQuery() at line 55 the query (the one in my post above) returns 2.470.422 rows and uses over 4 gigabyte of memory.
The call to isort() on this array takes 25 seconds and costs 930MB additional memory. Then even more time is spent with the three foreach() loops that come next, taking additional gigabytes of memory.
During this CPU is around 98% most of the time, only sometimes hitting 100% for short periods.

The number of rows is legitimate. This commit was an import of a large Magento Enterprise webshop.

When I leave this to run for a while the script halts with this exception;

EXCEPTION: (Exception) Trying to update Task 882410 (PhabricatorRepositoryCommitOwnersWorker) after lease expiration! at [/var/vhosts/phabricator/phabricator/src/infrastructure/daemon/workers/storage/PhabricatorWorkerActiveTask.php:62]
  #0 PhabricatorWorkerActiveTask::checkLease() called at [/var/vhosts/phabricator/phabricator/src/infrastructure/daemon/workers/storage/PhabricatorWorkerActiveTask.php:24]
  #1 PhabricatorWorkerActiveTask::setLeaseDuration(60) called at [/var/vhosts/phabricator/phabricator/src/infrastructure/daemon/workers/storage/PhabricatorWorkerActiveTask.php:145]
  #2 PhabricatorWorkerActiveTask::executeTask() called at [/var/vhosts/phabricator/phabricator/src/infrastructure/daemon/workers/PhabricatorTaskmasterDaemon.php:19]
  #3 PhabricatorTaskmasterDaemon::run() called at [/var/vhosts/phabricator/libphutil/src/daemon/PhutilDaemon.php:85]
  #4 PhutilDaemon::execute() called at [/var/vhosts/phabricator/libphutil/scripts/daemon/exec/exec_daemon.php:118

Could this be what is crashing the task without isReady() noticing?

I have added a "return" on the first line of function checkLease() in ./src/infrastructure/daemon/workers/storage/PhabricatorWorkerActiveTask.php. Daemon has been behaving better now

root     27746  0.0  0.1 259392 20224 pts/0    S    08:25   0:03 php /var/vhosts/phabricator/phabricator/scripts/daemon/phd-daemon PhabricatorRepositoryPullLocalDaemon --daemonize --log=/var/vhosts/phabricator/logs/daemons.log --phd=/var
root     27747  0.4  0.1 263380 28628 ?        Ss   08:25   1:46  \_ php /var/vhosts/phabricator/libphutil/scripts/daemon/exec/exec_daemon.php PhabricatorRepositoryPullLocalDaemon --load-phutil-library=/var/vhosts/phabricator/arcanist/sr
root     27749  0.0  0.1 259392 20216 pts/0    S    08:25   0:00 php /var/vhosts/phabricator/phabricator/scripts/daemon/phd-daemon PhabricatorGarbageCollectorDaemon --daemonize --log=/var/vhosts/phabricator/logs/daemons.log --phd=/var/vh
root     27750  0.0  0.1 261640 26772 ?        Ss   08:25   0:00  \_ php /var/vhosts/phabricator/libphutil/scripts/daemon/exec/exec_daemon.php PhabricatorGarbageCollectorDaemon --load-phutil-library=/var/vhosts/phabricator/arcanist/src -
root     27754  0.0  0.1 259392 20192 pts/0    S    08:25   0:00 php /var/vhosts/phabricator/phabricator/scripts/daemon/phd-daemon PhabricatorTaskmasterDaemon --daemonize --log=/var/vhosts/phabricator/logs/daemons.log --phd=/var/vhosts/p
root     27755  0.0  0.2 274200 39972 ?        Ss   08:25   0:05  \_ php /var/vhosts/phabricator/libphutil/scripts/daemon/exec/exec_daemon.php PhabricatorTaskmasterDaemon --load-phutil-library=/var/vhosts/phabricator/arcanist/src --load-

// edit
5 days later and still behaving, though the runtime is still a lot higher than yours, the CPU hasn't spiked anymore.

root     27746  0.0  0.1 259392 20268 ?        S    Mar26   1:07 php /var/vhosts/phabricator/phabricator/scripts/daemon/phd-daemon PhabricatorRepositoryPullLocalDaemon --daemonize --log=/var/vhosts/phabricator/logs/daemons.log --phd=/var
root     27747  0.4  0.1 263380 28680 ?        Ss   Mar26  31:18  \_ php /var/vhosts/phabricator/libphutil/scripts/daemon/exec/exec_daemon.php PhabricatorRepositoryPullLocalDaemon --load-phutil-library=/var/vhosts/phabricator/arcanist/sr
root     27749  0.0  0.1 259392 20268 ?        S    Mar26   0:07 php /var/vhosts/phabricator/phabricator/scripts/daemon/phd-daemon PhabricatorGarbageCollectorDaemon --daemonize --log=/var/vhosts/phabricator/logs/daemons.log --phd=/var/vh
root     27750  0.0  0.1 261896 26924 ?        Ss   Mar26   0:00  \_ php /var/vhosts/phabricator/libphutil/scripts/daemon/exec/exec_daemon.php PhabricatorGarbageCollectorDaemon --load-phutil-library=/var/vhosts/phabricator/arcanist/src -
root     27754  0.0  0.1 259392 20248 ?        S    Mar26   0:12 php /var/vhosts/phabricator/phabricator/scripts/daemon/phd-daemon PhabricatorTaskmasterDaemon --daemonize --log=/var/vhosts/phabricator/logs/daemons.log --phd=/var/vhosts/p
root     27755  0.0  1.2 439076 204540 ?       Ss   Mar26   2:13  \_ php /var/vhosts/phabricator/libphutil/scripts/daemon/exec/exec_daemon.php PhabricatorTaskmasterDaemon --load-phutil-library=/var/vhosts/phabricator/arcanist/src --load-

The problem seems the exception by checkLease() not being properly handled?

That is too bad. Allthough my change does fix the problem, checkLease() is now completly disabled, and I am not sure what its purpose is.

I do recommend handling the exception by checkLease(). My issue should happen on anyone who has a single task expiring the lease time.
There already is a comment in your code in PhabricatorWorkerActiveTask::executeTask() saying;

// We do this outside of the try .. catch because we don't have permission
// to release the lease otherwise.
$this->checkLease();

Seems this isn't handled elsewhere yet.

Either way, thank you for all your efforts!

I know it's an old thread, but I am encountering what is, as far as I can tell, exactly the same issue. I'm on the latest version of trunk (~Oct 26) with php 5.5.9 on Linux 3.13.0. I also observed the issue on the May 30 version.

Pieces that are the same:

  • taskmaster daemon
  • strace shows the process spinning in an identical rt_sigprocmask block/unblock loop (about 500 per second)
    • rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) followed by rt_sigprocmask(SIG_SETMASK, [], NULL, 8)
    • note the initial post mistakenly attributes 8 as a file descriptor, but in the rt_sigprocmask call it's just a struct size; there are no fds involved
  • 100% usage of a CPU core
  • unbounded memory usage growth (it grows steadily and I've not seen it cap out)

I observe the issue if I start the daemons normally (bin/phd start) or if I launch the taskmaster daemon explicitly (bin/phd launch taskmaster).
Notably though, I do *not* observe the behavior if I start the demon under debug (bin/phd debug taskmaster).

I have not yet gotten xhprof working nor have I tried to apply the change above that neuters checkLease().