Page MenuHomePhabricator

`phd-daemon` crash after automatic restart following config change
Closed, ResolvedPublic

Description

In T10786, I mentioned seeing a phd-daemon crash on our QA box, which lead me to want to use process supervision. @epriestley expressed interest in getting more details. We just observed another crash, but this time on our production phab instance. I tried to grab the log info that I could, but unfortunately there doesn't appear to have been a stack trace from the actual crash.

The rough timeline appears to be:

  • at 17:19 UTC, a bunch of No such daemon errors appear in /var/tmp/phd/log/daemons.log.
  • at some point after this, phd-daemon disappears
  • I have daemontools set to poll the output of phd status every 5 minutes
  • at roughly 17:24 UTC, daemontools notices that there are no daemons running, and executes a phd start

I'm including all of daemons.log around the time window (

). I didn't find anything else that was relevant in other logs.

Here is the version information of our install:

{
  "phabricator":"8b56e0082bff4807ea1ea275741d5f145804e0ee",
  "libphutil":"76425eaa812572cc02487db79f2dd43d93e3085f",
  "arcanist":"1439aaa871837031faa1ef26b81f1fb08e4a41e7"
}

Please let me know if there is any other data that you would like, or any more instrumentation that I can add.

Thanks!

Event Timeline

In https://<your.install.com>/config/history/, is there a configuration change of any kind shortly before 17:19 UTC?

If you send the process a SIGHUP, does it exit completely?

If you make a configuration change, does the process exit a minute or two later?

The diagnostically useful information from the trace is that this:

PhutilDaemonOverseer::didReceiveReloadSignal(integer) called at [<phutil>/src/daemon/PhutilDaemonOverseer.php:203]

...means we entered this block:

if ($should_reload) {
  $this->didReceiveReloadSignal(SIGHUP);
  $should_reload = false;
}

The $should_reload flag gets set if an extension tells us to reload the daemons.

Normally, the only module that is loaded is PhabricatorDaemonOverseerModule, which checks for configuration changes every 10 seconds and says "yes, reload the daemons" when configuration changes.

So I'd suspect configuration changed, the module observed the change and told the overseer to reload, and then the reload failed (?) and killed the overseer (??) for some as-yet-unknown reason. But if that was the root cause, we should be able to reproduce it on-demand and get to the bottom of it.

Hmm, this is pretty interesting. There was a configuration change right at the time of impact:

Screen Shot 2016-04-14 at 2.23.47 PM.png (139×773 px, 32 KB)

@dylan was testing a syntax highlighting change be adding it via the UI. Here is the contents of his change:

Screen Shot 2016-04-14 at 2.24.02 PM.png (148×889 px, 12 KB)

The timestamps line up with phd-daemon receiving a restart signal at the time that Dylan added this new config.

However, when I manually re-add this config, I don't lines like this (these are all from the first incident earlier today):

areitz@phabricator1.nw:~$ grep didReceiveReloadSignal /var/tmp/phd/log/daemons.log 
[14-Apr-2016 17:19:24 UTC]   #9 PhutilDaemonOverseer::didReceiveReloadSignal(integer) called at [<phutil>/src/daemon/PhutilDaemonOverseer.php:203]
[14-Apr-2016 17:19:33 UTC]   #10 PhutilDaemonOverseer::didReceiveReloadSignal(integer) called at [<phutil>/src/daemon/PhutilDaemonOverseer.php:203]
[14-Apr-2016 17:19:33 UTC]   #9 PhutilDaemonOverseer::didReceiveReloadSignal(integer) called at [<phutil>/src/daemon/PhutilDaemonOverseer.php:203]

in the log. Further, sending a HUP to phd-daemon doesn't appear to do anything:

areitz@phabricator1.nw:~$ ps -ef |grep phd-daemon
phd       1991     1  0 17:24 ?        00:00:36 php ./phd-daemon
areitz    5761 30176  0 21:40 pts/1    00:00:00 grep --color=auto phd-daemon

areitz@phabricator1.nw:~$ sudo -u phd kill -HUP 1991

areitz@phabricator1.nw:~$ date
Thu Apr 14 21:40:34 UTC 2016

areitz@phabricator1.nw:~$ grep didReceiveReloadSignal /var/tmp/phd/log/daemons.log 
[14-Apr-2016 17:19:24 UTC]   #9 PhutilDaemonOverseer::didReceiveReloadSignal(integer) called at [<phutil>/src/daemon/PhutilDaemonOverseer.php:203]
[14-Apr-2016 17:19:33 UTC]   #10 PhutilDaemonOverseer::didReceiveReloadSignal(integer) called at [<phutil>/src/daemon/PhutilDaemonOverseer.php:203]
[14-Apr-2016 17:19:33 UTC]   #9 PhutilDaemonOverseer::didReceiveReloadSignal(integer) called at [<phutil>/src/daemon/PhutilDaemonOverseer.php:203]

At this point, it's a little bit puzzling that I can't reproduce this issue.

The logs themselves are a bug that's been around for a while, but I think they're from a race condition that I believed only caused things to log messages with no ill effects.

It's possible that you got "lucky" with the original change (i.e., lost a race, cascading into the death of the daemon somehow) and everything is working normally when you try to reproduce things.

I'm planning to do some work on the daemons soon in conjunction with T10756, so I can take a deeper look at this and might be able to reproduce the issue, or suggest some change which can force the race to occur so we can reproduce it more reliably on your machine.

I'd also planned to actually fix the race, but maybe I should avoid doing that until we figure this out. Of course, if the problem doesn't reproduce when things work normally, maybe just fixing it and not finding a root cause is fine.

epriestley renamed this task from `phd-daemon` crash on our production phab instance to `phd-daemon` crash after automatic restart following config change.Apr 15 2016, 1:07 PM

D15975 should fix the (Exception) No such daemon "39084:6ra7uz"! error. I'm not sure if it will do anything for stability or not, but it should reduce noise. I didn't uncover anything interesting in resolving it (it wasn't really even a race, just a fairly straightforward set of state management bugs).

@scode has a possibly-somewhat-similar scenario (albeit with no apparent correlation to configuration changes) that I'm also looking into, but we're likely some ways away from narrowing it down into a reproduction case.

Q402 has a possibly-related situation which may be easier to reproduce. I suspect that may be some unrelated environmental issue, but maybe they're in the same vein.

D15977 improves some behaviors slightly and adds more extensive logging for signal handling. I don't think I caught anything with real meat to it.

It likely partially resolves the issue in Q402, since I did manage to reproduce some flavor of it, but also points a finger at it being signal-related.

There's a very small possibility that D16110 fixed a race condition which was causing this, but that seems like quite a stretch.

epriestley claimed this task.

We were never able to reproduce this convincingly and as far as we know the reporting install no longer uses Phabricator, so I'm going to close this out.