`phd-daemon` crash after automatic restart following config change
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	areitz
	Apr 14 2016, 7:41 PM

Description

In T10786, I mentioned seeing a phd-daemon crash on our QA box, which lead me to want to use process supervision. @epriestley expressed interest in getting more details. We just observed another crash, but this time on our production phab instance. I tried to grab the log info that I could, but unfortunately there doesn't appear to have been a stack trace from the actual crash.

The rough timeline appears to be:

at 17:19 UTC, a bunch of No such daemon errors appear in /var/tmp/phd/log/daemons.log.
at some point after this, phd-daemon disappears
I have daemontools set to poll the output of phd status every 5 minutes
at roughly 17:24 UTC, daemontools notices that there are no daemons running, and executes a phd start

I'm including all of daemons.log around the time window (

stripe_phd_daemon_log_crash.log15 KBDownload

). I didn't find anything else that was relevant in other logs.

Here is the version information of our install:

{
  "phabricator":"8b56e0082bff4807ea1ea275741d5f145804e0ee",
  "libphutil":"76425eaa812572cc02487db79f2dd43d93e3085f",
  "arcanist":"1439aaa871837031faa1ef26b81f1fb08e4a41e7"
}

Please let me know if there is any other data that you would like, or any more instrumentation that I can add.

Thanks!

Revisions and Commits

rPHU libphutil
	D16110	rPHUfb1e159d3640 Fix some issues with empty daemon autoscale pools
	D16053	rPHU5e43755a5b58 Log when daemons trigger a reload in response to config changes
	D15977	rPHU5eaf0a9f5a35 Make many minor improvements to daemon behavior
	D15975	rPHU05e30630a4d3 Fix some daemon errors related to multiple/out-of-order events
rP Phabricator
	D16054	rP0bf337b06c1f Lift ticks declaration to top level in Phabricator daemons

Related Objects
Search...

Status	Assigned	Task
Resolved	epriestley	T10751 Make Phabricator Highly Available
Resolved	epriestley	T10756 Make daemons work correctly no matter where they are or how many copies are running
Resolved	epriestley	T10811 `phd-daemon` crash after automatic restart following config change

Event Timeline

areitz created this task.Apr 14 2016, 7:41 PM

In https://<your.install.com>/config/history/, is there a configuration change of any kind shortly before 17:19 UTC?

If you send the process a SIGHUP, does it exit completely?

If you make a configuration change, does the process exit a minute or two later?

The diagnostically useful information from the trace is that this:

PhutilDaemonOverseer::didReceiveReloadSignal(integer) called at [<phutil>/src/daemon/PhutilDaemonOverseer.php:203]

...means we entered this block:

if ($should_reload) {
  $this->didReceiveReloadSignal(SIGHUP);
  $should_reload = false;
}

The $should_reload flag gets set if an extension tells us to reload the daemons.

Normally, the only module that is loaded is PhabricatorDaemonOverseerModule, which checks for configuration changes every 10 seconds and says "yes, reload the daemons" when configuration changes.

So I'd suspect configuration changed, the module observed the change and told the overseer to reload, and then the reload failed (?) and killed the overseer (??) for some as-yet-unknown reason. But if that was the root cause, we should be able to reproduce it on-demand and get to the bottom of it.

eadler added a subscriber: eadler.Apr 14 2016, 8:26 PM

Hmm, this is pretty interesting. There was a configuration change right at the time of impact:

@dylan was testing a syntax highlighting change be adding it via the UI. Here is the contents of his change:

The timestamps line up with phd-daemon receiving a restart signal at the time that Dylan added this new config.

However, when I manually re-add this config, I don't lines like this (these are all from the first incident earlier today):

areitz@phabricator1.nw:~$ grep didReceiveReloadSignal /var/tmp/phd/log/daemons.log 
[14-Apr-2016 17:19:24 UTC]   #9 PhutilDaemonOverseer::didReceiveReloadSignal(integer) called at [<phutil>/src/daemon/PhutilDaemonOverseer.php:203]
[14-Apr-2016 17:19:33 UTC]   #10 PhutilDaemonOverseer::didReceiveReloadSignal(integer) called at [<phutil>/src/daemon/PhutilDaemonOverseer.php:203]
[14-Apr-2016 17:19:33 UTC]   #9 PhutilDaemonOverseer::didReceiveReloadSignal(integer) called at [<phutil>/src/daemon/PhutilDaemonOverseer.php:203]

in the log. Further, sending a HUP to phd-daemon doesn't appear to do anything:

areitz@phabricator1.nw:~$ ps -ef |grep phd-daemon
phd       1991     1  0 17:24 ?        00:00:36 php ./phd-daemon
areitz    5761 30176  0 21:40 pts/1    00:00:00 grep --color=auto phd-daemon

areitz@phabricator1.nw:~$ sudo -u phd kill -HUP 1991

areitz@phabricator1.nw:~$ date
Thu Apr 14 21:40:34 UTC 2016

areitz@phabricator1.nw:~$ grep didReceiveReloadSignal /var/tmp/phd/log/daemons.log 
[14-Apr-2016 17:19:24 UTC]   #9 PhutilDaemonOverseer::didReceiveReloadSignal(integer) called at [<phutil>/src/daemon/PhutilDaemonOverseer.php:203]
[14-Apr-2016 17:19:33 UTC]   #10 PhutilDaemonOverseer::didReceiveReloadSignal(integer) called at [<phutil>/src/daemon/PhutilDaemonOverseer.php:203]
[14-Apr-2016 17:19:33 UTC]   #9 PhutilDaemonOverseer::didReceiveReloadSignal(integer) called at [<phutil>/src/daemon/PhutilDaemonOverseer.php:203]

At this point, it's a little bit puzzling that I can't reproduce this issue.

The logs themselves are a bug that's been around for a while, but I think they're from a race condition that I believed only caused things to log messages with no ill effects.

It's possible that you got "lucky" with the original change (i.e., lost a race, cascading into the death of the daemon somehow) and everything is working normally when you try to reproduce things.

I'm planning to do some work on the daemons soon in conjunction with T10756, so I can take a deeper look at this and might be able to reproduce the issue, or suggest some change which can force the race to occur so we can reproduce it more reliably on your machine.

I'd also planned to actually fix the race, but maybe I should avoid doing that until we figure this out. Of course, if the problem doesn't reproduce when things work normally, maybe just fixing it and not finding a root cause is fine.

epriestley renamed this task from `phd-daemon` crash on our production phab instance to `phd-daemon` crash after automatic restart following config change.Apr 15 2016, 1:07 PM

epriestley added a parent task: T10756: Make daemons work correctly no matter where they are or how many copies are running.

joshuaspence added a subscriber: joshuaspence.May 9 2016, 9:45 PM

chad added a project: Daemons.May 19 2016, 5:07 PM

scode added a subscriber: scode.May 25 2016, 9:12 PM

epriestley added a revision: D15975: Fix some daemon errors related to multiple/out-of-order events.May 26 2016, 1:13 AM

D15975 should fix the (Exception) No such daemon "39084:6ra7uz"! error. I'm not sure if it will do anything for stability or not, but it should reduce noise. I didn't uncover anything interesting in resolving it (it wasn't really even a race, just a fairly straightforward set of state management bugs).

@scode has a possibly-somewhat-similar scenario (albeit with no apparent correlation to configuration changes) that I'm also looking into, but we're likely some ways away from narrowing it down into a reproduction case.

Q402 has a possibly-related situation which may be easier to reproduce. I suspect that may be some unrelated environmental issue, but maybe they're in the same vein.

epriestley added a revision: D15977: Make many minor improvements to daemon behavior.May 26 2016, 3:19 AM

D15977 improves some behaviors slightly and adds more extensive logging for signal handling. I don't think I caught anything with real meat to it.

It likely partially resolves the issue in Q402, since I did manage to reproduce some flavor of it, but also points a finger at it being signal-related.

epriestley added a commit: rPHU05e30630a4d3: Fix some daemon errors related to multiple/out-of-order events.May 26 2016, 3:08 PM

epriestley added a commit: rPHU5eaf0a9f5a35: Make many minor improvements to daemon behavior.

epriestley added a revision: D16053: Log when daemons trigger a reload in response to config changes.Jun 6 2016, 1:34 PM

epriestley added a revision: D16054: Lift ticks declaration to top level in Phabricator daemons.

chad updated the task description. (Show Details)Jun 6 2016, 1:37 PM

epriestley added a commit: rPHU5e43755a5b58: Log when daemons trigger a reload in response to config changes.Jun 6 2016, 1:39 PM

epriestley added a commit: rP0bf337b06c1f: Lift ticks declaration to top level in Phabricator daemons.

epriestley added a revision: D16110: Fix some issues with empty daemon autoscale pools.Jun 13 2016, 8:49 PM

epriestley added a commit: rPHUfb1e159d3640: Fix some issues with empty daemon autoscale pools.Jun 13 2016, 8:53 PM

There's a very small possibility that D16110 fixed a race condition which was causing this, but that seems like quite a stretch.

We were never able to reproduce this convincingly and as far as we know the reporting install no longer uses Phabricator, so I'm going to close this out.

	F1219169: Screen Shot 2016-04-14 at 2.24.02 PM.png
	Apr 14 2016, 9:42 PM

	F1219165: Screen Shot 2016-04-14 at 2.23.47 PM.png
	Apr 14 2016, 9:42 PM

	F1219052: stripe_phd_daemon_log_crash.log
	Apr 14 2016, 7:41 PM

`phd-daemon` crash after automatic restart following config changeClosed, ResolvedPublicActions

Description

Revisions and Commits

Related ObjectsSearch...

Event Timeline

`phd-daemon` crash after automatic restart following config change
Closed, ResolvedPublic
Actions

Related Objects
Search...