Page MenuHomePhabricator

Start daemons that should be running but aren't
Closed, InvalidPublic

Description

I've recently come across a glitch where sometimes a daemon (generally the Taskmaster) stops running (I suspect it gets killed due to memory constraints in the fairly resource limited environment where I currently run our small instance), but others continue.

The web UI shows a warning that daemons are not (all) running, however bin/phd start does not work to resolve the issue because some other daemons are still running. The Overseer doesn't seem to detect or fix this itself either, though it continues to run.

I have already rolled some monitoring based on the output of bin/phd status which can detect if no daemons are running and automatically start them/notify me. However, detecting the 'not all daemons that should be running are running' case is more difficult.

It would be good either if Phabricator could look after itself in this circumstance (i.e. the Overseer detect and reinstate killed/unexpectedly dead instances). Failing that, at least if this situation could be detected from the commandline (not just the web UI) external means could be used to fix it up; preferably running bin/phd start would resolve the issue, not requiring bin/phd restart.

phabricator
    8f7983a5be3a56db5b79dc7c3a0eb470f1d7ca02 (Sat, Mar 25) (branched from b4effdf26c3e7d5de0d010cf14626c5d8d404e04 on origin) 
arcanist
    60aaee0ed3f5a1e4384ac7d7f2efd2c64cecbe44 (Sat, Mar 25) (branched from d1db9a72b552151613a918e3d49fa72433387a68 on origin) 
phutil
    b133c277014868d476f08b4ebecde2ea795509e4 (Sat, Mar 25) (branched from c0bc116bedc895fd617799a13549f8707edfd3fb on origin)

Event Timeline

I made a diff (D17780) that adds bin/phd check which runs the setup check that the web UI runs, writing the result to the console, and exiting with an indicative status. This at least allows the circumstance to be detected and I can fix up the problem with bin/phd restart. This might be good enough. Even though using bin/phd start or having Phabricator self-repair through the Overseer would be better, it's likely too rare to warrant work on more complex options.

It is intentional that daemons shutdown when they aren't doing anything. See T12298. They will be restarted automatically when work becomes ready.

Right now on this install, every daemon on all four hosts is hibernating. This is normal:

Screen Shot 2017-04-23 at 10.44.15 AM.png (540×700 px, 112 KB)

It is not expected that the UI shows a warning, but I can't reproduce this after D17397. For example, this install doesn't show a warning although no daemon processes are currently alive.

I've adjusted my monitoring to just alert me instead of restart the daemons when there's an issue so if/when this happens again I can investigate more fully/provide more information. The code from D17397 had definitely landed when I experienced this, as I saw it in the source code when I investigated. I've upgraded to current stable now.

One thing I noticed. All three daemons (Taskmaster, Trigger, PullLocal) are listed as "Waiting" on my install currently, and also show up in the output of phd status. When this problem occurred, I didn't look at the Daemons app in the web UI, but I did notice that Taskmaster was not listed in the phd status output. I'm guessing that behaviour is not normal and perhaps provides a little insight into what's going on here.

We aren't going to implement a bin/phd start-missing-daemon command.

If there's a bug which causes daemons to die, file a reproducible bug report. It's not clear that any of the behavior described here is a bug.

Agreed. I haven't experienced the problem since I upgraded, so I think it was related to an earlier fix, even if it wasn't the identified fix (which should have already been in my install when I did have the problems). There's nothing that needs to be addressed here.