Page MenuHomePhabricator

Show a setup warning when only some daemons are running
Closed, InvalidPublic

Description

Yesterday all of our taskmaster daemons died. I didn't realize for a few hours when people started complaining about not receiving Phabricator emails.

We have a setup check for "daemons not running" but it didn't seem to get fired off in this case, maybe because some of the daemons were still running (the non-taskmaster daemons).

Event Timeline

joshuaspence raised the priority of this task from to Needs Triage.
joshuaspence updated the task description. (Show Details)
joshuaspence added a project: Daemons.
joshuaspence added a subscriber: joshuaspence.
eadler added a project: Restricted Project.Feb 23 2016, 7:30 AM
eadler moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.
epriestley added a subscriber: epriestley.

This "shouldn't" be possible: the overseer should restart them if they fail. And if the overseer dies, all the other daemons should die.

I also haven't seen other reports of this in nearly a year, and we haven't experienced it here or in the cluster.

I'd rather fix whatever the root problem is, but this task doesn't have enough details to proceed. In particular, if I bin/phd start locally and then start picking off taskmasters with kill, I can't get rid of them:

epriestley@orbital ~/dev/phabricator $ ps auxww | grep -i taskmaster | grep -v grep
epriestley      32486   0.0  0.2  2620392  32188   ??  Ss    2:33PM   0:00.13 php /Users/epriestley/dev/core/lib/libphutil/scripts/daemon/exec/exec_daemon.php PhabricatorTaskmasterDaemon -l local
epriestley@orbital ~/dev/phabricator $ kill 32486
epriestley@orbital ~/dev/phabricator $ sleep 6
epriestley@orbital ~/dev/phabricator $ ps auxww | grep -i taskmaster | grep -v grep
epriestley      32550   0.2  0.2  2591720  32024   ??  Ss    2:33PM   0:00.12 php /Users/epriestley/dev/core/lib/libphutil/scripts/daemon/exec/exec_daemon.php PhabricatorTaskmasterDaemon -l local
epriestley@orbital ~/dev/phabricator $ kill 32550
epriestley@orbital ~/dev/phabricator $ sleep 6
epriestley@orbital ~/dev/phabricator $ ps auxww | grep -i taskmaster | grep -v grep
epriestley      32573   0.2  0.2  2591720  32060   ??  Ss    2:33PM   0:00.12 php /Users/epriestley/dev/core/lib/libphutil/scripts/daemon/exec/exec_daemon.php PhabricatorTaskmasterDaemon -l local

If you can figure out how to reproduce this, I'm happy to look into it in greater depth, but it seems like this might be a spooky ghosts sort of issue?