Page MenuHomePhabricator

Differentiate "Waiting" from "Restarting after an error" on the daemon console
Open, LowPublic

Description

Currently if a daemon encounters an error, it will restart itself. While it's restarting you see a little "waiting" icon on the daemon console. You also always see this little waiting icon while the daemon is just sitting there not doing anything at all. These two statuses should be differentiated so that you can look at the daemon console and know that something is wrong.

@epriestley can you help walk Mike through your thoughts on this, I'd like him to attempt to make the actual changes if you think they're appropriate.

Event Timeline

yelirekim created this task.

The daemon log (at /daemon/) shows status icons for running (or recently-known) daemons. This stuff lives in PhabricatorDaemonLogViewController. The status PhabricatorDaemonLog::STATUS_WAIT represents two (slightly-more-than-two) different scenarios:

  1. The daemon recently noticed it didn't have any work to do, so it went to sleep for a while to conserve RAM.
  2. The daemon hit an unexpected, serious error and exited. We're waiting a few moments to restart it. This backoff is, broadly, so that if a daemon gets stuck in a loop where it causes some kind of side effect (sending mail, etc) and then fails, you wake up to a couple hundred messages/side-effects to clean up instead of a couple million. (In some cases errors may also be load-related, and backing off may fix the problem.)
  3. This is really a sub-case of (2): if a daemon failed to respond to a heartbeat, we may assume that it has hung and kill it. From there, it's like it hit a serious error and exited on its own.

But why do we "conserve RAM"?

In the SAAS cluster we (Phacility) operate to make money (a kind of legacy bitcoin), the bottleneck for running lots of small instances is available RAM for the daemons. We experimented with giving away free hosted instances for a while, and our hardware cost for free instances was dominated by RAM usage on daemon hosts. Almost all free instances are completely unused, so we can put tens of thousands of them on each web or db node, but (with the current architecture) each instance must have dedicated daemons running even if no one is using it. We were able to get from about ~100 instances/host to ~400 instances/host by making daemons exit when they don't have any work to do. See T12298 for discussion. See also T12298#213181 for a brief mention of this particular issue.

The "free instances" experiment was ultimately unsuccessful and we no longer offer free instances.

However, we're still saddled with this very complicated mess in the daemons where they try to go to sleep when they don't have any work to do. (Internally, this is called "Hibernating".) This feature is complex and no longer serves any use case for any install, and never served any use case outside of SAAS hosting.

One possible way to fix this UI is to remove the hibernation mode completely so that situation (1) never occurs and the existing icon always means we're in (2). This is probably the right fix. However, I spent a lot of time making hibernation work so I'm hesitant to throw it away because I'm very emotionally attached to it. (To some degree, adding more code to support it digs this hole even deeper, but I expect the size of the change here will be small.)

Another possible fix might be to try to answer "is something wrong?" better. For example, we could just check the error log and see if there were any errors in the last 15 minutes, then show either:

No errors recently.

...or:

Recent errors, check bin/phd log for details.

We can't show the actual errors in the web UI since they have a tendency to include policy-violating details, secret key material, password reset links, ASCII selfies of a personal nature, etc., but we can show that they exist.

(In our own operations we currently use daemon queue length as a proxy for "are the daemons healthy?", but this is a less accurate proxy if you have long-running jobs -- most often, builds.)


In any case, the proposed change probably puts us in a better place, even if it is later followed by taking hibernation out behind the shed and ending its suffering and/or additional changes aimed at making operations questions easier to answer.

Unfortunately, I think this is a difficult starter task, in the sense that it's hard to explain how to accomplish just enough so that you can figure out the actual implementation and feel a sense of accomplishment and learn something about how the system works. I'll go through the broad strokes:

  • The daemons live in libphutil/ but all the monitoring and logging stuff lives in phabricator/. Among other reasons, one goal of this separation is to improve daemon overseer process stability by reducing how much code overseers depend on.
  • The libphutil/ part and the phabricator/ part are linked through a very old "event" system. This is a pattern we used for hooks in ~2011 that looks a lot like hooks/events in other languages, but which we ultimately moved away from. Essentially all modern modularity is achieved through runtime class discovery, which turned out to be much easier to work with and a better fit for pretty much every problem we actually run into.
  • In Phabricator, the part of the code that responds to these events is PhabricatorDaemonEventListener. It receives events from the daemons and writes logs into the database.
  • In libphutil/, the part of the code that sends events is in PhutilDaemonHandle. Note that "PhutilDaemonHandle" represents a sort of "slot" in a daemon pool, not necessarily a single subprocess. Over its lifetime, one DaemonHandle may manage many different subprocesses, restarting them after they die or hibernate. It's always in charge of either zero subprocesses (hibernating, waiting, starting up) or one subprocess (running), though.

I think you can make these changes to implement the desired behavior, all in phabricator/ (which is generally safer to touch, since it has display/monitoring code rather than process management code):

  • In PhabricatorDaemonLog, add a new constant like STATUS_HIBERNATING.
  • In PhabricatorDaemonLogViewController, add an icon and label for this status.
  • In PhabricatorDaemonEventListener->handleLogEvent(), change this logic:
switch ($type) {
  case 'WAIT':
    $current_status = PhabricatorDaemonLog::STATUS_WAIT;
    break;
  default:
    $current_status = PhabricatorDaemonLog::STATUS_RUNNING;
    break;
}

We're going to do this instead:

  • When we see a WAIT ("the DaemonHandle is going to restart the process in a bit"), we're going to check whether we most recently saw a 'DONE' or 'FAIL' event.
  • If we saw a 'DONE', the process exited with no error code, so this is STATUS_HIBERNATING.
  • If we saw a 'FAIL' (or didn't see a 'DONE' since the last 'WAIT'), assume this is STATUS_WAIT.

We could also look for 'ZZZZ' and 'HANG' but I think they'll only make things more complicated. I expect we see these sequences:

  • Daemon goes into hibernation: ZZZZ, DONE, WAIT.
  • Daemon exits with an error: FAIL, WAIT.
  • Daemon killed by hang detection: HANG, WAIT.
  • Daemon has no work but is the 2..Nth member of a pool, so it exits completely instead of hibernating: DONE, EVENT_WILL_EXIT.

So it should be sufficient to say "DONE, WAIT = hibernate; anything else = wait".

I think we can do this like this, roughly:

private $sawDone = array();

...

if ($type === 'DONE') {
  $this->sawDone[$daemon_id] = true;
}

if ($type === 'WAIT') {
  if (isset($this->sawDone[$daemon_id])) {
    $status = HIBERNATE;
  } else {
    $status = WAIT;
  }
  unset($this->sawDone[$daemon_id]);
} else {
  $status = RUNNING;
}

To test this:

  • To make a daemon hibernate, run bin/phd start with nothing in queue and wait for about 15 seconds.
  • To make a daemon exit with an error, add throw new Exception(...) to PhabricatorWorkerActiveTask->executeTask() above $worker->executeTask(), restart the daemons, and comment on something. Or you can poison a specific worker instead, like FeedPublisherWorker.

If that doesn't work, you might try adding some logging to PhabricatorDaemonEventListener->handleLogEvent() to see if the sequence of events you're receiving matches my expectations or not. It's hard to log out of this context because normal tools (var_dump(), phlog(), throw new Exception(), etc) won't necessarily end up anywhere very helpful, but you can do this to create an ad-hoc logfile:

Filesystem::appendFile('/tmp/daemon-events.log', "...\n");

It may be useful to log the Daemon ID, since bin/phd start will send multiple processes through this code simultaneously. You may be able to use bin/phd debug task instead to simplify the situation somewhat.