Support for process supervision for phabricator daemons
Closed, ResolvedPublic

Description

I'm currently working on splitting our phab daemons out across multiple machines, per:

https://secure.phabricator.com/book/phabricator/article/managing_daemons/#multiple-machines

As part of that work, I have discovered that it is possible that phd-daemon process can die, leaving our daemon processes unsupervised. Thus, we would like to have process supervision/monitoring for the phabricator daemon processes. We currently use daemontools for this, but unfortunately daemontools assumes that it will be doing the daemonizing.

And thus we arrive at the feature request. Currently the only way to make a phabricator daemon run in the foreground is to use the debug argument to phd. But this also routes all log data to stdout, which isn't what we want. Instead, something like phd launch --foreground <daemon type> seems ideal. This will allow the supervisor to handle the daemonizing, and keep logging going to the usual place.

I have a rough patch that adds this functionality, that I'd be happy to contribute.

areitz created this task.Apr 12 2016, 10:11 PM
areitz added a subscriber: eadler.Apr 12 2016, 11:06 PM

Hey @eadler, thanks for looking into this. I feel this ticket is different from T4181. The former covers starting phd at boot, this ticket covers using process supervisors (like daemontools) which monitor processes on an ongoing basis.

jshirley reopened this task as Open.Apr 12 2016, 11:12 PM
jshirley added a subscriber: jshirley.

Reopening, this is definitely different from T4181

epriestley added a subscriber: epriestley.EditedApr 13 2016, 2:08 AM

This is the same as T4181, sort of, except that merged somewhere else and isn't the same as itself, if you will.

The documentation section you refer to is out of date, and will be removed soon by D15689. It will vanish completely in less than 24 hours, when the documentation regenerates. T10751 discusses the pathway toward high availability. T10756 is the current state of the world for daemons.

Phabricator's daemon supervisor has many features which other supervisors generally do not (and, in some cases, can not) have, because it can talk directly to the daemons and to Phabricator. These features include:

  • autoscaling of daemon pools;
  • daemon hang detection;
  • restarting after configuration changes;
  • centralized logging;
  • custom delays before restart after failure;
  • graceful restart signal handling and other custom signal handling.

These features largely can not be made to work if the daemons are managed by a different supervisor process.

Beyond these features, phd start knows which daemons to launch, and this changes over time. The Trigger daemon is new in the last year, and replaces two daemons that existed before it. Although I don't currently have any plans to change the composition of the daemon pool in the near future, it has changed once or twice a year for the past few years. (The next likely change is the introduction of a Bot daemon as part of T10364 or future work.)

I don't think I've seen phd-daemon processes exit in production a very, very long time -- probably more than two years? Maybe more than three? In T7811, about a year ago, there was a memory pressure issue on a production host which might have seen us lose some processes to the OOM killer, but I don't recall any other cases where it wasn't a symptom of a different, more severe issue offhand. I can't recall reports of this from other installs in a long time either, except T8884, which sounds like a different issue than the one you describe (phd-daemon itself did not exit) and which I can not reproduce. The expectation is that these processes should not die. If you've found a way to get them killed, I'd strongly prefer to fix whatever is causing them to die over supporting foregrounding.

That said, this request comes up frequently enough that it may be something I should just give up on and allow users to make tradeoffs to use a supervisor they are comfortable with over the upstream supervisor.

One reason I've resisted this is that most users making this request seem to simply be motivated purely by the comfort of using something they're familiar with (they have not experienced any issues with the daemons). Many of the features the upstream overseer provides were put in place to reduce our support burden, by stopping us from receiving reports where the root cause is strict administrator error, like not restarting the daemons after a configuration change. Before we implemented these features we regularly received reports rooted in these kind of user errors, and receive almost none now.

Ultimately, none of the features our supervisor provides are strictly required, and it's possibly easier for everyone if we just say "do not do this; you void the warranty by doing so and we will not support you; here's how to do it if you're certain you want to continue". I'd be warmer on this possibility if we had a way to detect that you've switched overseers when you file a bug report so we can dodge the "I changed config and didn't restart the daemons" support load that the existence of such a flag will expose us to again, but I can't come up with one offhand.

Basically, this is a --shoot-myself-in-the-foot flag for many less-experienced or less-thorough administrators, who were empirically unable to manage the daemons reliably when instructions like "remember to restart the daemons after making configuration changes" were provided in text and on-screen instructions rather than implemented automatically by the supervisor.

At a minimum, I can write all of this up and explain why we don't have such a flag. There are some measures we can take to make it explicit that you're voiding the warranty, like disabling the flag in the upstream and requiring you to fork to enable it, although that's less clean here than it is with mechanisms like applying final to everything.

Do you have any additional information about how the phd-daemon overseer process can exit abnormally?

areitz closed this task as Resolved.Apr 14 2016, 12:50 AM
areitz claimed this task.

@epriestley thanks for the feedback. I've read through the new docs, and I think have a section explaining phd-daemon, and incorporating some of the notes from T4181 (about setting a startup script for your OS) could be really useful.

In general, there will come a point where the phabricator code base has to interact with the greater world. Even if you shipped a pre-cooked AMI, there would still be some need for operators to have it integrate with their monitoring systems, for example. At Stripe, we've standardized on daemontools for process control, and we have built expertise & tooling around that. And I'll switch to just having phd launch at boot, but I still have to consider how to monitor the daemons.

I think either supporting non-phd process monitors, or documenting the phabricator daemon philosophy with some examples of how to run it in production, should be sufficient.

I didn't look into why phd-daemon died on my test box over the weekend. If I see this happen again, I'll try and dig up more information for you.

Thanks!

@areitz FWIW, we use monit's ability to search for a process name. This isn't ideal but its the best we've come up with without forking the daemon code.

Yeah, I'm familiar with monit, and I think it will work decently well for this situation. Unfortunately, the current environment standardized on daemontools instead of monit.

I just had another phd-daemon crash, this time in production. I'll file what little details I have as a separate issue.

scode added a subscriber: scode.May 5 2016, 4:32 PM