Page MenuHomePhabricator

Version daemons more clearly in daemon console so it's clear when `phd reload` has taken effect
Open, NormalPublic

Description

I try to update the Phabricator install that our organization maintains every week, in some capacity. This typically requires restarting the daemons. Our organization also issues builds / performance analysis scripts which take 8+ hours in some cases. Recently, there have been many of these types of builds running at once virtually at all times. This puts me in a difficult position of needing to seek out times of the day when I'm hoping that none, or as few as possible, of these builds are running. It is not possible for us to make the individual build steps in these plans idempotent to their place in line, meaning that "reassuming" a step in a build process will never be a sufficient fix.

I'm aware that this is probably an extremely un-fun thing to fix, but it's also an extremely un-fun problem for me since I find myself needing to balance "how many people am I going to piss off by interrupting their builds" against "we need to release updates or extensions to phabricator" against "I would like to go to sleep tonight".

Event Timeline

You can use bin/phd reload -- instead of restart -- to instruct daemons to exit and restart only after completing their current tasks, with no time limit on task completion. The semantics on this are necessarily a little fuzzy, but basically:

$ bin/phd reload
$ # wait for a long time
$ # maybe the new code is live now??

You can still end up in trouble if there's a schema change or something and the old jobs try to complete against a new schema, but most of the time this will probably work fine. You could try this to start with -- I would guess it will reduce tension between these concerns to acceptable levels on its own, and then maybe there's some additional specific stuff we could improve later on.

I'll start doing this, I guess it makes me a little uncomfortable since the first thing I do is typically make sure nothing is broken once the daemons restart, but then again, nothing has ever been broken during a deployment. So maybe I will assume I am as smart as our past deployments have indicated I am, issue the reload, go to sleep, and then check in the morning.

One thing we could do to improve this is version daemons in /daemon/, so your "go to sleep" routine could be more like:

  • Check /daemon/, make sure PullLocal + Trigger + at least some of the Taskmasters have restarted and are running the new code.
  • Once they have, check daemon logs / etc for general normalcy.
  • Feel reasonably confident that things went through OK, and worst case is probably that whatever was holding up the remaining Taskmasters fails later but that's not really any worse than just killing it, and on average way better, and what option do you have anyway, and if they want their build results reliably maybe they should make their builds not take 8 hours.

You can probably do this anyway, but you can't really tell if things actually updated or not right now.

epriestley renamed this task from Gracefully handle builds when restarting daemons to Version daemons more clearly in daemon console so it's clear when `phd reload` has taken effect.Apr 25 2016, 9:53 PM
epriestley triaged this task as Normal priority.
epriestley added a project: Daemons.

Maybe also reasonable would be something like this:

$ bin/phd reload --and-then-wait-and-then-show-status
<10 seconds pass>
PullLocal daemon reloaded OK.
Trigger daemon reloaded OK.
3x Taskmaster daemon reloaded OK.
1 Taskmaster (pid 123) has not finished doing whatever it is doing yet (it is working on task 234, and has been for 6 h 12m).

I don't know if we really need that if the daemon console works a little better, but that's plausible to build at least.

We could also make bin/phd restart push anything it disrupts to the head of the queue explicitly if you only care about execution order stability, but we've seen other cases where a task is literally just running build.exe --long --hard for 12 hours and killing it at all is a problem, even if it resumes immediately.