Page MenuHomePhabricator

"Phabricator Daemons Need Restarting" is too difficult to understand/debug
Closed, ResolvedPublic

Description

I get this warning, even though the daemons have been started in a Docker instance and thus are always running the exact same code as the web server.

Event Timeline

hach-que raised the priority of this task from to Needs Triage.
hach-que updated the task description. (Show Details)
hach-que added a project: Phabricator.
hach-que added subscribers: hach-que, epriestley.

It might be helpful to add a status indicator to the daemon console, so user who report issues with this warning can go to the web UI and see which daemon(s) are running with out-of-date configuration. Then we can figure out if we have an issue in the checksumming method or an issue elsewhere.

Just to clarify for anyone else hitting this issue, the warning is about the configuration being different, not the code. T5957 the underlying issue was a different PHABRICATOR_ENV.

I'll try to indicate an out of date configuration on the main page:

https://secure.phabricator.com/daemon/  (probably with yellow instead of green)

and then on a details page, e.g.

https://secure.phabricator.com/daemon/log/15407/

I'll have the specific hashes listed if different and if the same just something like "they are the same!".

Additionally, T5957 offered the idea that sorting the keys might reduce false positives here; I'll probably just apply that as that seems theoretically sound to me. (Especially with something like T4018 still to do...)

T2374 might play a role here, too, in some cases.

This setup issue seems to have disappeared now. I think this was caused by the Docker instance not doing bin/phd stop to shutdown daemons (it just TERMs then KILLs all processes before shutdown) and hence for 6 minutes Phabricator still thinks those out-of-date daemons are still running.

(Feel free to re-open this task if you want to make it a catch-all for out-of-date daemon stuff, but this issue is at least resolved for me)

Reopening this as a catch-all since we're still seeings some users hit this and haven't improved the diagnostics yet.

epriestley renamed this task from Getting "Phabricator Daemons Need Restarting", even though they are started in a Docker instance to "Phabricator Daemons Need Restarting" is too difficult to understand/debug.Aug 26 2014, 11:58 PM

@joshuaspence hit this last night, but it also had a legitimate reason (local.json differed between daemons and web).

Wow, sorry about the support impact here. :/ working on this now...

Oh, don't worry about it! My thinking on this is that these (or at least, some of them) were probably all horrible bugs waiting to strike at a future date, and we're catching them now in a low-severity way where I can just say "oh check this ticket" instead of "sorry nothing works and we are the worst".

I'm not sure if D10367 makes this easy enough to debug... Other ideas on enhancements here?

I think the pull request above is the only remaining unsolved mystery -- unless you're hitting this, @cburroughs? I'll followup on the PR.

My only other ideas are:

  • Making the explanatory text more mild and detailed might help. Users seem fairly concerned and the text is relatively direct ("Need Restarting"). We could relax the language, e.g., "Daemons May Need to be Restarted" / "It looks like you've updated configuration recently. The daemons won't pick up the changes until you restart them. ... blah blah blah". Not sure if we really need this, though.
  • Adding a section like "If you intentionally run the daemons with a different configuration, you can safely ignore this" and making the "ignore" operation easier to find (mentioned in T4331, but we don't have a great place to stick an "ignore" button) might help too. I've seen some users have difficulty finding the ignore action, and it's a bit buried. I can take a stab at this and see if I can fix the T4331 thing too.

Oh, another thought is that we could checksum only the database configuration, or checksum the different configurations separately. This would have fixed @joshuaspence's issue by letting us raise a more narrow message ("Local configuration differs").

In theory, I guess we could also show the affected daemons directly in the error message.

Nope, just hanging out and following interesting looking tickets before our next upgrade.

Yeah, project issues are a pretty cool place to hang out.

I think a primary point of confusion is that this is measuring configuration state and not "the code is different", where the latter seems to be much more important? (I read the error message as something to do with code, because it says "out of date configuration" and not "mismatched configuration")

If a user upgrades, but there were no new configuration options upstream, will they still get the error message? Using the version numbers / Git hashes detected on the All Settings config page seems like a much better way of driving a "Daemons Are Out Of Date" setup issue, vs. the current behaviour which is more "Daemons Configuration Doesn't Match Web Server" (which no amount of restarting will directly fix).

where the latter seems to be much more important?

No users have hit issues with the code being different that I've seen, but quite a few have hit issues with config being different.

If a user upgrades, but there were no new configuration options upstream, will they still get the error message?

No.

"Daemons Configuration Doesn't Match Web Server" (which no amount of restarting will directly fix).

When you make configuration change via the web UI, the daemons don't get the new value until they are restarted. That's what this warning is about, and restarting will fix it immediately.

The most common issue is that a user configures mail via the web UI, and uses bin/mail send-test to verify that their configuration is correct, but normal mail still doesn't work. This is because the daemons are locked to the config values they had when they started, and don't see the new mail configuration. Restarting them makes them load the new configuration and fixes the issue.

where the latter seems to be much more important?

No users have hit issues with the code being different that I've seen, but quite a few have hit issues with config being different.

Oh okay (I've only ever hit issues because I forgot to run bin/phd restart while modifying the code, but that is probably not a usual scenario).

If a user upgrades, but there were no new configuration options upstream, will they still get the error message?

No.

"Daemons Configuration Doesn't Match Web Server" (which no amount of restarting will directly fix).

When you make configuration change via the web UI, the daemons don't get the new value until they are restarted. That's what this warning is about, and restarting will fix it immediately.

The most common issue is that a user configures mail via the web UI, and uses bin/mail send-test to verify that their configuration is correct, but normal mail still doesn't work. This is because the daemons are locked to the config values they had when they started, and don't see the new mail configuration. Restarting them makes them load the new configuration and fixes the issue.

Oh, I didn't know that happened :/... I figured the daemons would load config from MySQL as requested, but I'm guessing that's too much of a performance hit?

Oh, I didn't know that happened :/... I figured the daemons would load config from MySQL as requested, but I'm guessing that's too much of a performance hit?

In the general case, it's not possible to change some configuration without restarting the process. Offhand, for example, if a library is removed from load-libraries, there's no way to unload it without restarting. Other options like phd.pid-directory would be complicated to change, too.

The vast majority of configuration could be changed at runtime safely and without complications, but it's hard to know where the edge cases are or what's going to break if things get swapped.

If this warning went on for a while and kept producing false positives and generally confusing everyone, we could maybe look at whitelisting reloadable configuration. The bulk of issues here have come from mail-related configuration, with a small trickle from other kinds of configuration.

There's also a performance/polling issue, but the big thing is just the peril of swapping config at runtime and having no way to be sure exactly what that will do.

Could there be some sort of watchdog daemon which does nothing except for signalling the daemons to restart when the config changes?

It would need to be really lightweight and not load any additional libraries I guess?

chad triaged this task as Normal priority.Sep 2 2014, 8:34 PM
chad edited projects, added Daemons; removed Phabricator.

Not sure if this is new, but with HEAD toggling 'ignore' on a setup issue triggers a setup issue for 'Phabricator Daemons Need Restarting'.

Whoops... Toggling "ignore" does change the config though, so isn't this warning showing up correctly?

I guess I'm thinking the whitelist option might make sense here, as I feel like this is generally confusing everyone? Of course, it also seems like this is the sort of thing we should just totally skip anyway and push forward with phacility.com; I am not convinced there is anyway to actually build this such that users do not either 1) shoot themselves in the foot with config issues or 2) feel confusion when we try to help them not shoot themselves in the foot.

Setup issues are cached, toggling ignore is probably just clearing the cache.

I want to try massaging this a little bit more before we give up on it -- I think 100% of the issues so far have been real configuration issues, I just didn't expect so many users to be running with different configuration on web vs daemons. Let me take a shot at a couple of the things from earlier and see how they feel.

Whoops... Toggling "ignore" does change the config though, so isn't this warning showing up correctly?

Oh, durr. Your explanation is correct. I thought we stored this somewhere custom for some reason.

Maybe this is good now, but if not, I think your outstanding ideas are:

  • we could checksum only the database configuration, or checksum the different configurations separately. This would have fixed @joshuaspence's issue by letting us raise a more narrow message ("Local configuration differs").
  • In theory, I guess we could also show the affected daemons directly in the error message.
  • maybe more in your dome

I haven't seen more issues from users on this recently, so I think we either survived the initial backlash and/or writing a novel helped.

Personally I think the different title / issue message cleared up a lot of confusion, because it's now very clear as to what the problem is when the issue pops up.

I have this issue now. Just did a git pull an hour ago on arcanist, libphutil and phabricator. Restarted apache and the phabricator daemons, but the warning is still there. I also did bin/phd stop and check if there were no daemons left and they were indeed all gone. There are no errors in the daemons.log file. One thing I checked is the APC cache (I have APC enabled) but its cache is cleared when Apache is restarted or reloaded. I have no idea what to do about it. Ignoring the warning doesn't feel right; it must be caused by something and that should be fixed.