Page MenuHomePhabricator

"bin/phd" may try to kill processes it does not own if the daemon PID logfiles on disk are out of date
Closed, ResolvedPublic

Description

See T13320. Here's roughly what I observed:

  • Process 12345 was a daemon process.
  • The host underwent a ninja reboot.
  • After the host came back up, process 12345 was a system process.
  • bin/phd start still attempted to kill it, based on the out-of-date PID log on disk.

The easiest fix is probably to make the "stop existing processes" check the name of the processes it is stopping, although I suspect this whole flow could use updates in general.

Event Timeline

epriestley triaged this task as Normal priority.Jun 20 2019, 3:20 PM
epriestley created this task.

This is probably somewhat entangled with T13253 and/or T11037 and this whole thing is kind of a mess.

Since we already have code to examine the output of ps auxwww or similar and decide if a process is a daemon or not, I'm probably going to try to just bandaid this for now.

I have a patch for this which basically says "don't try to kill any process which doesn't look like a daemon process".

Actually testing this is proving challenging because our view of process state isn't very observable (this overlaps a lot with T11037). This code is also generally just old and does weird stuff like frequently updating the PID file with subprocess status.

I'm not sure we really need the PID files at all. The functions they serve are pretty much:

  1. identify which processes are Phabricator daemons; and
  2. identify which daemons are owned by the current instance.

But they're not actually useful for (1) because of this case (the PID becoming owned by something else after a restart), and PID files on disk aren't exactly reliable anyway so we don't really depend on them even when things are working okay.

So we also already have other code which ignores the PID files and just uses process titles to find processes which look like daemons. This is generally more reliable/accurate.

Since modern instanced daemons have labels anyway (./phd-daemon -l turtles, where "turtles" is the instance name) we can accomplish (2) by just looking at the process list, too.

So I'm inclined to say we should do this:

  • Remove the PID files completely.
  • Find running daemons only by examining the process list.
  • Find daemons for the current instance only by examining the process list.

This fixes the root issue and generally simplifies things a lot, provided I'm not missing some secret reason to retain PID files.

epriestley added a revision: Restricted Differential Revision.Jun 20 2019, 10:33 PM

provided I'm not missing some secret reason to retain PID files.

Having now done (most of?) the legwork, I'm left sort of puzzled why anything uses PID files at all.

Obviously, pattern matching the process list is really really dumb. But it's not obviously more dumb to me than all the cleanup issues with PID files -- as far as I know, there's no reliable way to get rid of the file on process exit or host restart, and there are about 200 million hits on StackOverflow for "clean up stale PID file" or "stale PID file after reboot", which suggests I'm not missing some obvious trick to making PID files reliable.

And, e.g., the httpd manual suggests this is a good pattern:

kill -TERM `cat /usr/local/apache2/logs/httpd.pid`

...but that file can contain the PID of anything after an abrupt restart and this does not actually seem like a very robust pattern.

We can race between ps auxwww and kill -TERM too (or between any PID observation and signaling that PID), but if the sequence of events is "observe PID, someone pulls the power cable out of the wall, signal PID" we're at least robust against mis-signaling. In the general case, I don't think there's any primitive we can use to guarantee identifying and signaling a process is atomic.

This feels like the kind of thing that made sense in 1970 when system reliability was bottlenecked on shooing moths away from your vacuum tubes or something and has stuck around mostly through inertia.

iiam

epriestley added a comment.EditedJun 20 2019, 10:36 PM

I've also never seen anyone use kill -TERM `cat /path/to/pidfile` in real life over some flavor of pkill, which is basically the same thing as "pattern match the process titles".

epriestley added a revision: Restricted Differential Revision.Jun 20 2019, 10:44 PM

I just ran into this for the first time:

~ $ cd / quack quack quack quack quack
/ $ echo $?
0

Look out! A moth is going to get the vacuum tubes!

epriestley added a commit: Restricted Diffusion Commit.Jun 24 2019, 6:30 PM

cd provides "at least once" delivery guarantees.

cd provides "at least once" delivery guarantees.

ohyou


I deployed this stuff to secure and all the commands seem to work. I imagine there may be some rough edges with "GentooLaris vCustom" or whatever, but hopefully we're on a more promising pathway now.

epriestley added a commit: Restricted Diffusion Commit.Jun 28 2019, 1:31 PM
epriestley closed this task as Resolved.Jun 29 2019, 6:56 PM
epriestley claimed this task.

This stuff is all deployed, now.