Page MenuHomePhabricator

Make "PhutilProcessQuery" detection of overseer processes more robust
ClosedPublic

Authored by epriestley on Jun 25 2019, 12:51 AM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, Dec 24, 3:52 AM
Unknown Object (File)
Thu, Dec 12, 4:38 AM
Unknown Object (File)
Sun, Dec 8, 2:52 PM
Unknown Object (File)
Fri, Dec 6, 6:22 AM
Unknown Object (File)
Dec 4 2024, 1:36 PM
Unknown Object (File)
Nov 30 2024, 4:05 PM
Unknown Object (File)
Nov 28 2024, 5:25 PM
Unknown Object (File)
Nov 28 2024, 4:42 PM
Subscribers
None

Details

Summary

Ref T13321. Since "just use the process list" is looking a whole lot more attractive than PID files, improve the robustness of the query when identifying overseers.

Currently, we could incorrectly identify processes like "grep phd-daemon" as overseers.

To fix this, don't just look for "\bphd-daemon\b" anywhere in the process title: require the first argument be a path that ends in "phd-daemon", or the first two arguments end in "php phd-daemon".

Test Plan
  • Added unit tests, ran unit tests. Ran bin/phd status, saw it identify processes correctly.
  • Ran sleep 60/phd-daemon.
  • Before: bin/phd status reported this as a daemon.
  • After: bin/phd status no longer believes this is a daemon.

Diff Detail

Repository
rPHU libphutil
Branch
daemon1
Lint
Lint Passed
Unit
Tests Passed
Build Status
Buildable 23060
Build 31655: Run Core Tests
Build 31654: arc lint + arc unit

Event Timeline

This whole series of changes is motivated by "a PID file was out of date because of a reboot, and we tried to kill some other random process", right? Was there anything else that PID file tracking wasn't doing correctly?

Wouldn't it be easier to just store the host boot time epoch alongside the PID so we can tell if our PID files are out of date?

src/filesystem/PhutilProcessQuery.php
40–41

Comment out of date.

82–84

Part of me keeps wanting to say "PID tracking is what you're supposed to do here for exactly these reasons", but since we're only here because PID tracking didn't work very well, I guess there's nothing wrong with trying something else.

I was brainstorming/googling other solutions for this, and one (possibly bad) idea I had was starting the daemons with a magic ENV variable, like SECRET_PHAB_COOKIE='🍩' or something. That would give us the benefit of not relying on regex/string parsing, and possibly insulating us a little more from weird shell gotchas like the above.

On the third hand, the worst damage this could do if we got it really wrong would be either failing to start some daemons, or killing an arbitrary process that happened to look sufficiently like a daemon, both of which we could probably figure out and recover from pretty easily.

91–94

See above >.<

src/filesystem/PhutilProcessRef.php
34

Is the HHVM binary also called php? What if someone uses a start_phabricator.go file that does fork/exec to start the daemons?

src/filesystem/__tests__/PhutilProcessRefTestCase.php
9

"correctly"

12

Obviously this isn't robust in the face of an attacker who is deliberately trying to get clever with argv. Also note that you can (in a non-portable way) overwrite an existing process's argv.

That said, I don't think there's a practical attack here? If an attacker already controls a sandboxed process and wants to kill it, they don't need to get tricky by making a Phabricator bin script send a signal; they can just crash the process themselves.

This whole series of changes is motivated by "a PID file was out of date because of a reboot, and we tried to kill some other random process", right? Was there anything else that PID file tracking wasn't doing correctly?

The issues with PID files I can come up with offhand are:

  • The require configuration (or convention) of a PID directory.
  • The PID directory has to exist or the user running the process has to be able to create it.
  • The PID file must not exist.
  • The disk can be full, or read-only. Or become full later, including while the daemon is running.
    • Any of these policy/disk state problems can pass any effort we make to test them early, then fail after the process daemonizes, making it difficult to report the failure to the user.
  • The user running bin/phd status has to be able to list and read the PID directory. This may not be the same user as the user the daemons are running as.
  • Each instance needs a separate PID directory, or we would have to restructure PID files around instancing.
  • Users or other processes can delete (or edit) the PID files without terminating the related processes, or write files over them.
    • Possibly, an adversarial user could edit a PID file to make bin/phd restart (or some other control flow) signal a process they could not otherwise signal. So, conceptually, all the policy stuff above can fail by being too open, not just by being too closed.
  • There's no way to guarantee PID files are removed when processes exit. Abrupt restarts are one example, but here's /core/run/admin/phd on admin001 with a random smattering of phantom daemons from 2017 onward:
$ ls -alh
total 96K
drwxrwxr-x 2 ubuntu ubuntu 4.0K Jun 25 11:57 .
drwxrwxr-x 3 ubuntu ubuntu 4.0K Feb 12  2015 ..
-rw-r--r-- 1 ubuntu ubuntu  366 Mar 25  2017 daemon.1337
-rw-r--r-- 1 ubuntu ubuntu  367 Mar 23 15:45 daemon.15877
-rw-r--r-- 1 ubuntu ubuntu  366 Mar  5  2017 daemon.16024
-rw-r--r-- 1 ubuntu ubuntu  367 Mar 18  2017 daemon.16096
-rw-r--r-- 1 ubuntu ubuntu  367 Mar 18  2017 daemon.16898
-rw-r--r-- 1 ubuntu ubuntu  367 Mar 18  2017 daemon.17923
-rw-r--r-- 1 ubuntu ubuntu  367 Mar 18  2017 daemon.20368
-rw-r--r-- 1 ubuntu ubuntu  347 Feb 25  2017 daemon.21106
-rw-r--r-- 1 ubuntu ubuntu  367 Jul 14  2017 daemon.23895
-rw-r--r-- 1 ubuntu ubuntu  171 Feb 28  2017 daemon.28282
-rw-r--r-- 1 ubuntu ubuntu  367 Sep 16  2017 daemon.28880
-rw-r--r-- 1 ubuntu ubuntu  367 Aug 25  2018 daemon.30442
-rw-r--r-- 1 ubuntu ubuntu 1.5K Jun 25 19:07 daemon.31386
-rw-r--r-- 1 ubuntu ubuntu  365 Feb 26  2017 daemon.4177
-rw-rw-r-- 1 ubuntu ubuntu  346 Mar  1  2015 daemon.5449
-rw-r--r-- 1 ubuntu ubuntu  365 Mar  5  2017 daemon.5492
-rw-r--r-- 1 ubuntu ubuntu  366 Apr 28  2017 daemon.5977
-rw-r--r-- 1 ubuntu ubuntu  179 Feb 28  2017 daemon.6627
-rw-r--r-- 1 ubuntu ubuntu  179 Feb 28  2017 daemon.7767
-rw-r--r-- 1 ubuntu ubuntu  366 Mar 18  2017 daemon.7913
-rw-r--r-- 1 ubuntu ubuntu  365 Mar 18  2017 daemon.808
-rw-r--r-- 1 ubuntu ubuntu  365 Mar  5  2017 daemon.8532

Basically, there are a lot of ways that the state on disk and the state in ps can fall out of sync. Although using process titles has problems too, it seems like it's a huge step forward on essentially every dimension here. A wide class of failure conditions arising from disk/ps disagreeing just can't happen anymore.

src/filesystem/PhutilProcessQuery.php
82–84

I think the ENV thing is a reasonable approach if we run into some concrete/hard problem with the args thing. My concerns are:

  • In the most-general case, same problem as args: I don't think we can unambiguously parse ENV out of ps since it looks like spaces are unescaped, and I'm not sure there's a fully-correct way to read it on macOS where we don't have /proc. That said, we can control ENV better than the process title.
  • ENV propagates to subprocesses, so we need some logic like "process has the secret ENV thing set AND is named phd-daemon" or "...and has no living (?) parents" and/or try really hard to stop the magic from getting into the subprocess environments. Not impossible, but adds more complexity, I think.
src/filesystem/PhutilProcessRef.php
34

The "php" comes from the #!/usr/bin/env php shebang line in the phd-daemon file. I believe you'd have to aggressively try to inject misbehavior here to break anything. For example, if I symlink quack to php and then use #!/usr/bin/env quack, the process comes up named "quack" -- so I suspect this is robust even in the face of php -> php5 or similar. If not, we may need to add more patterns here, but we can cross that bridge when we come to it.

Although I'm not sure my testing is exhaustive, I can't find any way to change the process title under macOS as long as some parent process runs bin/phd start. Some stuff I tried:

  • sudo ...
  • exec sudo ...
  • exec ...
  • nohup ... &
  • wrapper script
  • wrapper script that forks first
  • Setting phd.user to some other user, then running bin/phd start as a different user so it does a sudo internally.

These all ultimately produce a process with the same name.

It's definitely possible there's some other trick here, but I wasn't able to find one anywhere near the beaten path.

Obviously, you can do something other than bin/phd start to start a daemon process that has a different process title, but I think it's fair to say "if you don't use bin/phd start to start the process, you should not expect bin/phd status/restart/etc to manage it".

src/filesystem/__tests__/PhutilProcessRefTestCase.php
12

Yeah, I don't think we have to worry about an adversary writing a process which they trick us into sending a signal to, or an adversary who can edit process titles having difficulty sending a signal.

This approach stops an adversary who can edit PID files from tricking us into sending signals to arbitrary processes they do not control.

  • Fix typo.
  • Fix out-of-date comment/logic around overseer tests.
  • The disk can be full, or read-only. Or become full later, including while the daemon is running.
    • Any of these policy/disk state problems can pass any effort we make to test them early, then fail after the process daemonizes, making it difficult to report the failure to the user.

I actually kind of liked this "feature". It's nice to test some of the requirements of an actual working daemon process on startup (disk writeable, not full, etc).

Basically, there are a lot of ways that the state on disk and the state in ps can fall out of sync. Although using process titles has problems too, it seems like it's a huge step forward on essentially every dimension here. A wide class of failure conditions arising from disk/ps disagreeing just can't happen anymore.

Yeah, I suppose most of my skepticism comes from not knowing any other examples of long-loved processes parsing ps output instead of using PID files.

Maybe there's an attack here for stuff like the daemon logging? I don't think there's anything sensitive in the IPC messages we're sending now, but if this code gets used everywhere, I could imagine an attacker trying to spoof the overseer to make daemon processes log stuff to an attacker-controlled process.

This revision is now accepted and ready to land.Jun 25 2019, 8:29 PM

Oh, and this might be useful, but is probably less portable than just using ps.

Yeah, I suppose most of my skepticism comes from not knowing any other examples of long-loved processes parsing ps output instead of using PID files.

I suspect this may just be a "Unix Ritual" like grep x || thing and x | gzip > file which is in very common use and seems like it must be a good idea because convention is so strong, but looks more and more unsuitable for operation at scale upon closer examination.

Maybe there's an attack here for stuff like the daemon logging?

I think everywhere else we already have a better (actually reliable) ways to find process information. We know a process is our child because we ran fork() or execv() (through some PHP API wrapper). We know a process is our parent from getppid(). Presumably these are always correct (hopefully).

All the signaling/logging/message-passing we do other than bin/phd restart/reload is along these "strongly authenticated" channels, where we know a PID is legitimate because it came from an actual syscall handing us a PID back.

Beyond signals, we communicate up and down from overseer to daemon through stdin/stdout, which (hopefully?) nothing can hijack.

(Also, I need to write it up (T6994) but "attackers on the same host" is explicitly not part of our security model today, notably because we run git subprocesses that may have credentials in URIs visible in ps and there's absolutely nothing we can do about this as far as I know. Obviously, no reason to open up holes where we don't need to, but, formally, an adversary with shell access has already beaten the model.)

pgrep

It's turtles all the way down. Computers don't work and all conventions and toolsets are fundamentally broken:

ubuntu@secure001:/core$ './script with spaces.sh'  # Runs "sleep".
ubuntu@secure001:/core$ pgrep spaces
ubuntu@secure001:/core$ pgrep 'script with spaces'
ubuntu@secure001:/core$ pgrep pace  
ubuntu@secure001:/core$ pgrep script
1335
ubuntu@secure001:/core$ pgrep rip 
1335

It's turtles all the way down. Computers don't work and all conventions and toolsets are fundamentally broken:

computers were a mistake.jpg (480×640 px, 61 KB)