Page MenuHomePhabricator

consider allowing SIGTERM for ExecFuture timeout
Closed, ResolvedPublic

Description

ExecFuture::resolveKill uses proc_terminate, both when we really want to kill
the inferior process (e.g. from the destructor), but also when we just want to time
out.

This is problematic when trying to impose a timeout on a process that itself spawns
children. Because proc_terminate doesn't kill children, and because SIGKILL doesn't
give the child process a chance to clean its children for us, I end up with a bunch of
stuff still running.

Event Timeline

pieter assigned this task to epriestley.
pieter raised the priority of this task from to Needs Triage.
pieter updated the task description. (Show Details)
pieter added a project: libphutil.
pieter added a subscriber: pieter.

I guess the title here just focuses on one potential solution; one alternative would be
to try harder to kill the children's group while sticking with SIGKILL.

My unix is a bit fuzzy, but I think it's hard to clean up the whole tree -- the subprocess is probably in the same process group as we are, so I think if we kill the process group we'll kill ourselves. Maybe there's some standard solution to this which doesn't require a cooperating subprocess. Otherwise, I think we'd need to have a wrapper script which would call posix_setsid(), then invoke the command, effectively sandboxing the process tree (there might also be a race between launching that process and it calling posix_setsid(), and losing the race might mean killing ourselves). We can probably test of a kill will kill us by comparing our pgid to the subprocess pgid. Let me do the 2 minutes of legwork to at least approximately verify that I'm not completely crazy here.

Broadly, the way we deal with this case for the daemons is by having the daemon be a cooperating subprocess which uses posix_setsid(), TERM'ing and then KILL'ing the pgid, and resolving races with a heartbeat signal (at least in theory -- I'm not sure the code actually checks for this right now. It would be trivial to add, though).

A softer and simpler (but less reliable) alternative would be to add resolveTerm() and setTermTimeout() or similar, which would do a term -> wait -> kill sort of thing. In the setTermTimeout() case the semantics seem unambiguous (set it at 15 seconds, set kill at 30 seconds, and you're done) but in the case of resolveTerm() it's a little fuzzier -- it should probably be resolveTerm($time_to_wait_before_killing_instead), maybe.

Here's a test script:

<?php

require_once 'scripts/__init_script__.php';

$sleep = new ExecFuture('sleep 60');
$sleep->start();

$this_pid = getmypid();
$sub_pid  = $sleep->getPID();

$this_pgid = posix_getpgid($this_pid);
$sub_pgid = posix_getpgid($sub_pid);

var_dump(array(
  'this.pid' => $this_pid,
  'sub.pid' => $sub_pid,
  'this.pgid' => $this_pgid,
  'sub.pgid' => $sub_pgid,
));

echo "ABOUT TO TERM PGID...\n";
execx('kill -TERM -%d', $sub_pgid);
sleep(1);
echo "STILL ALIVE\n";

Here's the output when run:

array(4) {
  ["this.pid"]=>
  int(45554)
  ["sub.pid"]=>
  int(45555)
  ["this.pgid"]=>
  int(45554)
  ["sub.pgid"]=>
  int(45554)
}
ABOUT TO TERM PGID...
Terminated: 15

So that seems approximately consistent with my understanding of how this stuff works (killing the pgid kills us, too).

So I think that leaves us with two viable ways forward that I can see:

  • Use a coordinating subprocess to sandbox the child process so the process group can be signaled safely. This would be something like setImmuneToNormalWeapons(true), called before start(), which would make us run sandbox.php <command> instead of <command>, then use process group semantics when killing. The race can be resolved by checking that the PGID kill is safe. This won't work on Windows, but can just raise an error for now and eventually do whatever the Windows equivalent is.
  • Add methods for TERM semantics.

Both seem reasonable to me, I'm not sure which is a better fit for your use case.

@epriestley: The TERM option will probably help me for the time being.

Regarding the group semantics: I'd add that the posix_setsid might make it harder for someone
to kill arc + all exec'd child processes, since they now have different group ids. It might be wise
to make sure that CTRL + C still works as intended, say.

With normal Zend destructor semantics, I think we'd be OK, since __destroy() in ExecFuture invokes resolveKill() on running futures, which would do a pgid-aware kill as the script exited.

(With weird HHVM destructor semantics, who knows.)

From T10610, this can reasonably occur in arc unit subprocesses.

From elsewhere, this can reasonably occur when git fetch or similar VCS subprocesses are terminated by hang detection.

Changes associated with T11592 probably fix this in practice. The new behavior is transparent: subprocesses get SIGTERM at the timeout, and SIGKILL a little while later if they fail to exit.

If we run into real cases where we're still not getting subprocesses -- but could get them by using a cooperating subprocess which switches PGID, and then terminating the PGID -- we can look at doing that. This is not too difficult but I'd like evidence that such processes legitimately exist in the wild.