`repository update` may be killed after timing out without annihilating `git fetch` subprocesses
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	epriestley
	Sep 5 2016, 3:16 PM

Description

This is almost certainly a subproblem of T4551 and they may end up being the same issue in practice, but this is causing concrete issues while T4551 was largely hypothetical.

We kill repository update after 4 hours (initial clone) or 15 minutes (non-initial clones) but kill it by SIGKILL'ing the subprocess directly which doesn't wipe out grandchildren. If git fetch is hanging with any regularity, we'll end up with a lot of idle subprocesses sitting around.

One fix is to use SIGTERM instead, then SIGKILL after a short delay. This is likely to resolve the issue in practice.

A better fix is to sandbox and annihilate the process group more aggressively, but this is a little more involved and might be better held for later in T4551. This is probably not necessary with git fetch subprocesses (which are likely always well-behaved) but may be necessary in the future with arc unit subprocesses (which are arbitrary user code and may swallow SIGTERM).

Revisions and Commits

rPHU libphutil
	D16505	rPHUf983effe304c When stopping subprocesses, send SIGTERM first, then SIGKILL if that doesn't…
	D16504	rPHUe94f6e738e2d Formalize and centralize signal handling in libphutil scripts

Related Objects
Search...

		Status	Assigned	Task
		Resolved	epriestley	T4551 consider allowing SIGTERM for ExecFuture timeout
		Resolved	epriestley	T11592 `repository update` may be killed after timing out without annihilating `git fetch` subprocesses

Event Timeline

epriestley created this task.Sep 5 2016, 3:16 PM

Herald added a subscriber: eadler. · View Herald TranscriptSep 5 2016, 3:16 PM

epriestley added a parent task: T4551: consider allowing SIGTERM for ExecFuture timeout.Sep 5 2016, 3:16 PM

Part of this is likely a change to the semantics of setTimeout(). Current uses are:

diffusion.blame, caller-specified
diffusion.internalgitrawdiffquery, hard-coded at 30 seconds.
DiffusionCommitHookEngine, hard-coded at 15 minutes ("enormous changes").
DiffusionFileFutureQuery, ultimately caller-specified
Meme transforms, hard-coded at 10 seconds.
Thumb transforms, hard-coded at 60 seconds.
PullLocalDaemon, the case in question, hard-coded at 15 minutes and 4 hours.
Some stuff in test cases in libphutil.

For the ones that aren't caller-specified, I think we could pretty safely pick a min(timeout, 60) window between the TERM and the KILL.

I'm not exactly sure what we should do in the case of a very short caller-specified timeout.

If the caller specifies a timeout of 3 seconds, when do we TERM/KILL?
If they specify a timeout of 60 seconds, is it really OK to take up to 120 seconds to return?
Generally, when they say "X", is it more important to try to return within X seconds or to let the command run for X seconds?

I'm inclined to say that the semantics of setTimeout(X) are "let the command run for X", and the actual time it takes the future to resolve is X + min(X, 60), i.e. up to twice as long. For example:

If you set the timeout to 3, we TERM after 3 seconds and kill after 6 seconds. The command gets to run for 3 seconds but we may not return for 6.
If you set the timeout to 60, we TERM after 60 seconds and kill after 120 seconds.
If you set the timeout to 1234000, we TERM after 1234000 seconds and kill after 1234060 seconds.

There's always going to be call/command overhead anyway, so users shouldn't expect the behavior to be "return within X seconds" exactly anyway, although having 60 really mean up to 120 does feel a little weird. Maybe the magic number should be a bit shorter (like 15 or 30 seconds). But we should generally be hitting the timeout pretty much exactly anyway as long as subprocesses respond to TERM, which they almost always should.

epriestley added a revision: D16504: Formalize and centralize signal handling in libphutil scripts.Sep 7 2016, 1:01 AM

epriestley mentioned this in D16504: Formalize and centralize signal handling in libphutil scripts.Sep 7 2016, 1:02 AM

epriestley added a revision: D16505: When stopping subprocesses, send SIGTERM first, then SIGKILL if that doesn't work.Sep 7 2016, 1:18 AM

epriestley mentioned this in T4551: consider allowing SIGTERM for ExecFuture timeout.Sep 7 2016, 1:24 AM

epriestley added a commit: rPHUe94f6e738e2d: Formalize and centralize signal handling in libphutil scripts.Sep 7 2016, 1:25 AM

epriestley closed this task as Resolved by committing rPHUf983effe304c: When stopping subprocesses, send SIGTERM first, then SIGKILL if that doesn't….Sep 7 2016, 2:06 PM

epriestley added a commit: rPHUf983effe304c: When stopping subprocesses, send SIGTERM first, then SIGKILL if that doesn't….