Page MenuHomePhabricator

`repository update` may be killed after timing out without annihilating `git fetch` subprocesses
Closed, ResolvedPublic

Description

This is almost certainly a subproblem of T4551 and they may end up being the same issue in practice, but this is causing concrete issues while T4551 was largely hypothetical.

We kill repository update after 4 hours (initial clone) or 15 minutes (non-initial clones) but kill it by SIGKILL'ing the subprocess directly which doesn't wipe out grandchildren. If git fetch is hanging with any regularity, we'll end up with a lot of idle subprocesses sitting around.

One fix is to use SIGTERM instead, then SIGKILL after a short delay. This is likely to resolve the issue in practice.

A better fix is to sandbox and annihilate the process group more aggressively, but this is a little more involved and might be better held for later in T4551. This is probably not necessary with git fetch subprocesses (which are likely always well-behaved) but may be necessary in the future with arc unit subprocesses (which are arbitrary user code and may swallow SIGTERM).

Event Timeline

Part of this is likely a change to the semantics of setTimeout(). Current uses are:

  • diffusion.blame, caller-specified
  • diffusion.internalgitrawdiffquery, hard-coded at 30 seconds.
  • DiffusionCommitHookEngine, hard-coded at 15 minutes ("enormous changes").
  • DiffusionFileFutureQuery, ultimately caller-specified
  • Meme transforms, hard-coded at 10 seconds.
  • Thumb transforms, hard-coded at 60 seconds.
  • PullLocalDaemon, the case in question, hard-coded at 15 minutes and 4 hours.
  • Some stuff in test cases in libphutil.

For the ones that aren't caller-specified, I think we could pretty safely pick a min(timeout, 60) window between the TERM and the KILL.

I'm not exactly sure what we should do in the case of a very short caller-specified timeout.

  • If the caller specifies a timeout of 3 seconds, when do we TERM/KILL?
  • If they specify a timeout of 60 seconds, is it really OK to take up to 120 seconds to return?
  • Generally, when they say "X", is it more important to try to return within X seconds or to let the command run for X seconds?

I'm inclined to say that the semantics of setTimeout(X) are "let the command run for X", and the actual time it takes the future to resolve is X + min(X, 60), i.e. up to twice as long. For example:

  • If you set the timeout to 3, we TERM after 3 seconds and kill after 6 seconds. The command gets to run for 3 seconds but we may not return for 6.
  • If you set the timeout to 60, we TERM after 60 seconds and kill after 120 seconds.
  • If you set the timeout to 1234000, we TERM after 1234000 seconds and kill after 1234060 seconds.

There's always going to be call/command overhead anyway, so users shouldn't expect the behavior to be "return within X seconds" exactly anyway, although having 60 really mean up to 120 does feel a little weird. Maybe the magic number should be a bit shorter (like 15 or 30 seconds). But we should generally be hitting the timeout pretty much exactly anyway as long as subprocesses respond to TERM, which they almost always should.