Page MenuHomePhabricator

Set niceity levels of background processes: git diff sometimes appears to thrash disk I/O and CPU from background worker
Closed, InvalidPublic

Description

We're experiencing an issue where occasionally (perhaps on bad commits), the git diff commands that Phabricator runs during background processing thrash the disk I/O and CPU, and basically prevent the server from doing any other work while that's occurring. For example, this is the New Relic graph of server activity when this occurs:

pasted_file (272×1 px, 36 KB)

To prevent this, I'd like to be able to set the I/O nicety level of the processes Phabricator spawns to be lower than normal; i.e. I'd like Phabricator to run ionice 4 against the Git processes it spawns. While this will make Git diff processing take longer, it should ensure that other processes have access to disk I/O first, and hopefully not kill the server from serving all traffic when it occurs.

I didn't file this as a bug report, because I don't have reproductions steps (it seems to occur at random, but running kill -KILL $(pidof git) fixes it).

Event Timeline

For users of the Docker container, there is now a workaround available (enabled by default), which ensures that Git processes are automatically killed if the CPU wait averages above 50% in the last minute. The commit that enables this functionality is available to view here: https://github.com/hach-que-docker/phabricator/commit/0e77c36f694605371e85d420c4aecfbd3023ec5a

To disable this behaviour going forward, users of the Docker container can set the environment variable DISABLE_IOMONITOR=true on the container.

Once the Docker index builds the latest image, users of the Docker image can update Phabricator and enable this functionality by pulling the latest image with docker pull hachque/phabricator.

epriestley added a subscriber: epriestley.

We haven't ever experienced this, nor have we seen reports of users other than you experiencing it. This ("basically prevent the server from doing any other work") doesn't sound like a description of how properly configured servers are generally expected to work. My expectation of modern systems is that they have sophisticated schedulers which are largely fair in allocating resources and produce generally reasonable behavior, and this is consistent with my experience.

I'm not sure exactly what your chart is showing, but it looks like CPU is almost totally unused (and that most CPU is used by java, which we do not distribute). I would expect serving web requests from a properly configured host to require effectively zero disk I/O (the tiny number of disk reads we perform should be cached), and for the system to schedule disk I/O fairly so that one I/O heavy process does not completely stall a machine when other processes with tiny I/O requirements are running. If high I/O causes a CPU stall, that sounds like a (severe) scheduler bug to me. Perhaps you should report this issue to the Linux kernel maintainers instead.

I don't see any evidence that you've tried your proposed solution and it actually fixed your problem. Instead, you implemented a completely different, much hackier solution.

We don't have enough information to investigate or understand this issue. We have no way to determine if a nice-based solution actually fixes it, and not even any evidence that it would.

Even if we did have a way to completely reproduce and investigate what you're seeing, it would be impossible to prioritize because no other users experience it, and we and many other installs run large production services without CPU stalls from I/O. For example, here is the CPU graph on repo001, which runs a whole lot of git operations using essentially vanilla upstream code without ever seeing CPU stalls:

Screen Shot 2016-11-29 at 5.18.03 AM.png (1×1 px, 196 KB)

Please do not file requests like this which vaguely describe an operational issue in only your environment and guess at a possible fix. We can not ever move forward with this kind of report.

Separately I managed to grab this output from top at an impacted time, but this looks to be a different issue now since the CPU time is being reported against the taskmasters themselves:

jrhodes@Websites:~> top -c
top - 23:18:04 up 2 days,  3:01,  1 user,  load average: 428.05, 369.03, 313.97
Tasks: 880 total,   3 running, 873 sleeping,   0 stopped,   4 zombie
%Cpu(s): 69.7 us, 28.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  2.0 si,  0.0 st
KiB Mem:   3519096 total,  3425748 used,    93348 free,      316 buffers
KiB Swap:        0 total,        0 used,        0 free.   352160 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
10918 2000      20   0  284408  62740      0 R 42.27 1.783  27:59.77 php /srv/phabricator/libphutil/scripts/daemon/exec/exec_daemon.php PhabricatorTaskmasterDaemon
11236 2000      20   0  282828  61360      0 R 41.96 1.744  18:48.82 php /srv/phabricator/libphutil/scripts/daemon/exec/exec_daemon.php PhabricatorTaskmasterDaemon
   30 root      20   0       0      0      0 D 5.047 0.000  10:50.43 [kswapd0]
36945 jrhodes   20   0   14640   1608    288 R 0.946 0.046   0:01.22 top -c

I tried getting a backtrace, but then the issue seemed to go away. I'm going to try and decrease the container's maximum CPU shares again to try and prevent it impacting other services. I really don't understand why these problems have been appearing in the last couple of weeks (beyond the usual upgrades I do).