Page MenuHomePhabricator

Mercurial repository updates cause high CPU usage
Closed, InvalidPublic

Description

Version Information :

phabricator: a4a9485612839072a46e715d1ce47c29eddcb134 (Mon, Jan 16) 
arcanist: ade25facfdf22aed1c1e20fed3e58e60c0be3c2b (Thu, Jan 5) 
phutil: 9d85dfab0f532d50c2343719e92d574a4827341b (Fri, Jan 13) 
php: 5.6.29, with apcu
apache: 2.4.25
mercurial: 3.8.4

Steps leading to the problem:

  • installed Phabricator on host (vserver container on a dedicated Linux server, 32-bit userland in case)
  • created a project and 6 users
  • created and enabled a hosted Mercurial repository
  • copied existing repository to the right place (couldn't push because of the whole bundle2 issue discussed elsewhere https://secure.phabricator.com/T9548) and made sure it was imported (process took about an hour, repo is about 5MB and 2800 commits)
  • started pushing to it

Observed behaviour:
On an otherwise idle server, right after the first push, CPU usage went from ~1% to a steady 15% (no other pushes was made since).

We also tried hosting the repo elsewhere and have Phabricator mirror it, but the end result was the same in terms of CPU usage. Discussion in the general chat room seems to indicate that this is not normal.

I tried using repository update <repo> --trace. All I saw was 70 calls to hg log, targeting heads (normal and closed) revisions. Each call appear to require about 1.4s user CPU time, for a total of 12.5s user time (8 CPUs on the server). When called manually, mercurial command line is nowhere near as slow as this (0.090s on average).

This raises several questions:

  • Why would mercurial calls be so slow when called by phabricator? This seems to be the heart of the problem at hand.
  • On a self-hosted repo, since a push triggers an update by phabricator, what are the automatic (time-driven) updates for?
  • Can time-driven updates be disabled, as means to mitigate the issue I'm seeing? Or can the frequency be manually adjusted?

Thanks :)

Event Timeline

We need specific reproduction steps to move forward, and "a repository with about 2800 commits" isn't sufficiently specific.

Please provide the URI of a repository which reproduces the problem which we can clone in a test environment.

Unless you can make this task private to a group of people who would be prepared to sign a NDA in French, that's not going to happen.

What would work though, is to create a repo where I could import an existing, open source project, and see how it fares. I'm thinking maybe TortoiseHg. Would that work?

An existing, open source project is perfectly fine, provided you first verify that it reproduces the problem.

(A synthetic repository which you create by running hg commit in a loop, or instructions which allow us to build such a synthetic repository, are also fine.)

Noted. I'll try and reproduce the problem with a different repository.

The repository at https://bitbucket.org/LeCoyote/brokentest is enough to see unusual CPU usage. Of course, it is nowhere near as bad as the other one, but the CPU is still coasting a noticeable 2 or 3% above average, which seems disproportionate for such a small repo. Also, when calling the update manually:

$ time ./bin/repository update 3
Updated repository "cieltest".

real    0m0.934s
user    0m2.500s
sys     0m0.268s

(edited after creating dummy branches, total user time jumped from .8s to 2.5s)

Here's what I did to try to reproduce this:

  • I reused the host I recently built in T12112 (debian001.epriestley.com), see that task for setup steps.
  • I installed hg:
$ sudo apt-get install mercurial
  • I created a new, empty Mercurial repository, and a working copy directory for the repository.
  • I configured phabricator.base-uri and diffusion.allow-http-auth.
  • I set a VCS password in Settings.
  • I ran bin/repository update 1 to verify that the config was in good shape.
  • I configured phd.user to admin.
  • I allowed www-data to sudo to hg in /etc/sudoers.
www-data ALL=(admin) SETENV: NOPASSWD: /usr/bin/hg
  • I cloned the provided test repository:
$ hg clone https://bitbucket.org/LeCoyote/brokentest
  • I pushed the repository to the server:
$ hg push http://debian001.epriestley.com/diffusion/1/test-repository/
pushing to http://debian001.epriestley.com/diffusion/1/test-repository/
http authorization required for http://debian001.epriestley.com/diffusion/1/test-repository/
realm: Phabricator Repositories
user: admin
password: 
searching for changes
remote: adding changesets
remote: adding manifests
remote: adding file changes
remote: added 12 changesets with 12 changes to 12 files (+9 heads)
  • I started the daemons on the server:
$ ./bin/phd start
  • I waited for the repository to fully import:

Screen Shot 2017-01-19 at 1.55.21 PM.png (1×1 px, 202 KB)

  • I monitored load on the server for a few minutes and saw load below 1 with the CPU 99% idle:
$ while [ true ]; do date; uptime; mpstat -P 0 | tail -n2; echo; echo; sleep 15; done;
Thu Jan 19 22:00:43 UTC 2017
 22:00:43 up 2 days,  4:14,  1 user,  load average: 1.17, 0.69, 0.36
10:00:43 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
10:00:43 PM    0    0.06    0.00    0.02    0.02    0.00    0.00    0.01    0.00    0.00   99.89


Thu Jan 19 22:00:58 UTC 2017
 22:00:58 up 2 days,  4:14,  1 user,  load average: 0.91, 0.66, 0.35
10:00:58 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
10:00:58 PM    0    0.06    0.00    0.02    0.02    0.00    0.00    0.01    0.00    0.00   99.89


Thu Jan 19 22:01:13 UTC 2017
 22:01:13 up 2 days,  4:14,  1 user,  load average: 0.71, 0.62, 0.35
10:01:13 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
10:01:13 PM    0    0.06    0.00    0.02    0.02    0.00    0.00    0.01    0.00    0.00   99.89


Thu Jan 19 22:01:28 UTC 2017
 22:01:28 up 2 days,  4:15,  1 user,  load average: 0.55, 0.59, 0.34
10:01:28 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
10:01:28 PM    0    0.06    0.00    0.02    0.02    0.00    0.00    0.01    0.00    0.00   99.89

...

Thu Jan 19 22:02:28 UTC 2017
 22:02:28 up 2 days,  4:16,  1 user,  load average: 0.20, 0.48, 0.32
10:02:28 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
10:02:28 PM    0    0.06    0.00    0.02    0.02    0.00    0.00    0.01    0.00    0.00   99.89

I am unable to reproduce the issue you described by following the steps provided. To move forward:

  • provide steps which allow us to reproduce this issue in a test environment (perhaps by correcting mistakes I made in following the reproduction process, or providing additional information); or
  • we will close this report after a few days if we are still unable to reproduce it.

Note that the test host above (debian001.epriestley.com) is a t2.micro, so the CPU is an underclocked potato. I would expect the host to be enormously sensitive to CPU utilization issues.

I was afraid this would happen. However, short of giving you access to our server, there is no way I can exactly describe our setup: there are fare too many variables. Perhaps we could narrow it down a bit? What could be a factor here? The obvious non-standard features of our setup are :

  • linux-vserver
  • 32bit userland
  • LVM on soft raid
  • daemon user = web user

That said, I can't understand why the same user can run hg log with negligible CPU impact, and run the same hg log through phabricator with significant performance impact. Clearly phabricator does something in addition to calling mercurial that is causing the problem, and I don't know what it could be. That's where your input would help :-)

Also @epriestley what's the output for time ./bin/repository update 1 on your repo? Our system is a Xeon W3520 so the CPU itself shouldn't be a problem. We're going to try and install phabricator elsewhere as a last resort, but it won't help in telling what's wrong with this setup.

I'd be happy to work with you one-on-one to help troubleshoot issues in your environment, but we can't offer that kind of support for free. This process almost always takes up a large amount of our time and very rarely uncovers any real bugs in Phabricator or helps anyone except the single user experiencing a configuration problem. See Support Resources for more discussion of the kinds of free and paid support we offer. If you'd like to move forward with one-on-one configuration and environment support, see Consulting.

We are certainly not going to engage into paid consulting without knowing how long it might take to solve the problem, given your hourly rate :)
I would have appreciated at least a tiny hint on how to solve this ourselves instead, so I guess I'm going to dive into the code and implement a workaround somewhere. For example, tinkering with the $smart_wait calculation might be a start.

Sounds good. I'm going to close this since it doesn't seem to be moving toward becoming a bug report which we can accept upstream.

Could you at least give me the time output for the broken/test repository update on your test system, so that I have a baseline for comparison later? Thanks!

What'd be the fee for that 1 second command and a 5 second copy/paste?

Obviously, we can't just hand out estimates for free. Converting units and multiplying numbers together takes away valuable time and resources we could be using to develop Phabricator.

And if I gave you a precise estimate you could be sneaky and just divide it by $1,500/hr to get the answer without paying for it!

Erm… I was just asking for time repository update on the repository you have already cloned, hence the 15 seconds. If I understand this correctly, you seem to indicate that it would still be too much. I'm having a hard time understanding, but if you're adamant about this, then there is nothing left to say :)