Page MenuHomePhabricator

Improve Phacility repository import performance
Open, NormalPublic

Description

@Mnkras reports elsewhere:

I started up a phacility test instance to test something ... the import seems to be taking an abnormally long amount of time

The repository in question has 19K commits, but has been importing for more than 10 hours and is ~10% complete, for an effective import rate of 3 commits / minute. This is unreasonably slow.

Some immediate issues:

  • Because repo hosts are >80% full (T12217), essentially all memory on the repository hosts is consumed and overseers are no longer scaling up taskmaster pools. Thus, instances are effectively locked to a single taskmaster.
  • There are some errors in the error log for this instance:
Daemon 1 STDE [Mon, 20 Feb 2017 05:06:13 +0000] [2017-02-20 05:06:13] ERROR 8: Undefined offset: 1 at [/core/lib/phabricator/src/applications/repository/daemon/PhabricatorRepositoryPullLocalDaemon.php:219]
Daemon 1 STDE [Mon, 20 Feb 2017 05:06:13 +0000] arcanist(head=stable, ref.master=fad85844314b, ref.stable=a232e95a18f3), libcore(), phabricator(head=stable, ref.master=f844280d360c, ref.stable=3e10f69570c0), phutil(head=stable, ref.master=9128fb722671, ref.stable=d0ebdaf3ab70), services(head=stable, ref.master=b5cef1ac31ff, ref.stable=198eb67bd822)
Daemon 1 STDE [Mon, 20 Feb 2017 05:06:13 +0000]   #0 PhabricatorRepositoryPullLocalDaemon::run() called at [<phutil>/src/daemon/PhutilDaemon.php:181]
Daemon 1 STDE [Mon, 20 Feb 2017 05:06:13 +0000]   #1 PhutilDaemon::execute() called at [<phutil>/scripts/daemon/exec/exec_daemon.php:131]
Daemon 1 STDE [Mon, 20 Feb 2017 05:06:13 +0000] [2017-02-20 05:06:13] EXCEPTION: (InvalidArgumentException) Argument 1 passed to PhabricatorRepositoryPullLocalDaemon::resolveUpdateFuture() must be an instance of PhabricatorRepository, null given, called in /core/lib/phabricator/src/applications/repository/daemon/PhabricatorRepositoryPullLocalDaemon.php on line 221 and defined at [<phutil>/src/error/PhutilErrorHandler.php:200]
Daemon 1 STDE [Mon, 20 Feb 2017 05:06:13 +0000] arcanist(head=stable, ref.master=fad85844314b, ref.stable=a232e95a18f3), libcore(), phabricator(head=stable, ref.master=f844280d360c, ref.stable=3e10f69570c0), phutil(head=stable, ref.master=9128fb722671, ref.stable=d0ebdaf3ab70), services(head=stable, ref.master=b5cef1ac31ff, ref.stable=198eb67bd822)
Daemon 1 STDE [Mon, 20 Feb 2017 05:06:13 +0000]   #0 PhutilErrorHandler::handleError(integer, string, string, integer, array) called at [<phabricator>/src/applications/repository/daemon/PhabricatorRepositoryPullLocalDaemon.php:433]
Daemon 1 STDE [Mon, 20 Feb 2017 05:06:13 +0000]   #1 PhabricatorRepositoryPullLocalDaemon::resolveUpdateFuture(NULL, ExecFuture, integer) called at [<phabricator>/src/applications/repository/daemon/PhabricatorRepositoryPullLocalDaemon.php:221]
Daemon 1 STDE [Mon, 20 Feb 2017 05:06:13 +0000]   #2 PhabricatorRepositoryPullLocalDaemon::run() called at [<phutil>/src/daemon/PhutilDaemon.php:181]
Daemon 1 STDE [Mon, 20 Feb 2017 05:06:13 +0000]   #3 PhutilDaemon::execute() called at [<phutil>/scripts/daemon/exec/exec_daemon.php:131]
  • CPU on the host is pretty much all getting burned up anyway, although I think this may be coupled with memory pressure rather than an independent bottleneck.

Event Timeline

There's also a bunch of the usual free-tier shenanigans on this host -- free instances with 1 user importing multiple large repositories -- but that's to be expected.

@epriestley The import has finished, but some commits show as importing still?

My impression is that there is a lot of room for optimization in the RepositoryPullLocalDaemon. I've wanted to tackle that for a while because it causes a fair delay between when a commit goes into git and when it actually shows up in Diffusion.

I can't remember off the top of my head but I'm pretty sure that there are some low hanging fruit in there.

Starting with repository update:

This repository uses pull requests, so it has about 3,200 refs and takes about 1.5s to list:

$ time git ls-remote https://github.com/concrete5/concrete5 | wc -l
3164
real	0m1.670s

In production, a git fetch origin '+refs/*:refs/*' --prune takes about 2.25s:

$ PHABRICATOR_INSTANCE=test-3uoifmrptebb time /core/lib/phabricator/bin/repository update --trace rC
...
>>> [29] <exec> $ git fetch origin '+refs/*:refs/*' --prune
<<< [29] <exec> 2,257,513 us

We also run git for-each-ref three times, at a cost of approximately 400ms each.

Internally, we use a commit cache to track repository heads. The size of this cache is 2048. This repository has 3,200 refs, so they fall out of the cache. The cache purges like a ring buffer (FIFO) so every ref triggers a cache miss.

The size of each cache entry is approximately 40 bytes (this is 40 "PHP Bytes", so probably a lot more than 40 real bytes, but likely less than 10x more), so we can freely use a much larger cache size.

During import, we poll repositories at the maximum rate (once every 15 seconds). This repository is also active enough to get maximum polling after import. Each update takes approximately 4s.

Here are some things we could do to improve this:

  • Use git ls-remote and compare it to a local cache to detect whether we need to fetch or not. In theory, a no-op git fetch should have the same cost as git ls-remote, but it appears that it does not.
  • Identify callers to git for-each-ref and have them use a cache.
  • Increase the size of the commit cache.
  • When bulk-filling the commit cache, refuse to fill more entries than the size of the cache.
  • While a repository is importing, we could do a reduced-cost update which just checks for import progress but does not perform a fetch. It's unclear if this is worthwhile to pursue.

Of these, I believe only the cache changes are straightforward, and they are not an overwhelming fraction of the runtime.

@epriestley The import has finished, but some commits show as importing still?

I can't reproduce this, perhaps it resolved itself?

$ PHABRICATOR_INSTANCE=test-3uoifmrptebb /core/lib/phabricator/bin/repository importing R1 R2
No importing commits found.

Ah, okay. This is the relevant part:

This commit has been deleted in the repository: it is no longer reachable from any branch, tag, or ref.

We stop importing commits if we detect that they have become unreachable, which may strand commits in a partially imported state. The UI could be more clear about this.

The commit cache should be a fairly large performance win.

@epriestley all other commits that have the message "This commit has been deleted in the repository: it is no longer reachable from any branch, tag, or ref." have imported, so I don't think the reachable code is working correctly, as until within the last few weeks I wasn't seeing these "This commit has been deleted in the repository: it is no longer reachable from any branch, tag, or ref." commits showing up, but now I see them all the time.

It is expected that some deleted commits will import and others will not.

If a commit imports before it is deleted, it will import. For example: push a branch, wait for the commits on it to import, delete the branch.

If a commit does not import before it is deleted, the import will abort. For example: push a branch, don't wait for the commits on it to import, delete the branch.

Can you explain in more detail why you think the code isn't working correctly? For example, what is the hash of a commit which you believe to be reachable but which you think is incorrectly marked as unreachable, or vice versa?

The reason I think something isn't working correctly is in my dashboard, I list recent commits, and since I updated my install a few weeks ago my commit list looks like this a lot:

pasted_file (1×1 px, 470 KB)

Previously, I never saw these deleted commits show up, and they tend to be pretty useless/duplicates

If you don't want to see unreachable commits, use a query which hides them:

Screen Shot 2017-02-25 at 8.31.55 PM.png (172×468 px, 32 KB)

That works, Im curious when this changed, as I didn't need to filter since the last few weeks.

Likely somewhere in T9430 / T10978, which rewrote a big chunk of the audit code.

(That said, my expectation is that behaviors around unreachable commits have not changed -- D17279 added the "Unreachable" option to solve this problem.)

Cool, now I don't feel so bad since I know you only recently added the option 😄

Thanks!

While the git ls-remote change isn't really motivated as a performance improvement, it does seem to have reduced CPU usage a measurable amount (deployed on the morning of 3/18), maybe 15%:

Screen Shot 2017-03-21 at 7.08.23 AM.png (1×1 px, 162 KB)

There's no appreciable change in disk I/O or network I/O that I can see from the graphs.

While the git ls-remote change isn't really motivated as a performance improvement, it does seem to have reduced CPU usage a measurable amount (deployed on the morning of 3/18), maybe 15%:

I am seeing a similar drop in CPU use on wikimedia's phabricator host, it looks closer to 10% but still significant.