Improve Phacility repository import performance
Open, NormalPublic
Actions

Assigned To

None

Authored By

	epriestley
	Feb 20 2017, 4:09 PM

Description

@Mnkras reports elsewhere:

I started up a phacility test instance to test something ... the import seems to be taking an abnormally long amount of time

The repository in question has 19K commits, but has been importing for more than 10 hours and is ~10% complete, for an effective import rate of 3 commits / minute. This is unreasonably slow.

Some immediate issues:

Because repo hosts are >80% full (T12217), essentially all memory on the repository hosts is consumed and overseers are no longer scaling up taskmaster pools. Thus, instances are effectively locked to a single taskmaster.
There are some errors in the error log for this instance:

Daemon 1 STDE [Mon, 20 Feb 2017 05:06:13 +0000] [2017-02-20 05:06:13] ERROR 8: Undefined offset: 1 at [/core/lib/phabricator/src/applications/repository/daemon/PhabricatorRepositoryPullLocalDaemon.php:219]
Daemon 1 STDE [Mon, 20 Feb 2017 05:06:13 +0000] arcanist(head=stable, ref.master=fad85844314b, ref.stable=a232e95a18f3), libcore(), phabricator(head=stable, ref.master=f844280d360c, ref.stable=3e10f69570c0), phutil(head=stable, ref.master=9128fb722671, ref.stable=d0ebdaf3ab70), services(head=stable, ref.master=b5cef1ac31ff, ref.stable=198eb67bd822)
Daemon 1 STDE [Mon, 20 Feb 2017 05:06:13 +0000]   #0 PhabricatorRepositoryPullLocalDaemon::run() called at [<phutil>/src/daemon/PhutilDaemon.php:181]
Daemon 1 STDE [Mon, 20 Feb 2017 05:06:13 +0000]   #1 PhutilDaemon::execute() called at [<phutil>/scripts/daemon/exec/exec_daemon.php:131]

Daemon 1 STDE [Mon, 20 Feb 2017 05:06:13 +0000] [2017-02-20 05:06:13] EXCEPTION: (InvalidArgumentException) Argument 1 passed to PhabricatorRepositoryPullLocalDaemon::resolveUpdateFuture() must be an instance of PhabricatorRepository, null given, called in /core/lib/phabricator/src/applications/repository/daemon/PhabricatorRepositoryPullLocalDaemon.php on line 221 and defined at [<phutil>/src/error/PhutilErrorHandler.php:200]
Daemon 1 STDE [Mon, 20 Feb 2017 05:06:13 +0000] arcanist(head=stable, ref.master=fad85844314b, ref.stable=a232e95a18f3), libcore(), phabricator(head=stable, ref.master=f844280d360c, ref.stable=3e10f69570c0), phutil(head=stable, ref.master=9128fb722671, ref.stable=d0ebdaf3ab70), services(head=stable, ref.master=b5cef1ac31ff, ref.stable=198eb67bd822)
Daemon 1 STDE [Mon, 20 Feb 2017 05:06:13 +0000]   #0 PhutilErrorHandler::handleError(integer, string, string, integer, array) called at [<phabricator>/src/applications/repository/daemon/PhabricatorRepositoryPullLocalDaemon.php:433]
Daemon 1 STDE [Mon, 20 Feb 2017 05:06:13 +0000]   #1 PhabricatorRepositoryPullLocalDaemon::resolveUpdateFuture(NULL, ExecFuture, integer) called at [<phabricator>/src/applications/repository/daemon/PhabricatorRepositoryPullLocalDaemon.php:221]
Daemon 1 STDE [Mon, 20 Feb 2017 05:06:13 +0000]   #2 PhabricatorRepositoryPullLocalDaemon::run() called at [<phutil>/src/daemon/PhutilDaemon.php:181]
Daemon 1 STDE [Mon, 20 Feb 2017 05:06:13 +0000]   #3 PhutilDaemon::execute() called at [<phutil>/scripts/daemon/exec/exec_daemon.php:131]

CPU on the host is pretty much all getting burned up anyway, although I think this may be coupled with memory pressure rather than an independent bottleneck.

Revisions and Commits

rP Phabricator
	D17497	rP2b0ad243d179 Use "git ls-remote" to guess if "git fetch" is a no-op
	D17409	rP4270649abe1d Increase the size of the Diffusion commit cache

Related Objects
Search...

Status	Assigned	Task
Resolved	epriestley	T12218 Reduce the operational cost of a larger Phacility cluster
Invalid	epriestley	T12217 Reduce the hardware cost of Phacility free instances
Open	None	T12296 Improve Phacility repository import performance

Event Timeline

epriestley created this task.Feb 20 2017, 4:09 PM

Herald added subscribers: chad, eadler. · View Herald TranscriptFeb 20 2017, 4:09 PM

There's also a bunch of the usual free-tier shenanigans on this host -- free instances with 1 user importing multiple large repositories -- but that's to be expected.

@epriestley The import has finished, but some commits show as importing still?

My impression is that there is a lot of room for optimization in the RepositoryPullLocalDaemon. I've wanted to tackle that for a while because it causes a fair delay between when a commit goes into git and when it actually shows up in Diffusion.

I can't remember off the top of my head but I'm pretty sure that there are some low hanging fruit in there.

Starting with repository update:

This repository uses pull requests, so it has about 3,200 refs and takes about 1.5s to list:

$ time git ls-remote https://github.com/concrete5/concrete5 | wc -l
3164
real	0m1.670s

In production, a git fetch origin '+refs/*:refs/*' --prune takes about 2.25s:

$ PHABRICATOR_INSTANCE=test-3uoifmrptebb time /core/lib/phabricator/bin/repository update --trace rC
...
>>> [29] <exec> $ git fetch origin '+refs/*:refs/*' --prune
<<< [29] <exec> 2,257,513 us

We also run git for-each-ref three times, at a cost of approximately 400ms each.

Internally, we use a commit cache to track repository heads. The size of this cache is 2048. This repository has 3,200 refs, so they fall out of the cache. The cache purges like a ring buffer (FIFO) so every ref triggers a cache miss.

The size of each cache entry is approximately 40 bytes (this is 40 "PHP Bytes", so probably a lot more than 40 real bytes, but likely less than 10x more), so we can freely use a much larger cache size.

During import, we poll repositories at the maximum rate (once every 15 seconds). This repository is also active enough to get maximum polling after import. Each update takes approximately 4s.

Here are some things we could do to improve this:

Use git ls-remote and compare it to a local cache to detect whether we need to fetch or not. In theory, a no-op git fetch should have the same cost as git ls-remote, but it appears that it does not.
Identify callers to git for-each-ref and have them use a cache.
Increase the size of the commit cache.
When bulk-filling the commit cache, refuse to fill more entries than the size of the cache.
While a repository is importing, we could do a reduced-cost update which just checks for import progress but does not perform a fetch. It's unclear if this is worthwhile to pursue.

Of these, I believe only the cache changes are straightforward, and they are not an overwhelming fraction of the runtime.

In T12296#212850, @Mnkras wrote:

@epriestley The import has finished, but some commits show as importing still?

I can't reproduce this, perhaps it resolved itself?

$ PHABRICATOR_INSTANCE=test-3uoifmrptebb /core/lib/phabricator/bin/repository importing R1 R2
No importing commits found.

epriestley added a revision: D17409: Increase the size of the Diffusion commit cache.Feb 24 2017, 6:05 PM

@epriestley
https://test-3uoifmrptebb.phacility.com/R2:80233a27d7b72bb03710d358e12a531ce46808a9

Shows as importing in the ui

Ah, okay. This is the relevant part:

This commit has been deleted in the repository: it is no longer reachable from any branch, tag, or ref.

We stop importing commits if we detect that they have become unreachable, which may strand commits in a partially imported state. The UI could be more clear about this.

epriestley added a commit: rP4270649abe1d: Increase the size of the Diffusion commit cache.Feb 24 2017, 6:54 PM

The commit cache should be a fairly large performance win.

@epriestley all other commits that have the message "This commit has been deleted in the repository: it is no longer reachable from any branch, tag, or ref." have imported, so I don't think the reachable code is working correctly, as until within the last few weeks I wasn't seeing these "This commit has been deleted in the repository: it is no longer reachable from any branch, tag, or ref." commits showing up, but now I see them all the time.

It is expected that some deleted commits will import and others will not.

If a commit imports before it is deleted, it will import. For example: push a branch, wait for the commits on it to import, delete the branch.

If a commit does not import before it is deleted, the import will abort. For example: push a branch, don't wait for the commits on it to import, delete the branch.

Can you explain in more detail why you think the code isn't working correctly? For example, what is the hash of a commit which you believe to be reachable but which you think is incorrectly marked as unreachable, or vice versa?

The reason I think something isn't working correctly is in my dashboard, I list recent commits, and since I updated my install a few weeks ago my commit list looks like this a lot:

Previously, I never saw these deleted commits show up, and they tend to be pretty useless/duplicates

If you don't want to see unreachable commits, use a query which hides them:

Screen Shot 2017-02-25 at 8.31.55 PM.png (172×468 px, 32 KB)

That works, Im curious when this changed, as I didn't need to filter since the last few weeks.

Likely somewhere in T9430 / T10978, which rewrote a big chunk of the audit code.

(That said, my expectation is that behaviors around unreachable commits have not changed -- D17279 added the "Unreachable" option to solve this problem.)

Cool, now I don't feel so bad since I know you only recently added the option 😄

Thanks!

epriestley mentioned this in T12392: Instance reporting that synchronizeWorkingCopyBeforeRead() effectively fails.Mar 14 2017, 7:48 PM

epriestley added a revision: D17497: Use "git ls-remote" to guess if "git fetch" is a no-op.Mar 14 2017, 10:25 PM

epriestley added a commit: rP2b0ad243d179: Use "git ls-remote" to guess if "git fetch" is a no-op.Mar 17 2017, 11:43 PM

joshuaspence added a subscriber: joshuaspence.Mar 20 2017, 6:20 AM

While the git ls-remote change isn't really motivated as a performance improvement, it does seem to have reduced CPU usage a measurable amount (deployed on the morning of 3/18), maybe 15%:

Screen Shot 2017-03-21 at 7.08.23 AM.png (1×1 px, 162 KB)

There's no appreciable change in disk I/O or network I/O that I can see from the graphs.

In T12296#216132, @epriestley wrote:

While the git ls-remote change isn't really motivated as a performance improvement, it does seem to have reduced CPU usage a measurable amount (deployed on the morning of 3/18), maybe 15%:

I am seeing a similar drop in CPU use on wikimedia's phabricator host, it looks closer to 10% but still significant.

epriestley edited projects, added Customer Impact; removed Phacility.Apr 12 2017, 3:36 PM

epriestley moved this task from Backlog to "Customers" on the Customer Impact board.

• pasik added a subscriber: • pasik.May 12 2018, 2:11 PM

	F4232504: Screen Shot 2017-03-21 at 7.08.23 AM.png
	Mar 21 2017, 2:12 PM

	F3362049: Screen Shot 2017-02-25 at 8.31.55 PM.png
	Feb 26 2017, 4:32 AM

	F3361784: pasted_file
	Feb 26 2017, 4:29 AM

Improve Phacility repository import performanceOpen, NormalPublicActions

Description

Revisions and Commits

Related ObjectsSearch...

Event Timeline

Improve Phacility repository import performance
Open, NormalPublic
Actions

Related Objects
Search...