Initial import of Git repositories with many branches is grossly inefficient
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	epriestley
	Feb 11 2014, 10:13 PM

Description

This hasn't come up too often, but a user with a Git repository that has ~1.5K branches and ~55K commits required a very long time for initial discovery. The way the algorithm works means that we'll do O(1500 * 55000) commits worth of work in this case before beginning inserts. Particularly, the cache is a level above this section of the code.

Fixing this is potentially somewhat tricky, but I think we can use the working set like a cache. If that pans out, that should be an easy fix.

A workaround is to set "Track Only" to one branch, discover, fix "Track Only", then start the daemons. This discovers one branch first, then the rest can hit the cache.

Revisions and Commits

rP Phabricator
	D8374	rPd86bb086cac3 Reduce initial discovery from O(branches * commits) to O(commits)

Related Objects
Search...

Status	Assigned	Task
Open	epriestley	T13290 Clean up remaining "Autoclose" behaviors related to "One Revision, Many Commits"
Open	None	T4453 When multiple commits correspond to a single revision, Differential updates to show one of them arbitrarily
Resolved	epriestley	T2683 Improve the performance of Diffusion browse views
Open	epriestley	T4455 Cache parent relationships between commits
Resolved	chang888	T4414 Initial import of Git repositories with many branches is grossly inefficient

Event Timeline

epriestley created this task.Feb 11 2014, 10:13 PM

epriestley raised the priority of this task from to Normal.

epriestley updated the task description. (Show Details)

epriestley added a project: Diffusion.

epriestley added a subscriber: epriestley.

Here's the workaround in detail:

Stop the daemons phd stop.
Diffusion > Repository > Edit Repository > Branches. Set "Track Only" to "master", or some other main branch which contains most of the commits in the repository.
Run bin/repository discover X, where "X" is the callsign of the repository. You can add --trace and --verbose for more details.
When that exits (it should take 5-10 minutes for 30K commits), go clear "Track Only". (Optionally, you can skip this and the next couple steps if you don't care about tracking all your branches).
Run bin/repository discover X again. It should finish quickly this time (a few minutes), since it's just discovering new commits not on "master".
Restart the daemons, and it should be clear sailing after that.

This is likely easy for us to fix in the upstream, but we probably won't get to it for a couple of days.

Works! Thanks so much for your help!

Great, glad to hear it! (I'm going to keep this open until we can fix the root issue so the workaround isn't necessary.)

epriestley edited this Maniphest Task.Feb 19 2014, 4:17 AM

• bitglue added a subscriber: • bitglue.Feb 27 2014, 10:34 PM

I can confirm this issue. I have about 3500 commits and 170 branches, and was running out of memory on a 1GB VM.

I used a slightly different workaround, which I'm guessing accomplishes about the same thing, but you can do it only by pointing and grunting at your browser.

When creating the repository, say "don't create it yet -- edit some settings".
Set to track only master.
Then activate it.
Wait for the import to finish.
Set to track all branches.

That's roughly equivalent, but will incorrectly fire side effects for commits not on master (normally, they are suppressed in importing repositories). In most cases, that's probably a no-op, but if you were importing a repository full of commits with messages like Fixes T123, all your tasks would get closed, and if someone had written a Herald rule to be notified of every commit they'd get a barrage of email. The more convoluted workaround just avoids these cases.

epriestley edited this Maniphest Task.Feb 28 2014, 8:23 PM

epriestley edited this Maniphest Task.Feb 28 2014, 9:02 PM

Closed by commit rPd86bb086cac3.

Initial import of Git repositories with many branches is grossly inefficientClosed, ResolvedPublicActions

Description

Revisions and Commits

Related ObjectsSearch...

Event Timeline

Initial import of Git repositories with many branches is grossly inefficient
Closed, ResolvedPublic
Actions

Related Objects
Search...