Page MenuHomePhabricator

Initial import of Git repositories with many branches is grossly inefficient
Closed, ResolvedPublic

Description

This hasn't come up too often, but a user with a Git repository that has ~1.5K branches and ~55K commits required a very long time for initial discovery. The way the algorithm works means that we'll do O(1500 * 55000) commits worth of work in this case before beginning inserts. Particularly, the cache is a level above this section of the code.

Fixing this is potentially somewhat tricky, but I think we can use the working set like a cache. If that pans out, that should be an easy fix.

A workaround is to set "Track Only" to one branch, discover, fix "Track Only", then start the daemons. This discovers one branch first, then the rest can hit the cache.

Event Timeline

epriestley raised the priority of this task from to Normal.
epriestley updated the task description. (Show Details)
epriestley added a project: Diffusion.
epriestley added a subscriber: epriestley.

Here's the workaround in detail:

  • Stop the daemons phd stop.
  • Diffusion > Repository > Edit Repository > Branches. Set "Track Only" to "master", or some other main branch which contains most of the commits in the repository.
  • Run bin/repository discover X, where "X" is the callsign of the repository. You can add --trace and --verbose for more details.
  • When that exits (it should take 5-10 minutes for 30K commits), go clear "Track Only". (Optionally, you can skip this and the next couple steps if you don't care about tracking all your branches).
  • Run bin/repository discover X again. It should finish quickly this time (a few minutes), since it's just discovering new commits not on "master".
  • Restart the daemons, and it should be clear sailing after that.

This is likely easy for us to fix in the upstream, but we probably won't get to it for a couple of days.

chang888 claimed this task.

Works! Thanks so much for your help!

Great, glad to hear it! (I'm going to keep this open until we can fix the root issue so the workaround isn't necessary.)

I can confirm this issue. I have about 3500 commits and 170 branches, and was running out of memory on a 1GB VM.

I used a slightly different workaround, which I'm guessing accomplishes about the same thing, but you can do it only by pointing and grunting at your browser.

  • When creating the repository, say "don't create it yet -- edit some settings".
  • Set to track only master.
  • Then activate it.
  • Wait for the import to finish.
  • Set to track all branches.

That's roughly equivalent, but will incorrectly fire side effects for commits not on master (normally, they are suppressed in importing repositories). In most cases, that's probably a no-op, but if you were importing a repository full of commits with messages like Fixes T123, all your tasks would get closed, and if someone had written a Herald rule to be notified of every commit they'd get a barrage of email. The more convoluted workaround just avoids these cases.