Page MenuHomePhabricator

`git branch --contains X` has cost in the realm of O(commits * branches)
Open, LowPublic

Description

In one of the repos I'm working on now we have the setting

"base": "git:branch-unique(origin/master), git:merge-base(origin/master), arc:prompt"

in our .arcconfig. I've discovered the hard way that git:branch-unique is unbearably slow - it runs one command like git -c column.ui=never -c color.ui=never branch --contains 26d83a1aa5cb2ce166ede5572736bd37b per each diff on my current branch when I arc diff, and each of these commands takes 2.2-2.5 seconds for me, which adds up quickly - on a branch with 7 commits, I recently saw it take about 16 seconds on this phase of arc diff (the whole commmand took just over 60 seconds - the other big problem we have is slow linters). It might be worth mentioning that this repo has ~200 branches, most of which are irrelevant pretty much all the time, but might be part of the reason this git branch command is so slow.

We are likely to change our default to get:merge-base instead of git:branch-unique, partly due to the performance problem - and partly because git:branch-unique has some pretty big rough edges for the workflows it intends to enable.

Event Timeline

dgoldstein updated the task description. (Show Details)
dgoldstein added projects: Arcanist, Restricted Project.
dgoldstein added a subscriber: dgoldstein.

I can't reproduce this. Here's what I did:

My working copy has 256 branches:

$ git branch | wc -l
     256

I ran the command you report as slow on several commits -- HEAD, one a few commits ago, one a significant number of commits ago:

$ time git -c column.ui=never -c color.ui=never branch --contains 8d7c4ba620990a9720b79638e4e22003a2377f5e
  ee34
* ee35

real	0m0.092s
user	0m0.081s
sys	0m0.010s
$ time git -c column.ui=never -c color.ui=never branch --contains b82863d972a512da2cee1fc3cada5edb7d7efc3e
  ee31
  ee34
* ee35
  master
  phameskin1
  xdoc1

real	0m0.097s
user	0m0.086s
sys	0m0.010s
$ time git -c column.ui=never -c color.ui=never branch --contains 4c43704
  alloclease
  arcpatch_2
  card1
  circle1
  circle2
  circle3
  circle4hack
  customx
  daemonoutofdate
  ee31
  ee34
* ee35
  eengine7
  ffmerge
  huri1
  intwhen
  ldocs1
  ldocs2
  master
  nextstep1
  passphrase1
  phame1
  phameskin1
  router1
  showland1
  spinall
  stable
  uemail
  willedit2
  workmerge2
  workmerge3
  workmerge4
  xdoc1

real	0m0.090s
user	0m0.078s
sys	0m0.010s

Each command only took about 100ms to execute.

this repros reliably for me - 100% of the time.

Some stats:
repo has 115 branches, majority (>100) are remote
repo has 175415 commits (git rev-list --all --count)
total repo size is about 1.8GB

it's very possible that this is just a "very big repo" problem.

Here's the first part of a trace of an arc diff for me (includes most of
the pre-commit-message work):

$ arc diff --trace
libphutil loaded from '/usr/local/arcanist/libphutil/src'.
arcanist loaded from '/usr/local/arcanist/arcanist/src'.
Config: Reading user configuration file "/home/dgoldstein/.arcrc"...
Config: Did not find system configuration at "/etc/arcconfig".
Working Copy: Reading .arcconfig from "/srv/server/.arcconfig".
Working Copy: Path "/srv/server/vmsetup/python/lib/dbx/vmsetup" is part of
`git` working copy "/srv/server".
Working Copy: Project root is at "/srv/server".
Config: Reading local configuration file "/srv/server/.git/arc/config"...
Loading phutil library from '/srv/server/.arc_lib'...
>>> [0] <conduit> conduit.connect() <bytes = 457>
>>> [1] <http> https://tails.corp.dropbox.com/api/conduit.connect
<<< [1] <http> 447,553 us
<<< [0] <conduit> 449,465 us
>>> [2] <exec> $ git diff --no-ext-diff --no-textconv --raw 'HEAD' --
>>> [3] <exec> $ git ls-files --others --exclude-standard
<<< [2] <exec> 409,012 us
<<< [3] <exec> 369,168 us
>>> [4] <exec> $ git diff-files --name-only
<<< [4] <exec> 134,475 us
>>> [5] <event> diff.didCollectChanges <listeners = 0>
<<< [5] <event> 233 us
>>> [6] <exec> $ git rev-parse --verify HEAD^
<<< [6] <exec> 6,083 us
>>> [7] <exec> $ git merge-base 'origin/master' HEAD
<<< [7] <exec> 25,864 us
>>> [8] <exec> $ git log --format=%H
'1439c215ce3a4f639c906c3b629bbc831ef46f6f'..HEAD --
<<< [8] <exec> 24,923 us
>>> [9] <exec> $ git for-each-ref --format='%(refname)' refs/heads
<<< [9] <exec> 11,481 us
>>> [10] <exec> $ git symbolic-ref --quiet HEAD
<<< [10] <exec> 5,693 us
>>> [11] <exec> $ git -c column.ui=never -c color.ui=never branch
--contains '42ae6151ebf14ee5f007d6d9b29d9cba069dfca9'
<<< [11] <exec> 2,108,824 us
>>> [12] <exec> $ git -c column.ui=never -c color.ui=never branch
--contains 'be4fb532ef8ea09f10011fc29f479d8250439ad1'
<<< [12] <exec> 2,040,153 us
>>> [13] <exec> $ git -c column.ui=never -c color.ui=never branch
--contains 'face4389a6046e7127978ec568f189286e330833'
<<< [13] <exec> 2,066,330 us
>>> [14] <exec> $ git -c column.ui=never -c color.ui=never branch
--contains 'fb5c994b87f9d6c6c4973684ea251e8c85f40f94'
<<< [14] <exec> 2,100,448 us
>>> [15] <exec> $ git -c column.ui=never -c color.ui=never branch
--contains '6435d85a6a1f160422e5e37ca4a5791b8f8f2e71'
<<< [15] <exec> 2,209,376 us
>>> [16] <exec> $ git -c column.ui=never -c color.ui=never branch
--contains 'fe270bfe18b0451880e059490de99e2732d05c8a'
<<< [16] <exec> 2,147,650 us
>>> [17] <exec> $ git -c column.ui=never -c color.ui=never branch
--contains '09fdfda41eae76eea67be25d551fc9c0f3615e42'
<<< [17] <exec> 2,214,983 us
>>> [18] <exec> $ git -c column.ui=never -c color.ui=never branch
--contains '78b0dd74ea359557701ddaeeb556be6458c24435'
<<< [18] <exec> 2,209,153 us
>>> [19] <exec> $ git -c column.ui=never -c color.ui=never branch
--contains '39eeb54ca6becebb9e9895fb23dec510da169544'
<<< [19] <exec> 2,160,932 us
>>> [20] <exec> $ git -c column.ui=never -c color.ui=never branch
--contains '4e5f875f8b108cb08803f5a13de9ff379eb7703c'
<<< [20] <exec> 2,217,674 us
>>> [21] <exec> $ git -c column.ui=never -c color.ui=never branch
--contains '009276d9982f266a204eed7a8c9bf0b9c1d78a90'
<<< [21] <exec> 2,190,696 us
>>> [22] <exec> $ git -c column.ui=never -c color.ui=never branch
--contains '1eb050dff95372999c17b06d6c4caa63ce3f1659'
<<< [22] <exec> 2,087,702 us
>>> [23] <exec> $ git -c column.ui=never -c color.ui=never branch
--contains 'd7b001544a6cca17f0d72632d13a74c2b928b7eb'
<<< [23] <exec> 2,093,991 us
>>> [24] <exec> $ git -c column.ui=never -c color.ui=never branch
--contains '1439c215ce3a4f639c906c3b629bbc831ef46f6f'
<<< [24] <exec> 2,300,096 us
>>> [25] <exec> $ git rev-parse 'HEAD'
<<< [25] <exec> 16,204 us
>>> [26] <exec> $ git log --first-parent --format=medium
'1439c215ce3a4f639c906c3b629bbc831ef46f6f'..'42ae6151ebf14ee5f007d6d9b29d9cba069dfca9'
<<< [26] <exec> 12,037 us

If you run those commands on their own, they're also slow? Even just this?

git branch --contains 1eb050dff95372999c17b06d6c4caa63ce3f1659

If git takes 2-2.5 seconds to execute this in a repository with only 100 branches and 200K commits, that seems like a performance problem in git itself.

We could conceivably use this approach instead:

  • Run git branch --contains <head> once to get branches the current commit is on.
  • Run git log -n1000 HEAD.
  • Run git log -n1000 --all --not <list of all the branches HEAD is on>.
  • Choose the first commit in the first list which is also in the second list. If this fails, fall back to the slow method.

However, that's potentially worse if only have 1-2 commits on your branch, and still costs us one --contains.

Is git log -n1000 --format=%H --all --not <some feature branch name> fast in your repository?

It's also possible that we can run git log --all --format='%H %P' and git for-each-ref and implement --contains ourselves in less than 2 seconds in PHP, although this is hugely ridiculous and clearly a problem with Git if true.

(I think that --not construct is wrong, and we'd just have to enumerate every branch explicitly.)

$ time git branch --contains 1eb050dff95372999c17b06d6c4caa63ce3f1659
* langpack_as_js_module

real    0m2.183s
user    0m2.063s
sys     0m0.114s

so yeah, git is slow here. But my main problem isn't that... but that we
call one git command per commit, so the slowness adds up really fast.

git log, on the other hand, appears to be relatively fast:

$ time (git log -n1000 HEAD | wc -l )
17385

real    0m0.100s
user    0m0.025s
sys     0m0.094s

pretty big rough edges for the workflows it intends to enable

What are these, specifically? Is this entire task moot?


I believe we can implement a fast git branch --contains like this:

For contains(X), essentially run flood-fill on git log:

  • If we haven't yet, parse git for-each-ref to find all branch heads, a map of HEADS = <hash, branch>.
  • If we haven't yet, open a subprocess to stream git log --all --format='%H %P' with LinesOfALargeExecFuture.
  • Start with an empty BRANCHES = <hash, branches> map and an empty CHILDREN = <hash, children> map.
  • If we haven't seen X yet, process log lines until we see X:
    • For each line, parse the hash (H) and parents (P).
    • For each parent, store H in CHILDREN.
    • Find all children of H by looking in CHILDREN. Then find all branches those children appear on by looking up each child in BRANCHES.
    • Find new branches H appears on by looking H up in HEADS.
    • Store the union of children-branches and head-branches in BRANCHES.
    • If we just processed X, return.
    • If we consumed the entire stream, fail.

The maximum runtime is O(N + B), N = number of commits, B = number of branches.

The amortized runtime for multiple lookups is O(1).

Memory use is a concern. If we have 512 32-byte branch names and 1M commits and each commit appears on every branch, naively building this cache will require >16GB of memory. However, most of this data is duplicate lists of branch names and PHP should just share it without any special work.

The worst case for this is when X is a child of the first commit in the repository and on no other branches, and we make only one --contains call. In this case, we must do about as much work as git does, but do the work in PHP.

We could deal with the degenerate case and hard-cap memory use by bailing out and falling back to the slow algorithm after examining some fixed number of commits.

We already use a somewhat-similar approach in Phabricator for processing git log and hg log (PhabricatorGitGraphStream, PhabricatorMercurialGraphStream). This was originally developed for Mercurial because Mercurial performance has huge per-command overhead, but was successful enough to bring to Git later.

Before moving forward, we should do work to unify PhabricatorRepository and the various RepositoryAPI classes so that this code can move out of phabricator/ and be shared between the client and server. A possible approach is:

  • Introduce a new LowLevelWorkingCopy class in libphutil/, maybe under a better name?
  • Implement the getLocalCommandFuture(), etc., commands there.
  • Have RepositoryAPI and PhabricatorRepository implement execution and futures via composition with one of these new objects.
  • Move the GraphStream classes (and maybe other LowLevel query classes) up, either to Arcanist or all the way to libphutil.

Working to make repository interaction more consistent would be helpful in other areas, as well. We currently have some level of code duplication and unnecessary diversity across repository interactions in Phabricator and Arcanist. The Conduit split (see, e.g., T4209) has helped in moving this toward a more reasonably well-separated state, but it could still be much cleaner than it is.

epriestley renamed this task from git:branch-unique() is absurdly slow to `git branch --contains X` has cost in the realm of O(commits * branches).Dec 4 2015, 2:03 PM

pretty big rough edges for the workflows it intends to enable

What are these, specifically? Is this entire task moot?

Maybe "intends" is assuming too much about what you meant :) What we're thinking of is that git:branch-unique lets you chain branches and code reviews, like this:

git checkout -b feature
...
git commit
arc diff
git checkout -b child-feature
...
git commit
arc diff

Using git:branch-unique, the second "arc diff" will properly diff against "feature".

This falls apart when, in response to feedback from the first code review, the user goes back and continue working on the parent branch:

git checkout feature
...
git commit
arc diff

Now this "arc diff" will create a new code review, using as its base the first commit shared with "child-feature".

That's the workflow I use, but I --amend "feature" when updating it.

If the graph looks like this:

        /-- B  <-- child-feature
... -- A
        \-- C  <-- feature

...where A is feature, B is child-feature, and C is the update to feature, I don't think there's any context-free rule we can use to unambiguously decide whether "A" is part of "feature" or "child-feature" or neither when arc diff is run at C. Particularly, if you use merge-base instead, both diffs will incorrectly include A, and the original diff on child-feature will also include A.

T3875 also discusses an arc cascade, which would potentially rebase any "leaves" like "child-feature" on top of C.

There are possible not-context-free solutions where we examine what you've already done, but they'd make the behavior of arc diff even more complex, and I think we currently have very little product design space to introduce complexity -- users are already find arc diff befuddling far more often than I want them too (see T6072).

I think we're going to work around this by changing our base config value to git:upstream, git:merge-base(origin/master) and decreeing that if you want to do this sort of dependent feature thing, you have to explicitly set child-feature's upstream to feature. arc feature does that automatically, so we might just encourage people to start using that.

That makes sense, and should work well if you're able to convince developers to use arc feature (or the underlying branch flags, or I think branch.autosetupmerge in Git config?) consistently. You might also consider leaving git:branch-unique(...) in the ruleset after arc:upstream -- it should behave well in more situations than git:merge-base(...) will, I think, and it might be preferable if the penalty for not using --track is "things work, but are slow" instead of "things don't work".

I'm moderately confident we can follow the plan outlined above (or some similar plan) to implement an application-level version of git branch --contains that's faster than the one git implements. However, this is difficult to prioritize:

  • it needs infrastructure reorganization and is likely a large amount of work (at least a day or two);
  • it has the greatest impact in a somewhat-unusual case (deep local commit stacks in huge repositories with many branches);
  • it is really git's fault if we can beat it in PHP;
  • we should ideally also attempt to upstream a patch against git itself if we can develop a better approach than the one it currently uses, which will take more time; and
  • it sounds like this is at least mostly moot anyway, so improving this might directly benefit zero installs.

I found this thread (originally from the Git kernel list, I think?) discussing a similar problem (git tag --contains slowness) from 2010. It has a patch, but the patch uses a commit-time heuristic, introduces new "clockskew" configuration, and seems flimsy to me, at least as a naive outsider with little knowledge of Git internals (it looks like this heuristic is already in use in git name-rev, though, or was in 2010).

http://git.661346.n2.nabble.com/Why-is-quot-git-tag-contains-quot-so-slow-td5241514.html

eadler added a project: Restricted Project.Apr 10 2016, 6:20 AM
eadler added a subscriber: eadler.
eadler moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Apr 17 2016, 6:24 PM

We've run into this same problem, but in the phabricator daemons. They also, apparently, run commands like:

git branch --verbose --no-abbrev --contains <sha> --

As a side note: we resolved this problem by deleting a bunch of branches. But there won't actually be a speedup for phabricator until it does a pull that deletes the old branches, via git fetch --all --prune. Do you know offhand if the daemons already use --prune when pulling, and if not if they can be made to do so?

P2143 has some graph reachability code that I don't actually need yet, but may by the time I get here.