Page MenuHomePhabricator

Add ability to select ref mask used to fetch refs for git repos
Closed, WontfixPublic

Description

We have a use case where one of our repositories is huge (multiple GB) with tens of thousands of remote branches. Doing a full git fetch origin %s --prune +refs/heads/*:refs/heads/* takes an unreasonable amount of time, and puts very high load on the servers.
We'd like to be able to specify the mask to be just refs/heads/master:refs/heads/master or some other glob.

This somewhat relates to the feature to 'import only' specific branches but is more focused on not fetching unused branches instead of not importing branches into the database.

We'd be okay with having this fully supplant the existing 'import only' option (though I can't speak for other use cases).
At the moment we forked phabricator but this is not a great long term solution.

Event Timeline

eadler updated the task description. (Show Details)
eadler moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Feb 17 2016, 1:33 AM
eadler renamed this task from Add ability toSelect ref mask used to fetch refs for git repos to Add ability to select ref mask used to fetch refs for git repos.Feb 17 2016, 6:08 PM
eadler moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Feb 18 2016, 6:32 PM
eadler moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.

Before pursuing this, I'd like to develop a better understanding of the root problem. Particularly:

Doing a full git fetch origin %s --prune +refs/heads/*:refs/heads/* takes an unreasonable amount of time, and puts very high load on the servers.

Conceptually, it seems like this operation should not need to be expensive. I think git should only need to generate about 10K hashes, send them over the wire, then compare them to a similar list on the server (or vice-versa, I don't actually know which side of the connection gets to do the comparison). I wouldn't expect this operation to be exceptionally slow or resource-hungry. The heaviest bit should be the ~1MB blob-of-refs transfer, which shouldn't cost much within a local network.

Do you have data about how slow/resource intensive git fetch is? I can generate a synthetic repository with 10K refs, of course, but it would be good to at least have some baseline numbers of the performance you're experiencing so I can make sure my test case is similar.

If there's no technical reason git has to be slow, we may just be able to fix git.

Of course, I'm not very familiar with the git internals and there may be real technical difficulty here that just isn't apparent to me. But I'd like to develop a working understanding of why this operation must be inherently slow, first. But "has anything changed?" is obviously a computationally trivial question to ask in theory, so it seems like it should be possible to make it cheap in practice.


The other part of the root problem is: why do you have 10K refs? Can you just archive 9,800 of them to avoid this problem, particularly if you only care about master?

For example, I could imagine that perhaps for several years, hundreds of engineers were encouraged to push whatever they wanted to branches like epriestley-tmp1-stuff-patch-inprogress, and that process was stopped a couple of years ago since it was getting out of control, but you never cleaned up the branches. In a scenario like this, maybe you could just move all those refs to some read-only archive and forget about them. But maybe these refs a legitimate part of the permanent structure and history of the repository?


Assuming we can't do anything about git and you can't do anything about the refs, do you want to track a different set of branches than you want to import, or is "Track Only" already effectively equivalent to this proposed "Ref Masks" feature, other than the fact that it doesn't affect which refs we fetch?

If we figure out what to fetch by looking at "Track Only", we get into some trouble with regex(...). Can you express your entire ref mask without regular expressions in the cases you care about? That would let us use a rule like "only fetch listed refs, unless regex() is used -- in that case, case fetch everything".

If you do need globs, we could perhaps add glob(...) pretty easily.


There's also some overlap with T6878. I'd like to to start fetching more refs, including tags, so we can import commits which are reachable from tags but not from branch heads. Does this make things more complicated, or are the number of non-branch refs in your repositories modest?


(This is also straightforward and safe to patch locally in PhabricatorRepositoryPullEngine as a stopgap for a longer-term "fix git" or "delete branches, but 6 months from now" plan, if we end up on one of those approaches.)

You spoke to the "Track Only" stuff a little bit in the description -- for complete clarity, we have two separate options that are only sometimes available:

  • "Import Only" only works in Subversion, and takes a path within the repository like path/to/something/. It's for importing a subdirectory of some monolithic SVN repository.
  • "Track Only" only works in Git (and, uh, Mercurial? I think?), and is the ref-based (well, branch-based, currently) version of this, for importing some subset of refs.

But I think "Track Only" can reasonably specify this, except that if you use any regex() rules we would need to fetch everything, so we should see if there's a different way we can express those if you need them.

If there's no technical reason git has to be slow, we may just be able to fix git.

Commit discovery requires taking the graph intersection where every ref is a node in the graph. This is implemented as taking the complete set of server refs, and the complete set of local refs, and find out where they differ, and what objects are missing. We have had a team dedicated to this for a year and have been trying to the issues we encountered. This has resulted in substantially forking git's transfer mechanism is ways that are not entirely supported by upstream git.

A quote from a team member:

because the kind of state synchronization that git does is mental
it non-linear w/ the number of refs
it's not exponential
but it's bad

To be very concrete this task is about obviating the need for this patch - with the possibility down the line of increasing the set of branches we care about:

commit 87c55535716df6c1f53a93da2dc9904a2ea3e544
Author: Eitan Adler <eax@twitter.com>
Date:   Thu Feb 4 10:42:20 2016 -0800

    For SOURCE only fetch the 'master' branch. This is hard coded, but we should ask upstream for the ability to set this refspec
    
    Reviewers: scode
    
    Differential Revision: https://phabricator.twitter.biz/D102

diff --git a/src/applications/repository/engine/PhabricatorRepositoryPullEngine.php b/src/applications/repository/engine/PhabricatorRepositoryPullEngine.php
index 30ab234..77295bb 100644
--- a/src/applications/repository/engine/PhabricatorRepositoryPullEngine.php
+++ b/src/applications/repository/engine/PhabricatorRepositoryPullEngine.php
@@ -312,9 +312,16 @@ private function executeGitUpdate() {
       // This is a local command, but needs credentials.
       if ($repository->isWorkingCopyBare()) {
         // For bare working copies, we need this magic incantation.
-        $future = $repository->getRemoteCommandFuture(
-          'fetch origin %s --prune',
-          '+refs/heads/*:refs/heads/*');
+        // XXX: twitter workaround for lack of refspec repo property
+        if ($repository->getCallsign() == "SOURCE")  {
+          $future = $repository->getRemoteCommandFuture(
+            'fetch origin %s',
+            'refs/heads/master:refs/heads/master');
+        } else {
+          $future = $repository->getRemoteCommandFuture(
+            'fetch origin %s --prune',
+            '+refs/heads/*:refs/heads/*');
+        }
       } else {
         $future = $repository->getRemoteCommandFuture(
           'fetch --all --prune');

We don't rely on regex() for import only on the repository in question. We do use it for other repos, but we aren't concerned about them due to their relatively small size.

The other part of the root problem is: why do you have 10K refs? Can you just archive 9,800 of them to avoid this problem, particularly if you only care about master?

There are technical and social reasons for this that I don't feel comfortable fully explaining in a public forum. Needless to say, if this were feasible, we wouldn't have invested multiple engineers salaries into fixing git.

I also can't immediately reproduce the root issue. Here's what I did:

First, I made a repository with 10,000 branches.

epriestley@orbital ~/dev/scratch/10krefbare.git $ git ls-remote | wc -l
From ssh://local@localvault.phacility.com/diffusion/TESTY/10krefbare
   10075

Then I compared the git fetch time for all heads, vs the git fetch time for master only. Here's all heads:

epriestley@orbital ~/dev/scratch/10krefbare.git $ time git fetch origin '+refs/heads/*:refs/heads/*' --prune

real	0m1.260s
user	0m0.309s
sys	0m0.059s

Here's just master:

epriestley@orbital ~/dev/scratch/10krefbare.git $ time git fetch origin '+refs/heads/master:refs/heads/master' --prune

real	0m1.112s
user	0m0.178s
sys	0m0.028s

It doesn't look like there's a substantive difference. How can I change this setup to reproduce the issue you've observed, and particularly to create a situation where fetching a subset of heads has meaningfully better performance than fetching all heads?

Hi. I work with @eadler. I didn't do the initial analysis, but I have some familiarity with the issue.

Just making a bunch of branches isn't going repro the issue. The problem we saw was (a) on a medium-sized repo -- at the time, about 100k files in a relatively deep hierarchy, (b) with branches pointing to different places -- in particular, with some pointing to branches off deep history and some off more-recent history. Deep, in this case, is tens or hundreds of thousands of commits. I don't know whether (a) was a requirement, or whether just (b) would be sufficient to repro. I do think it's necessary that the branches point not just *to* deep history, but to *branches off of deep history*. In addition, when the initial work was done, I think we had on the order of 60k refs. We have fewer now, and I haven't actually done the work to see where the inflection point is.

Also, it might "help" if the client also has a large subset of these refs -- say the client is caught up to the state as-of 5000 commits ago.

Sorry for the vagueness here -- as I said, I didn't do the analysis, and I don't have a test case at hand. But I can tell you that it was 100% reproduceable at the time.

The expectation with the git fetch in question is that it will run very frequently on active repositories, so we should have all but the last few minutes (or even seconds) of data in most cases, and almost all the refs should be identical to the remote refs. The fetch schedule is here, but for enabled repositories we'll never wait more than 6 hours between fetches:

https://secure.phabricator.com/book/phabricator/article/diffusion_updates/

This is why I'm surprised by the behavior -- it seems like git should be comparing two lists that are 99.9% identical, dropping most of it on the floor immediately, and then needing to only process the actual updated refs.

In my test repository I stacked all 10K commits on top of each other, so the deepest branch was 10K commits deep, then 9999 commits deep, etc. It's possible this triggers some kind of pathologically good behavior. I can try making the history deeper and more varied.

epriestley closed this task as Wontfix.EditedFeb 23 2016, 11:33 PM
epriestley claimed this task.

From elsewhere, it sounds like there isn't driving interest in pursuing this.

For the sake of completeness, I ran this for a while:

<?php

require_once '/Users/epriestley/dev/libphutil/scripts/__init_script__.php';

$original_depth = 9999;
$original_branch = 'branch-9999';

$depth = $original_depth;
$branch = $original_branch;

while (true) {
  // Pick an ancestor of the current head at random.
  $branch_point = $branch.'~'.mt_rand(0, $depth);

  // Make a new random branch.
  $branch = 'branch-'.Filesystem::readRandomCharacters(12);
  execx('git checkout -b %s %s', $branch, $branch_point);

  // Make some commits.
  $depth = mt_rand(1, 100);
  for ($ii = 0; $ii < $depth; $ii++) {
    $file = Filesystem::readRandomCharacters(8);
    Filesystem::writeFile($file, $file);
    execx('git add %s', $file);
    execx('git commit -am %s', $file);
  }

  // Occasionally, jump back to the original branch.
  if (mt_rand(1, 100) == 1) {
    $depth = $original_depth;
    $branch = $original_branch;
  }
}

This generated an additional 15,000 commits on about 400 branches, but had no discernible impact on fetch cost:

epriestley@orbital ~/dev/scratch/10krefbare.git $ time git fetch origin '+refs/heads/master:refs/heads/master' --prune

real	0m1.128s
user	0m0.184s
sys	0m0.037s
epriestley@orbital ~/dev/scratch/10krefbare.git $ time git fetch origin '+refs/heads/*:refs/heads/*' --prune

real	0m1.312s
user	0m0.336s
sys	0m0.077s
epriestley@orbital ~/dev/scratch/10krefbare.git $ git log --oneline --all | wc -l
   26155
epriestley@orbital ~/dev/scratch/10krefbare.git $ git for-each-ref | wc -l
   10392

Obviously, this repository is still not very large, but it's no longer pathologically shaped and still doesn't seem to be triggering any unusually expensive / nonlinear behavior.