Page MenuHomePhabricator

Reduce repository update cost for repositories with large numbers of refs
Open, NormalPublic

Description

See PHI2043.

Currently, the cost to update repositories with large numbers of refs (30K+) is unreasonably high. A trivial test repository with 65K refs takes 24s to update locally.

75% of this cost is three calls to git for-each-ref .... These can be cached; one is for an issue from T5839 that may no longer be relevant.

For hosted repositories, we should be able to bail out of this process instantly in all cases where no updates have occurred.

For observed repositories, we should be able to bail out after confirming that the output of git for-each-ref has not changed.

Event Timeline

epriestley created this task.

Has this repository changed?

For observed repositories, we can ls-remote the remote and local and compare them, as we currently do. Right now, refs are compared to a local ls-remote. This isn't technically necessary, but is complex to remove entirely because we'd also need to pattern-match refspecs, and isn't relevant to the root problem in production. It's also very localized.

For hosted repositories, we can test if our version clock is ahead of the last update.

In the case of observed repositories, we should write a durable "needs update" flag before continuing, to avoid this case:

  • see the remote is ahead;
  • fetch the remote;
  • start the update;
  • hit an error.

This would leave us in an out-of-sync state that won't converge until the remote updates again, since our local is now in the same state as the remote.

This flag could just be a version event. Phabricator currently half-supports "versioning" observed repositories, but does this in a very partial way. Perhaps a better general approach would be:

  • examine the remote state and compare it to the last observed remote state;
  • if they differ, write a push log event ("observed remote change") and write the new observed state;
  • decide to update if unprocessed push log events exist.

At the end of the day, this gives us an option to bail out immediately in many cases, when there is no actual work to be done.

One call to for-each-ref comes from resolving a list of commit hashes, to test if they still exist. This is accomplished with DiffusionLowLevelResolveRefsQuery, which calls git for-each-ref first and falls back to git cat-file --batch-check.

This is "correct", in the sense that if you have a branch named aaaa and also happen to have a unique commit that starts with aaaa, we'd like the symbol aaaa to pick the branch, not the ref. Git does this for symbols that are not exactly 40 characters in length, e.g., with these refs:

db9191f9a8d5356e5c056f9f22e5bed2fd8eb791 # a commit
5b8b5f2141485415999eb9dd129dbb84d98c5eae # a commit
5b8b5f2141485415999eb9dd129dbb84d98c5ea # a branch with a silly name

...git show 5b8b5f2141485415999eb9dd129dbb84d98c5ea (the branch name) will show the branch:

$ git show 5b8b5f2141485415999eb9dd129dbb84d98c5e              
commit 5b8b5f2141485415999eb9dd129dbb84d98c5eae (origin/master, origin/HEAD, master)

With a branch with the same name as a commit hash, the commit "wins" with a warning:

$ git show 5b8b5f2141485415999eb9dd129dbb84d98c5eae
warning: refname '5b8b5f2141485415999eb9dd129dbb84d98c5eae' is ambiguous.
Git normally never creates a ref that ends with 40 hex characters
because it will be ignored when you just specify 40-hex. These refs
may be created by mistake. For example,

  git switch -c $br $(git rev-parse ...)

where "$br" is somehow empty and a 40-hex ref is created. Please
examine these refs and maybe delete them. Turn this message off by
running "git config advice.objectNameWarning false"
commit 5b8b5f2141485415999eb9dd129dbb84d98c5eae (origin/master, origin/HEAD, master)

This part of the warning seems overstated:

Git normally never creates a ref that ends with 40 hex characters...

...because the process for creating this branch is git checkout -b <40 character hex string>, which seems pretty "normal" to me and doesn't emit a warning.

In any case, this particular pathway never wants to resolve branches and is always starting with commit hashes, so we can always skip for-each-ref. We could do this by explicitly asking for it to be skipped, but I think it's also likely reasonable to implement Git's rule for consistency with Git behavior: a 40-character hex symbol is never resolved as a branch name, even if that branch exists:

$ git checkout -b aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
Switched to a new branch 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'

$ git show aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
warning: refname 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' is ambiguous.
Git normally never creates a ref that ends with 40 hex characters
because it will be ignored when you just specify 40-hex. These refs
may be created by mistake. For example,

  git switch -c $br $(git rev-parse ...)

where "$br" is somehow empty and a 40-hex ref is created. Please
examine these refs and maybe delete them. Turn this message off by
running "git config advice.objectNameWarning false"
fatal: bad object aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa