Page MenuHomePhabricator

When pushing changes to a staging area, also push the nearest ancestor present in the remote
Closed, ResolvedPublic

Description

See discusion in T8238. Currently, when we push changes to a staging area, git may do much more work than it needs to if no ancestor of HEAD is present in the remote.

This work does need to happen once (to push the actual history) but we can end up doing it much more often than we need to.

Although this tends to normalize over time, we can likely avoid this by finding the nearest merge-base of HEAD and any origin ref (or, slightly more simply, just using the base commit) and pushing it to some /base/ ref. This should generally reduce the amount of work required after the first push.

Event Timeline

epriestley created this object with edit policy "All Users".
epriestley moved this task from Backlog to v3 on the Harbormaster board.
epriestley edited projects, added Harbormaster (v3); removed Harbormaster.

Would this eliminate most of the benefit from cron based "sync refs from the main repo" glue?

Yes, this should eliminate the need to sync/cron/do anything.

From what I see, this doesn't actually normalize over time, if the staging repo is used only for staging: none of the /diff/ refs will ever be a parent of master from the real repo, because of the rebase during arc-land (I've written this at some point in Manual Staging Area Caveat).

I've figured it's not worth trying to add a /base ref because we'd have Automatic staging soon-ish.

Yeah, that claim assumes someone will arc diff some published commit sooner or later (i.e., run arc diff while on master with no local changes). While this wouldn't happen on a normal workflow or on purpose, I suspect it happens more-often-than-never.

I've marked D15424 as fixing this, although I don't have a detailed understanding of exactly what goes on in the git protocol and which heuristics it is using. In particular, I haven't been able to trick it into doing too much work locally (maybe it also looks at parents?).

So this probably fixes things, but let me know if you're still seeing issues after deploying it. The expected behavior is:

  • The first arc diff after configuring a staging area for a repository will transfer a lot of data: the entire repository history legitimately needs to be transferred.
  • The first arc diff made with a client that includes the changes in D15424 should push two refs: the actual changes, and the base commit (almost always some published commit on master or whatever remote branch you're working against).
  • After the first two-ref diff, git should be able to transfer only a small amount of data on future pushes.

Let me know if you're seeing behavior which differs from that.

I have a 500mb repo somewhere, and it would reproduce very easily to me, so I'll test this.

tl;dr: looks good!

Reproducing the problem involves simulating a new user: delete the reflog and running git gc --aggressive ;git prune -v between updates, so that no tag in staging matches anything in the local DB (Or just re-clone the origin).

That does mean that the old behavior would likely only hurt each user once, because they have their own old diffs in the ref-logs.

When pushing the base (Assuming it's a sane base), new users will still be ahead of that, and will only need to push small changes.

On a side note: git is always faster than I expect it to be.

Aha! That makes sense, thanks for digging into it!