Page MenuHomePhabricator

Formally support side-band change handoff in external repositories
Open, NormalPublic

Description

See T8090 for general discussion of change handoff.

For hosted Git repositories, T8092 seems like a promising way forward for change handoff.

For imported repositories, we can not virtualize refs. It's still desirable to support git-based change handoff if possible, for the same reasons that it's desirable for hosted repositories.

We can probably accept some rough edges here and support handoff for imported repositories, while sharing most of the handoff code for hosted repositories. Specifically:

  • We add configuration settings to Diffusion to let users specify a side-band repository where diffs are pushed by arc diff.
    • For now, options are: don't push to a side-band repository (default), use this repository's remote (maybe leave this out of v0 since user management is a bit of a pain for git/GitHub), use some other repository.
    • In the future, this setting changes slightly to support better support virtual refs in hosted repositories.
  • We emit this data in repository.query.
  • In arc, we check for this data and perform pushes to tags in the specified repository if things are configured, after creating a diff.
  • At some point, after pushing, we update diff metadata. Maybe for now we don't bother and the build harness just figures out that it didn't work because the tag isn't there (it needs to handle this case anyway).

We see how bad this is (does pushing tags to the same repository create a big user-facing mess?) and then go from there.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
epriestley raised the priority of this task from to Normal.May 18 2015, 4:29 PM
epriestley updated the task description. (Show Details)

I have this working, but pushing to tags in the same repo is pretty leaky. The docs say this of git fetch:

By default, any tag that points into the histories being fetched is also fetched; the effect is to fetch tags that point at branches that you are interested in.

I think I don't fully understand what that means, or it means something other than what it says, or some combination of the two. The behavior I observe is more like "By default, git fetches all the tags". Similarly, it seems like "By default, git clone clones all the tags."

You can disable this by setting this option in .git/config:

[remote "origin"]
  tagopt = --no-tags

However, this affects the general behavior of git, so it may not be appropriate. We could consider setting this automatically, but I hesitate to do so because it has side effects (affects behavior in general) and we don't have an opportunity to run it early enough (e.g., we can't set it any earlier than the first time the user runs arc, by which time they've already cloned the repository).

This is still generally fine and nothing here represents a blocker for this feature, it's just not as clean as I might have hoped.

It looks like we can push to some secret place:

$ git push <remote> refs/secret/spaghetti

This creates a "branch" which isn't really a branch and doesn't show up in git tag, git branch, or git remote show <remote>, and doesn't get fetched by git fetch, git pull or git clone.

However, actually fetching it is a bit of a mess, and seems to require:

$ git fetch <remote> refs/secret/spaghetti
$ git checkout FETCH_HEAD

That is, refs/secret/spaghetti doesn't seem to have a local symbolic representation. You can get the hash with:

$ git ls-remote <remote>

...which is a little less sketchy, although also a little more involved. So this is a possible avenue to explore eventually, as a sort of middle ground between tags and virtual refs. For now, I'm going to stick with tags since they're easier to understand, but this seems worth keeping in mind.

If I'm reading this right, with the tag approach & staging != main-repo, unless someone has specified things manually in .git/config they (well arc) can do git push to arbitrary places but since the staging repo isn't a configured remote none of {fetch,pull,clone} will know anything about it. I think that is a satisfactory level of cleanliness for both the main & local repos.

Yep, that's right. My plan for the upstream to move forward in the short term (until we build virtual refs) is:

  • Create a new rSTAGING repository.
  • Point all of rP, rARC and rPHU at it, so everything just stages into that repo.
  • See if that hits any major issues / dealbreakers.

This has its fair share of grimy hackiness to it, but it's pretty clean from the user perspective (basically zero leakiness) and build server perspective (just use rSTAGING and check out tags to do builds).

I think the only real issues are:

  • Authentication may have some UX hiccups, since arc diff now sometimes implies git push to a nonobvious/surprising place. Should be fine in the "everyone works here" case, but maybe a little muddier in the "open source contributor" case. Specifically, if pushing to staging fails because a user doesn't have credentials, I suspect it may not currently be obvious exactly what's wrong, why the push is failing, or how to resolve it.
  • Far down the line, a separate staging area may leave us slightly more poorly positioned to do automatic merges/rebases/etc.
  • From a build server perspective, a separate staging area adds some complexity if you actually want to run tests on some derivative of the change, like "the change rebased on master" instead of "the change, as written". This seems unusual to me, but I've seen interest from at least one source.

refs/secret/spaghetti

We were discussing GH magic branches lately, and it turns out you can do this:

git fetch origin refs/pull/33/head:pr33

and it will create a local branch named pr33 directly. I'm not sure if this feature is documented anywhere (I only tested this on git ~2.4). It's explicitly documented, and has been for many years.

And this trick should also be usable:

git config --add remote.origin.fetch +refs/secret/*:refs/remotes/origin/secret/*

which seems to make refs for anything under refs/secret/

I configured this for our repos; only hiccup so far is that there's no way to disable "Commit Hook" Herald rules right now so the first pushes (which pushed the entire repo history) took a little while. Normal "arc diff" staging pushes only take about a second.

We don't do a perfect job with the UI (see rSTAGING) either right now, but so far this seems like it's good enough to move forward with.

From D13041, it looks like the first time git does a push to a totally separate staging area it pushes a lot of extra data, probably because none of the heads in the remote are ancestors of the head being pushed, so it can't detect that their ancestries are 99.9% identical.

I don't immediately have a clean way to mitigate this automatically. I think it could be mitigated by pushing master into the staging area when you set things up and then maybe pushing it again every couple months, but it's not immediately obvious to me how to get arc to do that for you in a clean/automatic/general way.

We could really push two refs: phabricator/diff/12345 and phabricator/base/12345. To phabricator/base/12345, we push the first parent of the last thing that pops out of git log HEAD --not --remotes -- (that is, the first ancestor of the diff which is also present in any remote)?

It would be OK if this rule was wrong some of the time, just as long as it's not wrong all of the time.

(Not sure if my comment in T8090 should actually be here).

"You need to cron some stuff syncing master or it will sometimes be kind of slow" is acceptable from my point of view. Things I can solve with code is preferable to 'please debug my snowflake workstation where I did strange where I did strange things with arc and git'.

Doing the push to staging server side was infeasible because phabricator doesn't have a non-bare checkout and/or there were $LARGE_NUMBER of other infra things in the way, right?

@hach-que, this is what you were asking about in IRC.

@epriestley, does repository mirroring delete branches in the remote target? If not, this could be worked around by just setting the staging repository as a mirror of the original repository (so Phabricator would automatically push commits to the staging repository when master is updated).

It deletes branches in the remote, and does not necessarily have (or otherwise currently need) permission to push to the staging repository.

the staging area doesn't support working with both ssh and http now. We're using ssh for external connection and http when we're at work (using proxy).

a single repo link doesn't really work here, because either we have to use ssh and http (and use a key and Git password at a same time), or use ssh and using proxy won't work

That is an issue we've also encountered; hopefully it will be fixed when the staging area of repositories are included inline on a repository instead of being separate (i.e. they use the same URL, but staging is just a set of tags / branches that commits are pushed to).

eadler added a project: Restricted Project.Jan 8 2016, 10:36 PM
eadler moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Jan 8 2016, 10:40 PM

If anyone else set up a cron job or similar to keep a staging area with a fresh master, you should be able to throw it away after users upgrade past T10509. That task has some details.

eadler moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Mar 9 2016, 10:11 PM

I've been using this functionality for some days to use the CircleCI integration. Mostly the user experience is fine. It's mostly possible to configure git log to ignore these tags, like:

git log --exclude='refs/tags/phabricator/*' --all

and of course in many cases it's trivial to filter them out with grep -v or whatever, even if you aren't a git wizard.

However where it gets annoying is with git log --simplify-by-decoration. This is really handy for example to see what releases got merged into where. For example here's recent history of the linux kernel:

$ git log --exclude='refs/tags/phabricator/*' --color --graph --pretty=format:'%Cred%h%Creset -%C(yellow)%d%Creset %s ' --abbrev-commit --simplify-by-decoration
* c05c2ec - (HEAD -> master, origin/master, origin/HEAD) Merge branch 'parisc-4.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux 
* f55532a - (tag: v4.6-rc1) Linux 4.6-rc1 
*   1a46712 - Merge tag 'gpio-v4.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio 
|\  
| *   0bae2f1 - Merge branch 'ib-mfd-regulator-gpio-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd into devel 
| |\  
| * \   e5451c8 - Merge remote-tracking branch 'linusw-gpio/for-next' into devm_gpiochip 
| |\ \  
| * | | a101ad9 - Share upstreaming patches 
|  / /  
* | | b562e44 - (tag: v4.5) Linux 4.5 
* | | f6cede5 - (tag: v4.5-rc7) Linux 4.5-rc7 
* | | fc77dbd - (tag: v4.5-rc6) Linux 4.5-rc6 
* | | 81f70ba - (tag: v4.5-rc5) Linux 4.5-rc5 
* | | 18558ca - (tag: v4.5-rc4) Linux 4.5-rc4 
* | | 388f7b1 - (tag: v4.5-rc3) Linux 4.5-rc3 
| |/  
|/|   
* | 36f90b0 - (tag: v4.5-rc2) Linux 4.5-rc2 
|/  
* 92e963f - (tag: v4.5-rc1) Linux 4.5-rc1 

It's pretty plain to see here what got merged where without getting lost in the minutia of all the commits that happened on the branch. This is really handy for seeing what happened to bugs, what versions contain them, and what versions have been fixed when using DaggyFixes, for example.

Now compare that with the same from project where we've been using Phabricator's staging areas:

$ git log --exclude='refs/tags/phabricator/*' --color --graph --pretty=format:'%Cred%h%Creset -%C(yellow)%d%Creset %s ' --abbrev-commit --simplify-by-decoration
* fa4f9a5 - (HEAD -> master, tag: phabricator/base/997, origin/master, origin/fork) Send pass/fail metrics to hostedgraphite 
* c98687e - (tag: phabricator/base/991, tag: phabricator/base/990, tag: phabricator/base/989, tag: phabricator/base/987) Download PlaceLocal artifacts from master 
* efca0e5 - (tag: phabricator/base/988, tag: phabricator/base/986, tag: phabricator/base/984, tag: phabricator/base/981, tag: phabricator/base/980, tag: phabricator/base/978) Fix deployment of "fork" branch 
* 4a87911 - (tag: phabricator/base/974) Remove development cruft specific to my machine 
* 59509a3 - (tag: phabricator/base/962) Notify only on "master" branch 
* d7b3f72 - (tag: phabricator/base/959, tag: phabricator/base/958) Execute arbitrary commands in the test container 
* 8da8f0a - (tag: phabricator/base/957, tag: phabricator/base/956) Make cleanup issue during test not fail CI 
*   960278f - Add 'selenium/' from commit 'aacaa8f0e1c376743ed4a1a58b03bbc2bfb40394' 
|\  
| * 9d92915 - Initial selenium test framework along with a sample test 
* ae00805 - working docker-compose up 

While all the phabricator/diff/* tags aren't included because they aren't ancestors of HEAD, I still have to see all the phabricator/base/* tags. This largely nullifies the usefulness of --simplify-by-decoration because commits that would normally be simplified away aren't because there are tags referencing them. Without the phabricator tags, all the commits between 8da8f0a and c98687e would be simplified away.

There's some [magic with git for-each-ref](http://stackoverflow.com/questions/21057786/show-only-certain-tags-when-using-gitk) that might get around this, but I'm not sure that's solving this problem, and I haven't been able to get it to work. Alternately I can delete all the phabricator tags from my repo before looking at the history, but that's not really ideal.

The other option is using a staging repo that is distinct from the canonical one. The issue here is that CircleCI's configuration is scoped to a repository, so I can't share one staging repo for all projects. I could create a separate staging repo for each project, but that would double our GitHub hosting costs, and require all the CircleCI configuration to be done twice. Also not really ideal.

I'm thinking for my case, there's not really a lot of value in keeping the tags around after the diff has been closed. Maybe that's something to consider?

I'm thinking for my case, there's not really a lot of value in keeping the tags around after the diff has been closed. Maybe that's something to consider?

In this specific case (staging same as origin) we could just not push the base tag (it has no value in this case). We can probably even figure that out automatically without needing configuration. That would be a cleaner fix, I think?

We can delete tags too, but I expect that <100% of diffs will land so this probably only cleans things up a bit rather than completely. Getting rid of 90% of that clutter would obviously help a lot, but I think we can get rid of 100% of it by just declining to push base when origin/staging match (or when staging is a tracked remote of any flavor, I guess?).

When using a repo as its own staging, I kinda hoped that we won't be using refs/tags/ but refs/phabricator/ or something, so when cloning, the default would be "do not clone all staging things".

In this specific case (staging same as origin) we could just not push the base tag (it has no value in this case). We can probably even figure that out automatically without needing configuration. That would be a cleaner fix, I think?

I like this solution a lot.

cburroughs moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.
eadler moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Jul 4 2016, 9:17 PM

We see how bad this is (does pushing tags to the same repository create a big user-facing mess?) and then go from there.

Just wanted to report that at least for us this is starting to become an issue.

We tried using repo as its own staging for few months now and some repos already accumulated 50k+ phabricator/* tags. These are accumulating on users working copies and users are not happy about it. In some cases these tags are holding lots of obsolete data around that was never (and will never be) merged in but can't be garbage collected.

Are there any plans to add some means for cleaning up tags of closed diffs?

I think cleaning up closed tags probably isn't a particularly great long-term solution since some fraction of the revisions likely just get left open forever -- so you'd have 500 tags instead of 50000, but I assume that still causes about as many problems, and you'll be back to 50000 in just 100 more years anyway.

In most cases there is no technical reason we can't push to hidden refs instead, and the choice to use tags earlier (T8238#116395) was largely just to make this behavior visible and easier to understand. In practice, some build systems can not (or can not easily) operate on refs, so I'd expect tags to probably remain at least an option and possibly the default, but I think there's otherwise no reason we can't use refs (and our general support for refs is now in a fairly good place after T6878).

The plan for hosted repositories is still to use virtual refs (T8092), which is prototyped (D15954) but needs significant additional work before moving toward production.

We could also make arc land delete these tags/refs, but I'm slightly hesitant about this because I think it may be confusing, and isn't a real solution to the problem (it won't get close to 100% of them).

You could also have a script remove these tags relatively safely. The ID in the ref is a Diff ID, and diffusion.querydiffs can let you get information about the corresponding objects so you can nuke the ref if the diff is obsolete, old, the attached revision is closed, etc.

I think cleaning up closed tags probably isn't a particularly great long-term solution since some fraction of the revisions likely just get left open forever -- so you'd have 500 tags instead of 50000, but I assume that still causes about as many problems, and you'll be back to 50000 in just 100 more years anyway.

But me like 100 year option. I can I haz it plyz? :-)

The answer at least would be to educate people to close obsolete diffs (assuming 100 years old diffs are obsolete of course).

In most cases there is no technical reason we can't push to hidden refs instead, and the choice to use tags earlier (T8238#116395) was largely just to make this behavior visible and easier to understand. In practice, some build systems can not (or can not easily) operate on refs, so I'd expect tags to probably remain at least an option and possibly the default, but I think there's otherwise no reason we can't use refs (and our general support for refs is now in a fairly good place after T6878).

I don't have strong opinion on visible vs hidden really.

One thing that might be worth nothing against tags is that there are developers who do git push --tags and if they have all these phabricator/* tags accumulated in their working copy they recreate all of them on remote. If someone takes the time to cleanup these tags and next day git pull will brings everything back, that someone might not be very happy.

And in my understanding git tags at least in theory are not meant to be deleted (or otherwise modified).

On the other hand I'm not really a fan of hidden dangling things that accumulate at this rate. But if there was some kind of GC process I guess it could a fairly good compromise.

The plan for hosted repositories is still to use virtual refs (T8092), which is prototyped (D15954) but needs significant additional work before moving toward production.

Perfectly understandable. Also we are using external repositories so I guess that's even further in future.

We could also make arc land delete these tags/refs, but I'm slightly hesitant about this because I think it may be confusing, and isn't a real solution to the problem (it won't get close to 100% of them).

I've thought about that but at least with tags the deletion of diff tags on arc land wouldn't be very useful. Developer would get his/her diff tags cleaned up (deleted locally and on remote) but since git pull does not delete local tags that were deleted on remote everyone else would still have these tags in their local working copy. So basically developer on arc land would clean up his own diff tags but would accumulate tags created by everyone else.

Unless arc land would trigger some kind of "global" GC process that would consider all phabricator/* tags instead of only tags of diff being landed.

You could also have a script remove these tags relatively safely. The ID in the ref is a Diff ID, and diffusion.querydiffs can let you get information about the corresponding objects so you can nuke the ref if the diff is obsolete, old, the attached revision is closed, etc.

Thats what I was thinking about. Actually I was thinking about adding this to our Arcanist fork but decided to check with you first if there are plans to do something like this upstream.

I'm a little confused about how these tags are appearing so widely in developer working copies in the first place. Is git clone fetching them by default? Do you have unusual default options configured? Do users regularly explicitly fetch all tags present in the remote?

That is, my expectation is that git pull does not "bring everything back".

Maybe my assumptions about default behaviors are off, or the more-recent base tags (but not necessarily the diff tags) are getting fetched because they're often ancestors of master?

We use a staging repository (rSTAGING) for all this stuff so I'm not actually sure what the practical behavior of the tags-in-the-same-repo version of this is.

(The major goal of using refs instead of tags wouldn't really be to hide them per se, just to make them extremely difficult to fetch by accident.)

I'm a little confused about how these tags are appearing so widely in developer working copies in the first place. Is git clone fetching them by default? Do you have unusual default options configured? Do users regularly explicitly fetch all tags present in the remote?

That is, my expectation is that git pull does not "bring everything back".

Maybe my assumptions about default behaviors are off, or the more-recent base tags (but not necessarily the diff tags) are getting fetched because they're often ancestors of master?

We use a staging repository (rSTAGING) for all this stuff so I'm not actually sure what the practical behavior of the tags-in-the-same-repo version of this is.

  • git clone fetches all tags from remote by default.
  • Not entirely sure how much git pull fetches but I can confirm that there are way more base then diff tags locally. However there are significant number of the diff tags are not all related to the arc diffs that the developer initiated.

So I guess my "bring everything back" was not entirely correct, it should have been "bring almost everything back".

(The major goal of using refs instead of tags wouldn't really be to hide them per se, just to make them extremely difficult to fetch by accident.)

Just saying that they could also be more difficult to discover for someone not aware of this Phabricator feature in case they need to investigate something directly or indirectly related to them. Of course all this is entirely hypothetical so not sure if its something to worry about.

I'm a little confused about how these tags are appearing so widely in developer working copies in the first place. Is git clone fetching them by default? Do you have unusual default options configured? Do users regularly explicitly fetch all tags present in the remote?

That is, my expectation is that git pull does not "bring everything back".

I don't think git pull will fetch tags unless they are specifically requested, but git fetch will. I run get fetch all the time, probably more than git pull. I know I can be specific on what refs to fetch to avoid that, but out of habit I'm lazy and just fetch the whole remote.

Some GHC contributors have had some trouble with the fact that there is no choice in staging area URL. Unfortunately, some users appear to be behind corporate firewalls which block anything but http and https. We use SSH for our staging repository, which is of course problematic for these contributors. There are a few comments above from Nov 2015 which suggest that we aren't the only ones who have this issue.

Have they asked network operations to allow outbound traffic on port 22? It seems like organizations which employ software developers should reasonably permit this traffic. Particularly, security-minded organizations should reasonably prefer asymmetric keypair authentication over HTTP password authentication in the general case.

Offhand, I can imagine no legitimate technical reason for a software organization to forbid outbound traffic on port 22. It's possible that one exists, but this represents no barrier for attackers that I'm aware of and doesn't appear to serve any other goals that an organization employing software developers might reasonably have.

In cases where a problem is caused by organizational policies which exist for "organizational" reasons (rather than solid, objective technical reasons) we are extremely hesitant to bear the burden of questionable organizational policy. Organizations that choose to implement arbitrary restrictions or make bizarre technology choices should generally bear the burden of those decisions.

For example, a large social network once chose to locate several Phabricator hosts 3,000 miles apart with minutes of NFS-based replication delay, then asked us to make various special accommodations to support this (see T1969).

Organizations are in a particularly strong position to create bizarre problems with huge technical costs. Broadly, we can not bear the cost of solving these "organizational" problems in the general case: we have enough legitimately difficult technical problems to solve without organizations making it harder. And we can't have an upstream that is layers and layers of organization-specific hacks glommed on top of one another: this isn't sustainable or maintainable in the long term (see also T8227).

This isn't to say that we won't ever support multiple staging URIs, but I think this motivation (roughly, "bad network administration/policy") is an exceptionally weak one. If you organization's network is not well-administered and does not provide you with the capabilities you need to do your job, you should try to fix the policy, not ask us to help you work around it.

(If the policy isn't fixable, you can personally bear the cost of this policy by tunneling the traffic over 80 or 443 locally.)

I agree that there is no sensible technical reason for the broken network policies that prompted my comment and understand your reasons for preferring to punt on a technical solution. Thanks for your thoughtful remarks.

At the moment the work-around we will go with is for me to manually land and re-upload Diffs coming from affected contributors, ensuring they get pushed to the staging area. Thankfully there are few enough of these cases that this shouldn't be too painful.

We have been using Phabricator for a couple of months now and have configured the staging repository to be the same as the Diffusion repository. As mentioned by others we are now facing issues in identifying our release tags due to the immense clutter of the tags. With the virtual ref solution (T8092) still some way off.
Is it possible for starters to update the Diffusion UI to ignore the diff and base tags, so that at minimum the UI clutter is reduced?

@dreadlord2203: Why don't you use a different repository for the staging area?

@avivey I am exploring that solution as well, I was just wondering if hiding the diff and base tags is a doable solution wanted by the community by large.
We have multiple hosted repositories inter-dependent on each other, and our setup is still in its nascent stage. So I would prefer to not manage duplicate repositories until we finalize on our long term setup.

@dreadlord2203 indeed, hiding those tags would be useful. I found one problem with using a separate repository for change handoff, and that is access control. If you maintain several repositories with different access controls, then you would have to duplicate each of them and maintain a second set of access controls for the corresponding staging repositories. This is a lot of extra work and it would be easy to make a mistake and leak the staging repo to people who shouldn't be able to see it.

Yeah, the long-term plans are "automatic staging areas", which are essentially some kind of hidden refs (Probably using T8092 though, not just hiding stuff).

A possible mid-term solution is to use something like using refs which aren't "tags" (Like github's /refs/pulls/xxx); I think we've discussed it didn't like it because T8092 is desired for other things too.

@avivey Implementation of T8092 is eagerly awaited, I am merely proposing a stop-gap solution of hiding the diff and base tags listed on the Diffusion UI for a repo when the same repo is being used as a staging area.
This would greatly reduce the UI clutter and allow teams to use tags as an effective version tracking solution.

Not sure if this is the correct ticket for it, but I got here via D13020.

Yesterday I had to recognize that creating a diff via web does not push to the staging area (I see why), but I was wondering either this could be implemented somehow?

A possible mid-term solution is to use something like using refs which aren't "tags" (Like github's /refs/pulls/xxx); I think we've discussed it didn't like it because T8092 is desired for other things too.

This would be great when using Diffusion as a mirror rather than primary. I understand it might not fit the ideal Phabricator primary flow as seems to be outlined in T8092, but it would be quite valuable for those of us using Phabricator for review but still using another hosting service to avoid git fetch bringing loads of irrelevant tags down.