Page MenuHomePhabricator

Support GitHub-like forking of repositories
Open, Needs TriagePublic

Description

For context, even though our organization is fairly large, and we have stringent policies around what sorts of activities people are permitted to do with code bound for production, we purposefully aim for as much freedom and flexibility as possible "off master".

Reproducing a situation where this is unfavorable is not difficult. Simply cloning a repository and rebasing starting from some ancient hash, then publishing that branch will email everyone who committed to that branch (if they haven't adjusted their email settings already) with an email for every commit they made, and it's extremely unlikely that they care about this or would find the email welcome. For a non-contrived example of how this happens in the wild, here's the latest response to "why were you doing that".

A test branch was created for iterative testing and development, specifically for deploying onto hardware in the field, with the end goal that the resulting branch would be used to create a diff. A separate, recent commit on master, which was required for further testing, required master to be rebased onto the feature branch and pushed to the origin to be pulled down and deployed to the hardware. Because of this, history was rewritten and a subsequent push would have notified all committers whose commits were included in the new history to be notified.

So tl;dr; people don't always care a whole lot about managing diffs when in the field.

It's not exactly a frequent occurrence, but I find myself having a hard time explaining "why not" when it does happen, other than "because phabricator is going to generate a lot of email that we don't want".

Related Objects

Event Timeline

In this case, does the branch contain reviewed changes, or just draft changes by multiple different authors?

I'd normally expect merging or cherry-picking to be used to integrate upstream features with published code, just because rewriting published code is a Bad Thing (I think this is pretty widely accepted?). Beyond generating email, it can break or confuse users with changes in their working copies, break or disrupt links or references to the old commits, etc. In the general case, force-pushing can even destroy code. I'm sort of surprised force pushing is permitted given the stringent audit/log concerns I believe you have or will have in the future.

Is this something like a branch full of purely temporary code which is being pushed to a remote primarily as a convenient method for transferring it between hosts? Basically, "source code ftp"?

Two possible solutions might be:

  • Exclude branches that start with tmp- from being tracked (Diffusion can do this with a clever regexp in "Track Branches").
    • Somehow convince everyone to use tmp- for all their temporary stuff.
    • Or just stop creation of new branches with other names (Herald can do this).
  • We let users "fork" repositories via T8092 and create private branches?

It's not rewriting an existing published branch, we don't permit that, it's just publishing a branch that you've accumulated a bunch of things on as you've been iterating for some period of time. You're not done with it by and stretch of the imagination, and it's been sitting on your machine too long for you to not push it somewhere.

There are certainly other viable options here that get rid of the problem which mostly come down to workspace hygiene, but like I said, we try to not impose any rules on how you have your branches set up etc.

I think that "forking" is the right answer here.

I think I'm still missing something here -- are users doing this?

  1. Have some local commits based on something old.
  2. Rebase master on top of the local branch?
  3. Push the whole thing?
  4. Everyone who touched master between the branch point and HEAD gets email?

That can't be right because step (2) is insane?

Anyway, assuming this is a "freedom and flexibility" issue over a "perform only reasonable operations" issue, we're definitely open to implementing forking, but this is probably in the realm of prioritization with a nontrivial set of infrastructure and product blockers.

You could also conceivably sort of accomplish the same thing ("push it somewhere") by just creating a free-for-all repository called "everyone's fork of everything". I think the workflow on this wouldn't be too much worse from the client side than the workflow on "real" forks (e.g., you still need to add a second origin, and keep track of which one you're pushing to), and it would require zero upstream changes.

Are there reasons that's a nonstarter? Two I can think of are:

  • You have granular repository visibility policies, and everyone could see everything in the "forks of everything" repo?
  • Purely as a user experience issue, users hate the idea of adding a "remote" named "storage" but love the idea of adding a "fork" named "epriestley/repo"?

As a totally crazy idea, based on the virtualizing refs thing: Allow some sort of "private branch" without forking?
That way user can "push for backup", but don't need to add a remote, or keep their fork up-to-date with the master.

epriestley renamed this task from Provide a way for users to publish rebased or otherwise "rewritten" branches without generating a ton of email to Support GitHub-like forking of repositories.Mar 30 2016, 9:22 PM
epriestley added a project: Diffusion.

We can do "private branches", but you can already do something like this:

  • In Diffusion, don't track branches named personal-*.
  • When a user creates a branch named personal-*, ignore it in Herald.
  • When a user creates any other branch, reject the push in Herald with a message like "if you want to push personal branches, make sure they start with "personal-".".

I think users probably don't want real private branches particularly often (e.g., they want other users to be able to pull their branches, or want to be able to deploy those branches to live production servers because they live on the edge). Maybe there are some use cases, but I'd guess most users doing this are coming from a GitHub-ish sort of "push = save document" mindset, instead of a "push = publish forever" mindset.

To actually do forking, the pathway is roughly:

  • T4369: I want to fix the streaming issues with HTTP first. We need to do this before we can start intercepting the wire protocol, since we'll have different HTTP vs SSH views of the repository otherwise. This is well-understood and relatively straightforward.
  • T4245: Finish the URI stuff for callsign-free repositories. We'll need to do this before we can reasonably start exposing virtual repositories.
  • T8093: I believe proxying the protocol (vs using a Git other than git) is viable, but don't have a prototype yet. There's a lot of unexplored territory here still.
  • T5000: We may not really want/need to build this when we get here, but it's probably a very small distance out of the way. Almost all of the work blocking T5000 also directly blocks forking.
  • T8092: This is basically a soft/implicit fork in a hidden namespace, and a second logical step that builds something useful/testable while moving us closer to forking.
  • Design and build user-facing forking: at this point the technical stuff should be mostly settled, and we'll mostly have product decisions left.

The total amount of work involved here is difficult to estimate. HTTP streaming and URI stuff are pretty straightforward; maybe a day or two total. The proxying is probably half a day of research to estimate, then who knows how long to build. Diffs-by-pushing is maybe half a day beyond that. Internal forking is maybe a day? Then probably like a week to build something usable with external forking, since it'll be so product/UI heavy? Maybe it's not quite that bad.

T10366 probably also has to happen at some point before the UI stuff happens too much.

If there's interest in prioritizing this directly or via T5000 I'm comfortable committing to an 80-hour estimate to support these use cases:

  • Users get GitHub-like forks which they can push changes to and pull changes from.
    • This supports the "git push = save changes" workflow.
    • This supports the "git push = collaborate" workflow (share unreviewed changes among multiple users).
  • Users get a new workflow to map branches in a fork to revisions. This would almost certainly be substantially similar to the GitHub pull request workflow.

That estimate likely skews fairly high, but this is a big chunk of work with a lot of remaining high-variance unknowns that I don't think I can really lock down until we build it out.

eadler added a project: Restricted Project.May 21 2016, 3:50 PM
eadler moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.May 21 2016, 4:04 PM
eadler moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.
eadler moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Jul 4 2016, 8:59 PM
eadler moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.

In T11535, @samking says forking "seems like a fairly heavyweight solution," and that users don't want their own GH-style fork as much as they want to create branches "somewhere".

From what I've seen, when given the choice, people actually prefer to add their working branches to the "official" repo:

  • People don't consider "branch pollution" to be an issue; Having 1000 remote branches seems fine to them (Bonus points if the branches are all called PROJECT-1234, cause why not).
  • Adding a remote is not something users are comfortable doing on their own
  • When reviewing code, users want to fetch the changes locally to use local compare tools; In GH, that's more complicated when forking. That's 2 GH specific-issues (Bad reviewing tool + why is it harder!?), but it contributes to the "heavyweight" feeling.

GH forking works well in open-source, in lowering costs for new contributors, but I suspect that in Enterprise, where you'd normally just allow all engineers to push code directly, "everybody" just prefer to push to samking/feature branch on the "real" repo instead.

We do already support that with some Herald pre-commit rules, but I think that better UI around that would make more people happy than full-fledged Forking.

(I wrote this a long time ago but didn't submit. Don't know why)

I expect our forking to be much lighter-weight than GitHub's (ideally, I'd like it to work something like git push whatever/branch creates a fork called "whatever"), forking is just a familiar metaphor.

To start with, I suppose we could also step back and try to make Phabricator work better with pushing thousands of user branches into the remote until that model inevitably collapses. It is so clearly a bad and unsustainable model to me that I'm amazed how prevalent it is, but it seems that I am approximately the only engineer in the world who doesn't think "git push origin bugfixes3-epriestley-tmp-lul" means "save changes".

I think we can fix most of this by:

  • Splitting Herald commit rules into "commit" and "ref change" rules.
    • This is only materially difficult because we must generate ref changes synthetically for observed repositories.
    • Then, navigating the mess in T11953#202657 somehow.
  • Defaulting repositories to only autoclose their default branch?
  • UI stuff (typeaheads) to make browsing thousands of branches better.
  • Adding support for this garbage into arc land (T11535).
  • Probably rename "autoclose" to "permanent branches"?
  • A bunch of mangling around mail rules, maybe?

To me, this (and leaving loose, non-version-controlled, non-ignored files around in working copies) feel like practices that are so toxic that we should be actively discouraging them, but I think it's clear that I'm not going to win this one. Maybe we can split the forking features into a closed-source, enterprise only extension called "I Told You So: Tools for Reclaiming Overgrown Remotes" that costs a billion trillion dollars.

I think most of these changes are independently good, at least, but supporting this practice is largely handing users rope so they can shoot themselves in the feet.

I think most of these changes are independently good, at least, but supporting this practice is largely handing users rope so they can shoot themselves in the feet.

@epriestley Not that it will likely make you feel any better about it, but some of us just really want our Guns-Made-From-Rope™ to be powered by Phabricator instead of other much less awesome things. 😄