Page MenuHomePhabricator

Improve support for guiding users through repository rewrites
Open, NormalPublic

Description

Occasionally, published repository history is rewritten. This is a painful process, but worth doing in some situations -- for example, if large binaries have been committed and the repository has become too large to work with. This is prevalent enough that GitHub has a page dedicated to it.

Or if you have accidentally published a real dumb commit with a bogus timestamp which makes every graph of your cherished open source project draw its X axis starting on January 1, 2000 and look stupid, as with rPbb45f5eff5. ARRRGH.

Screen Shot 2016-08-23 at 6.02.30 PM.png (252×480 px, 28 KB)

I think the ship has sailed for me to undo that, but I'm absolutely nuking it if we ever have a legitimate reason to rewrite history.

If commits are rewritten, Phabricator will now recognize that the old commits have been deleted (after T9028), but edges linking to them will stay in place and users won't have a reasonable path to find the new hash. While we can't completely recover from this situation (for example, we've already sent a bunch of email with the old hashes, and it is almost certainly not reasonable to try to rewrite old comments mentioning them) but we can give users more guidance to find their way to where they want to go.

In particular, we can provide a way for users to populate a map from old hashes to new hashes, and check that map when a user loads the page for a deleted commit. (We may be able to go slightly further than this, too.) This should generally give users a better pathway forward in a rewritten repository, e.g., from revision to old commit to new commit. Not perfect, but a whole lot better than a dead end.

Event Timeline

We have an existing repository_badcommit table which is used to mark commits as "bad", where "bad" approximately means "the repository as a whole works, but the VCS tools can not read or should not read this particular commit for whatever reason". This is rare, but can happen when, e.g., third-party tools are used to write into repositories.

I originally implemented this at Facebook because some commits there (resulting, as I recall, from merging multiple SVN repositories, then bridging SVN to Git, all with third-party tools) just didn't work. I don't recall the specific nature of the failure, but it was easier to write these commits off than try to work around it (they were a handful of old commits from around the time the repositories were merged, years in the past).

Someone at Facebook also deleted a root directory by accident in SVN, which affected 78M paths or something like that. The next change reverted it. Although I believe I did import those commits, the Diffusion UI at the time choked when trying to display them so I think I just marked them bad and moved on. Diffusion is more capable of handling this situation today, but probably still doesn't handle it with any particular grace or elegance.

There's no tooling to write to this table, and I believe it has only come up once or twice since 2011.

I'm planning to merge the idea of "bad commit" and "rewritten commit" into a new "hint" table, something like this:

<repositoryPHID, oldCommit, newCommit?, hintType>

...where hintType is one of "bad" or "rewrite" today. This should let us reduce how weird and hacky the badcommit table is, and I can provide some unified tooling around accessing this table to make it more reasonable for users to update and for us to test.

It's sort of odd that this information isn't just in the commit table (e.g., a "bad" flag on the commit), but I can come up with a few somewhat-plausible arguments for keeping it separate:

  • Write semantics for the commit table are complicated because of writes by the daemons, and keeping this separate sidesteps that issue.
  • Although I plan to provide tooling, this is a pretty advanced feature and installs may reasonably want to just write to the table directly rather than go through whatever bin/repository hint API I put on top of it, particularly if they're rewriting very large numbers of commit hints.
  • Putting this in a separate table allows us to query for rewrites when looking at the new commit, so we could show "This commit was rewritten from XYZ.". Although I don't currently plan to do this (it would have a small performance cost and I don't think it's actually useful?), it's something we might want to do in the future.

I don't think these arguments mandate a separate table, but I think we're on firm enough ground to comfortably keep the table separate.

So my plan is:

  • Introduce this new table.
  • Migrate badcommit to it, and drop badcommit.
  • Provide bin/repository hint for writing new hints.
  • Support a rewrite hint type.
  • Document the existence of repository hints.

I also have a bit of a vague plan for dealing with the handle/PHID side of this, but want to research that in slightly more detail before writing up anything concrete.