Page MenuHomePhabricator

Support for Git's `.mailmap` author mapping
Open, WishlistPublic

Description

Git already provides a standard way of fixing up misspelled committer/author entries in Git commits. Instead of having to redundantly register all misspelled aliases with Phabricator manually, it'd be terrific if Phabricator could be optionally configured to consult a Git repo's latest .mailmap mapping to normalize committer/authors.

See also http://git-scm.com/docs/git-shortlog#_mapping_authors for details about the .mailmap semantics

Event Timeline

hvr raised the priority of this task from to Wishlist.
hvr updated the task description. (Show Details)
hvr added a project: People.
hvr added subscribers: hvr, thoughtpolice.

It seems difficult to identify what the "latest" .mailmap file is. For example, we may be importing a commit which is an ancestor of an arbitrarily large number of branch heads, each with different and conflicting .mailmap files.

We could do this at display time to be more consistent with git (i.e., you see the data adjusted by the .mailmap which is at the HEAD of the current branch), but it would impose a performance cost, and not work when viewing a commit in isolation (not on a branch).

This seems most valuable in helping Phabricator figure out which user account a commit is associated with, but that happens in isolation without the context of a branch.

We could mitigate the performance cost by caching the .mailmap at each branch head, then applying it at display time, but this is complex.

In cases like API access, we presumably should emit the unadulterated raw data, so if we do use caching we can't just "improve" the data we're storing, but need to store both the existing and the new data.

At first blush, the format seems a little ambiguous, so writing a parser probably requires porting whatever Git does. (If we're lucky, it uses C rather than perl.)

I've also never heard of .mailmap before and never seen one in use.

Overall, this seems solidly wishlisty with a fairly unimpressive ratio of effort-to-value.

The parser isn't that bad, but I did discover a Git feature which appears to be undocumented and used only by the kernel:

http://git.kaarsemaker.net/git/commit/7595e2ee6ef9b35ebc8dc45543723e1d89765ce3/

For the record, here's an example of what we have to cope with in our code-base: https://phabricator.haskell.org/diffusion/GHC/browse/master/.mailmap

As for "latest", it could also be simply configured to regard a single branch (usually master) as holding the canonical latest .mailmap. I'm mostly thinking about the use-cases where you reparse commits explicitly (as you usually edit the .mailmap post-facto to fixup some mistake long after you pushed that commit, and you notice that git shortlog and other tools don't properly associate that one commit).

Related to this issue, take a look at author mapping in Crucible https://confluence.atlassian.com/display/FISHEYE/Changing+your+user+profile#Changingyouruserprofile-AuthorMapping

IMHO there should be support for more flexible author mapping. This is not specific to Git.

In an ideal world Phabricator and the version management system are using the same user database. In pratice this will at time not be the case.

@ostraaten, see T1731 for general mapping of VCS usernames. This is a much more attractive feature than supporting .mailmap, specifically, given the very limited use of .mailmap.

Just as a data point, the Wikimedia community is a user of git's .mailmap support for our repositories as it gives repository owners/committers a way to clean things up unilaterally (instead of pushing the work on the individual authors who may or may not be contactable anymore). I was reminded of this after seeing a recent cleanup .mailmap commit in our operations/puppet repository: https://gerrit.wikimedia.org/r/#/c/282405/