Currently, when we discover and import commits, we identify the author and committer statically at import time by doing a lookup of the strings against known usernames and email addresses. This works well for established installs, but not as well for new installs. Some of the issues include:
- if you import an existing repository and later register users, users won't be retroactively associated with commits;
- the underlying algorithm usually works fairly well, but is somewhat opaque;
- in some cases (T1731) there are historic commits with dissimilar usernames;
- retroactively fixing things requires running scripts that aren't available to normal users or to administrators on Phacility instances, increasing our administrative burden.
Broadly, retroactive updates under the current system potentially require updating an authorPHID on millions of objects, so editing can never be lightweight.
It would be better to define a new type of object that provides an indirection layer between an author/committer string and a user account, so we can do one update to remap evan <email@example.com> to user account @epriestley retroactively after it becomes clear that some mistakes got made with a goofy string.
This is also probably a clean solution to T1731, by letting users "claim" a string from Diffusion, rather than list all their alternate names somewhere in settings.
Some open questions:
- Does some author string X ever identify different users in different repositories?
- Does some author string X ever identify different users in the same repository?
I suspect the answer to both of these questions is "yes", at least in some sense. For example, defaults like firstname.lastname@example.org probably identify many different users within a repository.
Implementation is much simpler if the answer is "no", or we can pretend the answer is "no". In the email@example.com case, it's probably fine to leave these commits without an author association. I'm not sure if we have cases where two users committed to two different repositories as jsmith. This could reasonably occur in SVN. Maybe we just cross this bridge when we come to it, since any instances where this is a problem are already problems today.
I also don't see a clean way to accommodate these in the design even if we know the answer is "yes". In particular, the possibility of repository-level or commit-level overrides make the JOINs we'd have to do in queries like "commits by author X" very complex. If these cases exist but are exceptionally rare, we could provide CLI tools to do a similar retroactive rewrite operation.
When a User is destroyed, it would be nice to synchronize the RepositoryIdentity table. See some discussion in D20224.