Page MenuHomePhabricator

Put an indirection layer between author/committer strings and user accounts
Open, LowPublic

Description

Currently, when we discover and import commits, we identify the author and committer statically at import time by doing a lookup of the strings against known usernames and email addresses. This works well for established installs, but not as well for new installs. Some of the issues include:

  • if you import an existing repository and later register users, users won't be retroactively associated with commits;
  • the underlying algorithm usually works fairly well, but is somewhat opaque;
  • in some cases (T1731) there are historic commits with dissimilar usernames;
  • retroactively fixing things requires running scripts that aren't available to normal users or to administrators on Phacility instances, increasing our administrative burden.

Broadly, retroactive updates under the current system potentially require updating an authorPHID on millions of objects, so editing can never be lightweight.

It would be better to define a new type of object that provides an indirection layer between an author/committer string and a user account, so we can do one update to remap evan <ejp@dogewow.com> to user account @epriestley retroactively after it becomes clear that some mistakes got made with a goofy string.

This is also probably a clean solution to T1731, by letting users "claim" a string from Diffusion, rather than list all their alternate names somewhere in settings.

Some open questions:

  • Does some author string X ever identify different users in different repositories?
  • Does some author string X ever identify different users in the same repository?

I suspect the answer to both of these questions is "yes", at least in some sense. For example, defaults like root@localhost.com probably identify many different users within a repository.

Implementation is much simpler if the answer is "no", or we can pretend the answer is "no". In the root@localhost.com case, it's probably fine to leave these commits without an author association. I'm not sure if we have cases where two users committed to two different repositories as jsmith. This could reasonably occur in SVN. Maybe we just cross this bridge when we come to it, since any instances where this is a problem are already problems today.

I also don't see a clean way to accommodate these in the design even if we know the answer is "yes". In particular, the possibility of repository-level or commit-level overrides make the JOINs we'd have to do in queries like "commits by author X" very complex. If these cases exist but are exceptionally rare, we could provide CLI tools to do a similar retroactive rewrite operation.


When a User is destroyed, it would be nice to synchronize the RepositoryIdentity table. See some discussion in D20224.

Event Timeline

epriestley moved this task from Backlog to v3 on the Diffusion board.Jan 27 2017, 5:13 PM
epriestley edited projects, added Diffusion (v3); removed Diffusion.
20after4 added a subscriber: 20after4.

More or less ran into first issue with Q444.

hskiba added a subscriber: hskiba.Feb 2 2017, 5:29 AM
epriestley moved this task from Backlog to Future on the Customer Impact board.
timor added a subscriber: timor.May 24 2017, 1:57 PM
ioeric added a subscriber: ioeric.Jun 19 2017, 8:26 AM
urzds added a subscriber: urzds.Jul 12 2017, 11:16 AM

See PHI594, where a real user had the same human name as an external user from another project.

To actually build this:

  • Create a new PhabricatorRepositoryIdentity or similar object.
  • It should work like PhabricatorRepositoryRefCursor to accommodate non-UTF8 identities (i.e. have raw, hash and encoding columns) and arbitrarily long identities. If you need test cases, the strategy in T11537#192019 seems very likely to let you build commits with arbitrarily long/silly/garbage author and committer identities.
  • We store the raw identity string that comes out of the VCS, e.g., Blah Blah <blah@blah.com>. Each sequence of bytes is a unique identity. If you very cleverly encode two different display strings in the same sequence of bytes using Shift-JIS in one repository and ISO-ROFLOL in another repository, too bad, Phabricator doesn't care, they're the same identity.
  • We assume an identity always identifies the same user in every repository. You can't have joe be two different users in two different repositories.
  • We assume an identity always identifies the same user in a given repository. You can't have joe be one user on some branches and a different user on other branches, or on old commits vs new commits, etc.
  • These objects should have real PHIDs.
  • Since we're going to let users muck around with them (claim identities, etc) they should also probably have proper transaction support so we can render a log of who messed with stuff and broke everything.
  • In PhabricatorReposiotryCommitMessageParserWorker->updateCommitData(), start creating Identity objects for $author and $committer if they do not exist yet.
  • Use bin/repository reparse --message ... to test this. It should populate a bunch of stuff into the database.

That should work, just not do anything useful yet.

  • Add authorIdentityPHID and committerIdentityPHID to PhabricatorRepositoryCommit.
  • Start setting those in updateCommitData().
  • At some point we're going to have to do some painful migrations here but probably not for a bit.

Before we migrate, we can do this:

  • Add a UI in Diffusion to list/manage identies (standard search/list/view/edit transaction stuff), just with no way to create a new identity.
  • I think we probably add three columns to Identity, like automaticGuessedUserPHID, manuallySetUserPHID, and currentEffectiveUserPHID.

These new columns work like this:

  • When we create an identity in MessageParserWorker, we use DiffusionResolveUserQuery to use the current rules (or maybe just the "email address" rule, without the "username" or "real name" rules) to guess which Phabricator user this identity resolves to.
  • When a user adds a new email address, we re-resolve any identities which contain it. I think it's fine to just do this with LIKE ..., although ideally we'd actually do the re-resolution in the daemons (just queue a task for them, "re-resolve everything containing this email address"). The goal here is to automatically associate existing commits with users who newly sign up. We might need a little fiddling here with uppercase/lowercase, etc.
  • There's probably some kind of script to update all the guesses again explicitly.
  • When users use the edit UI to "claim" or "release" an identity, they update the manuallySetUserPHID.
  • After the guessed or manual PHID is updated, the currentEffectiveUserPHID is set to the manual PHID (if it exists), or the guessed PHID (if it exists), or null (if neither exist). This is the column we'll actually JOIN/WHERE on, etc.
  • There should be a way to set the manual PHID to "this identity does not correspond to any Phabricator user" and force the effective identity to null, even if we guessed that it does correspond to a user.

To make this do more stuff:

  • In the UI, we can start doing stuff like if ($commit->getAuthorIdentityPHID()) { ... } and marking the older getAuthorPHID() stuff as obsolete.

Then:

  • Big scary migration. I don't really see a way around this.

Then we make all the rest of the UI and query stuff use the new Identity stuff, get rid of the old stuff, and hopefully we're in good shape?

From T13152 -- none of this is too important, just didn't want to lose it when I close that:

A sub-issue here is that rebuild-identities (and, likewise, CommitMessageParserWorker) uses getAuthorName() from the CommitData.
This may not be the original string: in particular, it has been converted to utf8 and truncated to 255 characters.
For some data on secure, it is incomplete (probably because of different behavior in older versions of Phabricator), e.g. jack instead of jack <jack@jacksdomain.com>.
It may also be mangled somewhat by DiffusionCommitRef, which splits the author apart rather than preserving it completely faithfully.
Under the hood, DiffusionCommitRef is (possibly) transported in JSON which can not transport non-utf8 values. Fixing this is probably out of scope until wire encodings in T5955, though.

These hosts ultimately timed out on the initial rebuild-identities: db010, db024, db025.

The latter two are somewhat special; the former might have just stalled in a mundane way. I'm going to make a couple of tweaks to the script to make re-running it faster (basically, skip writes if they'd have no effect) and finish those three up manually.

I deployed D19484 and db010, db024 and db025 finished up with no issues.

just wanted to give you some reference data (your mileage may vary), I have 2 Phabricator instances with multiple git repos

one has ~300,000 commits ( a mixture of read-only (mirrors) and hosted repos), the other ~200,000 (all hosted)

in both cases I ran rebuild-identities after doing the Week25 upgrade

time ./bin/repository rebuild-identities --all

it took

21 minutes on the one with 300,000 commits and
16 minutes on the one with 200,000

Can't help but think that this was most likely limited by the speed of the terminal output.

Hope that helps others

amckinley added a comment.EditedJun 23 2018, 7:39 PM

just wanted to give you some reference data (your mileage may vary), I have 2 Phabricator instances with multiple git repos
one has ~300,000 commits ( a mixture of read-only (mirrors) and hosted repos), the other ~200,000 (all hosted)
in both cases I ran rebuild-identities after doing the Week25 upgrade
time ./bin/repository rebuild-identities --all
it took
21 minutes on the one with 300,000 commits and
16 minutes on the one with 200,000
Can't help but think that this was most likely limited by the speed of the terminal output.
Hope that helps others

Thanks for the data point. @epriestley I’m going to add a —quiet flag for that script; diff coming soon.

The "activity" migration went to production just now. I expect all production instances just bailed out, since I ran rebuild-identities some time ago, but I'll verify that:

$ phage remote query --pools db -- --instance-statuses up --query 'SELECT * FROM <INSTANCE>_config.config_manualactivity'

This caught a handful of instances where commits had been missed and the activity had activated.

I spot-checked these and found:

  • One unreachable commit never had its message parse before it was marked as unreachable (this is consistent with pushing a branch by accident, then deleting it), so it didn't get author info. rebuild-identities now assigned it an identity corresponding to "" (the empty string). This presumably is a behavioral change. This is possibly a questionable behavior, but the state is such an edge case that I'm just going to mark it done for now.
  • Handful of the same on the next instance.
  • One on the next instance.
  • Two on the next instance.
  • The next instance had about 800 of these.

I did rebuild-identities on every instance, then config done identities to mark the activity complete. I also found one instance with some imported/old data that had a reindex activity scheduled. Now, no instances report waiting activities.

So, current notes here for thinking about handling null in the future:

  • A commit may legitimately have a null value for authorIdentityPHID between the time it is discovered and the time the message is parsed. This is normally a few seconds, but could be arbitrarily long.
  • When commits become unreachable before their messages parse (which is routine if you push a large branch by accident, then immediately delete it) the message parse phase will not activate and the commit will normally never get an authorIdentityPHID.
  • The rebuild-identities script will currently generate an empty string Identity for these commits and assign them the PHID for that empty identity, so commits in this state may or may not have an authorIdentityPHID. We might want to make this behavior more consistent as we look at how we'll handle null and how we'll move forward into the era of Identities.

I did rebuild-identities on every instance

Er, not every instance, but the the handful of instances (I think 7 total) which had the "identities" activity end up in queue after the migration ran.