Put an indirection layer between author/committer strings and user accounts
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	epriestley
	Jan 27 2017, 5:06 PM

Description

Currently, when we discover and import commits, we identify the author and committer statically at import time by doing a lookup of the strings against known usernames and email addresses. This works well for established installs, but not as well for new installs. Some of the issues include:

if you import an existing repository and later register users, users won't be retroactively associated with commits;
the underlying algorithm usually works fairly well, but is somewhat opaque;
in some cases (T1731) there are historic commits with dissimilar usernames;
retroactively fixing things requires running scripts that aren't available to normal users or to administrators on Phacility instances, increasing our administrative burden.

Broadly, retroactive updates under the current system potentially require updating an authorPHID on millions of objects, so editing can never be lightweight.

It would be better to define a new type of object that provides an indirection layer between an author/committer string and a user account, so we can do one update to remap evan <ejp@dogewow.com> to user account @epriestley retroactively after it becomes clear that some mistakes got made with a goofy string.

This is also probably a clean solution to T1731, by letting users "claim" a string from Diffusion, rather than list all their alternate names somewhere in settings.

Some open questions:

Does some author string X ever identify different users in different repositories?
Does some author string X ever identify different users in the same repository?

I suspect the answer to both of these questions is "yes", at least in some sense. For example, defaults like root@localhost.com probably identify many different users within a repository.

Implementation is much simpler if the answer is "no", or we can pretend the answer is "no". In the root@localhost.com case, it's probably fine to leave these commits without an author association. I'm not sure if we have cases where two users committed to two different repositories as jsmith. This could reasonably occur in SVN. Maybe we just cross this bridge when we come to it, since any instances where this is a problem are already problems today.

I also don't see a clean way to accommodate these in the design even if we know the answer is "yes". In particular, the possibility of repository-level or commit-level overrides make the JOINs we'd have to do in queries like "commits by author X" very complex. If these cases exist but are exceptionally rare, we could provide CLI tools to do a similar retroactive rewrite operation.

When a User is destroyed, it would be nice to synchronize the RepositoryIdentity table. See some discussion in D20224.

Revisions and Commits

rP Phabricator
	Closed		D19492 Start changing DiffusionCommitController to use identities
	Abandoned		D19506 Add --quiet flag to rebuild-identities
		D20881	rP97bed3508579 Show repository information (and use repository identities) in commit hovercards
		D20418	rP5b6d6c4fb313 Use repository identities, not denormalized strings, to identify authors for…
		D19580	rP5c4c593af325 Update DiffusionLastModifiedController to use identities
		D19497	rPa6951a0a5aa0 Add migration to encourage rebuilding repository identities
		D19491	rP05f333dfba0a Attach identities to commits and users to identities
		D19484	rP38557b96c27e (stable) Make re-running `rebuild-identities` a bit faster and add a little…
		D19484	rP1459fb303769 Make re-running `rebuild-identities` a bit faster and add a little progress…
		D19443	rPfe5fde591026 Assign RepositoryIdentity objects to commits
		D19429	rPf191a66490b1 Add controllers/search/edit engine functionality to RepositoryIdentity
		D19423	rPcd84e53c4413 Begin building out RepositoryIdentity indirection layer

Related Objects
Search...

Status	Assigned	Task
Resolved	amckinley	T12164 Put an indirection layer between author/committer strings and user accounts
Open	epriestley	T13444 Provide a standalone script entry point for resolving a repository identity
Resolved	epriestley	T13457 "phabricator_repository.repository_commit" table has poor keys for naive iteration

Event Timeline

epriestley created this task.Jan 27 2017, 5:06 PM

Herald added subscribers: chad, eadler. · View Herald TranscriptJan 27 2017, 5:06 PM

epriestley mentioned this in T1731: Allow users to set their VCS names.Jan 27 2017, 5:11 PM

epriestley merged a task: T1731: Allow users to set their VCS names.

epriestley added subscribers: klimek, FauxFaux, guvenatbakan and 17 others.

epriestley moved this task from Backlog to v3 on the Diffusion board.Jan 27 2017, 5:13 PM

epriestley edited projects, added Diffusion (v3); removed Diffusion.

20after4 awarded a token.Jan 28 2017, 1:11 AM

20after4 added a subscriber: 20after4.

More or less ran into first issue with Q444.

hskiba added a subscriber: hskiba.Feb 2 2017, 5:29 AM

epriestley moved this task from Backlog to Wow! Features! on the Diffusion (v3) board.Feb 2 2017, 3:53 PM

epriestley mentioned this in T12525: amckinley's Onboarding.Apr 9 2017, 1:46 PM

epriestley edited projects, added Customer Impact; removed Phacility.Apr 12 2017, 3:57 PM

epriestley moved this task from Backlog to Future on the Customer Impact board.

epriestley mentioned this in T12658: Make it more clear what renaming a user does and doesn't affect.May 1 2017, 12:33 PM

timor added a subscriber: timor.May 24 2017, 1:57 PM

epriestley mentioned this in T12251: Add author information when creating a build in Buildkite.Jun 5 2017, 3:45 PM

ioeric added a subscriber: ioeric.Jun 19 2017, 8:26 AM

urzds added a subscriber: urzds.Jul 12 2017, 11:16 AM

jsixface added a subscriber: jsixface.Jul 21 2017, 4:42 PM

epriestley mentioned this in T13075: Plans: Diffusion authors, Herald ref rules, parsing, performance.Feb 14 2018, 1:33 AM

asherkin added a subscriber: asherkin.Feb 14 2018, 10:10 AM

See PHI594, where a real user had the same human name as an external user from another project.

To actually build this:

Create a new PhabricatorRepositoryIdentity or similar object.
It should work like PhabricatorRepositoryRefCursor to accommodate non-UTF8 identities (i.e. have raw, hash and encoding columns) and arbitrarily long identities. If you need test cases, the strategy in T11537#192019 seems very likely to let you build commits with arbitrarily long/silly/garbage author and committer identities.
We store the raw identity string that comes out of the VCS, e.g., Blah Blah <blah@blah.com>. Each sequence of bytes is a unique identity. If you very cleverly encode two different display strings in the same sequence of bytes using Shift-JIS in one repository and ISO-ROFLOL in another repository, too bad, Phabricator doesn't care, they're the same identity.
We assume an identity always identifies the same user in every repository. You can't have joe be two different users in two different repositories.
We assume an identity always identifies the same user in a given repository. You can't have joe be one user on some branches and a different user on other branches, or on old commits vs new commits, etc.
These objects should have real PHIDs.
Since we're going to let users muck around with them (claim identities, etc) they should also probably have proper transaction support so we can render a log of who messed with stuff and broke everything.
In PhabricatorReposiotryCommitMessageParserWorker->updateCommitData(), start creating Identity objects for $author and $committer if they do not exist yet.
Use bin/repository reparse --message ... to test this. It should populate a bunch of stuff into the database.

That should work, just not do anything useful yet.

Add authorIdentityPHID and committerIdentityPHID to PhabricatorRepositoryCommit.
Start setting those in updateCommitData().
At some point we're going to have to do some painful migrations here but probably not for a bit.

Before we migrate, we can do this:

Add a UI in Diffusion to list/manage identies (standard search/list/view/edit transaction stuff), just with no way to create a new identity.
I think we probably add three columns to Identity, like automaticGuessedUserPHID, manuallySetUserPHID, and currentEffectiveUserPHID.

These new columns work like this:

When we create an identity in MessageParserWorker, we use DiffusionResolveUserQuery to use the current rules (or maybe just the "email address" rule, without the "username" or "real name" rules) to guess which Phabricator user this identity resolves to.
When a user adds a new email address, we re-resolve any identities which contain it. I think it's fine to just do this with LIKE ..., although ideally we'd actually do the re-resolution in the daemons (just queue a task for them, "re-resolve everything containing this email address"). The goal here is to automatically associate existing commits with users who newly sign up. We might need a little fiddling here with uppercase/lowercase, etc.
There's probably some kind of script to update all the guesses again explicitly.
When users use the edit UI to "claim" or "release" an identity, they update the manuallySetUserPHID.
After the guessed or manual PHID is updated, the currentEffectiveUserPHID is set to the manual PHID (if it exists), or the guessed PHID (if it exists), or null (if neither exist). This is the column we'll actually JOIN/WHERE on, etc.
There should be a way to set the manual PHID to "this identity does not correspond to any Phabricator user" and force the effective identity to null, even if we guessed that it does correspond to a user.

To make this do more stuff:

In the UI, we can start doing stuff like if ($commit->getAuthorIdentityPHID()) { ... } and marking the older getAuthorPHID() stuff as obsolete.

Then:

Big scary migration. I don't really see a way around this.

Then we make all the rest of the UI and query stuff use the new Identity stuff, get rid of the old stuff, and hopefully we're in good shape?

epriestley assigned this task to amckinley.Apr 30 2018, 10:15 PM

amckinley added a revision: D19423: Begin building out RepositoryIdentity indirection layer.May 1 2018, 8:36 PM

amckinley added a revision: D19429: Add controllers/search/edit engine functionality to RepositoryIdentity.May 5 2018, 3:23 AM

amckinley added a revision: D19443: Assign RepositoryIdentity objects to commits.May 10 2018, 1:45 AM

amckinley added a commit: rPcd84e53c4413: Begin building out RepositoryIdentity indirection layer.May 31 2018, 2:01 PM

amckinley added a commit: rPf191a66490b1: Add controllers/search/edit engine functionality to RepositoryIdentity.May 31 2018, 2:03 PM

amckinley added a commit: rPfe5fde591026: Assign RepositoryIdentity objects to commits.May 31 2018, 2:28 PM

From T13152 -- none of this is too important, just didn't want to lose it when I close that:

A sub-issue here is that rebuild-identities (and, likewise, CommitMessageParserWorker) uses getAuthorName() from the CommitData.

This may not be the original string: in particular, it has been converted to utf8 and truncated to 255 characters.

For some data on secure, it is incomplete (probably because of different behavior in older versions of Phabricator), e.g. jack instead of jack <jack@jacksdomain.com>.

It may also be mangled somewhat by DiffusionCommitRef, which splits the author apart rather than preserving it completely faithfully.

Under the hood, DiffusionCommitRef is (possibly) transported in JSON which can not transport non-utf8 values. Fixing this is probably out of scope until wire encodings in T5955, though.

These hosts ultimately timed out on the initial rebuild-identities: db010, db024, db025.

The latter two are somewhat special; the former might have just stalled in a mundane way. I'm going to make a couple of tweaks to the script to make re-running it faster (basically, skip writes if they'd have no effect) and finish those three up manually.

epriestley mentioned this in T13151: Plans: 2018 Week 23 - Week 30 Bonus Content.Jun 12 2018, 1:56 PM

epriestley added a revision: D19484: Make re-running `rebuild-identities` a bit faster and add a little progress information.Jun 12 2018, 2:08 PM

epriestley added a commit: rP1459fb303769: Make re-running `rebuild-identities` a bit faster and add a little progress….Jun 12 2018, 8:19 PM

epriestley added a commit: rP38557b96c27e: (stable) Make re-running `rebuild-identities` a bit faster and add a little….

I deployed D19484 and db010, db024 and db025 finished up with no issues.

amckinley added a revision: D19491: Attach identities to commits and users to identities.Jun 12 2018, 9:10 PM

amckinley added a commit: rP05f333dfba0a: Attach identities to commits and users to identities.Jun 18 2018, 10:31 PM

epriestley mentioned this in 2018 Week 25 (Very Late June).Jun 23 2018, 11:16 AM

just wanted to give you some reference data (your mileage may vary), I have 2 Phabricator instances with multiple git repos

one has ~300,000 commits ( a mixture of read-only (mirrors) and hosted repos), the other ~200,000 (all hosted)

in both cases I ran rebuild-identities after doing the Week25 upgrade

time ./bin/repository rebuild-identities --all

it took

21 minutes on the one with 300,000 commits and
16 minutes on the one with 200,000

Can't help but think that this was most likely limited by the speed of the terminal output.

Hope that helps others

In T12164#239595, @mydeveloperday wrote:

just wanted to give you some reference data (your mileage may vary), I have 2 Phabricator instances with multiple git repos

one has ~300,000 commits ( a mixture of read-only (mirrors) and hosted repos), the other ~200,000 (all hosted)

in both cases I ran rebuild-identities after doing the Week25 upgrade

time ./bin/repository rebuild-identities --all

it took

21 minutes on the one with 300,000 commits and
16 minutes on the one with 200,000

Can't help but think that this was most likely limited by the speed of the terminal output.

Hope that helps others

Thanks for the data point. @epriestley I’m going to add a —quiet flag for that script; diff coming soon.

amckinley added a revision: D19506: Add --quiet flag to rebuild-identities.Jun 23 2018, 7:46 PM

amckinley added a revision: D19497: Add migration to encourage rebuilding repository identities.Aug 9 2018, 7:32 PM

amckinley added a commit: rPa6951a0a5aa0: Add migration to encourage rebuilding repository identities.Aug 10 2018, 8:47 PM

The "activity" migration went to production just now. I expect all production instances just bailed out, since I ran rebuild-identities some time ago, but I'll verify that:

$ phage remote query --pools db -- --instance-statuses up --query 'SELECT * FROM <INSTANCE>_config.config_manualactivity'

This caught a handful of instances where commits had been missed and the activity had activated.

I spot-checked these and found:

One unreachable commit never had its message parse before it was marked as unreachable (this is consistent with pushing a branch by accident, then deleting it), so it didn't get author info. rebuild-identities now assigned it an identity corresponding to "" (the empty string). This presumably is a behavioral change. This is possibly a questionable behavior, but the state is such an edge case that I'm just going to mark it done for now.
Handful of the same on the next instance.
One on the next instance.
Two on the next instance.
The next instance had about 800 of these.

I did rebuild-identities on every instance, then config done identities to mark the activity complete. I also found one instance with some imported/old data that had a reindex activity scheduled. Now, no instances report waiting activities.

So, current notes here for thinking about handling null in the future:

A commit may legitimately have a null value for authorIdentityPHID between the time it is discovered and the time the message is parsed. This is normally a few seconds, but could be arbitrarily long.
When commits become unreachable before their messages parse (which is routine if you push a large branch by accident, then immediately delete it) the message parse phase will not activate and the commit will normally never get an authorIdentityPHID.
The rebuild-identities script will currently generate an empty string Identity for these commits and assign them the PHID for that empty identity, so commits in this state may or may not have an authorIdentityPHID. We might want to make this behavior more consistent as we look at how we'll handle null and how we'll move forward into the era of Identities.

I did rebuild-identities on every instance

Er, not every instance, but the the handful of instances (I think 7 total) which had the "identities" activity end up in queue after the migration ran.

amckinley added a revision: D19492: Start changing DiffusionCommitController to use identities.Aug 13 2018, 10:36 PM

amckinley added a revision: D19580: Update DiffusionLastModifiedController to use identities.Aug 13 2018, 11:42 PM

amckinley added a commit: rP5c4c593af325: Update DiffusionLastModifiedController to use identities.Aug 17 2018, 7:24 PM

epriestley mentioned this in D20129: When building audit queries, prefilter possible "authorPHID" values.Feb 7 2019, 7:54 PM

epriestley mentioned this in rP509fbb6c20e2: When building audit queries, prefilter possible "authorPHID" values.Feb 7 2019, 11:37 PM

epriestley mentioned this in D20224: Remove "Effective User" attachment from Repository Identities.Feb 28 2019, 5:28 PM

epriestley updated the task description. (Show Details)

epriestley added a revision: D20418: Use repository identities, not denormalized strings, to identify authors for "Revision closed by commit X" stories.Apr 14 2019, 5:51 PM

epriestley added a commit: rP5b6d6c4fb313: Use repository identities, not denormalized strings, to identify authors for….Apr 17 2019, 7:24 PM

epriestley mentioned this in T13439: Include repository information on commit hovercards.Oct 31 2019, 3:21 PM

epriestley added a revision: D20881: Show repository information (and use repository identities) in commit hovercards.Oct 31 2019, 3:58 PM

epriestley added a commit: rP97bed3508579: Show repository information (and use repository identities) in commit hovercards.Oct 31 2019, 4:58 PM

epriestley added a subtask: T13444: Provide a standalone script entry point for resolving a repository identity.Nov 4 2019, 8:24 PM

epriestley mentioned this in T13444: Provide a standalone script entry point for resolving a repository identity.Nov 14 2019, 1:19 AM

I think this doesn't have anything actionable left, see T13444 for some followups. This feature probably isn't 100% perfect quite yet, but I think remaining work is just cleanup.

Sorry if this is a stupid question, but now that this is done, how do I actually map between Phabricator users and VCS user strings, in an existing Diffusion repo?

In the general case:

Go to Diffusion → Identities → Browse Identities (left menu).
Find the VCS user string you want to map (perhaps by searching with "Identity Contains" if you have a large number of identities).
Click it, then Edit Identity, type the Phabricator user it should map to, then Save Changes.

If you have a commit by that user handy:

Click the "Author: Identity <identity@identy.com>" link.
As above, Edit Identity and select the user to map to.

Perfect; thanks!

Is there a way to find which object(s) an (unmapped) identity was discovered on? After a rebuild-identities, I have an empty string identity and a couple asdf-style garbage ones. I'd like to find the source of them and either fix them there or see the context to understand the appropriate identity mapping.

Is there a way to find which object(s) an (unmapped) identity was discovered on?

Sort of. There's a value in the database:

mysql> SELECT authorPHID FROM repository_identity WHERE identityNameRaw = 'EC2 Default User <ec2-user@ip-10-170-207-10.us-west-1.compute.internal>';
+--------------------------------+
| authorPHID                     |
+--------------------------------+
| PHID-CMIT-st4quw227w3dakcilr6x |
+--------------------------------+
1 row in set (0.01 sec)

However:

this value currently is not exposed in the UI; and
this value may not be populated for older identities.

You can also find commits by author identity:

mysql> SELECT phid FROM repository_identity WHERE identityNameRaw = 'EC2 Default User <ec2-user@ip-10-170-207-10.us-west-1.compute.internal>';
+--------------------------------+
| phid                           |
+--------------------------------+
| PHID-RIDT-64vn2gtoyccdok4o66kk |
+--------------------------------+
1 row in set (0.00 sec)

mysql> SELECT * FROM repository_commit WHERE authorIdentityPHID = 'PHID-RIDT-64vn2gtoyccdok4o66kk'\G
*************************** 1. row ***************************
                   id: 244
         repositoryID: 1
                 phid: PHID-CMIT-st4quw227w3dakcilr6x
     commitIdentifier: 31d2790075fa60a682cd626036cd0faac976db3e
                epoch: 1300222313
           authorPHID: NULL
          auditStatus: none
              summary: durf durf sql
         importStatus: 15
   authorIdentityPHID: PHID-RIDT-64vn2gtoyccdok4o66kk
committerIdentityPHID: PHID-RIDT-64vn2gtoyccdok4o66kk
1 row in set (0.01 sec)

However:

this won't find Committer identities, and they're currently stored in a JSON blob so there's no efficient way to query them.

I imagine exposing the authorPHID value in the UI and supporting "search commits by identity" and "browse commits using this identity" eventually, but I'm not sure when I'll be back in this area of the codebase.

That was sufficient to get what I needed, thanks so much.

Committer identities ... they're currently stored in a JSON blob so there's no efficient way to query them.

...

committerIdentityPHID: PHID-RIDT-64vn2gtoyccdok4o66kk

(I think I remembered this wrong, clearly not in a blob.)

Put an indirection layer between author/committer strings and user accountsClosed, ResolvedPublicActions

Description

Revisions and Commits

Related ObjectsSearch...

Event Timeline

Put an indirection layer between author/committer strings and user accounts
Closed, ResolvedPublic
Actions

Related Objects
Search...