Page MenuHomePhabricator

Add workflow to create repository identities

Authored by amckinley on May 11 2018, 1:46 AM.



Depends on D19443. Creates a workflow for populating the new identity table by iterating over commits, either one repo at a time or all at once. Locally caches identities to avoid fetching them inf times. An actual migration that invokes this workflow will come in another revision that won't land until at least next week.

Performance is ~2k commits in 4.9s on my local machine.

Test Plan

Ran locally a few times with a few different states of the repository_identity table.

Diff Detail

rP Phabricator
Lint Not Applicable
Tests Not Applicable

Event Timeline

Owners added a subscriber: Restricted Owners Package.May 11 2018, 1:46 AM
jcox added inline comments.
6 ↗(On Diff #46514)

This seems to load every commit into memory, is that correct? That is going to be impossible for us (I think we have ~2 million commits). IIRC, you can use a LiskMigrationIterator to get around this.

13 ↗(On Diff #46514)

We have several commits that were never successfully imported. Any idea how this will behave when importStatus=0, meaning the author (and the rest of the commitData) is blank?

LiskMigrationIterator improves the peak memory usage but won't let us do the needCommitData() part cheaply, so I suspect the migration would become dramatically more expensive if we only did that swap. In theory, we could write some kind of QueryIterator which takes a CursorPagedPolicyAwareQuery and pages through it. This seems like a generally reasonable sort of thing to write, but I don't think we have one right now.

(I'll go through this in more detail, but I don't think we're aiming for this release on it no matter what, so you've got a minimum of a week to hyperventilate about deploying this. 🐱)

so you've got a minimum of a week to hyperventilate about deploying this.

thank god :D

giphy-downsized.gif (225×300 px, 149 KB)

6 ↗(On Diff #46514)

Yep, and what @epriestley said is right. I'd use LiskMigrationIterator but we need the commit data loaded as well. I'm also going to test this on some mega-huge repos (Linux kernel, GNOME, etc) to see if I can break anything.

13 ↗(On Diff #46514)

I'll look into that. (And thank you for reminding me that RepositoryIdentity shouldn't accept the empty string (we already reject null).

I think this is in pretty good shape, but we should probably do something like this:

  • put it in bin/repository rebuild-identities <repository | commit | --all> or similar;
  • ship that for at least a week with changelog guidance so installs can do an online rebuild at their leisure if they want;
  • when we do the migration, basically just have it run bin/repository rebuild-identities --all.

This will also put us in a better position to change how DiffusionResolveUserQuery works later (e.g., drop the "Real Name" rule) and let installs rebuild the guesses easily.

I'll see what I can do about providing a QueryIterator, I think that's probably the best way forward for dealing with the commit data + iterator memory usage issue and that it shouldn't be very involved. Putting this in a repeatable bin/repository workflow will also let us debug/test that more flexibly.

I think D19450 will give us an Iterator which can take a Query and iterate over all results with constant memory usage.

switch to using PhabricatorQueryIterator; remove progress bar because we don't have access to the number of query results any more

when we do the migration, basically just have it run bin/repository rebuild-identities --all

I rewrote the migration as PhabricatorRepositoryManagementRebuildIdentitiesWorkflow. What's the most straightforward way to invoke that via a migration? Something like this:

execx('bin/repository rebuild-identities');

Or this:

$workflow = new PhabricatorRepositoryManagementRebuildIdentitiesWorkflow();
$args = getFakedOutArgs();

Or put the code in a class that doesn't extend Workflow and have both the migration and the workflow instantiate it?

I think we can't just execx() / phutil_passthru() because it won't inherit some of the bin/storage state -- flags like bin/storage --namespace xyz, particularly.

So I'd try the PhabricatorRepositoryManagementRebuildIdentitiesWorkflow approach, I think.

Remove migration, replace with workflow.

No idea where this is coming from since I just pulled both phabricator and libphutil.

Update arcanist/ to clear this I think, that's a new exception I just added to arcanist/ to deal with the "600 MB of video patch".

phabricator/ currently depends on arcanist/ for some repository/diff parsing stuff. This is a little weird (it "feels" like Phabricator and Arcanist should each only depend on libphutil) and should maybe change eventually, but that's the way the cookie crumbles for now.


Probably commit? This takes rXYZabcd (a "commit"), not D123 (a "revision"), right?

This revision is now accepted and ready to land.May 15 2018, 3:59 PM

Ah, good call. I just noticed that "repository" was clearly wrong, and on line 26 the argument itself is named "revision".

amckinley retitled this revision from Big migration to attach identities to commit objects to Add workflow to create repository identities.May 15 2018, 4:55 PM
amckinley edited the summary of this revision. (Show Details)
amckinley edited the summary of this revision. (Show Details)

Fix lint error, change help text.

amckinley added inline comments.

Weird that the linter didn't notice the extra space here.

This revision was automatically updated to reflect the committed changes.