Page MenuHomePhabricator

Importing a very large repository does not complete correctly
Closed, ResolvedPublic

Description

This is a fairly nasty use case because we have a large mercurial repository (32000 commits) converted from another VCS system. Attempting to import it appears to have missed some commits.

The Edit Repository page displays:

Repository Active
Found Binary hg	/usr/bin/hg
Pull Daemon Running	
Task Daemon Running	
Storage Directory OK	/home/phabricator/repos
Working Copy OK	/home/phabricator/repos/ML/
Updates OK	Last updated Wed, Nov 6, 11:10 AM.
Importing	100.0% Complete

There are no daemon tasks queued.

Diffusion commit history:

Browse	rML7ccbbc74c84b	Contents Modified	Tue, Nov 5	6:49 PM	user	message
Browse	rML49b2b934988b	Contents Modified	Tue, Nov 5	6:43 PM	user	message
Browse	rML901fb9868d42	Contents Modified	Tue, Nov 5	6:42 PM	user	message
Browse	rMLd31829c6e3a4	Contents Modified	Tue, Nov 5	6:06 PM	user	message
Browse	rML8fed9a3f7fdf	Contents Modified	Tue, Nov 5	4:50 PM	user	message
Browse	rMLb86f00ee9ff5	Contents Modified	Tue, Nov 5	2:35 PM	user	message
Browse	rML193472dda92b	Contents Modified	Tue, Nov 5	2:06 PM	user	message
Browse	rML711cf652f2d6	Contents Modified	Tue, Nov 5	2:00 PM	user	message
Browse	rMLd93ef5f9efbd	Importing…				 
Browse	rML56ed1d89d7ec	Importing…				 
Browse	rML21d7751001df	Importing…

There are a sizeable number of commits which are still in the 'importing' state (some % > 1).

Is there anything I can do to force a reparse of the missed commits? Or any more information I can provide for debugging?

Event Timeline

kbrownlees raised the priority of this task from to Needs Triage.
kbrownlees updated the task description. (Show Details)
kbrownlees added a subscriber: kbrownlees.

Let me write a script to identify exactly what's missing, and we can start there. The scripts/repository/reparse.php script can re-run steps, but there isn't a script right now to figure out what needs to be rerun in a straightforward manner.

(We added the progress indicator last week, and while it's a lot better than the "maybe your commits are importing???" we had before, we still have some stuff like this to improve.)

D7514 adds an administrative command you can use to forcefully mark the repository as imported. This will cause the UI to show "Imported", and activate Herald/Feed/Harbormaster/etc. The downside is that if we're able to fix those 1% of commits in the future, they'll also activate all the publishing, so users with broad Herald rules might get uninteresting email, etc. If you're comfortable with that, you can mark the repository imported after D7514 lands with:

phabricator/ $ ./bin/repository mark-imported ML

Running this command won't impact our ability to fix the actual commits. This command is reversible later (with --mark-not-imported), although side effects generated by an imported repository (like sending email for new commits) are generally not feasible to reverse.

Next, I'll add a command to list the un-imported commits and their status explicitly so we can get a better picture of what's not importing and why.

D7515 adds an administrative command to list commits which aren't imported. Once it lands, you can run this with:

phabricator/ $ ./bin/repository importing ML

That should give us an idea of how much stuff failed, and which phases are missing. The output should look something like this:

phabricator/ $ ./bin/repository importing P
rP82a061b485050a8b57283ecf91805cb10ad59900 Message, Change, Owners, Herald
rP7a97a71e200290db6c81dce8dbe8053c1ad68058 Message, Change, Owners, Herald
rP45f38c549b65c4c8ca3d731725a5abe3b7905152 Message, Owners, Herald
rP3147a6ca5709b2db62ad913113c747ede185c327 Message, Change, Owners, Herald
rP1ee455c441a109418cc01e85f9bd538b86c4d674 Message, Change, Herald

The text on the right shows which of the four import phases the commit is missing.

To start with, can you just show me the output of bin/repository importing ML (once the command is available)? From there, I can walk you through how to best proceed.

Without seeing the data, my guesses about resolutions here are:

  1. rPd1649d1 fixed a bug which could cause this in some cases, so that might have been the issue. If it is, fixing it is straightforward.
  2. Could be a general upgrade/version issue since this stuff is very new -- e.g., if a daemon from before the upgrade managed to stick around afterward, that might cause this.
  3. The "converted from another VCS system" part might have exposed some bugs in our parser, so these commits may not be making it through a parse step. If this is the case, we can reparse the commits to get detailed errors and then fix the parser.

But getting the actual data should give us more of an idea about what's going on.

Thank you for the quick response and work!

It does appear the error has occurred earlier than what those scripts will show, a search of my db for importStatus!=15 (in phabricator_commit) only yields one result, and a search for commitIdentifier='d93ef5f9efbd997584c62d366b1ddbf81c749252' (the full id of the first still importing commit I pasted) show no results.. Not entirely sure where the information that is drawing the diffusion history page is coming from?

Oh, interesting. Diffusion is basically running hg log to generate the commit list for the web UI, but the daemons should be running an identical command to get the commits into the pipeline in the first place. Can you try this and see if anything jumps out at you as suspicious?

phabricator/ $ ./bin/repository discover ML --trace

(This is the command the daemons run to find commits in the repository.)

I'll look over the code and see if I can spot anything -- maybe there's some difference in the log commands we're running.

phabricator@lal-phabricator:~/phabricator$ ./bin/repository discover ML --trace
>>> [2] <connect> phabricator_repository
<<< [2] <connect> 1,087 us
>>> [3] <query> SELECT * FROM `repository` r  WHERE (r.callsign IN ('ML')) ORDER BY r.id DESC 
<<< [3] <query> 206 us
Discovering 'ML'...
>>> [4] <exec> $ (cd '/home/phabricator/repos/ML/' && HGPLAIN=1 hg --debug branches)
<<< [4] <exec> 155,312 us
>>> [5] <query> SELECT * FROM `repository_commit` WHERE repositoryID = 15 AND commitIdentifier = 'ae4359086352cd514d9be80ce59830f69ee3941b' 
<<< [5] <query> 734 us
>>> [6] <query> SELECT * FROM `repository_commit` WHERE repositoryID = 15 AND commitIdentifier = '81603c4c6139a0499cc41a16b0222f8156745798' 
<<< [6] <query> 371 us
>>> [7] <query> SELECT * FROM `repository_commit` WHERE repositoryID = 15 AND commitIdentifier = '0761b2dfbe24c8ab3bb0f964a0d147f6ce5b5b5f' 
<<< [7] <query> 501 us
>>> [8] <query> SELECT * FROM `repository_commit` WHERE repositoryID = 15 AND commitIdentifier = '62680cd8cd244456138cba95730ab4e3438845ac' 
<<< [8] <query> 380 us
>>> [9] <query> SELECT * FROM `repository_commit` WHERE repositoryID = 15 AND commitIdentifier = 'e5426ec9b111bfb56800a7c8023fe5dbf90c67b5' 
<<< [9] <query> 370 us
>>> [10] <query> SELECT * FROM `repository_commit` WHERE repositoryID = 15 AND importStatus != 15
        LIMIT 1
<<< [10] <query> 223 us
Done.
phabricator@lal-phabricator:~/phabricator$

Do I need to drop something to force it to reparse a branch?

Are those the right branch heads, and the missing commits are behind them? We should never insert a commit into the database before inserting all of its ancestors, so something is definitely awry.

There's no way to force a branch reparse, except by nuking the whole repo:

phabricator/ $ ./bin/repository delete

...and then recreating it. That couldn't hurt, but I'm worried we'll end up right back where we are, since the things I can think of which could cause this if the system is behaving well are, like, "MySQL configured to randomly forget data".

The missing commits only appear to be on default, which has a head of ae4359086352cd514d9be80ce59830f69ee3941b. Is there any debug I can leave on while it is parsing that would help us try and catch the error? It took ~4 hours last time so will leave it going overnight.

You can test if things are OK more quickly like this, if it's OK to stop the daemons for a bit:

  1. Stop the daemons.
  2. Delete and recreate the repository.
  3. Run bin/repository pull ML (this will clone a working copy).
  4. Run bin/repository discover ML.
  5. Verify the database has the right number of rows (one per commit) in phabricator_repository.repository_commit, for the new repository.

This should take more like 10-20 minutes than 4 hours, probably. If things look good, you can restart the daemons and let them do their thing.

However, I think I found a bug in the importer which could cause this. I'll get that fixed in a second, so maybe wait for that?

epriestley edited this Maniphest Task.

Quick update here:

  • The Mercurial discovery daemon has a bug where it doesn't insert discovered commits in topographical order. This is fixed by D7518. This is likely to have caused the issue you're seeing.
  • We do actually have a "force branch rediscovery" flag, --repair, that I wrote a while ago and promptly forgot about.

So the slightly more efficient plan is:

  • Wait for D7518, which should get reviewed later today.
  • Without deleting the repository, run: bin/repository discover ML --repair
  • That should take more like 10 minutes than 4 hours.
  • We should be in good shape after that, although the daemons will need to do a bit more work to finish the import.

(If you already deleted the repository, that's fine, and we can stick with the original plan.)

Okay, this stuff is in HEAD now. After upgrading, try this and let me know how it goes?

phabricator/ $ ./bin/repository discover ML --repair

(You can also use the --trace and --verbose flags for more detailed output.)

Thank you again for the quick responses and work! I had already deleted the repository unfortunately. The import is running again now so we will see how it goes, a basic discover on my desk appeared to be substantially better than before the latest upgrades so looks promising.

Awesome, let us know how it goes.

Success!

Three tasks got stuck while processing, going in a manually freeing the leases seemed to work (not sure if that was a bad idea?).

Thank you again for all your work.

Kieran

Three tasks got stuck while processing, going in a manually freeing the leases seemed to work (not sure if that was a bad idea?).

That's perfectly fine. (They should have automatically reprocessed eventually otherwise, but freeing the leases speeds up the process.)

Thanks for the detailed report and your work figuring out the root issue, it was very helpful in narrowing it down. Let us know if you run into anything else.