Page MenuHomePhabricator

Modified script to commit smaller batches of symbols to the database.
ClosedPublic

Authored by wotte on Nov 17 2013, 11:52 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Jan 11, 5:25 AM
Unknown Object (File)
Tue, Dec 31, 11:13 AM
Unknown Object (File)
Dec 23 2024, 6:01 AM
Unknown Object (File)
Dec 23 2024, 6:01 AM
Unknown Object (File)
Dec 23 2024, 6:01 AM
Unknown Object (File)
Dec 23 2024, 5:46 AM
Unknown Object (File)
Dec 9 2024, 4:36 PM
Unknown Object (File)
Nov 27 2024, 8:17 AM

Details

Summary

Modified the import script so it will only try to load a configurable
number of symbols at a time to avoid exhausting memory for large project
imports.

I haven't written a line of PHP in more than a decade, so please forgive
any stylistic or technical errors.

Test Plan

Ran the script on symbol table generated from linux kernel.

Diff Detail

Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

This seems broadly reasonable. One issue is that we still read all of stdin in one chunk; it would be nice to stream it instead. We don't currently have a nice abstraction for this, though (we have LinesOfALargeFile, but not LinesOfALargeStream or a similar class we could pass stdin to). This would be easy enough to do manually, too, but if you aren't hitting issues we could just wait until the next time this gets touched.

This could also be cleaned up very slightly with PhutilChunkedIterator (saving the final catch-all call to commit_symbols() at the end), and possibly further motivates introducing some AutochunkSQLQuery sort of thing (see D7513).

Finally, it would be nice to do this atomically, but I'm not sure if throwing a transaction around the whole thing will create performance issues at large scale or not offhand.

All considered, I think this clearly moves things forward. Thanks!

epriestley added a subscriber: wrotte.

Closed by commit rP7d43e5911066 (authored by @wrotte, committed by @epriestley).

This seems broadly reasonable. One issue is that we still read all of stdin in one chunk; it would be nice to stream it instead. We don't currently have a nice abstraction for this, though (we have LinesOfALargeFile, but not LinesOfALargeStream or a similar class we could pass stdin to). This would be easy enough to do manually, too, but if you aren't hitting issues we could just wait until the next time this gets touched.

Yeah, the size of the stream isn't really that big of a deal; we're only talking 140 megs or so, which isn't a big deal. The process now appears to max out at about 550 megs or so

This could also be cleaned up very slightly with PhutilChunkedIterator (saving the final catch-all call to commit_symbols() at the end), and possibly further motivates introducing some AutochunkSQLQuery sort of thing (see D7513).

Reasonable suggestion. I'll leave that to someone else unless I I'm itching for a learning exercise.

Finally, it would be nice to do this atomically, but I'm not sure if throwing a transaction around the whole thing will create performance issues at large scale or not offhand.

The whole update process take about 5 minutes on my testing VM, so I'd imagine that wrapping that in a transaction would be not such a great idea (I know next to nothing about interacting efficiently with databases, however - that stuff usually gets farmed off onto one of my grad students). Most of that execution time appears to be talking to the database. If doing many smaller commits rather than fewer large commits would be faster/more efficient, this is an embarrassingly parallel problem that could be easily split off into several threads.

All considered, I think this clearly moves things forward. Thanks!

scripts/symbols/import_project_symbols.php
187–205

I might be a little out, but isn't this just equivalent to:

commit_symbols($symbols, $project, $no_purge);

?

Which would mean this diff doesn't actually add any chunking?

scripts/symbols/import_project_symbols.php
187–205

It's a little hard to read because of where the added/removed lines ended up, but the construction is:

$symbols = array();
foreach ($lines as $line) {
  $symbols[] = $symbol;
  if (count($symbols) > $limit) {
    commit($symbols);
    $symbols = array();
  }
}

commit($symbols);
scripts/symbols/import_project_symbols.php
187–205

Oh, I see that now; My bad :)