Ref T6656 performance optimazion for edge queries with getDestinationPHIDs
Details
- Reviewers
epriestley - Group Reviewers
Blessed Reviewers - Maniphest Tasks
- T6656: Maniphest Reports break when you have 75k tasks
- Commits
- Restricted Diffusion Commit
rPd923f68aad2a: performance optimazion for edge queries with getDestinationPHIDs
accessed reports and several pages containing tasks with edges (projects...)
Diff Detail
- Repository
- rP Phabricator
- Lint
Lint Not Applicable - Unit
Tests Not Applicable
Event Timeline
Ah! Really nice catch!
On the actual patch here, do we lose much performance if we simplify a little bit? Specifically, I'd like to avoid having two loops which implement the same filters. A proposed simplification is to retain a single loop, but do this instead:
+ $set = $this->resultSet; + if ($src_phids) { + $set = array_select_keys($set, $src_phids); + } - foreach ($this->resultSet as $src => $edges_by_type) { + foreach ($set as $src => $edges_by_type) {
Then do something similar for $types. I think that would let us get rid of the top-level if/else and sort-of-duplicated code, but should give us the bulk of the performance benefit?
One inline about array_merge(). Here's a simple example for the array_merge() issue:
<?php $N = 10000; $t_start = microtime(true); $result = array(); for ($n = 0; $n < $N; $n++) { $result[] = $n; } printf( "Append %d elements with operator[] in %dms.\n", count($result), 1000 * (microtime(true) - $t_start)); $t_start = microtime(true); $result = array(); for ($n = 0; $n < $N; $n++) { $result = array_merge($result, array($n)); } printf( "Append %d elements with array_merge() in %dms.\n", count($result), 1000 * (microtime(true) - $t_start));
On my system, the array_merge() cost jumps from 38ms to 6889ms when N jumps from 1000 to 10000 (181x slower with an input 10x larger). The cost of using [] to append with N=10000 is 1ms.
The numbers here probably aren't big enough to start hitting the bad parts of this behavior, but you could imagine that a task could reasonably have 10K edges (e.g., 10,000 subscribers).
My general approach to avoiding this is to never array_merge() in a loop. You can array_mergev() (convenience method provided by libphutil) after the loop, or iterate + operator[] in a loop, or use array + array (all of which have costs on the order of O(N) in the total number of elements, but none of which is a drop-in replacement for array_merge, per se).
src/infrastructure/edges/query/PhabricatorEdgeQuery.php | ||
---|---|---|
281–282 | This can produce very bad performance behavior -- the cost of array_merge() is approximately the size of all the arguments combined, so when you repeatedly merge a small addition into a large existing array the overall runtime approaches O(N^2). Some discussion here: https://secure.phabricator.com/book/phabflavor/article/php_pitfalls/#array-merge-in-incredibl |
I guess the pathological array_merge() case is really 10,000 types of edges, which seems very unlikely to ever occur in the wild, but is still probably best avoided.
libphutil sure has some very nice utility functions :D
performance is about the same.
src/infrastructure/edges/query/PhabricatorEdgeQuery.php | ||
---|---|---|
280 | Oh, sorry -- array_mergev() in a loop is just as bad as array_merge(), it just simplifies this: $list_of_lists = array() foreach ($input as $list) { $list_of_lists[] = $list; } $list_of_items = array_mergev($list_of_lists); Otherwise the last line would need to be: $list_of_items = call_user_func_array('array_merge', $list_of_lists); ...which is way more cumbersome. But it's equivalent, and just calls array_merge() internally. Just keep this like it was, I think? |
You could technically do this to avoid the innermost loop and also avoid the runtime explosion:
foreach ($edges_by_type as $edges) { $result_phids += $edges; }
...but I suspect the runtime of that construction is not meaningfully distinguishable from the loop version (I'd expect its runtime should differ only by some fairly small constant), and the loop version reads a little more cleanly to me (it's more obvious that we're discarding the values and don't plan to use them later).
ah ok i see.
I'll just change it to a loop. I guess that also makes it safe if someone reuses that code.
And the difference is minimal