Page MenuHomePhabricator

performance optimazion for edge queries with getDestinationPHIDs
ClosedPublic

Authored by fabe on Dec 2 2014, 12:57 PM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Dec 20, 8:27 AM
Unknown Object (File)
Fri, Dec 20, 7:07 AM
Unknown Object (File)
Fri, Dec 20, 2:18 AM
Unknown Object (File)
Thu, Dec 19, 1:18 PM
Unknown Object (File)
Thu, Dec 19, 1:18 PM
Unknown Object (File)
Thu, Dec 19, 1:18 PM
Unknown Object (File)
Thu, Dec 19, 1:17 PM
Unknown Object (File)
Thu, Dec 19, 1:17 PM
Subscribers
Tokens
"Mountain of Wealth" token, awarded by epriestley.

Details

Summary

Ref T6656 performance optimazion for edge queries with getDestinationPHIDs

Test Plan

accessed reports and several pages containing tasks with edges (projects...)

Diff Detail

Repository
rP Phabricator
Branch
reportperf
Lint
Lint Passed
Unit
Tests Passed
Build Status
Buildable 3160
Build 3166: [Placeholder Plan] Wait for 30 Seconds

Event Timeline

fabe retitled this revision from to performance optimazion for edge queries with getDestinationPHIDs.
fabe updated this object.
fabe edited the test plan for this revision. (Show Details)
epriestley added a reviewer: epriestley.

Ah! Really nice catch!

On the actual patch here, do we lose much performance if we simplify a little bit? Specifically, I'd like to avoid having two loops which implement the same filters. A proposed simplification is to retain a single loop, but do this instead:

+ $set = $this->resultSet;
+ if ($src_phids) {
+   $set = array_select_keys($set, $src_phids);
+ }
- foreach ($this->resultSet as $src => $edges_by_type) {
+ foreach ($set as $src => $edges_by_type) {

Then do something similar for $types. I think that would let us get rid of the top-level if/else and sort-of-duplicated code, but should give us the bulk of the performance benefit?


One inline about array_merge(). Here's a simple example for the array_merge() issue:

<?php

$N = 10000;

$t_start = microtime(true);

$result = array();
for ($n = 0; $n < $N; $n++) {
  $result[] = $n;
}

printf(
  "Append %d elements with operator[] in %dms.\n",
  count($result),
  1000 * (microtime(true) - $t_start));


$t_start = microtime(true);

$result = array();
for ($n = 0; $n < $N; $n++) {
  $result = array_merge($result, array($n));
}

printf(
  "Append %d elements with array_merge() in %dms.\n",
  count($result),
  1000 * (microtime(true) - $t_start));

On my system, the array_merge() cost jumps from 38ms to 6889ms when N jumps from 1000 to 10000 (181x slower with an input 10x larger). The cost of using [] to append with N=10000 is 1ms.

The numbers here probably aren't big enough to start hitting the bad parts of this behavior, but you could imagine that a task could reasonably have 10K edges (e.g., 10,000 subscribers).

My general approach to avoiding this is to never array_merge() in a loop. You can array_mergev() (convenience method provided by libphutil) after the loop, or iterate + operator[] in a loop, or use array + array (all of which have costs on the order of O(N) in the total number of elements, but none of which is a drop-in replacement for array_merge, per se).

src/infrastructure/edges/query/PhabricatorEdgeQuery.php
281–282

This can produce very bad performance behavior -- the cost of array_merge() is approximately the size of all the arguments combined, so when you repeatedly merge a small addition into a large existing array the overall runtime approaches O(N^2).

Some discussion here:

https://secure.phabricator.com/book/phabflavor/article/php_pitfalls/#array-merge-in-incredibl

This revision now requires changes to proceed.Dec 2 2014, 1:16 PM

I guess the pathological array_merge() case is really 10,000 types of edges, which seems very unlikely to ever occur in the wild, but is still probably best avoided.

fabe edited edge metadata.
  • simplify code

libphutil sure has some very nice utility functions :D

performance is about the same.

src/infrastructure/edges/query/PhabricatorEdgeQuery.php
276

Oh, sorry -- array_mergev() in a loop is just as bad as array_merge(), it just simplifies this:

$list_of_lists = array()
foreach ($input as $list) {
  $list_of_lists[] = $list;
}
$list_of_items = array_mergev($list_of_lists);

Otherwise the last line would need to be:

$list_of_items = call_user_func_array('array_merge', $list_of_lists);

...which is way more cumbersome. But it's equivalent, and just calls array_merge() internally.

Just keep this like it was, I think?

You could technically do this to avoid the innermost loop and also avoid the runtime explosion:

foreach ($edges_by_type as $edges) {
  $result_phids += $edges;
}

...but I suspect the runtime of that construction is not meaningfully distinguishable from the loop version (I'd expect its runtime should differ only by some fairly small constant), and the loop version reads a little more cleanly to me (it's more obvious that we're discarding the values and don't plan to use them later).

ah ok i see.
I'll just change it to a loop. I guess that also makes it safe if someone reuses that code.
And the difference is minimal

fabe edited edge metadata.
  • do not use array_merge
epriestley edited edge metadata.

Cool, looks good to me. Thanks!

This revision is now accepted and ready to land.Dec 2 2014, 1:51 PM
This revision was automatically updated to reflect the committed changes.