On the actual patch here, do we lose much performance if we simplify a little bit? Specifically, I'd like to avoid having two loops which implement the same filters. A proposed simplification is to retain a single loop, but do this instead:

+ $set = $this->resultSet;
+ if ($src_phids) {
+   $set = array_select_keys($set, $src_phids);
+ }
- foreach ($this->resultSet as $src => $edges_by_type) {
+ foreach ($set as $src => $edges_by_type) {

Then do something similar for $types. I think that would let us get rid of the top-level if/else and sort-of-duplicated code, but should give us the bulk of the performance benefit?

One inline about array_merge(). Here's a simple example for the array_merge() issue:

<?php

$N = 10000;

$t_start = microtime(true);

$result = array();
for ($n = 0; $n < $N; $n++) {
  $result[] = $n;
}

printf(
  "Append %d elements with operator[] in %dms.\n",
  count($result),
  1000 * (microtime(true) - $t_start));


$t_start = microtime(true);

$result = array();
for ($n = 0; $n < $N; $n++) {
  $result = array_merge($result, array($n));
}

printf(
  "Append %d elements with array_merge() in %dms.\n",
  count($result),
  1000 * (microtime(true) - $t_start));

On my system, the array_merge() cost jumps from 38ms to 6889ms when N jumps from 1000 to 10000 (181x slower with an input 10x larger). The cost of using [] to append with N=10000 is 1ms.

The numbers here probably aren't big enough to start hitting the bad parts of this behavior, but you could imagine that a task could reasonably have 10K edges (e.g., 10,000 subscribers).

My general approach to avoiding this is to never array_merge() in a loop. You can array_mergev() (convenience method provided by libphutil) after the loop, or iterate + operator[] in a loop, or use array + array (all of which have costs on the order of O(N) in the total number of elements, but none of which is a drop-in replacement for array_merge, per se).

src/infrastructure/edges/query/PhabricatorEdgeQuery.php
281–282	This can produce very bad performance behavior -- the cost of `array_merge()` is approximately the size of all the arguments combined, so when you repeatedly merge a small addition into a large existing array the overall runtime approaches O(N^2). Some discussion here: https://secure.phabricator.com/book/phabflavor/article/php_pitfalls/#array-merge-in-incredibl

This revision now requires changes to proceed.Dec 2 2014, 1:16 PM

epriestley awarded a token.Dec 2 2014, 1:17 PM

I guess the pathological array_merge() case is really 10,000 types of edges, which seems very unlikely to ever occur in the wild, but is still probably best avoided.

simplify code

Herald added a subscriber: Korvin. · View Herald TranscriptDec 2 2014, 1:39 PM

Harbormaster completed remote builds in B3160: Diff 26220.Dec 2 2014, 1:40 PM

libphutil sure has some very nice utility functions :D

performance is about the same.

epriestley added inline comments.Dec 2 2014, 1:42 PM

src/infrastructure/edges/query/PhabricatorEdgeQuery.php
280	Oh, sorry -- `array_mergev()` in a loop is just as bad as `array_merge()`, it just simplifies this: $list_of_lists = array() foreach ($input as $list) { $list_of_lists[] = $list; } $list_of_items = array_mergev($list_of_lists); Otherwise the last line would need to be: $list_of_items = call_user_func_array('array_merge', $list_of_lists); ...which is way more cumbersome. But it's equivalent, and just calls `array_merge()` internally. Just keep this like it was, I think?

You could technically do this to avoid the innermost loop and also avoid the runtime explosion:

foreach ($edges_by_type as $edges) {
  $result_phids += $edges;
}

...but I suspect the runtime of that construction is not meaningfully distinguishable from the loop version (I'd expect its runtime should differ only by some fairly small constant), and the loop version reads a little more cleanly to me (it's more obvious that we're discarding the values and don't plan to use them later).

ah ok i see.
I'll just change it to a loop. I guess that also makes it safe if someone reuses that code.
And the difference is minimal