Improve the performance of large remarkup documents with many complex rules
ClosedPublic
Actions

Authored by epriestley on May 15 2019, 4:20 PM.

Details

Reviewers

amckinley

Commits

rPHUc175590464bd: Improve the performance of large remarkup documents with many complex rules

Summary

See PHI1114. An install encountered a multi-megabyte document with approximately 11,000 replacement tokens (complex remarkup rules which store text and evaluate at display time) that required 17s to render.

Onsite investigation narrowed this down to a large amount of time spent in restore(), here.

Before this change, a document like this must call str_replace() on the full document for each token, so roughly O(size of the document * number of tokens) bytes are being shuffled around.

We can improve this dramatically by:

incrementally expanding tokens, so most operations are on one token instead of the entire document (and the total document size has a much smaller performance effect); and
replacing tokens in a single pass with preg_match() + append + implode() instead of running str_replace() in a loop.

Test Plan

On this document:

echo str_repeat("T300 T301 T302 T303 T304 T305 T306 T307 T308 T309 T310\n", 1024).str_repeat("qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq\n", 1024 * 512);

..saw local time in restore() drop from ~3,300ms to ~10ms with no apparent behavioral changes.

Ran all unit tests, browsed around locally, loaded the page in the web UI.

Diff Detail

Repository

rPHU libphutil

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

epriestley created this revision.May 15 2019, 4:20 PM

Harbormaster completed remote builds in B22802: Diff 48929.May 15 2019, 4:21 PM

epriestley requested review of this revision.May 15 2019, 4:21 PM

epriestley edited the summary of this revision. (Show Details)May 15 2019, 4:28 PM

Part of the reason this works is that str_replace(), even in list mode, is basically running a loop over the string. That is:

str_replace(array('A', 'B'), array('X', 'Y'), $corpus);

...has behavior equivalent to:

foreach (array('A' => 'X', 'B' => 'Y') as $search => $replace) {
  $corpus = str_replace($search, $replace, $corpus);
}

...and also currently has performance equivalent to that loop (i.e., cost is roughly O(size of the string * number of search terms)).

It can't easily be smarter because this behavior matters when the "search" terms and "replacement" terms are substrings of one another. For example:

$ php -r "echo str_replace(array('meow', 'woof', 'abracadabra'), array('abra', 'cadabra', 'Magic'), 'meowwoof');"
Magic

If "meow" and "woof" were replaced in single pass, the result would be "abracadabra", not "Magic". Since "str_replace()" has multi-pass behavior, it it very likely to have a multi-pass implementation.

We also rely on this behavior, because tokens may include other tokens. If str_replace() did everything in a single pass, we couldn't use it in the original algorithm.

(In this revised algorithm, we could use a single-pass str_replace() instead of the "$parts"/"$pos" loop, which is basically implementing a single-pass str_replace().)

The actual implementation of str_replace() in PHP as of 7.1.19 is approximately the loop above:

ext/standard/string.c

	/* If search is an array */
	if (Z_TYPE_P(search) == IS_ARRAY) {
		/* Duplicate subject string for repeated replacement */
		ZVAL_STR_COPY(result, subject_str);

    ...

		/* For each entry in the search array, get the entry */
		ZEND_HASH_FOREACH_VAL(Z_ARRVAL_P(search), search_entry) {
			/* Make sure we're dealing with strings. */
			zend_string *search_str = zval_get_string(search_entry);

      ...

      if (ZSTR_LEN(search_str) > 1) {
        ...
        
				tmp_result = php_str_to_str_i_ex(Z_STR_P(result), ZSTR_VAL(lc_subject_str),
						search_str, replace_value, replace_len, &replace_count);
            
        ...
			}

			ZVAL_STR(result, tmp_result);

		} ZEND_HASH_FOREACH_END();
	}

We can somewhat abusively demonstrate the O(size of the string * number of search terms) behavior like this:

<?php

$operations = 512;

$inputs = array(
  '1KB' => str_repeat('A', 1024),
  '1MB' => str_repeat('A', 1024 * 1024),
  '2MB' => str_repeat('A', 1024 * 1024 * 2),
);

printf("| Replacements |");
foreach ($inputs as $key => $input) {
  printf(" % 10s |", $key);
}
printf("\n");

while (true) {
  $search = array_fill(0, $operations, 'X');
  $replace = array_fill(0, $operations, 'Y');

  $times = array();
  foreach ($inputs as $label => $input) {
    $t_start = microtime(true);
    str_replace($search, $replace, $input);
    $t_end = microtime(true);

    $times[$label] = ($t_end - $t_start);
  }

  printf("| % 11dx |", $operations);
  foreach ($times as $time) {
    printf(" % 8dms |", (int)(1000 * $time));
  }
  printf("\n");

  $operations *= 2;
}

When run:

$ php -f replace.php 
| Replacements |        1KB |        1MB |        2MB |
|         512x |        0ms |       11ms |       20ms |
|        1024x |        0ms |       19ms |       36ms |
|        2048x |        0ms |       36ms |       73ms |
|        4096x |        0ms |       72ms |      141ms |
|        8192x |        0ms |      154ms |      287ms |
|       16384x |        0ms |      285ms |      580ms |
|       32768x |        0ms |      619ms |     1291ms |
|       65536x |        1ms |     1214ms |     2466ms |
|      131072x |        3ms |     2456ms |     4798ms |
|      262144x |        5ms |     4961ms |     9769ms |
|      524288x |       10ms |     9725ms |    19658ms |

The more times you replace "X" with "Y" in the string "AAA...", the longer it takes. (And this is a function of the corpus size, not overhead in dealing with the larger search/replace lists.)

We could imagine that a smarter str_replace() might detect that the 500,000th replacement of "X" in "AAA..." is not likely to have an new effects and can be skipped, or that replacing any text with "Y" can never affect matches of "X", but these would add a huge amount of complexity to the implementation and some overhead in the common case (a small number of replacements in a short string). It also seems reasonable to assume that most callers aren't pathological like this.

I've been staring at this long enough that I can plausibly claim to understand how it works. Put this on the pile of "terrible interview questions that I'm sure someone somewhere is getting asked".

src/markup/engine/remarkup/PhutilRemarkupBlockStorage.php
8–19	Ok, this makes sense now that I understand that this is replacing tokens with `MAGIC_BYTE.$index_where_we_stored_the_token.'Z'`, but this docblock was very confusing because `D12` is a weird example to use when the replacement text becomes `\11Z`. Maybe make the starting remarkup `__D99999__` or something with a much farther Levenshtein distance from `\11Z`?
32	So just to make sure I've got this, when we're done processing this snippet, `$map` will look this, right? array( '\11Z' => '<a href="http://...">...</a>,' '\12Z' => '<em>\11Z</em>', );
34	Oh, that's why you chose italics as the modifier. ¯\_(ツ)_/¯
48	TIL that you can apparently rely on PHP "incrementing" an uninitialized variable by deciding that you intended to make it an integer. Why not just define `private $index = 0`?
68–71	This is more like "However, we know that all the children of a given token were extracted previously, and therefore must appear in our map before the token itself does", right?
83	Are there any risks to using `'/'.self::MAGIC_BYTE.'\d+Z/'` instead of the slightly stricter `'/'.self::MAGIC_BYTE.'[1-9]\d+Z/'`?
103–104	"have a corresponding token". And yeah, seems like we should throw here, because it probably means there's something subtly wrong with this new implementation?

This revision is now accepted and ready to land.May 16 2019, 5:49 PM

epriestley added inline comments.May 16 2019, 6:06 PM

src/markup/engine/remarkup/PhutilRemarkupBlockStorage.php
32	Yep, this is right.
68–71	Right.
83	Both can match `<0x01>999999Z` in a block with fewer than 999999 tokens, so we have to handle the case where we match a magic sequence with no corresponding token unless we make the regexp extremely strict (e.g., dynamically build a regexp which matches only values between 1-119 or whatever). It "should" be impossible for invalid sequences to ever appear in the corpus because we tokenize any existing `<0x01>` bytes at the very beginning (sort of like adding a backslash to any existing backslashes in the input). So if you write a document with `<0x01>0123Z`, we'll never try to expand token "0123". Instead, we'll tokenize `<0x01>` and produce `<0x01>1Z0123Z`. Then, the very last token expansion will put the original `<0x01>` back, but the `preg_match()` will never see the original `<0x01>`.
103–104	Yeah, I'll make this throw. It "should definitely" be impossible so it's probably good for us to know about it if that isn't actuallyt rue.

Choose less-misleading constants and notations for the examples.
Rewrite the example documentation to more clearly follow the new execution flow.
Clarify the "child tokens appear first" explanation.
Make encountering an invalid token ("01234" or "99999") throw. It should be impossible for invalid tokens to appear anywhere.
Make encountering an out-of-order token (for example, token 30 while we're processing token 20) throw. It should be impossible for out-of-order tokens to appear anywhere.

Harbormaster completed remote builds in B22810: Diff 48937.May 16 2019, 6:11 PM

I'll hold this until after the release cut anyway since errors here are unusually dangerous (they can lead to XSS very easily), but I think it would also be hard to get this wrong without breaking a lot of unit tests and UI behavior in fairly obvious ways.