Page MenuHomePhabricator

Improve the performance of large remarkup documents with many complex rules
ClosedPublic

Authored by epriestley on May 15 2019, 4:20 PM.
Tags
None
Referenced Files
F14493412: D20522.diff
Fri, Jan 3, 2:20 AM
Unknown Object (File)
Fri, Dec 20, 6:58 PM
Unknown Object (File)
Thu, Dec 19, 12:35 AM
Unknown Object (File)
Fri, Dec 13, 6:08 AM
Unknown Object (File)
Nov 22 2024, 3:24 AM
Unknown Object (File)
Nov 18 2024, 11:03 PM
Unknown Object (File)
Oct 22 2024, 11:22 AM
Unknown Object (File)
Oct 14 2024, 3:24 PM
Subscribers
None

Details

Summary

See PHI1114. An install encountered a multi-megabyte document with approximately 11,000 replacement tokens (complex remarkup rules which store text and evaluate at display time) that required 17s to render.

Onsite investigation narrowed this down to a large amount of time spent in restore(), here.

Before this change, a document like this must call str_replace() on the full document for each token, so roughly O(size of the document * number of tokens) bytes are being shuffled around.

We can improve this dramatically by:

  • incrementally expanding tokens, so most operations are on one token instead of the entire document (and the total document size has a much smaller performance effect); and
  • replacing tokens in a single pass with preg_match() + append + implode() instead of running str_replace() in a loop.
Test Plan

On this document:

echo str_repeat("T300 T301 T302 T303 T304 T305 T306 T307 T308 T309 T310\n", 1024).str_repeat("qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq\n", 1024 * 512);

..saw local time in restore() drop from ~3,300ms to ~10ms with no apparent behavioral changes.

Ran all unit tests, browsed around locally, loaded the page in the web UI.

Diff Detail

Repository
rPHU libphutil
Branch
token1
Lint
Lint Passed
Unit
Tests Passed
Build Status
Buildable 22802
Build 31279: Run Core Tests
Build 31278: arc lint + arc unit

Event Timeline

Part of the reason this works is that str_replace(), even in list mode, is basically running a loop over the string. That is:

str_replace(array('A', 'B'), array('X', 'Y'), $corpus);

...has behavior equivalent to:

foreach (array('A' => 'X', 'B' => 'Y') as $search => $replace) {
  $corpus = str_replace($search, $replace, $corpus);
}

...and also currently has performance equivalent to that loop (i.e., cost is roughly O(size of the string * number of search terms)).

It can't easily be smarter because this behavior matters when the "search" terms and "replacement" terms are substrings of one another. For example:

$ php -r "echo str_replace(array('meow', 'woof', 'abracadabra'), array('abra', 'cadabra', 'Magic'), 'meowwoof');"
Magic

If "meow" and "woof" were replaced in single pass, the result would be "abracadabra", not "Magic". Since "str_replace()" has multi-pass behavior, it it very likely to have a multi-pass implementation.

We also rely on this behavior, because tokens may include other tokens. If str_replace() did everything in a single pass, we couldn't use it in the original algorithm.

(In this revised algorithm, we could use a single-pass str_replace() instead of the "$parts"/"$pos" loop, which is basically implementing a single-pass str_replace().)

The actual implementation of str_replace() in PHP as of 7.1.19 is approximately the loop above:

ext/standard/string.c
	/* If search is an array */
	if (Z_TYPE_P(search) == IS_ARRAY) {
		/* Duplicate subject string for repeated replacement */
		ZVAL_STR_COPY(result, subject_str);

    ...

		/* For each entry in the search array, get the entry */
		ZEND_HASH_FOREACH_VAL(Z_ARRVAL_P(search), search_entry) {
			/* Make sure we're dealing with strings. */
			zend_string *search_str = zval_get_string(search_entry);

      ...

      if (ZSTR_LEN(search_str) > 1) {
        ...
        
				tmp_result = php_str_to_str_i_ex(Z_STR_P(result), ZSTR_VAL(lc_subject_str),
						search_str, replace_value, replace_len, &replace_count);
            
        ...
			}

			ZVAL_STR(result, tmp_result);

		} ZEND_HASH_FOREACH_END();
	}

We can somewhat abusively demonstrate the O(size of the string * number of search terms) behavior like this:

<?php

$operations = 512;

$inputs = array(
  '1KB' => str_repeat('A', 1024),
  '1MB' => str_repeat('A', 1024 * 1024),
  '2MB' => str_repeat('A', 1024 * 1024 * 2),
);

printf("| Replacements |");
foreach ($inputs as $key => $input) {
  printf(" % 10s |", $key);
}
printf("\n");

while (true) {
  $search = array_fill(0, $operations, 'X');
  $replace = array_fill(0, $operations, 'Y');

  $times = array();
  foreach ($inputs as $label => $input) {
    $t_start = microtime(true);
    str_replace($search, $replace, $input);
    $t_end = microtime(true);

    $times[$label] = ($t_end - $t_start);
  }

  printf("| % 11dx |", $operations);
  foreach ($times as $time) {
    printf(" % 8dms |", (int)(1000 * $time));
  }
  printf("\n");

  $operations *= 2;
}

When run:

$ php -f replace.php 
| Replacements |        1KB |        1MB |        2MB |
|         512x |        0ms |       11ms |       20ms |
|        1024x |        0ms |       19ms |       36ms |
|        2048x |        0ms |       36ms |       73ms |
|        4096x |        0ms |       72ms |      141ms |
|        8192x |        0ms |      154ms |      287ms |
|       16384x |        0ms |      285ms |      580ms |
|       32768x |        0ms |      619ms |     1291ms |
|       65536x |        1ms |     1214ms |     2466ms |
|      131072x |        3ms |     2456ms |     4798ms |
|      262144x |        5ms |     4961ms |     9769ms |
|      524288x |       10ms |     9725ms |    19658ms |

The more times you replace "X" with "Y" in the string "AAA...", the longer it takes. (And this is a function of the corpus size, not overhead in dealing with the larger search/replace lists.)

We could imagine that a smarter str_replace() might detect that the 500,000th replacement of "X" in "AAA..." is not likely to have an new effects and can be skipped, or that replacing any text with "Y" can never affect matches of "X", but these would add a huge amount of complexity to the implementation and some overhead in the common case (a small number of replacements in a short string). It also seems reasonable to assume that most callers aren't pathological like this.

I've been staring at this long enough that I can plausibly claim to understand how it works. Put this on the pile of "terrible interview questions that I'm sure someone somewhere is getting asked".

src/markup/engine/remarkup/PhutilRemarkupBlockStorage.php
8–16

Ok, this makes sense now that I understand that this is replacing tokens with MAGIC_BYTE.$index_where_we_stored_the_token.'Z', but this docblock was very confusing because D12 is a weird example to use when the replacement text becomes \11Z.

Maybe make the starting remarkup __D99999__ or something with a much farther Levenshtein distance from \11Z?

25

So just to make sure I've got this, when we're done processing this snippet, $map will look this, right?

array(
  '\11Z' => '<a href="http://...">...</a>,'
  '\12Z' => '<em>\11Z</em>',
);
27

Oh, that's why you chose italics as the modifier. ¯\_(ツ)_/¯

41

TIL that you can apparently rely on PHP "incrementing" an uninitialized variable by deciding that you intended to make it an integer. Why not just define private $index = 0?

61–64

This is more like "However, we know that all the children of a given token were extracted previously, and therefore must appear in our map before the token itself does", right?

76

Are there any risks to using '/'.self::MAGIC_BYTE.'\d+Z/' instead of the slightly stricter '/'.self::MAGIC_BYTE.'[1-9]\d+Z/'?

96–97

"have a corresponding token". And yeah, seems like we should throw here, because it probably means there's something subtly wrong with this new implementation?

This revision is now accepted and ready to land.May 16 2019, 5:49 PM
src/markup/engine/remarkup/PhutilRemarkupBlockStorage.php
25

Yep, this is right.

61–64

Right.

76

Both can match <0x01>999999Z in a block with fewer than 999999 tokens, so we have to handle the case where we match a magic sequence with no corresponding token unless we make the regexp extremely strict (e.g., dynamically build a regexp which matches only values between 1-119 or whatever).

It "should" be impossible for invalid sequences to ever appear in the corpus because we tokenize any existing <0x01> bytes at the very beginning (sort of like adding a backslash to any existing backslashes in the input).

So if you write a document with <0x01>0123Z, we'll never try to expand token "0123". Instead, we'll tokenize <0x01> and produce <0x01>1Z0123Z. Then, the very last token expansion will put the original <0x01> back, but the preg_match() will never see the original <0x01>.

96–97

Yeah, I'll make this throw. It "should definitely" be impossible so it's probably good for us to know about it if that isn't actuallyt rue.

  • Choose less-misleading constants and notations for the examples.
  • Rewrite the example documentation to more clearly follow the new execution flow.
  • Clarify the "child tokens appear first" explanation.
  • Make encountering an invalid token ("01234" or "99999") throw. It should be impossible for invalid tokens to appear anywhere.
  • Make encountering an out-of-order token (for example, token 30 while we're processing token 20) throw. It should be impossible for out-of-order tokens to appear anywhere.

I'll hold this until after the release cut anyway since errors here are unusually dangerous (they can lead to XSS very easily), but I think it would also be hard to get this wrong without breaking a lot of unit tests and UI behavior in fairly obvious ways.

src/markup/engine/remarkup/PhutilRemarkupBlockStorage.php
41

Fixed this, too.