Differential D20522 Diff 48951 src/markup/engine/remarkup/PhutilRemarkupBlockStorage.php

Changeset View

Standalone View

src/markup/engine/remarkup/PhutilRemarkupBlockStorage.php

	<?php			<?php

	/**			/**
	* Remarkup prevents several classes of text-processing problems by replacing			* Remarkup prevents several classes of text-processing problems by replacing
	* tokens in the text as they are marked up. For example, if you write something			* tokens in the text as they are marked up. For example, if you write something
	* like this:			* like this:
	*			*
	* //D12//			* //D12//
	*			*
	* It is processed in several stages. First the "D12" matches and is replaced:			* It is processed in several stages. First the "D12" matches and is replaced
				* with a token, in the form of "<0x01><ID number><literal "Z">". The first
				* byte, "<0x01>" is a single byte with value 1 that marks a token. If this is
				* token ID "444", the text may now look like this:
	*			*
	* //\11Z//			* //<0x01>444Z//
	*			*
	* Now the italics match and are replaced:			* Now the italics match and are replaced, using the next token ID:
	*			*
	* \12Z			* <0x01>445Z
				amckinleyUnsubmitted Not Done Inline Actions Ok, this makes sense now that I understand that this is replacing tokens with `MAGIC_BYTE.$index_where_we_stored_the_token.'Z'`, but this docblock was very confusing because `D12` is a weird example to use when the replacement text becomes `\11Z`. Maybe make the starting remarkup `__D99999__` or something with a much farther Levenshtein distance from `\11Z`? amckinley: Ok, this makes sense now that I understand that this is replacing tokens with `MAGIC_BYTE.
	*			*
	* When processing completes, all the tokens are replaced again in reverse			* When processing completes, all the tokens are replaced with their final
	* order:			* equivalents. For example, token 444 is evaluated to:
	*			*
	* <em>\11Z</em>			* <a href="http://...">...</a>
	*			*
	* Then:			* Then token 445 is evaluated:
				*
				* <em><0x01>444Z</em>
				*
				* ...and all tokens it contains are replaced:
	*			*
	* <em><a href="http://...">...</a></em>			* <em><a href="http://...">...</a></em>
				amckinleyUnsubmitted Not Done Inline Actions So just to make sure I've got this, when we're done processing this snippet, `$map` will look this, right? array( '\11Z' => '<a href="http://...">...</a>,' '\12Z' => '<em>\11Z</em>', ); amckinley: So just to make sure I've got this, when we're done processing this snippet, `$map` will look…
				epriestleyAuthorUnsubmitted Done Inline Actions Yep, this is right. epriestley: Yep, this is right.
	*			*
	* If we didn't do this, the italics rule could match the "//" in "http://",			* If we didn't do this, the italics rule could match the "//" in "http://",
				amckinleyUnsubmitted Not Done Inline Actions Oh, that's why you chose italics as the modifier. ¯\_(ツ)_/¯ amckinley: Oh, that's why you chose italics as the modifier. ¯\_(ツ)_/¯
	* or any other number of processing mistakes could occur, some of which create			* or any other number of processing mistakes could occur, some of which create
	* security risks.			* security risks.
	*			*
	* This class generates keys, and stores the map of keys to replacement text.			* This class generates keys, and stores the map of keys to replacement text.
	*/			*/
	final class PhutilRemarkupBlockStorage extends Phobject {			final class PhutilRemarkupBlockStorage extends Phobject {

	const MAGIC_BYTE = "\1";			const MAGIC_BYTE = "\1";

	private $map = array();			private $map = array();
	private $index;			private $index = 0;

	public function store($text) {			public function store($text) {
	$key = self::MAGIC_BYTE.(++$this->index).'Z';			$key = self::MAGIC_BYTE.(++$this->index).'Z';
				amckinleyUnsubmitted Not Done Inline Actions TIL that you can apparently rely on PHP "incrementing" an uninitialized variable by deciding that you intended to make it an integer. Why not just define `private $index = 0`? amckinley: TIL that you can apparently rely on PHP "incrementing" an uninitialized variable by deciding…
				epriestleyAuthorUnsubmitted Done Inline Actions Fixed this, too. epriestley: Fixed this, too.
	$this->map[$key] = $text;			$this->map[$key] = $text;
	return $key;			return $key;
	}			}

	public function restore($corpus, $text_mode = false) {			public function restore($corpus, $text_mode = false) {
	if ($this->map) {			$map = $this->map;
	if ($text_mode) {
	$corpus = str_replace(			if (!$text_mode) {
	array_reverse(array_keys($this->map)),			foreach ($map as $key => $content) {
	array_reverse($this->map),			$map[$key] = phutil_escape_html($content);
	$corpus);			}
				$corpus = phutil_escape_html($corpus);
				}

				// NOTE: Tokens may contain other tokens: for example, a table may have
				// links inside it. So we can't do a single simple find/replace, because
				// we need to find and replace child tokens inside the content of parent
				// tokens.

				// However, we know that rules which have child tokens must always store
				// all their child tokens first, before they store their parent token: you
				// have to pass the "store(text)" API a block of text with tokens already
				// in it, so you must have created child tokens already.
				amckinleyUnsubmitted Not Done Inline Actions This is more like "However, we know that all the children of a given token were extracted previously, and therefore must appear in our map before the token itself does", right? amckinley: This is more like "However, we know that all the children of a given token were extracted…
				epriestleyAuthorUnsubmitted Done Inline Actions Right. epriestley: Right.

				// Thus, all child tokens will appear in the list before parent tokens, so
				// if we start at the beginning of the list and replace all the tokens we
				// find in each piece of content, we'll end up expanding all subtokens
				// correctly.

				$map[] = $corpus;
				$seen = array();
				foreach ($map as $key => $content) {
				$seen[$key] = true;

				// If the content contains no token magic, we don't need to replace
				amckinleyUnsubmitted Not Done Inline Actions Are there any risks to using `'/'.self::MAGIC_BYTE.'\d+Z/'` instead of the slightly stricter `'/'.self::MAGIC_BYTE.'[1-9]\d+Z/'`? amckinley: Are there any risks to using `'/'.self::MAGIC_BYTE.'\d+Z/'` instead of the slightly stricter…
				epriestleyAuthorUnsubmitted Done Inline Actions Both can match `<0x01>999999Z` in a block with fewer than 999999 tokens, so we have to handle the case where we match a magic sequence with no corresponding token unless we make the regexp extremely strict (e.g., dynamically build a regexp which matches only values between 1-119 or whatever). It "should" be impossible for invalid sequences to ever appear in the corpus because we tokenize any existing `<0x01>` bytes at the very beginning (sort of like adding a backslash to any existing backslashes in the input). So if you write a document with `<0x01>0123Z`, we'll never try to expand token "0123". Instead, we'll tokenize `<0x01>` and produce `<0x01>1Z0123Z`. Then, the very last token expansion will put the original `<0x01>` back, but the `preg_match()` will never see the original `<0x01>`. epriestley: Both can match `<0x01>999999Z` in a block with fewer than 999999 tokens, so we have to handle…
				// anything.
				if (strpos($content, self::MAGIC_BYTE) === false) {
				continue;
				}

				$matches = null;
				preg_match_all(
				'/'.self::MAGIC_BYTE.'\d+Z/',
				$content,
				$matches,
				PREG_OFFSET_CAPTURE);

				$matches = $matches[0];

				// See PHI1114. We're replacing all the matches in one pass because this
				// is significantly faster than doing "substr_replace()" in a loop if the
				// corpus is large and we have a large number of matches.

				// Build a list of string pieces in "$parts" by interleaving the
				// plain strings between each token and the replacement token text, then
				// implode the whole thing when we're done.
				amckinleyUnsubmitted Not Done Inline Actions "have a corresponding token". And yeah, seems like we should throw here, because it probably means there's something subtly wrong with this new implementation? amckinley: "have a corresponding token". And yeah, seems like we should throw here, because it probably…
				epriestleyAuthorUnsubmitted Done Inline Actions Yeah, I'll make this throw. It "should definitely" be impossible so it's probably good for us to know about it if that isn't actuallyt rue. epriestley: Yeah, I'll make this throw. It "should definitely" be impossible so it's probably good for us…

				$parts = array();
				$pos = 0;
				foreach ($matches as $next) {
				$subkey = $next[0];

				// If we've matched a token pattern but don't actually have any
				// corresponding token, just skip this match. This should not be
				// possible, and should perhaps be an error.
				if (!isset($seen[$subkey])) {
				if (!isset($map[$subkey])) {
				throw new Exception(
				pht(
				'Matched token key "%s" while processing remarkup block, but '.
				'this token does not exist in the token map.',
				$subkey));
	} else {			} else {
	$corpus = phutil_safe_html(str_replace(			throw new Exception(
	array_reverse(array_keys($this->map)),			pht(
	array_map('phutil_escape_html', array_reverse($this->map)),			'Matched token key "%s" while processing remarkup block, but '.
	phutil_escape_html($corpus)));			'this token appears later in the list than the key being '.
				'processed ("%s").',
				$subkey,
				$key));
				}
				}

				$subpos = $next[1];

				// If there were any non-token bytes since the last token, add them.
				if ($subpos > $pos) {
				$parts[] = substr($content, $pos, $subpos - $pos);
				}

				// Add the token replacement text.
				$parts[] = $map[$subkey];

				// Move the non-token cursor forward over the token.
				$pos = $subpos + strlen($subkey);
				}

				// Add any leftover non-token bytes after the last token.
				$parts[] = substr($content, $pos);

				$content = implode('', $parts);

				$map[$key] = $content;
	}			}
				$corpus = last($map);

				if (!$text_mode) {
				$corpus = phutil_safe_html($corpus);
	}			}

	return $corpus;			return $corpus;
	}			}

	public function overwrite($key, $new_text) {			public function overwrite($key, $new_text) {
	$this->map[$key] = $new_text;			$this->map[$key] = $new_text;
	return $this;			return $this;
	}			}

	Show All 10 Lines