Changeset View
Standalone View
src/markup/engine/remarkup/PhutilRemarkupBlockStorage.php
<?php | <?php | ||||
/** | /** | ||||
* Remarkup prevents several classes of text-processing problems by replacing | * Remarkup prevents several classes of text-processing problems by replacing | ||||
* tokens in the text as they are marked up. For example, if you write something | * tokens in the text as they are marked up. For example, if you write something | ||||
* like this: | * like this: | ||||
* | * | ||||
* //D12// | * //D12// | ||||
* | * | ||||
* It is processed in several stages. First the "D12" matches and is replaced: | * It is processed in several stages. First the "D12" matches and is replaced | ||||
* with a token, in the form of "<0x01><ID number><literal "Z">". The first | |||||
* byte, "<0x01>" is a single byte with value 1 that marks a token. If this is | |||||
* token ID "444", the text may now look like this: | |||||
* | * | ||||
* //\11Z// | * //<0x01>444Z// | ||||
* | * | ||||
* Now the italics match and are replaced: | * Now the italics match and are replaced, using the next token ID: | ||||
* | * | ||||
* \12Z | * <0x01>445Z | ||||
amckinley: Ok, this makes sense now that I understand that this is replacing tokens with `MAGIC_BYTE. | |||||
* | * | ||||
* When processing completes, all the tokens are replaced again in reverse | * When processing completes, all the tokens are replaced with their final | ||||
* order: | * equivalents. For example, token 444 is evaluated to: | ||||
* | * | ||||
* <em>\11Z</em> | * <a href="http://...">...</a> | ||||
* | * | ||||
* Then: | * Then token 445 is evaluated: | ||||
* | |||||
* <em><0x01>444Z</em> | |||||
* | |||||
* ...and all tokens it contains are replaced: | |||||
* | * | ||||
* <em><a href="http://...">...</a></em> | * <em><a href="http://...">...</a></em> | ||||
Not Done Inline ActionsSo just to make sure I've got this, when we're done processing this snippet, $map will look this, right? array( '\11Z' => '<a href="http://...">...</a>,' '\12Z' => '<em>\11Z</em>', ); amckinley: So just to make sure I've got this, when we're done processing this snippet, `$map` will look… | |||||
Done Inline ActionsYep, this is right. epriestley: Yep, this is right. | |||||
* | * | ||||
* If we didn't do this, the italics rule could match the "//" in "http://", | * If we didn't do this, the italics rule could match the "//" in "http://", | ||||
Not Done Inline ActionsOh, that's why you chose italics as the modifier. ¯\_(ツ)_/¯ amckinley: Oh, that's why you chose italics as the modifier. ¯\_(ツ)_/¯ | |||||
* or any other number of processing mistakes could occur, some of which create | * or any other number of processing mistakes could occur, some of which create | ||||
* security risks. | * security risks. | ||||
* | * | ||||
* This class generates keys, and stores the map of keys to replacement text. | * This class generates keys, and stores the map of keys to replacement text. | ||||
*/ | */ | ||||
final class PhutilRemarkupBlockStorage extends Phobject { | final class PhutilRemarkupBlockStorage extends Phobject { | ||||
const MAGIC_BYTE = "\1"; | const MAGIC_BYTE = "\1"; | ||||
private $map = array(); | private $map = array(); | ||||
private $index; | private $index = 0; | ||||
public function store($text) { | public function store($text) { | ||||
$key = self::MAGIC_BYTE.(++$this->index).'Z'; | $key = self::MAGIC_BYTE.(++$this->index).'Z'; | ||||
Not Done Inline ActionsTIL that you can apparently rely on PHP "incrementing" an uninitialized variable by deciding that you intended to make it an integer. Why not just define private $index = 0? amckinley: TIL that you can apparently rely on PHP "incrementing" an uninitialized variable by deciding… | |||||
Done Inline ActionsFixed this, too. epriestley: Fixed this, too. | |||||
$this->map[$key] = $text; | $this->map[$key] = $text; | ||||
return $key; | return $key; | ||||
} | } | ||||
public function restore($corpus, $text_mode = false) { | public function restore($corpus, $text_mode = false) { | ||||
if ($this->map) { | $map = $this->map; | ||||
if ($text_mode) { | |||||
$corpus = str_replace( | if (!$text_mode) { | ||||
array_reverse(array_keys($this->map)), | foreach ($map as $key => $content) { | ||||
array_reverse($this->map), | $map[$key] = phutil_escape_html($content); | ||||
$corpus); | } | ||||
$corpus = phutil_escape_html($corpus); | |||||
} | |||||
// NOTE: Tokens may contain other tokens: for example, a table may have | |||||
// links inside it. So we can't do a single simple find/replace, because | |||||
// we need to find and replace child tokens inside the content of parent | |||||
// tokens. | |||||
// However, we know that rules which have child tokens must always store | |||||
// all their child tokens first, before they store their parent token: you | |||||
// have to pass the "store(text)" API a block of text with tokens already | |||||
// in it, so you must have created child tokens already. | |||||
Not Done Inline ActionsThis is more like "However, we know that all the children of a given token were extracted previously, and therefore must appear in our map before the token itself does", right? amckinley: This is more like "However, we know that all the children of a given token were extracted… | |||||
Done Inline ActionsRight. epriestley: Right. | |||||
// Thus, all child tokens will appear in the list before parent tokens, so | |||||
// if we start at the beginning of the list and replace all the tokens we | |||||
// find in each piece of content, we'll end up expanding all subtokens | |||||
// correctly. | |||||
$map[] = $corpus; | |||||
$seen = array(); | |||||
foreach ($map as $key => $content) { | |||||
$seen[$key] = true; | |||||
// If the content contains no token magic, we don't need to replace | |||||
Not Done Inline ActionsAre there any risks to using '/'.self::MAGIC_BYTE.'\d+Z/' instead of the slightly stricter '/'.self::MAGIC_BYTE.'[1-9]\d+Z/'? amckinley: Are there any risks to using `'/'.self::MAGIC_BYTE.'\d+Z/'` instead of the slightly stricter… | |||||
Done Inline ActionsBoth can match <0x01>999999Z in a block with fewer than 999999 tokens, so we have to handle the case where we match a magic sequence with no corresponding token unless we make the regexp extremely strict (e.g., dynamically build a regexp which matches only values between 1-119 or whatever). It "should" be impossible for invalid sequences to ever appear in the corpus because we tokenize any existing <0x01> bytes at the very beginning (sort of like adding a backslash to any existing backslashes in the input). So if you write a document with <0x01>0123Z, we'll never try to expand token "0123". Instead, we'll tokenize <0x01> and produce <0x01>1Z0123Z. Then, the very last token expansion will put the original <0x01> back, but the preg_match() will never see the original <0x01>. epriestley: Both can match `<0x01>999999Z` in a block with fewer than 999999 tokens, so we have to handle… | |||||
// anything. | |||||
if (strpos($content, self::MAGIC_BYTE) === false) { | |||||
continue; | |||||
} | |||||
$matches = null; | |||||
preg_match_all( | |||||
'/'.self::MAGIC_BYTE.'\d+Z/', | |||||
$content, | |||||
$matches, | |||||
PREG_OFFSET_CAPTURE); | |||||
$matches = $matches[0]; | |||||
// See PHI1114. We're replacing all the matches in one pass because this | |||||
// is significantly faster than doing "substr_replace()" in a loop if the | |||||
// corpus is large and we have a large number of matches. | |||||
// Build a list of string pieces in "$parts" by interleaving the | |||||
// plain strings between each token and the replacement token text, then | |||||
// implode the whole thing when we're done. | |||||
Not Done Inline Actions"have a corresponding token". And yeah, seems like we should throw here, because it probably means there's something subtly wrong with this new implementation? amckinley: "have a corresponding token". And yeah, seems like we should throw here, because it probably… | |||||
Done Inline ActionsYeah, I'll make this throw. It "should definitely" be impossible so it's probably good for us to know about it if that isn't actuallyt rue. epriestley: Yeah, I'll make this throw. It "should definitely" be impossible so it's probably good for us… | |||||
$parts = array(); | |||||
$pos = 0; | |||||
foreach ($matches as $next) { | |||||
$subkey = $next[0]; | |||||
// If we've matched a token pattern but don't actually have any | |||||
// corresponding token, just skip this match. This should not be | |||||
// possible, and should perhaps be an error. | |||||
if (!isset($seen[$subkey])) { | |||||
if (!isset($map[$subkey])) { | |||||
throw new Exception( | |||||
pht( | |||||
'Matched token key "%s" while processing remarkup block, but '. | |||||
'this token does not exist in the token map.', | |||||
$subkey)); | |||||
} else { | } else { | ||||
$corpus = phutil_safe_html(str_replace( | throw new Exception( | ||||
array_reverse(array_keys($this->map)), | pht( | ||||
array_map('phutil_escape_html', array_reverse($this->map)), | 'Matched token key "%s" while processing remarkup block, but '. | ||||
phutil_escape_html($corpus))); | 'this token appears later in the list than the key being '. | ||||
'processed ("%s").', | |||||
$subkey, | |||||
$key)); | |||||
} | |||||
} | |||||
$subpos = $next[1]; | |||||
// If there were any non-token bytes since the last token, add them. | |||||
if ($subpos > $pos) { | |||||
$parts[] = substr($content, $pos, $subpos - $pos); | |||||
} | |||||
// Add the token replacement text. | |||||
$parts[] = $map[$subkey]; | |||||
// Move the non-token cursor forward over the token. | |||||
$pos = $subpos + strlen($subkey); | |||||
} | |||||
// Add any leftover non-token bytes after the last token. | |||||
$parts[] = substr($content, $pos); | |||||
$content = implode('', $parts); | |||||
$map[$key] = $content; | |||||
} | } | ||||
$corpus = last($map); | |||||
if (!$text_mode) { | |||||
$corpus = phutil_safe_html($corpus); | |||||
} | } | ||||
return $corpus; | return $corpus; | ||||
} | } | ||||
public function overwrite($key, $new_text) { | public function overwrite($key, $new_text) { | ||||
$this->map[$key] = $new_text; | $this->map[$key] = $new_text; | ||||
return $this; | return $this; | ||||
} | } | ||||
Show All 10 Lines |
Ok, this makes sense now that I understand that this is replacing tokens with MAGIC_BYTE.$index_where_we_stored_the_token.'Z', but this docblock was very confusing because D12 is a weird example to use when the replacement text becomes \11Z.
Maybe make the starting remarkup __D99999__ or something with a much farther Levenshtein distance from \11Z?