Page MenuHomePhabricator

Remarkup hyperlink pattern may backtrack explosively
Closed, ResolvedPublic

Description

I ran across this in the context of T13586. PhutilRemarkupHyperlinkRule uses a (slightly more complex) version of this regular expression to find links:

(\w{3,}://[^\s]+)

When matched against an input like:

AAAAAA...

...this expression executes very slowly:

$ cat example.php       
<?php

$pattern = '(\w{3,}://)';
$corpus = str_repeat('A', 1024 * 512);

$result = preg_match($pattern, $corpus);

var_dump($result);

$ time php -f example.php 
int(0)
php -f example.php  64.72s user 0.74s system 99% cpu 1:05.48 total

Here, it took 64s to match a 512KB input. This may be specific to particular PHP and/or PCRE versions, since I'd naively expect it to have arisen earlier.

I think the issue is: since this regex has no concrete/anchoring prefix, the PCRE engine may backtrack explosively trying to match \w{3,}. Replacing it with \w+ doesn't fix the problem.

We'd like this pattern to anchor on ://. We can't use a lookbehind to match the protocol portion because the lookbehind pattern is not fixed-length.

Event Timeline

epriestley created this task.

We also can't pass $flags to preg_replace_callback(...) because it wasn't supported until PHP 7.4. This means preg_replace_callback() can't operate on a match-offset basis.

Perhaps \w{3,}+ resolves the issue? (+ being a possessive quantifier here that prevents backtracking)

At least on my system, no: \w+, \w{3,}, and \w{3,}+ all take 60s+ to (fail to) match AAAAAAAA... (with total length 512KB).

My use of "backtrack" might not really be precise, here. I'd guess it's trying to match "AAA:", "AAAA:", "AAAAA:", etc., at position 0. All 512K possible matches fail. Then it tries to match them at position 1. All 511,999 possible matches fail, and so on.

A possible fix is to use \w{3,64} (imposing a reasonable restriction that "the maximum length of a valid URI protocol is 64 characters"), but I already wrote a much much more complicated fix so I'm probably going to land that unless I can't get it to pass tests.

This passes tests, but it sure is a big mess compared to a 3-character change to just add a maximum length, and discretion is the better part of avoiding a sunk cost fallacy or something along those lines:

// See T13608. We identify the protocol after matching a link to give
// this pattern a concrete sequence to search for. If we match a
// variable-length protocol here with a pattern like "\w{3,}://", the regex
// engine may take a very long time to match against long strings.

// We also can't use "preg_replace_callback()" with "PREG_OFFSET_CAPTURE"
// because the function does not support flags until PHP 7.4.

$pattern = '(://[^\s'.PhutilRemarkupBlockStorage::MAGIC_BYTE.']+)';

// For each match, look backward from the match position to extend the
// matching sequence so it includes the protocol specification before
// the "://" in the URI, if one exists.

// Minimum required length for a valid protocol.
$min_protocol_length = 3;
// Maximum length for a valid protocol.
$max_protocol_length = 64;

// Characters which can never appear in a protocol, and stop the search.
static $nonprotocol = array(
  ' ' => true,
  "\n" => true,
  "\t" => true,
  ':' => true,
  '/' => true,
);

$offset = 0;
while (true) {
  $match = phutil_preg_match($pattern, $text, PREG_OFFSET_CAPTURE, $offset);
  if (!$match) {
    break;
  }

  $match_text = $match[0][0];
  $match_offset = $match[0][1];

  // Only search backward to the beginning of the string or the maximum
  // protocol length, whichever is shorter.
  $max_distance = min($match_offset, $max_protocol_length);

  for ($ii = 1; $ii <= $max_distance; $ii++) {
    $c = $text[$match_offset - $ii];
    if (isset($nonprotocol[$c])) {
      $ii--;
      break;
    }
  }

  $offset = $match_offset + strlen($match_text);

  // If we found a sequence of valid protocol characters between the
  // minimum and maximum lengths, try to mark it up as a hyperlink.
  if (($ii >= $min_protocol_length) && ($ii <= $max_distance)) {
    $old_offset = $match_offset - $ii;

    $old_text = substr($text, $old_offset, $ii).$match_text;
    $new_text = $this->markupHyperlinkUngreedy($old_text);

    if ($new_text !== $old_text) {
      $text = substr_replace(
        $text,
        $new_text,
        $old_offset,
        strlen($old_text));
      $offset += (strlen($new_text) - strlen($old_text));
    }
  }
}

This is at least approximately resolved by D21562. Hopefully that'll stick if I don't kick it anymore. I may or may not land D21561, depending on whether I ultimately use it in T13586.

Perhaps too little too late: changing to ((?<!\w)\w{3,}://) also resolves the issue for me. Limiting the protocol to 32 characters seems sane either way.

Ah! I like that fix, and it also works for me locally. I think the protocol-length-limit is reasonable to retain on the basis of general sanity, but I'll add the negative lookbehind to further encourage PCRE to run in something resembling O(N) time.