The JSONLintLexer class uses a series of regular expression rules to verify that
lexed JSON tokens are formatted correctly. The string check attempts to express
that strings are well-formed by matching:
- Start of token
- A double-quote
- A repetition of zero or more of:
- A two-character escape sequence (i.e. \n)
- A six-character unicode codepoint (i.e. \u1234)
- Any number of printable characters that are not \ or ".
- A closing quote
Key to this regex is that the three alternatives that make up the core of the
string are non-overlapping sets. The PCRE engine which backs PHP, however, does
not note this, and assumes that it must thus mark each run though this
alternation as a possible point to backtrack to. To do so, internally it uses a
stack frame. Long strings can thus cause the process to run out of stack space;
this can be replicated via:
preg_match('{^(?:x|y)*}', str_repeat("x", 10000));
PCRE does not consider this type of stack exhaustion a bug -- see
https://bugs.exim.org/show_bug.cgi?id=979
Specifically, long JSON strings with many escape codes (each of which
necessitates a stack frame) can trigger this stack exhaustion in the PCRE
engine, and thus segfault the PHP process. This failure mode does not trigger
on PHP 7.0 and up, which default to using PCRE's JIT mode.
Force the PCRE engine to realize that the alternation cases when parsing the
JSON string are non-overlapping, and that backtracking across them is never an
option, by using the non-backtracking (?>x|y) capture form, along with the
non-backtracking repetition symbols ++ and *+. These allow PCRE to use a
tail-recursive form, thus preventing stack exhaustion.