Page MenuHomePhabricator

Prevent backtracking on long JSON strings with escape codes
ClosedPublic

Authored by alexmv on Feb 16 2017, 2:15 AM.
Tags
None
Referenced Files
Unknown Object (File)
Mon, Dec 16, 5:12 PM
Unknown Object (File)
Fri, Dec 13, 10:53 AM
Unknown Object (File)
Thu, Dec 12, 1:25 PM
Unknown Object (File)
Dec 3 2024, 11:42 PM
Unknown Object (File)
Dec 3 2024, 11:42 PM
Unknown Object (File)
Dec 3 2024, 11:42 PM
Unknown Object (File)
Dec 3 2024, 11:21 PM
Unknown Object (File)
Dec 3 2024, 8:42 PM
Subscribers
Tokens
"Evil Spooky Haunted Tree" token, awarded by epriestley.

Details

Summary

The JSONLintLexer class uses a series of regular expression rules to verify that
lexed JSON tokens are formatted correctly. The string check attempts to express
that strings are well-formed by matching:

  • Start of token
  • A double-quote
  • A repetition of zero or more of:
    • A two-character escape sequence (i.e. \n)
    • A six-character unicode codepoint (i.e. \u1234)
    • Any number of printable characters that are not \ or ".
  • A closing quote

Key to this regex is that the three alternatives that make up the core of the
string are non-overlapping sets. The PCRE engine which backs PHP, however, does
not note this, and assumes that it must thus mark each run though this
alternation as a possible point to backtrack to. To do so, internally it uses a
stack frame. Long strings can thus cause the process to run out of stack space;
this can be replicated via:

preg_match('{^(?:x|y)*}', str_repeat("x", 10000));

PCRE does not consider this type of stack exhaustion a bug -- see
https://bugs.exim.org/show_bug.cgi?id=979

Specifically, long JSON strings with many escape codes (each of which
necessitates a stack frame) can trigger this stack exhaustion in the PCRE
engine, and thus segfault the PHP process. This failure mode does not trigger
on PHP 7.0 and up, which default to using PCRE's JIT mode.

Force the PCRE engine to realize that the alternation cases when parsing the
JSON string are non-overlapping, and that backtracking across them is never an
option, by using the non-backtracking (?>x|y) capture form, along with the
non-backtracking repetition symbols ++ and *+. These allow PCRE to use a
tail-recursive form, thus preventing stack exhaustion.

Test Plan

In a repository configured to lint JSON files:

perl -le 'print q|{"k":"|.(q|a\\n|x10000).q|"}|' > test.json
arc lint test.json

...no longer segfaults.

Diff Detail

Repository
rPHU libphutil
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

This could also probably go upstream (https://github.com/Seldaek/jsonlint) if you're feeling ambitious.

externals/jsonlint/src/Seld/JsonLint/Lexer.php
23
[^\0-\x09\x0a-\x1f\\\\"]

Am I reading this wrong, or is \0-\x09\x0a-\x1f a complicated way to write \x00-\x1f? Not necessarily worth fixing.

This revision is now accepted and ready to land.Feb 16 2017, 2:30 AM
alexmv edited the test plan for this revision. (Show Details)

Non-backtracking forms of the two repetition characters.

Yup, will try to get this upstreamed.

externals/jsonlint/src/Seld/JsonLint/Lexer.php
23
This revision was automatically updated to reflect the committed changes.