Page MenuHomePhabricator

Sanitize UTF8 more aggressively to satisfy json_encode()
ClosedPublic

Authored by epriestley on Aug 24 2016, 3:53 PM.
Tags
None
Referenced Files
F18815415: D16440.diff
Tue, Oct 21, 2:05 AM
F18783426: D16440.id39539.diff
Mon, Oct 13, 6:37 AM
F18717912: D16440.id39537.diff
Mon, Sep 29, 2:42 PM
F18588347: D16440.id39539.diff
Sep 11 2025, 11:58 PM
F18588326: D16440.id39539.diff
Sep 11 2025, 11:56 PM
F18536405: D16440.diff
Sep 7 2025, 2:16 PM
F18454658: D16440.id39539.diff
Sep 1 2025, 5:28 AM
F18431270: D16440.id.diff
Aug 31 2025, 5:01 AM
Subscribers
None

Details

Summary

Fixes T11525. Currently, there are some strings such that:

json_encode(phutil_utf8ize($string));

...fails. I encountered this with DarkConsole trying to JSON encode queries that inserted encrypted file data into the MySQL blob store, so basically random data.

There appear to be two cases we aren't handling well:

  • Overlong representations: Shorter characters can be written in an invalid way with more bytes. We previously allowed these -- sometimes -- but json_encode() does not. Instead, reject them. We already rejected overlong 2-character codes.
  • Surrogate characters: There is a range of surrogate characters reserved for use in UTF16 which json_encode() rejects. Just reject these ourselves, too.
Test Plan

Wrote a bunch of test cases to cover this stuff, all of which now pass.

Fuzzed json_encode(phutil_utf8ize($string)) on random strings in a loop. Before these changes it would fail after a handful of attempts, in less than a second. After these changes, I ran it for several minutes and didn't see any failures.

Diff Detail

Repository
rPHU libphutil
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

epriestley retitled this revision from to Sanitize UTF8 more aggressively to satisfy json_encode().
epriestley updated this object.
epriestley edited the test plan for this revision. (Show Details)
epriestley added a reviewer: chad.
chad edited edge metadata.
chad added inline comments.
src/utils/utf8.php
446–461

¯\_(ツ)_/¯

This revision is now accepted and ready to land.Aug 24 2016, 4:12 PM
This revision was automatically updated to reflect the committed changes.
src/utils/utf8.php
446–461

look how nicely formatted it is

it must be right!