Page MenuHomePhabricator

[Wilds] Fix phutil_is_utf8_slowly() to reject reserved UTF16 surrogate character ranges
ClosedPublic

Authored by epriestley on Oct 2 2018, 5:51 PM.

Details

Summary

Ref T13209. See T11525. We want to reject certain 3-byte characters as "invalid" unicode, primarily because json_decode() does not accept them.

We currently reject them correctly if we go down the fast path in phutil_is_utf8() via mb_check_encoding(), but incorrectly accept them if we go down the slow path.

Add test coverage that the slow path has the same behavior as the fast path, and then make the slow path reject these byte sequences.

Test Plan
  • Added failing tests.
  • Made them pass on OSX and Windows 10.

Diff Detail

Repository
rARC Arcanist
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

epriestley requested review of this revision.Oct 2 2018, 5:51 PM
epriestley created this revision.
epriestley updated this revision to Diff 47133.Oct 2 2018, 5:51 PM
  • With context.
epriestley marked an inline comment as done.Oct 2 2018, 5:52 PM
epriestley added inline comments.
src/utils/utf8.php
46–60

The new logic is just a version of this existing logic for nuking these sequences, which is where all the magic numbers are coming from.

amckinley accepted this revision.Oct 2 2018, 7:00 PM

noidea

src/utils/utf8.php
120

Does this function exist because it will always be available, unlike the other mechanisms? Or is this used for stuff that requires constant-time execution like password hash comparisons?

This revision is now accepted and ready to land.Oct 2 2018, 7:00 PM
epriestley marked an inline comment as done.Oct 2 2018, 7:03 PM
epriestley added inline comments.
src/utils/utf8.php
97–100

It's an "always available" thing mostly -- if we have the mb (multibyte) extension we'll use that instead in most cases.

I think there's also some callsite where we use the $only_bmp mode of phutil_is_utf8_slowly() even if we have mb, but that might have been removed when we upgraded the database from utf8 columns to utf8mb4 columns -- before, we couldn't put non-BMP unicode into the database even though we could process it in PHP.