Page MenuHomePhabricator

[Wilds] Fix phutil_is_utf8_slowly() to reject reserved UTF16 surrogate character ranges
ClosedPublic

Authored by epriestley on Oct 2 2018, 5:51 PM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Nov 22, 4:12 PM
Unknown Object (File)
Fri, Nov 22, 10:46 AM
Unknown Object (File)
Thu, Nov 21, 8:43 AM
Unknown Object (File)
Sun, Nov 17, 4:46 AM
Unknown Object (File)
Oct 23 2024, 2:41 AM
Unknown Object (File)
Oct 19 2024, 12:28 PM
Unknown Object (File)
Oct 19 2024, 12:28 PM
Unknown Object (File)
Oct 17 2024, 3:44 AM
Subscribers
None

Details

Summary

Ref T13209. See T11525. We want to reject certain 3-byte characters as "invalid" unicode, primarily because json_decode() does not accept them.

We currently reject them correctly if we go down the fast path in phutil_is_utf8() via mb_check_encoding(), but incorrectly accept them if we go down the slow path.

Add test coverage that the slow path has the same behavior as the fast path, and then make the slow path reject these byte sequences.

Test Plan
  • Added failing tests.
  • Made them pass on OSX and Windows 10.

Diff Detail

Repository
rARC Arcanist
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

epriestley created this revision.
epriestley added inline comments.
src/utils/utf8.php
46–60

The new logic is just a version of this existing logic for nuking these sequences, which is where all the magic numbers are coming from.

noidea

src/utils/utf8.php
120

Does this function exist because it will always be available, unlike the other mechanisms? Or is this used for stuff that requires constant-time execution like password hash comparisons?

This revision is now accepted and ready to land.Oct 2 2018, 7:00 PM
epriestley added inline comments.
src/utils/utf8.php
97–100

It's an "always available" thing mostly -- if we have the mb (multibyte) extension we'll use that instead in most cases.

I think there's also some callsite where we use the $only_bmp mode of phutil_is_utf8_slowly() even if we have mb, but that might have been removed when we upgraded the database from utf8 columns to utf8mb4 columns -- before, we couldn't put non-BMP unicode into the database even though we could process it in PHP.