[Wilds] Fix phutil_is_utf8_slowly() to reject reserved UTF16 surrogate character ranges
ClosedPublic
Actions

Authored by epriestley on Oct 2 2018, 5:51 PM.

Details

Reviewers

amckinley

Maniphest Tasks

T13209: How To Properly Escape Commands on Windows (A Dark Tragedy)

Commits

rARCb1921850452c: [Wilds] Fix phutil_is_utf8_slowly() to reject reserved UTF16 surrogate…

Summary

Ref T13209. See T11525. We want to reject certain 3-byte characters as "invalid" unicode, primarily because json_decode() does not accept them.

We currently reject them correctly if we go down the fast path in phutil_is_utf8() via mb_check_encoding(), but incorrectly accept them if we go down the slow path.

Add test coverage that the slow path has the same behavior as the fast path, and then make the slow path reject these byte sequences.

Test Plan

Added failing tests.
Made them pass on OSX and Windows 10.

Diff Detail

Repository

rARC Arcanist

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

epriestley requested review of this revision.Oct 2 2018, 5:51 PM

epriestley created this revision.

Harbormaster failed remote builds in B20975: Diff 47132!Oct 2 2018, 5:51 PM

With context.

epriestley marked an inline comment as done.Oct 2 2018, 5:52 PM

epriestley added inline comments.

src/utils/utf8.php
46–60	The new logic is just a version of this existing logic for nuking these sequences, which is where all the magic numbers are coming from.

epriestley mentioned this in D19725: [Wilds] Sanitize UTF8 output in `tsprintf(...)` under Windows.Oct 2 2018, 5:57 PM

noidea

src/utils/utf8.php
120	Does this function exist because it will always be available, unlike the other mechanisms? Or is this used for stuff that requires constant-time execution like password hash comparisons?

This revision is now accepted and ready to land.Oct 2 2018, 7:00 PM

epriestley marked an inline comment as done.Oct 2 2018, 7:03 PM

epriestley added inline comments.

src/utils/utf8.php
97–100	It's an "always available" thing mostly -- if we have the `mb` (multibyte) extension we'll use that instead in most cases. I think there's also some callsite where we use the `$only_bmp` mode of `phutil_is_utf8_slowly()` even if we have `mb`, but that might have been removed when we upgraded the database from `utf8` columns to `utf8mb4` columns -- before, we couldn't put non-BMP unicode into the database even though we could process it in PHP.