Page MenuHomePhabricator

Use unicode mode when tokenizing strings like user realnames
ClosedPublic

Authored by epriestley on Nov 8 2015, 1:49 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Apr 11, 10:21 AM
Unknown Object (File)
Tue, Apr 9, 3:08 AM
Unknown Object (File)
Wed, Apr 3, 5:55 AM
Unknown Object (File)
Sun, Mar 31, 2:49 AM
Unknown Object (File)
Sun, Mar 24, 10:44 AM
Unknown Object (File)
Mar 5 2024, 2:48 AM
Unknown Object (File)
Mar 5 2024, 2:48 AM
Unknown Object (File)
Mar 5 2024, 2:48 AM
Subscribers

Details

Summary

Fixes T9732. We currently tokenize strings (like user realnames) in the default non-unicode mode, which can cause patterns like \s to work incorrectly.

Use /u to use unicode-aware tokenization instead.

Test Plan

The behavior of "\s" depends upon environmental settings like LC_ALL.

With LC_ALL set to "C", \xA0 is not considered a whitespace character.
With LC_ALL set to "en_US", it is:

$ php -r 'setlocale(LC_ALL, "C"); echo count(preg_split("/\s/", "\xE5\xBF\xA0")) . "\n";'
1
$ php -r 'setlocale(LC_ALL, "en_US"); echo count(preg_split("/\s/", "\xE5\xBF\xA0")) . "\n";'
2

To reproduce the original issue, I added an explicit:

setlocale(LC_ALL, "en_US");

...call before the preg_split() call. This caused "忠" to be improperly split.

I then added "/u", and observed proper tokenization.

Diff Detail

Repository
rP Phabricator
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

epriestley retitled this revision from to Use unicode mode when tokenizing strings like user realnames.
epriestley updated this object.
epriestley edited the test plan for this revision. (Show Details)
epriestley added a reviewer: chad.
chad edited edge metadata.
This revision is now accepted and ready to land.Nov 8 2015, 2:58 PM
This revision was automatically updated to reflect the committed changes.