Page MenuHomePhabricator

Use unicode mode when tokenizing strings like user realnames
ClosedPublic

Authored by epriestley on Nov 8 2015, 1:49 PM.
Tags
None
Referenced Files
F13137263: D14441.diff
Thu, May 2, 6:04 PM
Unknown Object (File)
Thu, Apr 25, 2:53 AM
Unknown Object (File)
Thu, Apr 11, 10:21 AM
Unknown Object (File)
Tue, Apr 9, 3:08 AM
Unknown Object (File)
Wed, Apr 3, 5:55 AM
Unknown Object (File)
Mar 31 2024, 2:49 AM
Unknown Object (File)
Mar 24 2024, 10:44 AM
Unknown Object (File)
Mar 5 2024, 2:48 AM
Subscribers

Details

Summary

Fixes T9732. We currently tokenize strings (like user realnames) in the default non-unicode mode, which can cause patterns like \s to work incorrectly.

Use /u to use unicode-aware tokenization instead.

Test Plan

The behavior of "\s" depends upon environmental settings like LC_ALL.

With LC_ALL set to "C", \xA0 is not considered a whitespace character.
With LC_ALL set to "en_US", it is:

$ php -r 'setlocale(LC_ALL, "C"); echo count(preg_split("/\s/", "\xE5\xBF\xA0")) . "\n";'
1
$ php -r 'setlocale(LC_ALL, "en_US"); echo count(preg_split("/\s/", "\xE5\xBF\xA0")) . "\n";'
2

To reproduce the original issue, I added an explicit:

setlocale(LC_ALL, "en_US");

...call before the preg_split() call. This caused "忠" to be improperly split.

I then added "/u", and observed proper tokenization.

Diff Detail

Repository
rP Phabricator
Branch
split1
Lint
Lint Passed
Unit
Tests Passed
Build Status
Buildable 8703
Build 10103: Run Core Tests
Build 10102: arc lint + arc unit

Event Timeline

epriestley retitled this revision from to Use unicode mode when tokenizing strings like user realnames.
epriestley updated this object.
epriestley edited the test plan for this revision. (Show Details)
epriestley added a reviewer: chad.
chad edited edge metadata.
This revision is now accepted and ready to land.Nov 8 2015, 2:58 PM
This revision was automatically updated to reflect the committed changes.