Fixes T9732. We currently tokenize strings (like user realnames) in the default non-unicode mode, which can cause patterns like \s to work incorrectly.
Use /u to use unicode-aware tokenization instead.
Differential D14441
Use unicode mode when tokenizing strings like user realnames epriestley on Nov 8 2015, 1:49 PM. Authored by Tags None Referenced Files
Subscribers
Details
Fixes T9732. We currently tokenize strings (like user realnames) in the default non-unicode mode, which can cause patterns like \s to work incorrectly. Use /u to use unicode-aware tokenization instead. The behavior of "\s" depends upon environmental settings like LC_ALL. With LC_ALL set to "C", \xA0 is not considered a whitespace character. $ php -r 'setlocale(LC_ALL, "C"); echo count(preg_split("/\s/", "\xE5\xBF\xA0")) . "\n";' 1 $ php -r 'setlocale(LC_ALL, "en_US"); echo count(preg_split("/\s/", "\xE5\xBF\xA0")) . "\n";' 2 To reproduce the original issue, I added an explicit: setlocale(LC_ALL, "en_US"); ...call before the preg_split() call. This caused "忠" to be improperly split. I then added "/u", and observed proper tokenization.
Diff Detail
|