Paths

Table of Contentst

Differential D14441

Use unicode mode when tokenizing strings like user realnames
ClosedPublic
Actions

Authored by epriestley on Nov 8 2015, 1:49 PM.

Details

Reviewers

chad

Maniphest Tasks

T9732: Can't register new account when "Real Name" contains some special unicode character

Commits

Restricted Diffusion Commit
rP152ddf57092e: Use unicode mode when tokenizing strings like user realnames

Summary

Fixes T9732. We currently tokenize strings (like user realnames) in the default non-unicode mode, which can cause patterns like \s to work incorrectly.

Use /u to use unicode-aware tokenization instead.

Test Plan

The behavior of "\s" depends upon environmental settings like LC_ALL.

With LC_ALL set to "C", \xA0 is not considered a whitespace character.
With LC_ALL set to "en_US", it is:

$ php -r 'setlocale(LC_ALL, "C"); echo count(preg_split("/\s/", "\xE5\xBF\xA0")) . "\n";'
1
$ php -r 'setlocale(LC_ALL, "en_US"); echo count(preg_split("/\s/", "\xE5\xBF\xA0")) . "\n";'
2

To reproduce the original issue, I added an explicit:

setlocale(LC_ALL, "en_US");

...call before the preg_split() call. This caused "忠" to be improperly split.

I then added "/u", and observed proper tokenization.

Diff Detail

Repository

rP Phabricator

Branch

split1

Lint

Lint Passed

Unit

Tests Passed

Build Status

Buildable 8703
Build 10103: Run Core Tests
Build 10102: arc lint + arc unit

Event Timeline

epriestley updated this revision to Diff 34894.Nov 8 2015, 1:49 PM

epriestley retitled this revision from to Use unicode mode when tokenizing strings like user realnames.

epriestley updated this object.

epriestley edited the test plan for this revision. (Show Details)

epriestley added a reviewer: chad.

epriestley added a task: T9732: Can't register new account when "Real Name" contains some special unicode character.

Perfect!

Thanks, @epriestley @chad.

chad accepted this revision.Nov 8 2015, 2:58 PM

chad edited edge metadata.

This revision is now accepted and ready to land.Nov 8 2015, 2:58 PM

Closed by commit rP152ddf57092e: Use unicode mode when tokenizing strings like user realnames (authored by epriestley, committed by epriestley). · Explain WhyNov 8 2015, 3:03 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents
Changeset List

Path

Size

src/

applications/

typeahead/

datasource/

PhabricatorTypeaheadDatasource.php

2 lines

Diff 34894

View Options

src/applications/typeahead/datasource/PhabricatorTypeaheadDatasource.php

Use unicode mode when tokenizing strings like user realnamesClosedPublicActions