Can't register new account when "Real Name" contains some special unicode character
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	qiu8310
	Nov 8 2015, 5:36 AM

Description

phab-error-1.min.png (728×978 px, 39 KB)

I tracked the source code, found out that UserName and RealName will be tokenized, and the tokenized source code located in applications/typeahead/datasource/PhabricatorTypeaheadDatasource.php$110.

The problem is that the preg regexp "/\s+/" will split one unicode "忠" into two.

I created a gist to describe why unicode "忠" will be splited.

I wonder if there is a setting in php which can disable "/\s/" to match code points in the range 128-255 ? If not, I think "\s" should be replaced with "[\t\n\f\r ]".

Revisions and Commits

rP Phabricator
	D14441	rP152ddf57092e Use unicode mode when tokenizing strings like user realnames

Related Objects

Mentioned Here: T7339: Raise a setup warning when the "en_US.UTF-8" locale is unavailable

Event Timeline

qiu8310 created this task.Nov 8 2015, 5:36 AM

qiu8310 updated the task description. (Show Details)

qiu8310 added a subscriber: qiu8310.

I can't reproduce this: https://secure.phabricator.com/p/foo_bar/

I register a account in secure.phabricator.com is also ok, can't reproduce it.

I think is php config problem, see the split result in my computer

As php manual said:

The "whitespace" characters are HT (9), LF (10), FF (12), CR (13), and space (32).
However, if locale-specific matching is happening, characters with code points in
the range 128-255 may also be considered as whitespace characters,
for instance, NBSP (A0).

In php

"忠" === "\xE5\xBF\xA0"

Because "\xA0" is whitespace, so "忠" will be splited.

We need full reproduction steps. If it's a config issue, we need to know what so we can detect it and have you correct it. https://secure.phabricator.com/book/phabcontrib/article/bug_reports/

Unfortunately, I haven't been able to reproduce any issue on this server or on my local dev machines.

Maybe you can try this to see if it is php's config issue.

Runing the code in your server :

php -r 'echo count(preg_split("/\s/", "忠")) . "\n";'

// if result is 1, then you can't reproduce any issue
// if result is 2, then you can reproduce it

Back to php manual

The "whitespace" characters are HT (9), LF (10), FF (12), CR (13), and space (32).
However, if locale-specific matching is happening, characters with code points in
the range 128-255 may also be considered as whitespace characters,
for instance, NBSP (A0).

If you can't reproduce the issue, maybe it means "locale-specific matching is not happening" in your machine.

In my machine, the result of last code is 2, so it means "locale-specific matching is happening" in my machine.

But I don't know how to disable the so called "locale-specific matching".

Found some similar thing in here.

You still haven't told us anything we can use to reproduce the issue. Some things might be useful like:

PHP Version
Server OS
Phabricator Version
Server Locale

Sorry

PHP Version:

PHP 5.5.27 (cli) (built: Jul 23 2015 00:21:59)
Copyright (c) 1997-2015 The PHP Group
Zend Engine v2.5.0, Copyright (c) 1998-2015 Zend Technologies

Server OS

OSX Yosemite  v10.10.5

Phabricator Version ( result of Config -> Versions, all the source is latest and git branch is master)

Current Versions	
Phabricator Version Unknown 
Arcanist Version Unknown 
libphutil Version Unknown

Server Locale

zh_CN.UTF-8

Did you install php via homebrew? If so can you update to 5.6 or later?

No, I use xampp.

Tried it on ubuntu 14 and centos 6.5, all ok.

mora@mora:~$ php -v
PHP 5.5.9-1ubuntu4 (cli) (built: Apr  9 2014 17:11:57)
Copyright (c) 1997-2014 The PHP Group
Zend Engine v2.5.0, Copyright (c) 1998-2014 Zend Technologies
    with Zend OPcache v7.0.3, Copyright (c) 1999-2014, by Zend Technologies

mora@mora:~$ php -r 'echo count(preg_split("/\s/u", "忠")) . "\n";'
1

mora@mora:~$ echo $LANG
en_US.UTF-8

[mora@ceph ~]$ php -v
PHP 5.3.3 (cli) (built: Jul  9 2015 17:39:00)
Copyright (c) 1997-2010 The PHP Group
Zend Engine v2.3.0, Copyright (c) 1998-2010 Zend Technologies

[mora@ceph ~]$ php -r 'echo count(preg_split("/\s/", "忠")) . "\n";'
1

[mora@ceph ~]$ echo $LANG
en_US.UTF-8

This might be T7339, you can try some of the advice in there and see.

epriestley claimed this task.Nov 8 2015, 1:32 PM

epriestley triaged this task as Normal priority.

epriestley added a revision: D14441: Use unicode mode when tokenizing strings like user realnames.Nov 8 2015, 1:49 PM

epriestley closed this task as Resolved by committing rP152ddf57092e: Use unicode mode when tokenizing strings like user realnames.Nov 8 2015, 3:03 PM

epriestley added a commit: rP152ddf57092e: Use unicode mode when tokenizing strings like user realnames.

	F952877: QQ20151108-1.png
	Nov 8 2015, 5:52 AM

Can't register new account when "Real Name" contains some special unicode characterClosed, ResolvedPublicActions

Description

Revisions and Commits

Related Objects

Event Timeline

Can't register new account when "Real Name" contains some special unicode character
Closed, ResolvedPublic
Actions