Page MenuHomePhabricator

Support non-Latin scripts in usernames
Closed, WontfixPublic

Description

Currently Phabricator only supports Latin characters in usernames. Wikimedia has 1000s of users with non-Latin scripts. It would be nice to give them the chance to have their same usernames in Wikimedia Phabricator.

See MediaWiki/Wikimedia restrictions for usernames.

Event Timeline

qgil created this task.Aug 29 2014, 8:51 PM
qgil updated the task description. (Show Details)
qgil raised the priority of this task from to Needs Triage.
qgil added projects: People, Wikimedia.
qgil moved this task from Backlog to Potential blockers on the Wikimedia board.
qgil moved this task from Potential blockers to Important on the Wikimedia board.
qgil added a subscriber: qgil.

We are unlikely to support this:

  • Many Phabricator interfaces (on the web, but particularly in the CLI) require typing usernames, but most users can not type usernames like "☃" or "睅喊嵅"on their keyboards. I'm not sure how MediaWiki deals with this. Does it just not require as much arbitrary selection of users? These are even harder to type on a phone, generally, if the script isn't one you use regularly.
  • Some characters are problematic in URLs, although this is surmountable with blacklisting.
  • Some other latin characters are likely to be things we want to use in the future as syntax, similar to how we recently added #projects. This would have been more difficult if # was permitted in usernames.
  • We do not have the infrastructure to canonicalize codepoint sequences, so b o <combining umlaut> b and b ö b would not be considered the same username, even though they are visually and semantically identical. This is likely surmountable, but lots of work. (However, I'm not sure if there's an agreed-upon or standard way to canonicalize combining marks.)
  • There are a bunch of ways users can pick unusable names, like <zero width space> and <right-to-left mark>. This is surmountable, but requires blacklisting a bunch of characters.
  • There are a bunch of ways users can pick disruptive names, like X̀́̂̃̄̅̆̇̈̉̊̋̌̍̎̏. This can be partially mitigated by limiting the number of permissible successive combining characters, requiring characters begin with a non-combining character, etc.
  • There are a bunch of ways users can pick misleading names, like epriest <capital greek letter Iota> ey. This allows users to impersonate administrators, etc. This is partially surmountable with a huge set of normalization rules like those used by the Mediawiki "Antispoof" extension.
  • We would potentially need to rewrite a some parser rules to deal with users putting, e.g., <full width full stop> in their usernames and no longer matching \w+ or \b in regular expressions.
  • Some of the JS parser stuff might have issues, too.

It looks like Wikipedia, in addition to having tons of code in MW to automatically block these cases, also combines this with manual enforcement and blocks users like http://en.wikipedia.org/wiki/User:☃ and http://en.wikipedia.org/wiki/User:☭ (although http://en.wikipedia.org/wiki/User:✈ is a working redirect and http://en.wikipedia.org/wiki/User:☢ is a real account). Obviously some manual enforcement is required, but allowing non-latin usernames opens up a class of issues which are otherwise prevented technically.

Overall, this is a huge amount of work and complexity to support a feature that I don't see a lot of value in: we haven't seen other requests for this, there are plenty of global services (e.g., Twitter, Facebook) with restrictions similar to the restrictions we place, our audience is fairly technical and more likely to be familiar with latin characters than average international consumers, and these usernames will always be more difficult for most users to deal with. Even for WMF, doesn't thousands of users represent a fairly tiny fraction of the userbase? And presumably not all of them need accounts. Given the complexity of the feature, I could easily see this costing us something on the order of an hour of time per affected user.

If WMF wants to allow these names, you can change PhabricatorUser::validateUsername() locally. If the usernames are required to pass checks in MediaWiki, that may mitigate some of the most extreme issues. However, it will still be difficult to assign a review to "☢" and mentioning that user with "@☢" won't work. Beyond this, I don't think anything too catastrophic will happen, but I'd still generally discourage this.

Note that real names are permitted to contain anything. One approach you could take is to use the existing username if it's permitted by Phabricator, and copy it over as the real name if not, letting the user select a new latin username. This is imperfect, but comparatively straightforward.

qgil added a comment.Aug 30 2014, 8:56 PM

Alright, thank you for the explanation. I'm proposing Declined at Wikimedia (aka Wontfix) after understanding your reasoning.

  • However, I'm not sure if there's an agreed-upon or standard way to canonicalize combining marks.

Yeah, there is: http://unicode.org/reports/tr15/

chad closed this task as Wontfix.Sep 2 2014, 3:49 PM
chad claimed this task.

Thanks for all the information! Seems safe to close as wontfix, but reopen if needed.