Page MenuHomePhabricator

Typeahead uses locale-specific comparison
Open, Needs TriagePublic

Description

The dropdown for project names uses locale-specific comparison when sorting the projects, which can lead to weird results (projects never showing up and being unselectable). For example, on Wikimedia Phabricator, using Opera 12, with partially Polish, partially English system and browser locale, I can't select the "VisualEditor" project.

That's because projects "VisualEditor-ContentEditable", "VisualEditor-DataModel", "VisualEditor-EditingTools", "VisualEditor-Initialisation" (and a few others like this) exist, because the dropdown is limited to five items, and because, for some reason, "visualeditor VisualEditor".localeCompare("visualeditor-contenteditable VisualEditor-ContentEditable") yields 1 (even thought "visualeditor VisualEditor" < "visualeditor-contenteditable VisualEditor-ContentEditable" is true).

Using localeCompare seems incorrect to me; the strings are not in my locale (whatever that locale might be), they are in English. It would probably be more correct to just compare with < and >.

Revisions and Commits

Event Timeline

matmarex raised the priority of this task from to Needs Triage.
matmarex updated the task description. (Show Details)
matmarex added a subscriber: matmarex.

I debugged this in browser console, so no idea where this code comes from in the source, but lines that needs changing look like:

			var default_comparator = function(u, v) {
					var key_u = u.sort || u.name;
					var key_v = v.sort || v.name;
					return key_u.localeCompare(key_v);
				};

…and probably also:

JX.install('TypeaheadNormalizer', {
	statics: {
		normalize: function(str) {
			return ('' + str).toLocaleLowerCase().replace(/[\.,\/#!$%\^&\*;:{}=_`~()]/g, '').replace(/-/g, ' ').replace(/ +/g, ' ').replace(/^\s*|\s*$/g, '');
		}
	}
});

I tried submitting a patch and it seems to have worked, so I guess I could "claim" this…?

Assuming locale compares work in a reasonable way, it seems like they should be more correct.

For example, as an English speaker, I expect strings to sort from "a -> z". Even if I'm viewing Australian strings on an Australian install, it's still better for me as a user to see an "a -> z" sort than an Australian "z -> a" sort of the strings, because it's one I'm more familiar with.

I do get a better result for localeCompare() than for < with Ł, in English: localeCompare() correctly puts it between "L" and "M". This is where I'd expect it to appear as an English speaker, and presumably a Polish speaker searching on an English-language install would still prefer a colleague named "Stanisław" to sort between "Staniskaw" and "Stanismaw", not after "Staniszaw".

The specific case above with the hyphen seems surprising to me, unless there are reasonable rules which give hyphens an unusual collation in Polish (it's possible that there are; here's a post claiming that correct collation in Danish puts "-" before " ": http://www.unicode.org/mail-arch/unicode-ml/Archives-Old/UML007/0349.html -- In Polish, would you expect "black and white" or "black-and-blue" to sort first?).

It looks more like we're incorrectly assuming that " " collates before "-" in all locales, when this isn't true, at least in Danish (and it looks like it may not be true in Polish). I think we could remove this assumption by splitting the strings on spaces and comparing them piece-by-piece instead of using locale comparison over the entire string, potentially, although I'm not sure if this would resolve the issue since I haven't touched the code in a while. This would let us retain locale ordering for other letters.

There may also be some issues with tokenization being applied unevenly to the two halves of the string.

For the second part of the patch, do you have an example of a concrete bad behavior caused by toLocaleLowerCase()?

I edited the above to remove an erroneous discussion of sorting "ë" -- I just misread the output.

Assuming locale compares work in a reasonable way, it seems like they should be more correct.

For example, as an English speaker, I expect strings to sort from "a -> z". Even if I'm viewing Australian strings on an Australian install, it's still better for me as a user to see an "a -> z" sort than an Australian "z -> a" sort of the strings, because it's one I'm more familiar with.

I do get a better result for localeCompare() than for < with Ł, in English: localeCompare() correctly puts it between "L" and "M". This is where I'd expect it to appear as an English speaker, and presumably a Polish speaker searching on an English-language install would still prefer a colleague named "Stanisław" to sort between "Staniskaw" and "Stanismaw", not after "Staniszaw".

That is true, but not all languages are this convenient. For example, in Estonian, the alphabet order is …R, S, Z, T…, which would definitely be confusing to speakers of most other languages. Another fancy one is Hungarian, where the digraph 'Ny' sorts after 'Nz' but before 'O' (and they have a few more of these). And one more: in some languages Ä sorts after A, but in some after Z.

Unicode defines an algorithm for reasonably correct collation order or multi-language strings, but I see no way to make localeCompare use it. You could specify the locale to sort with ('a'.localeCompare('b', 'en'), for example), but support for that is spotty.

If you really want as-good-as-possibly locale-insensitive collation and are willing to put the effort into it, the only sensible way is to use ICU server-side to generate sortkeys (using the 'uca-default' collation, or 'uca-en' or other language-specific one if Phabricator assumes/knows that the strings are in some given language), then send them to client, where they can be compared using just < / >. (ICU has reliable PHP bindings, MediaWiki uses it for category element collation.)

The specific case above with the hyphen seems surprising to me, unless there are reasonable rules which give hyphens an unusual collation in Polish (it's possible that there are; here's a post claiming that correct collation in Danish puts "-" before " ": http://www.unicode.org/mail-arch/unicode-ml/Archives-Old/UML007/0349.html -- In Polish, would you expect "black and white" or "black-and-blue" to sort first?).

I would expect space to compare less than hyphen, so "black and white" < "black-and-blue". It compares greater, so I guess that's a browser bug.

It looks more like we're incorrectly assuming that " " collates before "-" in all locales, when this isn't true, at least in Danish (and it looks like it may not be true in Polish). I think we could remove this assumption by splitting the strings on spaces and comparing them piece-by-piece instead of using locale comparison over the entire string, potentially, although I'm not sure if this would resolve the issue since I haven't touched the code in a while. This would let us retain locale ordering for other letters.

That would probably help in my case, but I still think it'd be better to just kill off locale ordering. I could implement it.

For the second part of the patch, do you have an example of a concrete bad behavior caused by toLocaleLowerCase()?

No. The only language I know where it would make a difference is Turkish, which distinguishes dotted and dotless lowercase and uppercase I/i (I i İ ı), and alas I do not speak Turkish. .toLowerCase() already handles non-ASCII characters. It just seemed consistent.

chad renamed this task from The dropdown for project names uses locale-specific comparison when sorting the projects, which can lead to weird results (projects never showing up and being unselectable) to Typeahead uses locale-specific comparison, i.e. projects never showing up and being unselectable.Nov 24 2014, 7:04 PM
chad edited projects, added PHUI; removed Phabricator.

If you really want as-good-as-possibly locale-insensitive collation

This isn't what I want -- I want an English user to see:

Apple
Radish
Turnip
Zebra

...and an Estonian user to see:

Apple
Radish
Zebra
Turnip

...and an Australian user to see:

Zebra
Turnip
Radish
Apple

I want the strings to collate as the user expects strings in their language to collate, even if the strings are not actually written in their language. This will produce the best result if the strings are written in their language, and a reasonable (consistent with expectations) result if the strings are written in some other language or a mixture of languages.

Ideally, we would collate with the same locale ruleset that the user has set on the server, so the instance used cohesive collation rules. There are various technical barriers to doing this, but the assumption browser locale settings will mirror user locale preferences seems reasonable.

I don't know if that assumption (that browser locale settings are reliable) actually is reasonable. T1888 documents a case where toLocaleDateString() did something a user found surprising/undesirable.

We could drop this assumption by binding server-side locale settings to a client-side locale, passing it down, and then passing it to localeCompare() and toLocaleLowerCase(), as in your example above. Since the only supported locale on the server side today (at least, in the upstream) is en_US, that would have the same effect as your proposed change. However, once we added pl_PL in the future we'd have the same issue unless we fix the " " vs "-" collation assumption.

I want the strings to collate as the user expects strings in their language to collate, even if the strings are not actually written in their language. This will produce the best result if the strings are written in their language, and a reasonable (consistent with expectations) result if the strings are written in some other language or a mixture of languages.

I don't think this would be the correct behavior. In general, text should probably be collated using the language it's written in. (I don't know any of the languages where this matters, so my opinion is probably worth little. We should probably find someone who does and ask about their expectations.)

For mixed-language data, surely collating in the user's locale is preferable to collating in the default locale for the install? An example of this type of data is user "Real Names", which show up when trying to typeahead users, but contain letters from different languages with some frequency. This specific type of data is present to some degree on most installs, but there's no single valid language for it, since user human names are written in a variety of languages.

Even if an install has only English-language data, I'd assume a user who is choosing to use the UI in another locale would prefer to use their locale's collation rules (since they probably aren't as comfortable with English). Maybe this isn't true, of course. It's also reasonable that they might have their browser set to some other collation but really want to use English for everything, but we can deal with that by propagating the server-side selection to the client and making sure the UI is consistently presented in the chosen locale.

(See T3025 for related discussion which might be relevant if we want to detect locales for new users; see D10602 for discussion of our actual ability to do locale-aware collation on the server at present.)

chad renamed this task from Typeahead uses locale-specific comparison, i.e. projects never showing up and being unselectable to Typeahead uses locale-specific comparison.Nov 24 2014, 9:16 PM

Please unassign me from this task (why can't I do it myself?). My patch was rejected.