Use binary collations for most text
ClosedPublic
Actions

Authored by epriestley on Sep 28 2014, 7:54 PM.

Details

Reviewers

btrahan

Maniphest Tasks

T1191: Comments are truncated at first non-base-plane character

Commits

Restricted Diffusion Commit
rP1dfa94e57127: Use binary collations for most text

Summary

Ref T1191. For most text columns, we either don't care if "a" and "A" are the same, or we expect them to be different (for example: keys, domains, secrets, etc). Default text columns to the _bin collation so they are compared by strict character value. This is safer in cases where we aren't sure.

For some text columns, we allow the user to sort by the column in the UI (like Maniphest task titles) or we do care that "A" and "a" are the same (for example: project names). Introduce a new class of virtual data types, the "sort..." types, to cover these columns. These are like the "text..." types but use sorting collations which treat "A" and "a" the same.

Test Plan

Made an effort to identify all columns where the UI relies on database collation.
Ran bin/storage adjust and cleared all warnings.

Diff Detail

Repository

rP Phabricator

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

epriestley updated this revision to Diff 25439.Sep 28 2014, 7:54 PM

epriestley retitled this revision from to Use binary collations for most text.

epriestley updated this object.

epriestley edited the test plan for this revision. (Show Details)

epriestley added a reviewer: btrahan.

epriestley added a task: T1191: Comments are truncated at first non-base-plane character.

Herald added a subscriber: epriestley. · View Herald TranscriptSep 28 2014, 7:54 PM

Harbormaster completed remote builds in B2692: Diff 25439.Sep 28 2014, 7:55 PM

epriestley mentioned this in T1191: Comments are truncated at first non-base-plane character.Sep 29 2014, 3:47 PM

Remember that comparing two strings ignoring differences in case (or accents, etc.) is a fairly locale-specific operation. The default Unicode collation algorithm tries to be as agnostic as it can of locale rules, but it falls down spectacularly for a few languages.

For example, in T1191, you mentioned that you don't case if "Diffusion" and "DIFFUSION" collide. In English, that's OK, but in Turkish, "i" and "I" are totally different letters. ("i" capitalizes to 'İ' in Turkish, and 'I' lower-cases to 'ı'. They are not related letters any more than 'i' and 'j' are in English.)

In addition, the Unicode collation algorithm doesn't deal with differences in accents or special cases like German ß mapping to SS in upper-case, or position-sensitive letters like Greek sigma σ, which is ς at the end of a word.

Because of this, it's usually a good idea to apply collation rules only when you know the locale of the viewer. If you want the database to cache locale-specific collation rules for performance, you can use functionality like ICU's collation keys, which are available to PHP via the Collator class. (Just be sure that the keys are discarded between PHP or ICU version upgrades.)

http://php.net/manual/en/class.collator.php

Yeah -- we'll probably pursue that eventually, but I think the cost-to-benefit isn't worthwhile today. In particular, users can't even select Turkish or German locales yet, we've never received user reports about collation issues, occasional incorrect collisions / sub-optimal sorts have reasonable workarounds (rename projects/documents; just scroll through the list), and it won't get much harder to fix over time if we put it off for now.

(In contrast, the meat of T1191 has no workarounds, affects an increasing number of users, and gets harder to fix over time a lot faster.)

Sure—as long as the original data is present and we can rebuild collation indices later, it probably doesn't matter too much. (I'm not sure how the migration would work, but hopefully it's not lossy.)

Yeah, there's no data loss here. To migrate, we'd:

Create some new <objectPHID, field, locale, collationKey> table in databases which need it.
We join that table to express ORDER BY, using the viewer's locale to select which locale we examine.
When objects get their search index updated, publish into the applicable tables (we already publish similar data for other reasons).
Historic data can be rebuilt online by the daemons with bin/search index --all --background.
Convert the sort (_unicode_ci) columns to text (_bin) columns, basically undoing this diff.
This would fix all collation issues and make sorting locale-dependent, but allow "Diffusion" and "DIFFUSION" again without collision. We can either accept that, or find some acceptable normalization function which collides only "bad" duplicates, but this has to be locale-independent.

Makes sense. Could just do that now, of course..

Oh, and:

This would fix all collation issues and make sorting locale-dependent, but allow "Diffusion" and "DIFFUSION" again without collision.

Well, they would have the same sort keys, so you could use that to collapse them if you'd like.

Makes sense. Could just do that now, of course..

Sure, but it would be a bunch of work and not actually make anything better for anyone since it's not possible to select any locale except English.

Well, they would have the same sort keys, so you could use that to collapse them if you'd like.

We'd have to pick a locale for the keys, though. We could say "project names must be unique when collated in en_US", but I don't know enough about locale issues to know how reasonable that is. If we choose "project names must be unique when collated in utf8mb4_unicode_ci" instead, which is probably about as good as en_US, we can just skip undoing the column collations.

chad awarded a token.Sep 29 2014, 11:54 PM