Page MenuHomePhabricator

Use binary collations for most text
ClosedPublic

Authored by epriestley on Sep 28 2014, 7:54 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Dec 8, 7:21 PM
Unknown Object (File)
Thu, Dec 5, 11:22 AM
Unknown Object (File)
Mon, Dec 2, 1:02 PM
Unknown Object (File)
Sat, Nov 23, 3:35 PM
Unknown Object (File)
Nov 8 2024, 9:02 PM
Unknown Object (File)
Nov 8 2024, 6:11 PM
Unknown Object (File)
Nov 1 2024, 10:53 AM
Unknown Object (File)
Oct 24 2024, 1:43 PM
Subscribers
Tokens
"Manufacturing Defect?" token, awarded by chad.

Details

Summary

Ref T1191. For most text columns, we either don't care if "a" and "A" are the same, or we expect them to be different (for example: keys, domains, secrets, etc). Default text columns to the _bin collation so they are compared by strict character value. This is safer in cases where we aren't sure.

For some text columns, we allow the user to sort by the column in the UI (like Maniphest task titles) or we do care that "A" and "a" are the same (for example: project names). Introduce a new class of virtual data types, the "sort..." types, to cover these columns. These are like the "text..." types but use sorting collations which treat "A" and "a" the same.

Test Plan
  • Made an effort to identify all columns where the UI relies on database collation.
  • Ran bin/storage adjust and cleared all warnings.

Diff Detail

Repository
rP Phabricator
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

epriestley retitled this revision from to Use binary collations for most text.
epriestley updated this object.
epriestley edited the test plan for this revision. (Show Details)
epriestley added a reviewer: btrahan.

Remember that comparing two strings ignoring differences in case (or accents, etc.) is a fairly locale-specific operation. The default Unicode collation algorithm tries to be as agnostic as it can of locale rules, but it falls down spectacularly for a few languages.

For example, in T1191, you mentioned that you don't case if "Diffusion" and "DIFFUSION" collide. In English, that's OK, but in Turkish, "i" and "I" are totally different letters. ("i" capitalizes to 'İ' in Turkish, and 'I' lower-cases to 'ı'. They are not related letters any more than 'i' and 'j' are in English.)

In addition, the Unicode collation algorithm doesn't deal with differences in accents or special cases like German ß mapping to SS in upper-case, or position-sensitive letters like Greek sigma σ, which is ς at the end of a word.

Because of this, it's usually a good idea to apply collation rules only when you know the locale of the viewer. If you want the database to cache locale-specific collation rules for performance, you can use functionality like ICU's collation keys, which are available to PHP via the Collator class. (Just be sure that the keys are discarded between PHP or ICU version upgrades.)

http://php.net/manual/en/class.collator.php

Yeah -- we'll probably pursue that eventually, but I think the cost-to-benefit isn't worthwhile today. In particular, users can't even select Turkish or German locales yet, we've never received user reports about collation issues, occasional incorrect collisions / sub-optimal sorts have reasonable workarounds (rename projects/documents; just scroll through the list), and it won't get much harder to fix over time if we put it off for now.

(In contrast, the meat of T1191 has no workarounds, affects an increasing number of users, and gets harder to fix over time a lot faster.)

Sure—as long as the original data is present and we can rebuild collation indices later, it probably doesn't matter too much. (I'm not sure how the migration would work, but hopefully it's not lossy.)

Yeah, there's no data loss here. To migrate, we'd:

  • Create some new <objectPHID, field, locale, collationKey> table in databases which need it.
  • We join that table to express ORDER BY, using the viewer's locale to select which locale we examine.
  • When objects get their search index updated, publish into the applicable tables (we already publish similar data for other reasons).
  • Historic data can be rebuilt online by the daemons with bin/search index --all --background.
  • Convert the sort (_unicode_ci) columns to text (_bin) columns, basically undoing this diff.
  • This would fix all collation issues and make sorting locale-dependent, but allow "Diffusion" and "DIFFUSION" again without collision. We can either accept that, or find some acceptable normalization function which collides only "bad" duplicates, but this has to be locale-independent.

Makes sense. Could just do that now, of course..

Oh, and:

This would fix all collation issues and make sorting locale-dependent, but allow "Diffusion" and "DIFFUSION" again without collision.

Well, they would have the same sort keys, so you could use that to collapse them if you'd like.

Makes sense. Could just do that now, of course..

Sure, but it would be a bunch of work and not actually make anything better for anyone since it's not possible to select any locale except English.

Well, they would have the same sort keys, so you could use that to collapse them if you'd like.

We'd have to pick a locale for the keys, though. We could say "project names must be unique when collated in en_US", but I don't know enough about locale issues to know how reasonable that is. If we choose "project names must be unique when collated in utf8mb4_unicode_ci" instead, which is probably about as good as en_US, we can just skip undoing the column collations.

btrahan edited edge metadata.

Bu iyi bir iş

This revision is now accepted and ready to land.Sep 30 2014, 5:07 PM
This revision was automatically updated to reflect the committed changes.