Page MenuHomePhabricator

Ngram search for "é" has slicing and collation issues with multibyte characters and multicharacter glyphs
Open, LowPublic

Description

See https://discourse.phabricator-community.org/t/conduit-file-search-doesnt-accept-name-with-accent/3621.

Filenames with "é" are not sliced properly for insertion into the ngram index. The slicing is currently byte-oriented.

At least two different byte sequences can represent "é":

  • U+00E9 LATIN SMALL LETTER E WITH ACUTE
  • e + U+0301 COMBINING ACUTE ACCENT

For search, at least in the ngram index, these sequences are not being collated properly. When searching, these sequences should be considered the same.

Part of the issue is that the ngram index is a utf8mb4 CHAR(3), which means it can hold a maximum of 3 characters (i.e., a maximum of 12 bytes). When a character like "é" is represented with combining characters and we have a token like "détail", we can not fit the ngram for "dét" into the ngrams table because this is four characters long (d + e + combining accent + t).

We can slice it character-by-character instead, producing two ngrams (d + e + combining accent) and (e + combining accent + t).

This is the simplest fix for byte-oriented slicing, but may resist fixing collation.

A particular collation issue is that when I submit a multicharacter é in Safari on macOS, the glyph has been combined into a single character by the time the request reaches the server. We then search for U+00E9, which certainly won't match both halves of the multicharacter é in the ngram table.

MySQL can't reasonably collate this away even if it has the technical capability to. Some things we could do:

  • When indexing and searching, strip all combining characters, so we index "e" and search for "e".
  • When indexing and searching, normalize combining character sequences with multiple representations to some canonical representation.

These approaches aren't very different, since we need a lookup table either way: either to take 1-character é to e, or to take 2-character é to 1-character é. In latin languages, I believe normalizing away accents is desirable (e.g., a search for "jalapeno" should find "jalapeño"). There is probably a big existing table of these somewhere.

Doing all normalization with a table seems desirable, since some combining characters are already minimal and work without normalization, e.g. FLAG + all the country codes. A table avoids any peril with weird off-label cases like this, and we could make a case-by-case decision for each sequence (e.g., normalizing all THUMBS UP + SKIN TONE is probably desirable, but FLAG + COUNTRY CODE is not).


Reproduction (Slicing)

  • Name a file xéx, where é uses the multibyte sequence to represent the character.
  • bin/search index it to get a "binary data inserted" exception.

Reproduction (Searching)

  • Search for é as a name constraint using the Conduit web console for file.search.
  • (In Safari, on macOS; é is the multibyte version.)
  • No hits. é in the ngram index appears to have been normalized into U+00E9 somewhere and this isn't being collated, and can't reasonably be collated by MySQL alone given how the ngram table works.

Event Timeline

epriestley triaged this task as Low priority.Mon, Mar 9, 5:16 PM
epriestley created this task.

This appears to be the unicode normalization chart:

https://www.unicode.org/charts/normalization/

PHP has a builtin (?) class for this, although I'm slightly suspicious about availability given the PECL intl >= 1.0.0 part:

https://www.php.net/manual/en/class.normalizer.php

This chart covers "é", but does not cover normalizing 👍🏻 into 👍, and I can't find anything like "all [some descriptor including the skin tone modifiers] characters are special modifiers and should always be discarded in normalization", although maybe this is an implicit rule of ZWJ. Being able to search for 👍 is obviously less important than search for é working, but changes to this table likely require an index rebuild to take effect, so it may be worthwhile trying harder to get things right.

For now, I'm going to change the ngram slicing to be character-oriented. This should never be worse than the current behavior, and moves us closer to effective normalization.

We also have two separate pieces of ngram extraction code:

  • PhabricatorFerretEngine->getNgramsFromString() is already character-oriented (and has large-corpus optimizations) although it does do character normalization.
  • PhabricatorSearchNgrams->getNgramsFromString() is not character-oriented.

So the cleaner fix is to make PhabricatorSearchNgrams use the same code as PhabricatorFerretEngine does. This isn't completely straightforward.