Page MenuHomePhabricator

Improve search index normalization of "é" and other characters with variants or multiple representations
Open, WishlistPublic

Description

See https://discourse.phabricator-community.org/t/conduit-file-search-doesnt-accept-name-with-accent/3621.

Filenames with "é" are not sliced properly for insertion into the ngram index. The slicing is currently byte-oriented.

At least two different byte sequences can represent "é":

  • U+00E9 LATIN SMALL LETTER E WITH ACUTE
  • e + U+0301 COMBINING ACUTE ACCENT

For search, at least in the ngram index, these sequences are not being collated properly. When searching, these sequences should be considered the same.

Part of the issue is that the ngram index is a utf8mb4 CHAR(3), which means it can hold a maximum of 3 characters (i.e., a maximum of 12 bytes). When a character like "é" is represented with combining characters and we have a token like "détail", we can not fit the ngram for "dét" into the ngrams table because this is four characters long (d + e + combining accent + t).

We can slice it character-by-character instead, producing two ngrams (d + e + combining accent) and (e + combining accent + t).

This is the simplest fix for byte-oriented slicing, but may resist fixing collation.

A particular collation issue is that when I submit a multicharacter é in Safari on macOS, the glyph has been combined into a single character by the time the request reaches the server. We then search for U+00E9, which certainly won't match both halves of the multicharacter é in the ngram table.

MySQL can't reasonably collate this away even if it has the technical capability to. Some things we could do:

  • When indexing and searching, strip all combining characters, so we index "e" and search for "e".
  • When indexing and searching, normalize combining character sequences with multiple representations to some canonical representation.

These approaches aren't very different, since we need a lookup table either way: either to take 1-character é to e, or to take 2-character é to 1-character é. In latin languages, I believe normalizing away accents is desirable (e.g., a search for "jalapeno" should find "jalapeño"). There is probably a big existing table of these somewhere.

Doing all normalization with a table seems desirable, since some combining characters are already minimal and work without normalization, e.g. FLAG + all the country codes. A table avoids any peril with weird off-label cases like this, and we could make a case-by-case decision for each sequence (e.g., normalizing all THUMBS UP + SKIN TONE is probably desirable, but FLAG + COUNTRY CODE is not).


Reproduction (Slicing)

  • Name a file xéx, where é uses the multibyte sequence to represent the character.
  • bin/search index it to get a "binary data inserted" exception.

Reproduction (Searching)

  • Search for é as a name constraint using the Conduit web console for file.search.
  • (In Safari, on macOS; é is the multibyte version.)
  • No hits. é in the ngram index appears to have been normalized into U+00E9 somewhere and this isn't being collated, and can't reasonably be collated by MySQL alone given how the ngram table works.

Event Timeline

epriestley created this task.

This appears to be the unicode normalization chart:

https://www.unicode.org/charts/normalization/

PHP has a builtin (?) class for this, although I'm slightly suspicious about availability given the PECL intl >= 1.0.0 part:

https://www.php.net/manual/en/class.normalizer.php

This chart covers "é", but does not cover normalizing 👍🏻 into 👍, and I can't find anything like "all [some descriptor including the skin tone modifiers] characters are special modifiers and should always be discarded in normalization", although maybe this is an implicit rule of ZWJ. Being able to search for 👍 is obviously less important than search for é working, but changes to this table likely require an index rebuild to take effect, so it may be worthwhile trying harder to get things right.

For now, I'm going to change the ngram slicing to be character-oriented. This should never be worse than the current behavior, and moves us closer to effective normalization.

We also have two separate pieces of ngram extraction code:

  • PhabricatorFerretEngine->getNgramsFromString() is already character-oriented (and has large-corpus optimizations) although it does do character normalization.
  • PhabricatorSearchNgrams->getNgramsFromString() is not character-oriented.

So the cleaner fix is to make PhabricatorSearchNgrams use the same code as PhabricatorFerretEngine does. This isn't completely straightforward.

Normalizer requires intl which I'm hesitant to add a dependency on.

Wordpress has a giant table in a function named remove_accents().

Wikimedia has this fairly nice looking implementation of the giant table:

https://github.com/wikimedia/utfnormal/blob/master/src/UtfNormalData.inc

Doing this correctly with real normalization rules gets us from LATIN E + COMBINING ACCENT to LATIN E WITH ACCENT (or vice versa), but we really want to index PLAIN OLD LATIN LOWERCASE LETTER e anyway (and I'm fairly sure the rule to get here is not "denormalize, then drop everything after the first character", even though that rule would normally work for latin).

I think I'm just going to hard-code a small latin1 lookup table -- which includes é -- as a proof of concept for now, and can go back and do this properly once use cases for Hangul or Arabic or whatever else arise.

Getting through the ngram index alone isn't good enough, because LIKE operators against utf8mb4_unicode_ci treat combining accents as separate characters:

Here's a LIKE match using % to match LATIN SMALL e + COMBINING ACUTE:

mysql> SELECT name FROM file WHERE id = 21470 AND name LIKE '%D%TAIL%';
+----------------------+
| name                 |
+----------------------+
| Détail article.jpg  |
+----------------------+
1 row in set (0.00 sec)

This also works if __ is used to match the two characters separately:

mysql> SELECT name FROM file WHERE id = 21470 AND name LIKE '%D__TAIL%';
+----------------------+
| name                 |
+----------------------+
| Détail article.jpg  |
+----------------------+
1 row in set (0.00 sec)

This fails if e, E, _, or LATIN SMALL e WITH ACUTE ACCENT is used:

mysql> SELECT name FROM file WHERE id = 21470 AND name LIKE '%DeTAIL%';
Empty set (0.00 sec)

mysql> SELECT name FROM file WHERE id = 21470 AND name LIKE '%DETAIL%';
Empty set (0.00 sec)

mysql> SELECT name FROM file WHERE id = 21470 AND name LIKE '%D_TAIL%';
Empty set (0.00 sec)

mysql> SELECT name FROM file WHERE id = 21470 AND name LIKE '%DéTAIL%';
Empty set (0.00 sec)

Using E_ works:

mysql> SELECT name FROM file WHERE id = 21470 AND name LIKE '%DE_TAIL%';
+----------------------+
| name                 |
+----------------------+
| Détail article.jpg  |
+----------------------+
1 row in set (0.00 sec)

So it seems we can't LIKE against a column we haven't previously normalized. Any indexes which currently rely on this need to be rewritten to index a normalized version.

For now, I'm just going to fix the slicing issues and give up on the collation/normalization issues. At time of writing, I only have evidence that they impact one user searching for one file, so it feels premature to use this to motivate an overhaul.

epriestley renamed this task from Ngram search for "é" has slicing and collation issues with multibyte characters and multicharacter glyphs to Improve search index normalization of "é" and other characters with variants or multiple representations.Apr 17 2020, 1:05 PM
epriestley lowered the priority of this task from Low to Wishlist.