Filenames with "é" are not sliced properly for insertion into the ngram index. The slicing is currently byte-oriented.
At least two different byte sequences can represent "é":
- U+00E9 LATIN SMALL LETTER E WITH ACUTE
- e + U+0301 COMBINING ACUTE ACCENT
For search, at least in the ngram index, these sequences are not being collated properly. When searching, these sequences should be considered the same.
Part of the issue is that the ngram index is a utf8mb4 CHAR(3), which means it can hold a maximum of 3 characters (i.e., a maximum of 12 bytes). When a character like "é" is represented with combining characters and we have a token like "détail", we can not fit the ngram for "dét" into the ngrams table because this is four characters long (d + e + combining accent + t).
We can slice it character-by-character instead, producing two ngrams (d + e + combining accent) and (e + combining accent + t).
This is the simplest fix for byte-oriented slicing, but may resist fixing collation.
A particular collation issue is that when I submit a multicharacter é in Safari on macOS, the glyph has been combined into a single character by the time the request reaches the server. We then search for U+00E9, which certainly won't match both halves of the multicharacter é in the ngram table.
MySQL can't reasonably collate this away even if it has the technical capability to. Some things we could do:
- When indexing and searching, strip all combining characters, so we index "e" and search for "e".
- When indexing and searching, normalize combining character sequences with multiple representations to some canonical representation.
These approaches aren't very different, since we need a lookup table either way: either to take 1-character é to e, or to take 2-character é to 1-character é. In latin languages, I believe normalizing away accents is desirable (e.g., a search for "jalapeno" should find "jalapeño"). There is probably a big existing table of these somewhere.
Doing all normalization with a table seems desirable, since some combining characters are already minimal and work without normalization, e.g. FLAG + all the country codes. A table avoids any peril with weird off-label cases like this, and we could make a case-by-case decision for each sequence (e.g., normalizing all THUMBS UP + SKIN TONE is probably desirable, but FLAG + COUNTRY CODE is not).
Reproduction (Slicing)
- Name a file xéx, where é uses the multibyte sequence to represent the character.
- bin/search index it to get a "binary data inserted" exception.
Reproduction (Searching)
- Search for é as a name constraint using the Conduit web console for file.search.
- (In Safari, on macOS; é is the multibyte version.)
- No hits. é in the ngram index appears to have been normalized into U+00E9 somewhere and this isn't being collated, and can't reasonably be collated by MySQL alone given how the ngram table works.