Page MenuHomePhabricator

Explain to users how fulltext queries are parsed and executed
Closed, ResolvedPublic

Description

Stopwords, short tokens, and nonlatin characters are currently ignored without explanation.

When you search for "the qx あああああ dog", we should provide a hint that "the" was ignored as a stopword, "qx" was ignored as a short token, and "あああああ" was ignored as non-latin (does this actually work with InnoDB fulltext?).

Ideally, we should provide an option to perform substring search instead.

Event Timeline

I can find this task by searching for "あああああ" so the CJK stuff may not really be much of an issue anymore? Maybe just an issue with Chinese because the index doesn't know how to tokenize words?

Elasticsearch has much better support for non-latin language analysis. See https://www.elastic.co/guide/en/elasticsearch/guide/current/icu-tokenizer.html discusses their ability to properly tokenize Thai, Chinese and Japanese text.