Page MenuHomePhabricator

When search terms contain CJK characters, default to substring mode
Closed, ResolvedPublic

Description

See PHI76. Latin languages separate words with spaces. Chinese, Japanese and Korean (CJK) do not.

The default "term search" mode for the Ferret engine uses spaces as word boundaries, and works well for English and other latin languages (where users do not expect the query "cat fee" to match "reallocates coffee"), but does not work well for CJK languages where users do expect the query "猫费" to match "分配猫费咖啡".

As an initial approximation, we can:

  • Detect if each term contains CJK characters.
  • Imply the ~ prefix if it does, putting the term in substring mode.

This can almost certainly be refined, but at least one Chinese user reports success with it (or, at a minimum, better results than with term search) in PHI76.

Additionally, substring terms are not highlighted in results. Substring with no highlight:

Screen Shot 2017-09-21 at 5.03.02 AM.png (725×492 px, 55 KB)

Term search with highlight:

Screen Shot 2017-09-21 at 5.03.47 AM.png (604×489 px, 42 KB)

It would be helpful for all languages to highlight substrings. This is somewhat adjacent to T8646.