When search terms contain CJK characters, default to substring mode
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	epriestley
	Sep 21 2017, 12:05 PM

Description

See PHI76. Latin languages separate words with spaces. Chinese, Japanese and Korean (CJK) do not.

The default "term search" mode for the Ferret engine uses spaces as word boundaries, and works well for English and other latin languages (where users do not expect the query "cat fee" to match "reallocates coffee"), but does not work well for CJK languages where users do expect the query "猫费" to match "分配猫费咖啡".

As an initial approximation, we can:

Detect if each term contains CJK characters.
Imply the ~ prefix if it does, putting the term in substring mode.

This can almost certainly be refined, but at least one Chinese user reports success with it (or, at a minimum, better results than with term search) in PHI76.

Additionally, substring terms are not highlighted in results. Substring with no highlight:

Screen Shot 2017-09-21 at 5.03.02 AM.png (725×492 px, 55 KB)

Term search with highlight:

Screen Shot 2017-09-21 at 5.03.47 AM.png (604×489 px, 42 KB)

It would be helpful for all languages to highlight substrings. This is somewhat adjacent to T8646.

Revisions and Commits

rPHU libphutil
	D18634	rPHUe48b86c85efc Default CJK query terms to "substring" mode, not "term" mode
rP Phabricator
	D18635	rP1ac52c09e757 Improve search highlighting for CJK and substring queries

Related Objects

Mentioned Here: T8646: Provide more context for search results, particularly wiki documents

Event Timeline

epriestley created this task.Sep 21 2017, 12:05 PM

Herald added subscribers: eadler, revi. · View Herald TranscriptSep 21 2017, 12:05 PM

epriestley added a revision: D18634: Default CJK query terms to "substring" mode, not "term" mode.Sep 22 2017, 2:25 PM

epriestley added a revision: D18635: Improve search highlighting for CJK and substring queries.Sep 22 2017, 3:15 PM

epriestley closed this task as Resolved by committing rP1ac52c09e757: Improve search highlighting for CJK and substring queries.Sep 22 2017, 6:34 PM

epriestley added a commit: rPHUe48b86c85efc: Default CJK query terms to "substring" mode, not "term" mode.

epriestley added a commit: rP1ac52c09e757: Improve search highlighting for CJK and substring queries.

	F5190248: Screen Shot 2017-09-21 at 5.03.02 AM.png
	Sep 21 2017, 12:05 PM

	F5190255: Screen Shot 2017-09-21 at 5.03.47 AM.png
	Sep 21 2017, 12:05 PM

When search terms contain CJK characters, default to substring modeClosed, ResolvedPublicActions

Description

Revisions and Commits

Related Objects

Event Timeline

When search terms contain CJK characters, default to substring mode
Closed, ResolvedPublic
Actions