Maniphest T12003

Explain to users how fulltext queries are parsed and executed
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	epriestley
	Dec 13 2016, 12:36 PM

Description

Stopwords, short tokens, and nonlatin characters are currently ignored without explanation.

When you search for "the qx あああああ dog", we should provide a hint that "the" was ignored as a stopword, "qx" was ignored as a short token, and "あああああ" was ignored as non-latin (does this actually work with InnoDB fulltext?).

Ideally, we should provide an option to perform substring search instead.

Revisions and Commits

rPHU libphutil
	D17669	rPHU6fe33623cda6 Make the query compiler emit intermediate tokens
rP Phabricator
	D17672	rP3245e74f16bb Show users how fulltext search queries are parsed and executed; don't query…
	D17670	rPcb49acc2ca71 Update Phabricator to use intermediate tokens from the query compiler

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T12003 Explain to users how fulltext queries are parsed and executed
Resolved	epriestley	T6892 Invalid search result when I input less than 2 Korean character.
Resolved	epriestley	T2632 MyISAM fulltext does not support non-latin languages and we don't warn you about it
Resolved	joshuaspence	T5282 Provide documentation on setting up ElasticSearch

Event Timeline

epriestley created this task.Dec 13 2016, 12:36 PM

Herald added a subscriber: eadler. · View Herald TranscriptDec 13 2016, 12:36 PM

I can find this task by searching for "あああああ" so the CJK stuff may not really be much of an issue anymore? Maybe just an issue with Chinese because the index doesn't know how to tokenize words?

epriestley mentioned this in T12443: Applying fulltext limits first causes missing results.Mar 23 2017, 1:54 AM

20after4 added a subscriber: 20after4.Mar 23 2017, 3:01 AM

Elasticsearch has much better support for non-latin language analysis. See https://www.elastic.co/guide/en/elasticsearch/guide/current/icu-tokenizer.html discusses their ability to properly tokenize Thai, Chinese and Japanese text.

epriestley closed subtask T6892: Invalid search result when I input less than 2 Korean character. as Resolved.Mar 26 2017, 12:28 PM

epriestley moved this task from Backlog to v2 on the Search board.Mar 26 2017, 12:32 PM

epriestley mentioned this in T10640: Allow application queries to be promoted as global search modes.Mar 26 2017, 12:41 PM

epriestley mentioned this in T12450: New Search Configuration Errata.Mar 26 2017, 12:44 PM

20after4 mentioned this in D17564: Address some New Search Configuration Errata.Mar 27 2017, 2:41 PM

20after4 mentioned this in rP699228c73b74: Address some New Search Configuration Errata.Mar 28 2017, 8:19 PM

epriestley added a revision: D17669: Make the query compiler emit intermediate tokens.Apr 12 2017, 10:43 PM

epriestley added a revision: D17670: Update Phabricator to use intermediate tokens from the query compiler.

epriestley added a revision: D17672: Show users how fulltext search queries are parsed and executed; don't query stopwords or short tokens.Apr 13 2017, 12:04 AM

epriestley closed subtask T2632: MyISAM fulltext does not support non-latin languages and we don't warn you about it as Resolved.Apr 13 2017, 12:18 AM

epriestley closed this task as Resolved by committing rP3245e74f16bb: Show users how fulltext search queries are parsed and executed; don't query….Apr 13 2017, 2:06 AM

epriestley added a commit: rPHU6fe33623cda6: Make the query compiler emit intermediate tokens.