MyISAM fulltext does not support non-latin languages and we don't warn you about it
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• fbzhong
	Mar 2 2013, 6:58 PM

Description

The default MyISAM fulltext index does not support non-latin languages (like Chinese, Japanese and Korean). Although there are some hacks to make it work, they seem prohibitively messy. Instead, we should:

Verify that ElasticSearch produces reasonable results for non-latin languages (the documentation implies it does).
Verify that our documentation / instructions are sufficient for an install to move to ElasticSearch and reindex its existing corpora.
Detect when users search for non-latin glyphs and present a useful instruction screen: "the current search engine only supports English and other latin languages. See such-and-such documentation for instructions to upgrade to ElasticSearch".

Original Description

I can't search Chinese word in Maniphest. When I type Chinese Work in Search field, the result always is empty.

Does Maniphest not support CJK?

Revisions and Commits

		Restricted Differential Revision
rPHU libphutil
	D17669	rPHU6fe33623cda6 Make the query compiler emit intermediate tokens
	Restricted Differential Revision	rPHU38996e6111f5 Introduce libphutil UTF8 case changing functions
rP Phabricator
	D17672	rP3245e74f16bb Show users how fulltext search queries are parsed and executed; don't query…
	D17670	rPcb49acc2ca71 Update Phabricator to use intermediate tokens from the query compiler
	Restricted Differential Revision	rP555c0421bb35 Allow slugs to contain most utf8 characters
	Restricted Differential Revision	rPcb2d0adf95f2 Fix exception in Maniphest task ID filtering

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T12003 Explain to users how fulltext queries are parsed and executed
Resolved	epriestley	T2632 MyISAM fulltext does not support non-latin languages and we don't warn you about it
Resolved	joshuaspence	T5282 Provide documentation on setting up ElasticSearch

Event Timeline

• fbzhong added a project: Maniphest.Mar 2 2013, 6:58 PM

• fbzhong added a subscriber: • fbzhong.

Hm, for the search field itself it returns all tasks for me instead of filtering. In fact, it apparently filters as something empty, as I had problems when creating a new task with Chinese letters. The title 汉字/漢字 project was thought to be colliding with Project, so it would not allow me to create it. I tested searching for {T2633}.

This does not happen with the Typeahead though. It recognizes Chinese characters fine.

@epriestley
Some idea how UTF-8 letters may be converted differently for the Search field query than the Typeahead?

Tbh, I suspect unfiltered queries. I queried with some Chinese characters in the Task ID field and got this: https://secure.phabricator.com/maniphest/view/custom/?tasks=%E6%B1%89%E5%AD%97%E6%BC%A2%E5%AD%97 (sanitized from unnecessary get queries, still repro)

• AnhNhan added subscribers: epriestley, • AnhNhan.Mar 2 2013, 8:15 PM

I think there are several different problems here:

UTF8 characters in URIs (see T1466). When we need to generate URIs from strings (for example, when naming Phriction pages) we are currently very aggressive about sanitizing the characters, and pretty much remove everything but a small subset of latin characters. We must remove or encode some characters (like "?" and "%") and removing them normally generates better URIs than encoding them (for a Phriction paged named "What is the best PHP function?", a URI like "/what_is_the_best_php_function/" is cleaner than "what_is_the_best_php_function%3F"). It's very easy to stop stripping these characters, but since I don't have any experience with UTF8 in URIs I'm not sure what the consequences of doing this are. In particular, can we safely generate a URI like /汉/ and have it work in all browsers/email/etc., or do we need to encode it (/%E6%B1%89/)? If we do need to encode it, it becomes nearly unreadable, so maybe we should try to transliterate (or even just use random IDs, since /xf1/ is probably better than /%E6%B1%89/). My major concern here is that we generate unencoded URIs and then 6 months later learn that there are 200 subtle reasons they don't actually work, and possibly end up with a bunch of semi-broken data we can't reasonably migrate. Do you have experience with the interactions between URIs and non-latin characters?

Project name collisions. This is basically the same as (1). When we create a project, we generate a name for it based on the Phriction URI, and this name must be unique (we create project pages under /w/project/THE_PROJECT_NAME/ or similar).

Truncation of non-base-plane characters (see T1191). Not addressed here, but somewhat related. This is a MySQL limitation -- T1191 has a good discussion.

Exception when searching by task ID. This is just some kind of minor input filtering issue, it repros for some normal latin strings too like this one or this one.

Searching for non-latin text. My guess on this is that it's a limitation of the default MyISAM FULLTEXT engine we use -- see this article, for example. It offers a possible hackaround that we might be able to use, but it's probably a lot of work that involves a lot of special casing for character ranges. A better approach might be to detect that a query contains non-latin characters and raise a warning like "The default engine only supports latin text. To search for text in other languages, configure the ElasticSearch engine." and then link to the documentation (which we might need to write first), since it at least ostensibly supports a bunch of languages. We could look into Senna, too, or there might be other options.

My general rule of thumb is: Encode anything that's a special token or not in the visible ASCII range.

Providing plain links with UTF-8 a lá http://en.wiktionary.org/wiki/汉字 works for modern browsers (even curl), too, but I'd worry about mail and other programs, which may interpret these characters as ascii and not convert UTF-8 accordingly, so some clutter/weird letters appear in the URL, and in worse but unlikely cases, also non-desirable characters like ? or %. We should avoid supplying plain UTF-8 characters.

We have to url-encode certain characters like ?, &, (space), +, and % (think #, too, when you want to have it server-side) as they are special tokens which will be interpreted by browsers and servers. ? denotes the start of the query string, while & separate the query parameters etc. I think you are familiar with that.
Apart from these few cases and some unreasonable additions like ", anything in the visible 7-bit ASCII range can be put plain into the url: .../wiki/bla:=fge//+~*!'$bla@

Truncating ? or % from Phriction slugs is fine imo, since they aren't part of natural language and their removal usually does not disrupt something. Maybe the user is confused about it's disappearance, but he'll shut up fast, since nobody types a full URL paths into the address bar apart from visiting TLDs like google.com or facebook.com.
We should include + or & though, since they do add something reasonable to the name usually, even if it makes the URL ugly (consider Kitty + moar kitty = 4wsum vs kitty_moar_kitty_4wsum).

All somewhat modern browsers I know of are pretty smart about it and display that pretty fine in the address and status bar. Means people who aren't using IE5 won't get clutter like .../wiki/%E6%B1%89%E5%AD%97 but a fine .../wiki/汉字 will appear in their address bar.

So allowing CJK should be no problem at all. Just when posting links somewhere etc. this clutter will appear.

Regarding 3) and 5)... I never dived into that.
The warning is reasonable, Phrabricator is used by devs, not some ordinary mortals, so they should be able to comprehend what we throw at them.

epriestley edited this Maniphest Task.Mar 3 2013, 6:09 PM

epriestley edited this Maniphest Task.Mar 3 2013, 6:26 PM

epriestley edited this Maniphest Task.Mar 3 2013, 6:31 PM

epriestley edited this Maniphest Task.Mar 3 2013, 6:52 PM

epriestley edited this Maniphest Task.

T1695 related / duplicate (How do I merge?)

Ah, thanks! There's a "Merge Duplicates" action in the upper right to perform merges.

◀ Merged tasks: T1695.

Oh, I didn't see that. Thanks!

epriestley edited this Maniphest Task.Mar 6 2013, 9:28 PM

epriestley renamed this task from Custom Query of Maniphest can't search for CJK to MyISAM fulltext does not support non-latin languages and we don't warn you about it.Apr 11 2013, 3:53 AM

epriestley triaged this task as Normal priority.

epriestley updated the task description. (Show Details)

epriestley changed the visibility from "All Users" to "Public (No Login Required)".Feb 10 2014, 2:58 AM

• dylanninin added a subscriber: • dylanninin.Jun 19 2014, 8:47 AM

epriestley merged a task: T6368: Cannot using global search using Korean language..Oct 22 2014, 12:43 AM

epriestley added a subscriber: synsun.moon.

synsun.moon awarded a token.Nov 3 2014, 7:04 AM

epriestley merged a task: T6517: Search people by Korean problems.Nov 11 2014, 12:43 PM

chad edited projects, added Search; removed Maniphest.Nov 12 2014, 5:07 AM

chad merged a task: T6529: search of Chinese words not function well.Nov 12 2014, 5:11 AM

chad added a subscriber: felixdae.

This has recently had some Support Impact. A setup warning seems reasonable?

The task description has a little spec that I like over a setup warning. Support Impact part is true given the recent merge action.

chad added a project: Elasticsearch.Nov 13 2014, 5:50 PM

chad added a subtask: T5282: Provide documentation on setting up ElasticSearch.Nov 13 2014, 5:56 PM

epriestley mentioned this in T3165: Provide Search in Conpherence.Jan 6 2015, 2:57 AM

Is there a temporary way to resolve this issue? ( Just only for wiki search )

https://secure.phabricator.com/T5282#62676

Thank you for your answer. But I met this error message.
after "search index --all"

[HTTP/404] Not Found
{"error":"IndexMissingException[[phabricator] missing]","status":404}

Is it need more configuration for this?

just run

./bin/search init

first.
This will create the needed elasticsearch index.

Thanks @fabe.
I resolved that issue after search init and change mysql config ( STRICT option ).
It works well but I found T6892.

timesking added a subscriber: timesking.Mar 16 2015, 4:37 PM

chad merged a task: T7821: Phabricator full text search not working with mysql/elasticsearch .Apr 14 2015, 2:42 AM

chad added a subscriber: netroby.

liuxinyu970226 added a subscriber: liuxinyu970226.May 29 2015, 9:57 AM

epriestley mentioned this in T9893: Support ElasticSearch 2.0 - 5.1.Dec 3 2015, 2:16 PM

epriestley moved this task from Backlog to v2 on the Search board.Dec 8 2016, 7:03 PM

Herald added a subscriber: eadler. · View Herald TranscriptDec 8 2016, 7:03 PM

epriestley added a parent task: T12003: Explain to users how fulltext queries are parsed and executed.Dec 13 2016, 12:36 PM

epriestley closed subtask T5282: Provide documentation on setting up ElasticSearch as Resolved.Mar 26 2017, 12:20 PM

epriestley removed a project: Elasticsearch.Mar 26 2017, 12:25 PM

epriestley removed a project: Support Impact.Apr 12 2017, 2:48 PM

epriestley added a revision: D17669: Make the query compiler emit intermediate tokens.Apr 12 2017, 10:43 PM

epriestley added a revision: D17670: Update Phabricator to use intermediate tokens from the query compiler.

epriestley added a revision: D17672: Show users how fulltext search queries are parsed and executed; don't query stopwords or short tokens.Apr 13 2017, 12:04 AM

漢字
汉字

ひらがな
漢字汉字

漢字汉

Searching for ひらがな or 漢字汉字 or 漢字汉 now finds this task. Searching for 漢字 or 汉字 does not, but I think that's because they're under the minimum token length.

So I think this works fine in some cases, although you may need to upgrade to get InnoDB fulltext, and if you want to be able to search for 2-character terms, adjust innodb_ft_min_token_size to 2 and use bin/search index --all --force to rebuild the search index.

If there are remaining issues here (which I think is likely), please file a new task with specific reproduction instructions -- this task is extremely old and discusses about a dozen different issues, many of which are now resolved.

I suspect there is probably still be an issue where CJK languages are not tokenized correctly because they don't use spaces and word-based indexes are sort of meaningless, but please file a new report with a specific example of what you're trying to do if you're impacted by this. We can pursue substring search and ngram indexing, and already have substantial support for this elsewhere (T9979).

\o/

(◍•ᴗ•◍)

epriestley added a commit: rPHU6fe33623cda6: Make the query compiler emit intermediate tokens.Apr 13 2017, 2:06 AM

epriestley added a commit: rPcb49acc2ca71: Update Phabricator to use intermediate tokens from the query compiler.

epriestley added a commit: rP3245e74f16bb: Show users how fulltext search queries are parsed and executed; don't query….

liuxinyu970226 removed a subscriber: liuxinyu970226.Jun 8 2017, 4:36 AM

epriestley mentioned this in T12819: InnoDB FULLTEXT appears to fail catastrophically once it reaches a moderate size.Aug 30 2017, 7:49 PM

MyISAM fulltext does not support non-latin languages and we don't warn you about itClosed, ResolvedPublicActions

Description

Revisions and Commits

Related ObjectsSearch...

Event Timeline

MyISAM fulltext does not support non-latin languages and we don't warn you about it
Closed, ResolvedPublic
Actions

Related Objects
Search...