Page MenuHomePhabricator

MyISAM fulltext does not support non-latin languages and we don't warn you about it
Closed, ResolvedPublic

Description

The default MyISAM fulltext index does not support non-latin languages (like Chinese, Japanese and Korean). Although there are some hacks to make it work, they seem prohibitively messy. Instead, we should:

  • Verify that ElasticSearch produces reasonable results for non-latin languages (the documentation implies it does).
  • Verify that our documentation / instructions are sufficient for an install to move to ElasticSearch and reindex its existing corpora.
  • Detect when users search for non-latin glyphs and present a useful instruction screen: "the current search engine only supports English and other latin languages. See such-and-such documentation for instructions to upgrade to ElasticSearch".

Original Description

I can't search Chinese word in Maniphest. When I type Chinese Work in Search field, the result always is empty.

Does Maniphest not support CJK?

Event Timeline

Hm, for the search field itself it returns all tasks for me instead of filtering. In fact, it apparently filters as something empty, as I had problems when creating a new task with Chinese letters. The title 汉字/漢字 project was thought to be colliding with Project, so it would not allow me to create it. I tested searching for {T2633}.

This does not happen with the Typeahead though. It recognizes Chinese characters fine.

@epriestley
Some idea how UTF-8 letters may be converted differently for the Search field query than the Typeahead?

Tbh, I suspect unfiltered queries. I queried with some Chinese characters in the Task ID field and got this: https://secure.phabricator.com/maniphest/view/custom/?tasks=%E6%B1%89%E5%AD%97%E6%BC%A2%E5%AD%97 (sanitized from unnecessary get queries, still repro)

I think there are several different problems here:

  1. UTF8 characters in URIs (see T1466). When we need to generate URIs from strings (for example, when naming Phriction pages) we are currently very aggressive about sanitizing the characters, and pretty much remove everything but a small subset of latin characters. We must remove or encode some characters (like "?" and "%") and removing them normally generates better URIs than encoding them (for a Phriction paged named "What is the best PHP function?", a URI like "/what_is_the_best_php_function/" is cleaner than "what_is_the_best_php_function%3F"). It's very easy to stop stripping these characters, but since I don't have any experience with UTF8 in URIs I'm not sure what the consequences of doing this are. In particular, can we safely generate a URI like /汉/ and have it work in all browsers/email/etc., or do we need to encode it (/%E6%B1%89/)? If we do need to encode it, it becomes nearly unreadable, so maybe we should try to transliterate (or even just use random IDs, since /xf1/ is probably better than /%E6%B1%89/). My major concern here is that we generate unencoded URIs and then 6 months later learn that there are 200 subtle reasons they don't actually work, and possibly end up with a bunch of semi-broken data we can't reasonably migrate. Do you have experience with the interactions between URIs and non-latin characters?
  1. Project name collisions. This is basically the same as (1). When we create a project, we generate a name for it based on the Phriction URI, and this name must be unique (we create project pages under /w/project/THE_PROJECT_NAME/ or similar).
  1. Truncation of non-base-plane characters (see T1191). Not addressed here, but somewhat related. This is a MySQL limitation -- T1191 has a good discussion.
  1. Exception when searching by task ID. This is just some kind of minor input filtering issue, it repros for some normal latin strings too like this one or this one.
  1. Searching for non-latin text. My guess on this is that it's a limitation of the default MyISAM FULLTEXT engine we use -- see this article, for example. It offers a possible hackaround that we might be able to use, but it's probably a lot of work that involves a lot of special casing for character ranges. A better approach might be to detect that a query contains non-latin characters and raise a warning like "The default engine only supports latin text. To search for text in other languages, configure the ElasticSearch engine." and then link to the documentation (which we might need to write first), since it at least ostensibly supports a bunch of languages. We could look into Senna, too, or there might be other options.

My general rule of thumb is: Encode anything that's a special token or not in the visible ASCII range.

Providing plain links with UTF-8 a lá http://en.wiktionary.org/wiki/汉字 works for modern browsers (even curl), too, but I'd worry about mail and other programs, which may interpret these characters as ascii and not convert UTF-8 accordingly, so some clutter/weird letters appear in the URL, and in worse but unlikely cases, also non-desirable characters like ? or %. We should avoid supplying plain UTF-8 characters.

We have to url-encode certain characters like ?, &, (space), +, and % (think #, too, when you want to have it server-side) as they are special tokens which will be interpreted by browsers and servers. ? denotes the start of the query string, while & separate the query parameters etc. I think you are familiar with that.
Apart from these few cases and some unreasonable additions like ", anything in the visible 7-bit ASCII range can be put plain into the url: .../wiki/bla:=fge//+~*!'$bla@

Truncating ? or % from Phriction slugs is fine imo, since they aren't part of natural language and their removal usually does not disrupt something. Maybe the user is confused about it's disappearance, but he'll shut up fast, since nobody types a full URL paths into the address bar apart from visiting TLDs like google.com or facebook.com.
We should include + or & though, since they do add something reasonable to the name usually, even if it makes the URL ugly (consider Kitty + moar kitty = 4wsum vs kitty_moar_kitty_4wsum).

All somewhat modern browsers I know of are pretty smart about it and display that pretty fine in the address and status bar. Means people who aren't using IE5 won't get clutter like .../wiki/%E6%B1%89%E5%AD%97 but a fine .../wiki/汉字 will appear in their address bar.

So allowing CJK should be no problem at all. Just when posting links somewhere etc. this clutter will appear.

Regarding 3) and 5)... I never dived into that.
The warning is reasonable, Phrabricator is used by devs, not some ordinary mortals, so they should be able to comprehend what we throw at them.

epriestley edited this Maniphest Task.
epriestley edited this Maniphest Task.

T1695 related / duplicate (How do I merge?)

Ah, thanks! There's a "Merge Duplicates" action in the upper right to perform merges.

epriestley renamed this task from Custom Query of Maniphest can't search for CJK to MyISAM fulltext does not support non-latin languages and we don't warn you about it.Apr 11 2013, 3:53 AM
epriestley triaged this task as Normal priority.
epriestley updated the task description. (Show Details)
epriestley changed the visibility from "All Users" to "Public (No Login Required)".Feb 10 2014, 2:58 AM
chad added a subscriber: chad.

This has recently had some Support Impact. A setup warning seems reasonable?

The task description has a little spec that I like over a setup warning. Support Impact part is true given the recent merge action.

Is there a temporary way to resolve this issue? ( Just only for wiki search )

Thank you for your answer. But I met this error message.
after "search index --all"

[HTTP/404] Not Found
{"error":"IndexMissingException[[phabricator] missing]","status":404}

Is it need more configuration for this?

just run

./bin/search init

first.
This will create the needed elasticsearch index.

Thanks @fabe.
I resolved that issue after search init and change mysql config ( STRICT option ).
It works well but I found T6892.

epriestley claimed this task.

Searching for ひらがな or 漢字汉字 or 漢字汉 now finds this task. Searching for 漢字 or 汉字 does not, but I think that's because they're under the minimum token length.

So I think this works fine in some cases, although you may need to upgrade to get InnoDB fulltext, and if you want to be able to search for 2-character terms, adjust innodb_ft_min_token_size to 2 and use bin/search index --all --force to rebuild the search index.

If there are remaining issues here (which I think is likely), please file a new task with specific reproduction instructions -- this task is extremely old and discusses about a dozen different issues, many of which are now resolved.

I suspect there is probably still be an issue where CJK languages are not tokenized correctly because they don't use spaces and word-based indexes are sort of meaningless, but please file a new report with a specific example of what you're trying to do if you're impacted by this. We can pursue substring search and ngram indexing, and already have substantial support for this elsewhere (T9979).