Default CJK query terms to "substring" mode, not "term" mode
ClosedPublic
Actions

Authored by epriestley on Sep 22 2017, 2:25 PM.

Details

Reviewers

amckinley

Maniphest Tasks

T12995: When search terms contain CJK characters, default to substring mode

Commits

rPHU9f4d688e576d: (stable) Promote 2017 Week 38
rPHUe48b86c85efc: Default CJK query terms to "substring" mode, not "term" mode

Summary

Ref T12995. This might be a little rough but should get us started on CJK support.

If a query term has CJK characters and doesn't have a query operator, default it to "substring" mode instead of "term" mode.

You can force a query term into "term" mode with the "+" operator, which is normally redundant. So this should just make the default behavior better without removing any capabilities.

Test Plan

Added some unit tests and made them pass.
Searched for "主羽岡" with no operators, found "距暮類供派主羽岡際問月勉央".

Diff Detail

Repository

rPHU libphutil

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

epriestley created this revision.Sep 22 2017, 2:25 PM

Harbormaster completed remote builds in B18518: Diff 44741.Sep 22 2017, 2:26 PM

amckinley accepted this revision.Sep 22 2017, 4:59 PM

amckinley added inline comments.

src/search/__tests__/PhutilSearchQueryCompilerTestCase.php
95	Do we try to avoid embedding actual Unicode characters in source?
src/utils/utf8.php
397	Is this a function that will need to change all the time? I have no idea how stable UTF8 is for CJK stuff.

This revision is now accepted and ready to land.Sep 22 2017, 4:59 PM

Do we try to avoid embedding actual Unicode characters in source?

Yeah, there's a lint rule that warns about non-ASCII source.

Although Phabricator now generally handles UTF8 fine (and, to a lesser degree, other non-UTF8 encodings), a lot of engineers at Facebook had editors configured to immediately mangle all UTF8 in any file they opened. This may be less prevalent now than it was in ~2007-2011, but the rule seems generally reasonable anyway, and Phabricator is at least theoretically open source open to contribution from anyone. This rule also covers a bunch of theoretical badness with ZWS, RTL markers, BOM sequences, etc (see T12822). The cost is mostly that a small number of unit tests are a bit harder to read, which doesn't seem too huge a price to pay.

Is this a function that will need to change all the time? I have no idea how stable UTF8 is for CJK stuff.

I doubt Unicode will change much, but I'm only like 90% confident that I got most of the CJK characters. It's also possible that we might want to include, say, emoji in this block -- if you search for 🐑 you might not care if it's a term search? Or patterns like /path/to/file.c might make sense to automatically execute as substring search. So I'd guess this will change a little, maybe, but probably because of human expectations rather than Unicode itself changing. The similar phutil_utf8_console_strlen(), above, has been stable since I wrote it and phutil_utf8_is_combining_character() has too. In both cases these methods are probably not perfect, but they may reasonably cover 100% of actual use cases real humans have.