Page MenuHomePhabricator

Default CJK query terms to "substring" mode, not "term" mode
ClosedPublic

Authored by epriestley on Sep 22 2017, 2:25 PM.
Tags
None
Referenced Files
F14245410: D18634.diff
Fri, Dec 13, 5:28 AM
F14233997: D18634.id44741.diff
Thu, Dec 12, 5:53 AM
Unknown Object (File)
Tue, Dec 10, 9:46 PM
Unknown Object (File)
Mon, Dec 9, 4:15 PM
Unknown Object (File)
Mon, Dec 9, 4:15 PM
Unknown Object (File)
Mon, Dec 9, 4:13 PM
Unknown Object (File)
Fri, Dec 6, 3:48 AM
Unknown Object (File)
Tue, Dec 3, 1:09 PM
Subscribers
None

Details

Summary

Ref T12995. This might be a little rough but should get us started on CJK support.

If a query term has CJK characters and doesn't have a query operator, default it to "substring" mode instead of "term" mode.

You can force a query term into "term" mode with the "+" operator, which is normally redundant. So this should just make the default behavior better without removing any capabilities.

Test Plan
  • Added some unit tests and made them pass.
  • Searched for "主羽岡" with no operators, found "距暮類供派主羽岡際問月勉央".

Diff Detail

Repository
rPHU libphutil
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

amckinley added inline comments.
src/search/__tests__/PhutilSearchQueryCompilerTestCase.php
95

Do we try to avoid embedding actual Unicode characters in source?

src/utils/utf8.php
397

Is this a function that will need to change all the time? I have no idea how stable UTF8 is for CJK stuff.

This revision is now accepted and ready to land.Sep 22 2017, 4:59 PM

Do we try to avoid embedding actual Unicode characters in source?

Yeah, there's a lint rule that warns about non-ASCII source.

Although Phabricator now generally handles UTF8 fine (and, to a lesser degree, other non-UTF8 encodings), a lot of engineers at Facebook had editors configured to immediately mangle all UTF8 in any file they opened. This may be less prevalent now than it was in ~2007-2011, but the rule seems generally reasonable anyway, and Phabricator is at least theoretically open source open to contribution from anyone. This rule also covers a bunch of theoretical badness with ZWS, RTL markers, BOM sequences, etc (see T12822). The cost is mostly that a small number of unit tests are a bit harder to read, which doesn't seem too huge a price to pay.

Is this a function that will need to change all the time? I have no idea how stable UTF8 is for CJK stuff.

I doubt Unicode will change much, but I'm only like 90% confident that I got most of the CJK characters. It's also possible that we might want to include, say, emoji in this block -- if you search for 🐑 you might not care if it's a term search? Or patterns like /path/to/file.c might make sense to automatically execute as substring search. So I'd guess this will change a little, maybe, but probably because of human expectations rather than Unicode itself changing. The similar phutil_utf8_console_strlen(), above, has been stable since I wrote it and phutil_utf8_is_combining_character() has too. In both cases these methods are probably not perfect, but they may reasonably cover 100% of actual use cases real humans have.

This revision was automatically updated to reflect the committed changes.