Page MenuHomePhabricator

Implement basic ngram search for Owners Package names
ClosedPublic

Authored by epriestley on Dec 21 2015, 9:07 PM.
Tags
None
Referenced Files
F18506922: D14846.id.diff
Fri, Sep 5, 1:53 AM
F18499669: D14846.diff
Thu, Sep 4, 7:59 PM
F18453414: D14846.id.diff
Mon, Sep 1, 4:14 AM
F18446726: D14846.id35881.diff
Sun, Aug 31, 10:17 PM
F18446535: D14846.id35903.diff
Sun, Aug 31, 9:43 PM
F18428453: D14846.id.diff
Sun, Aug 31, 1:39 AM
F18418570: D14846.diff
Sat, Aug 30, 1:22 PM
F18359153: D14846.id35881.diff
Wed, Aug 27, 9:01 AM
Subscribers
None

Details

Summary

Ref T9979. This uses ngrams (specifically, trigrams) to build a reasonably efficient index for substring matching. Specifically, for a package like "Example", with ID 123, we store rows like this:

< ex, 123>
<exa, 123>
<xam, 123>
<amp, 123>
<mpl, 123>
<ple, 123>
<le , 123>

When the user searches for exam, we join this table for packages with tokens exa and xam. MySQL can do this a lot more efficiently than it can process a LIKE "%exam%" query against a huge table.

When the user searches for a one-letter or two-letter string, we only search the beginnings of words. This is probably what they want, the only thing we can do quickly, and a reasonable/expected behavior for typeaheads.

Test Plan
  • Ran storage upgrades and search indexer.
  • Searched for stuff with "name contains".
  • Used typehaead and got sensible results.
  • Searched for aabbccddeeffgghhiijjkkllmmnnooppqqrrssttuuvvwwxxyyzz and saw only 16 joins.

Diff Detail

Repository
rP Phabricator
Lint
Lint Not Applicable
Unit
Tests Not Applicable