Page MenuHomePhabricator

Allow the Ferret engine to remove "common" ngrams from the index
ClosedPublic

Authored by epriestley on Oct 2 2017, 10:50 PM.
Tags
None
Referenced Files
F14033944: D18672.diff
Sat, Nov 9, 8:54 PM
F14032555: D18672.diff
Sat, Nov 9, 2:44 PM
F14027303: D18672.id44830.diff
Fri, Nov 8, 6:28 AM
F14018856: D18672.diff
Tue, Nov 5, 8:20 PM
F13992681: D18672.id44830.diff
Tue, Oct 22, 6:10 PM
F13991102: D18672.id.diff
Tue, Oct 22, 8:50 AM
F13976249: D18672.id44838.diff
Fri, Oct 18, 1:25 PM
Unknown Object (File)
Oct 6 2024, 3:24 AM
Subscribers
None

Details

Summary

Ref T13000. This adds support for tracking "common" ngrams, which occur in too many documents to be useful as part of the ngram index.

If an ngram is listed in the "common" table, it won't be written when indexing documents, or queried for when searching for them.

In this change, nothing actually writes to the "common" table. I'll start writing to the table in a followup change.

Specifically, I plan to do this:

  • A new GC process updates the "common" table periodically, by writing ngrams which appear in more than X% of documents to it, for some value of X, if there are at least a minimum number of documents (maybe like 4,000).
  • A new GC process deletes ngrams that have been added to the common table from the existing indexes.

Hopefully, this will pare down the ngrams index to something reasonable over time without requiring any manual tuning.

Test Plan
  • Ran some queries and indexes.
  • Manually inserted ngrams xxx and yyy into the ngrams table, searched and indexed, saw them ignored as viable ngrams for search/index.

Diff Detail

Repository
rP Phabricator
Branch
common1
Lint
Lint Passed
Unit
Tests Passed
Build Status
Buildable 18612
Build 25070: Run Core Tests
Build 25069: arc lint + arc unit

Event Timeline

Basically, the idea here is that if you search for:

  • "the": It's pointless to use the ngrams table anyway since it has very little power to constrain this query, and falling back to LIKE on the ffield table is perfectly fine, since most documents match anyway.
  • "the xylophone": Rare ngrams like "xyl", "ylo", and "oph" will be selected, which is correct and desirable. This kind of query should actually perform better.
  • "the of in and to it is a my": This query is probably fine as a LIKE since all the terms are very common, but it's possible it performs a little worse.
  • "source code program bug": This sort of query, where all terms are common words but not incredibly common words, might get worse. If 2+ of the ngrams occur in a small number of documents, performance should be pretty great. If only 1, performance should be reasonable. If 0, this may be the worst case (we fall back to LIKE and need to grep a lot of text). The choice of a "common" threshold will determine how many words fall into this awkward spot before the ngrams index kicks in.

This change doesn't alter search results, it just changes the situations in which we try to use the ngrams table to constrain the amount of text we eventually grep through with LIKE. So the worst case is just that we fall back to a big LIKE grep, which isn't great, but isn't horrible either.

Couple of bugfixes coming; MySQL does not retain trailing whitespace (???) on CHAR(...) columns so we have to fake it a bit.

  • Work around MySQL being creative in implementing "CHAR(X)".
  • Fix a variable name typo from refactoring.

Do we we need distinct, per-app tables for tracking common ngrams? I'm sure there's some variance between which ngrams are "common", but I'm not sure I can see the distribution affecting search result quality.

We don't strictly need separate tables, but it gives us more flexibility: we could change the thresholds for each document type independently, or scale the threshold to the size of the corpus, or add exceptions for particular document types, etc. We can also drop and rebuild one document type's index without having to rebuild everything else.

I don't know if we'll need to do any of that, but the only advantage I can come up with for a single table is that we need slightly fewer queries when running searches, but the cost of those queries is trivial and we could easily cache the results if it ever showed up on a profile.

This revision is now accepted and ready to land.Oct 3 2017, 4:45 PM
This revision was automatically updated to reflect the committed changes.