Allow the Ferret engine to remove "common" ngrams from the index
ClosedPublic
Actions

Authored by epriestley on Oct 2 2017, 10:50 PM.

Details

Reviewers

amckinley

Maniphest Tasks

T13000: Sustained MySQL I/O overwhelmed db009 / huge Ferret engine ngrams table

Commits

rP1de130c9f5b2: Allow the Ferret engine to remove "common" ngrams from the index

Summary

Ref T13000. This adds support for tracking "common" ngrams, which occur in too many documents to be useful as part of the ngram index.

If an ngram is listed in the "common" table, it won't be written when indexing documents, or queried for when searching for them.

In this change, nothing actually writes to the "common" table. I'll start writing to the table in a followup change.

Specifically, I plan to do this:

A new GC process updates the "common" table periodically, by writing ngrams which appear in more than X% of documents to it, for some value of X, if there are at least a minimum number of documents (maybe like 4,000).
A new GC process deletes ngrams that have been added to the common table from the existing indexes.

Hopefully, this will pare down the ngrams index to something reasonable over time without requiring any manual tuning.

Test Plan

Ran some queries and indexes.
Manually inserted ngrams xxx and yyy into the ngrams table, searched and indexed, saw them ignored as viable ngrams for search/index.

Diff Detail

Repository

rP Phabricator

Branch

common1

Lint

Lint Passed

Unit

Tests Passed

Build Status

Buildable 18613
Build 25072: Run Core Tests
Build 25071: arc lint + arc unit

Event Timeline

epriestley created this revision.Oct 2 2017, 10:50 PM

Harbormaster completed remote builds in B18612: Diff 44829.Oct 2 2017, 10:52 PM

Basically, the idea here is that if you search for:

"the": It's pointless to use the ngrams table anyway since it has very little power to constrain this query, and falling back to LIKE on the ffield table is perfectly fine, since most documents match anyway.
"the xylophone": Rare ngrams like "xyl", "ylo", and "oph" will be selected, which is correct and desirable. This kind of query should actually perform better.
"the of in and to it is a my": This query is probably fine as a LIKE since all the terms are very common, but it's possible it performs a little worse.
"source code program bug": This sort of query, where all terms are common words but not incredibly common words, might get worse. If 2+ of the ngrams occur in a small number of documents, performance should be pretty great. If only 1, performance should be reasonable. If 0, this may be the worst case (we fall back to LIKE and need to grep a lot of text). The choice of a "common" threshold will determine how many words fall into this awkward spot before the ngrams index kicks in.

This change doesn't alter search results, it just changes the situations in which we try to use the ngrams table to constrain the amount of text we eventually grep through with LIKE. So the worst case is just that we fall back to a big LIKE grep, which isn't great, but isn't horrible either.

Couple of bugfixes coming; MySQL does not retain trailing whitespace (???) on CHAR(...) columns so we have to fake it a bit.

Work around MySQL being creative in implementing "CHAR(X)".
Fix a variable name typo from refactoring.

Harbormaster completed remote builds in B18613: Diff 44830.Oct 2 2017, 11:36 PM

epriestley added a child revision: D18673: Add a workflow for populating (or depopulating) the common ngrams table.Oct 2 2017, 11:40 PM

Do we we need distinct, per-app tables for tracking common ngrams? I'm sure there's some variance between which ngrams are "common", but I'm not sure I can see the distribution affecting search result quality.

We don't strictly need separate tables, but it gives us more flexibility: we could change the thresholds for each document type independently, or scale the threshold to the size of the corpus, or add exceptions for particular document types, etc. We can also drop and rebuild one document type's index without having to rebuild everything else.

I don't know if we'll need to do any of that, but the only advantage I can come up with for a single table is that we need slightly fewer queries when running searches, but the cost of those queries is trivial and we could easily cache the results if it ever showed up on a profile.

amckinley accepted this revision.Oct 3 2017, 4:45 PM

This revision is now accepted and ready to land.Oct 3 2017, 4:45 PM

Closed by commit rP1de130c9f5b2: Allow the Ferret engine to remove "common" ngrams from the index (authored by epriestley). · Explain WhyOct 3 2017, 8:27 PM

This revision was automatically updated to reflect the committed changes.