Page MenuHomePhabricator

Add a workflow for populating (or depopulating) the common ngrams table
ClosedPublic

Authored by epriestley on Oct 2 2017, 11:40 PM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Nov 22, 7:13 AM
Unknown Object (File)
Sun, Nov 17, 11:33 PM
Unknown Object (File)
Thu, Nov 14, 10:31 AM
Unknown Object (File)
Sun, Nov 10, 8:06 AM
Unknown Object (File)
Sat, Nov 9, 2:44 PM
Unknown Object (File)
Wed, Nov 6, 2:56 AM
Unknown Object (File)
Thu, Oct 24, 4:02 PM
Unknown Object (File)
Oct 21 2024, 8:56 PM
Subscribers
None

Details

Summary

Depends on D18672. Ref T13000. This does an on-demand build of the common ngrams table.

Plan here is:

  • Push to secure.
  • Build the common ngrams table here.
  • See if stuff breaks?

If it looks okay on this dataset, we can build out the GC support and try it in production.

Test Plan
  • Locally, my dataset has a bunch of bin/lipsum tasks with similar, common words.
  • Verified that ipsum terms now skip ngrams. For "lorem ipsum" search performance actually IMPROVED by skipping the ngrams table (12s to 9s).
  • Queried for normal terms, got very fast results using the ngram table, as normal.

Diff Detail

Repository
rP Phabricator
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

amckinley added inline comments.
src/applications/search/management/PhabricatorSearchManagementNgramsWorkflow.php
27–28

I'm pretty comfortable hard-coding this. I think we know better than our users and probably shouldn't make this a configurable option.

This revision is now accepted and ready to land.Oct 3 2017, 4:39 PM

The 4096 doesn't matter too much (it's fine for us to just grep everything with LIKE if there are that few records), but I'm going to try to get a better sense of the magic 0.02 value on this host.

Specifically, I'm planning to write a script which looks at recent search queries, breaks them into ngrams and checks their frequency, so we can get a sense of which queries hit/miss the ngrams constraint at different values for the 0.02 threshold.

Hopefully most queries have some ngrams that appear in fewer than 2% (or less) of documents and we're relatively free to decrease the size of the index.

But I'm worried that the long tail of ngrams may have a lot of really obscure ngrams which occur in exactly one document (for example, say, one task somewhere has a long chunk of base64, and generates 5,000 totally unique ngrams). If the data looks more like this, we probably have to use a fairly conservative threshold and just get rid of all the "the" stuff, if this is even worth pursuing. But I should be able to get a clearer picture of this before we commit to it.

  • Based on "research" / guesswork in T13000, start with a conservative threshold of 0.15.
This revision was automatically updated to reflect the committed changes.