Paths

Table of Contentst

Differential D18673

Add a workflow for populating (or depopulating) the common ngrams table
ClosedPublic
Actions

Authored by epriestley on Oct 2 2017, 11:40 PM.

Details

Reviewers

amckinley

Maniphest Tasks

T13000: Sustained MySQL I/O overwhelmed db009 / huge Ferret engine ngrams table

Commits

rP3e589cdd73ba: Add a workflow for populating (or depopulating) the common ngrams table

Summary

Depends on D18672. Ref T13000. This does an on-demand build of the common ngrams table.

Plan here is:

Push to secure.
Build the common ngrams table here.
See if stuff breaks?

If it looks okay on this dataset, we can build out the GC support and try it in production.

Test Plan

Locally, my dataset has a bunch of bin/lipsum tasks with similar, common words.
Verified that ipsum terms now skip ngrams. For "lorem ipsum" search performance actually IMPROVED by skipping the ngrams table (12s to 9s).
Queried for normal terms, got very fast results using the ngram table, as normal.

Diff Detail

Repository

rP Phabricator

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

epriestley created this revision.Oct 2 2017, 11:40 PM

Harbormaster completed remote builds in B18614: Diff 44831.Oct 2 2017, 11:41 PM

amckinley accepted this revision.Oct 3 2017, 4:39 PM

amckinley added inline comments.

src/applications/search/management/PhabricatorSearchManagementNgramsWorkflow.php
27–28	I'm pretty comfortable hard-coding this. I think we know better than our users and probably shouldn't make this a configurable option.

This revision is now accepted and ready to land.Oct 3 2017, 4:39 PM

The 4096 doesn't matter too much (it's fine for us to just grep everything with LIKE if there are that few records), but I'm going to try to get a better sense of the magic 0.02 value on this host.

Specifically, I'm planning to write a script which looks at recent search queries, breaks them into ngrams and checks their frequency, so we can get a sense of which queries hit/miss the ngrams constraint at different values for the 0.02 threshold.

Hopefully most queries have some ngrams that appear in fewer than 2% (or less) of documents and we're relatively free to decrease the size of the index.

But I'm worried that the long tail of ngrams may have a lot of really obscure ngrams which occur in exactly one document (for example, say, one task somewhere has a long chunk of base64, and generates 5,000 totally unique ngrams). If the data looks more like this, we probably have to use a fairly conservative threshold and just get rid of all the "the" stuff, if this is even worth pursuing. But I should be able to get a clearer picture of this before we commit to it.

Based on "research" / guesswork in T13000, start with a conservative threshold of 0.15.

Harbormaster completed remote builds in B18618: Diff 44835.Oct 3 2017, 7:55 PM

Closed by commit rP3e589cdd73ba: Add a workflow for populating (or depopulating) the common ngrams table (authored by epriestley). · Explain WhyOct 3 2017, 8:28 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents
Changeset List

Path

Size

src/

__phutil_library_map__.php

2 lines

applications/

search/

management/

PhabricatorSearchManagementNgramsWorkflow.php

106 lines

Diff 44839

View Options

src/__phutil_library_map__.php