Details
- Locally, my dataset has a bunch of bin/lipsum tasks with similar, common words.
- Verified that ipsum terms now skip ngrams. For "lorem ipsum" search performance actually IMPROVED by skipping the ngrams table (12s to 9s).
- Queried for normal terms, got very fast results using the ngram table, as normal.
Diff Detail
- Repository
- rP Phabricator
- Branch
- common2
- Lint
Lint Passed - Unit
Tests Passed - Build Status
Buildable 18618 Build 25081: Run Core Tests Build 25080: arc lint + arc unit
Event Timeline
src/applications/search/management/PhabricatorSearchManagementNgramsWorkflow.php | ||
---|---|---|
27–28 | I'm pretty comfortable hard-coding this. I think we know better than our users and probably shouldn't make this a configurable option. |
The 4096 doesn't matter too much (it's fine for us to just grep everything with LIKE if there are that few records), but I'm going to try to get a better sense of the magic 0.02 value on this host.
Specifically, I'm planning to write a script which looks at recent search queries, breaks them into ngrams and checks their frequency, so we can get a sense of which queries hit/miss the ngrams constraint at different values for the 0.02 threshold.
Hopefully most queries have some ngrams that appear in fewer than 2% (or less) of documents and we're relatively free to decrease the size of the index.
But I'm worried that the long tail of ngrams may have a lot of really obscure ngrams which occur in exactly one document (for example, say, one task somewhere has a long chunk of base64, and generates 5,000 totally unique ngrams). If the data looks more like this, we probably have to use a fairly conservative threshold and just get rid of all the "the" stuff, if this is even worth pursuing. But I should be able to get a clearer picture of this before we commit to it.
- Based on "research" / guesswork in T13000, start with a conservative threshold of 0.15.