- Group Reviewers
- Restricted Diffusion Commit
rPa366f85c118b: Properly create Elasticsearch index
Deleted and created index multiple times to verify
proper index creation and usage.
I thought about doing something just like this. However i didn't like the fact that every search request now does an additional elasticsearch roundtrip to check if the index already exists (which it usually will except on the first request). Most elasticsearch client libraries do this and i've no real idea how to avoid it since when a user deletes the index it will be automatically recreated without phabricator ever knowing... So if we don't check it might not recreate the mapping
i've no real idea how to avoid it
A setup check can run "does the index exist?" if ElasticSearch is configured, then tell the user to run bin/search init-engine or similar.
Example setup checks are in phabricator/src/config/check/. PhabricatorSetupCheckPygment is a simple check which is somewhat similar to this (if an external option is configured, check if it works, then tell the user what's wrong if it isn't in good shape). This setup check could also, e,g., check that the configured ElasticSearch server is reachable (and possibly check versions / capabilities).
It should also check then if the mapping is correct and warn if a mismatch is detected.
I think the typical problem would be that someone has an phabricator install, enables elasticsearch and then some random task/commit gets indexed automatically and creates the index without the mapping. Even if the user then deletes the index. In the time between running init / reindex just a single object needs to be reindexed to create the index without its mapping again and i don't like the idea of deleting it in an init/setup script. So just a setupcheck telling the user how to set it up in a clean way is probably best.
@WikiChad wannna add the check? otherwise i could do it if you like.
This looks good, and I don't really have any substantive feedback, but see some questions and notes inline.
(I'm also out of my depth on evaluating the actual index configuration, but am happy to trust you on that.)
This (bool) cast is unnecessary, as the result is unused.
(Should we throw here? Reaching this section would be unexpected and indicate some serious behavioral change in ElasticSearch?)
Offhand, do you know if this index will perform worse for non-English text than the default index does? That is, is this an improvement for English text but a setback for Chinese text?
Naively, I'd expect it not to have much of an affect since it looks like they'll use the same tokenizer.
(It's not terribly important for the moment since i18n support is largely theoretical, but this should eventually use the install's default language rules when they exist, not be hard-coded as English.)
Why can't we?
You could safely remove this and we'll just no-op MySQL and have a more general implementation, right?
I think putting a couple of extra stubs in the base class is cleaner than having an ElasticSearch-specific chunk of code here, since it seems pretty reasonable that we may eventually support, e.g., Solr or AWS CloudSearch (although neither are on the horizon).
I meant to return here. I'm actually a little miffed I'm getting a 400 from this as it should return 200.
Actually good catch. I kind of just assumed English since like you said i18n is theoretical. In our WMF setup, we actually fall back to default analysis when we don't have a language-specific implementation rather than falling back to English.
Really there's no one-size-fits-all to this problem. CJK typically requires additional considerations (for Japanese and Chinese there are plugins far superior to what Lucene ships by default). It's not just a matter of dropping language names in (although we can do that to some extent).
Since some stuff is generally useful (like auto_expand_replicas: 0-1) across all languages we can probably factor this out a bit and build the array based on the language.
Work in progress :) Basically you need a "recursively compare the two arrays and be really lenient with boolean and integers" and I have a headache today :(
I wasn't sure if you wanted the stubs in the base class which is why I went this way. Can totally go that route though.
I can break down non-English question by walking through the choices.
The english_possessive_stemmer is going to create false positive matches outside of English but is otherwise harmless. Its probably a good idea for most codebases that expect comments in English.
The lowercase filter is fine for all languages except Greek and Persian. They'd need to configure it to properly lowercase for them as it'd make mistakes resulting in somewhat unexpected results.
english_stop is kind of debatable. English speekers expect works like "is", "and", "but", and "the" to be optional in their searches. The way we accomplish this on WMF wikis is by analyzing everything twice - once with stopwords and stemming and stuff and once without. Then we search both and rank results that match the more exact analysis form more highly. Another way to do it is to leave the stop words in the index (don't add the filter) and use stuff like the `cutoff_frequency``` option to match query and its ink to convert those matches to optional. That works well too. The disadvantage to removing stopwords (adding this filter) is that searches that contain only stop words will find nothing and searches that contain stop words in phrases will find stuff that doesn't have the stop words. I vote removing this if we aren't going to analyze twice. Then we have the option of using cutoff_frequency or making stopwords required. Ofcourse you could always make the default operator on search OR rather than AND in which case stopwords are always optional.
Searching for `I like eating apples`` will search for `I AND "lik ike" AND "eat ati tin ing" AND "app ppl ple les"`. This'll find `I like eating apples` but also `ilikeeatingapples` and `IllIkeAtIngapPlesiosaur". Whether this is ok all depends on your tolerance for false positives.
All those phrase queries might become expensive. Especially if someone searches for a really long word. That'll make a very large number of phrase queries. I _could_ be a DOS vector.
The way we go about this on Wikipedias is to use word_delimiter filter to cause artificial word breaks at case changes. And we also use a character filter to transform things like `_`` and `( into ` `. This makes finding partial variable names possible without any phrase queries. Its not foolproof. Searching for `randomthings` won't find `random_things`. In fact, it'll only find `randomThings`` if you do the two analyzers trick I mentioned above and leave out the word_delimiter filter on the exact matches. This confusion stems from Lucene's low level contact for TokenStream - the tokens must come out in ascending order (repeats are ok) in terms of token position and start and end offset. That's usually a fine thing for prose but not so good for code.
I think the removing the stopwords is no big deal. Getting no results when searching for stopwords only can indeed be counterintuitive. Elasticsearchs normalisation will deal with the stopwords and sort the results properly. (same thing usually for the ngrams results)
In our install we are seeing quite good results with the ngrams filter. Especially getting search hits for class/method names which tend to be large constructed words can be tricky without it. We could increase the min/max setting to 4 or 5 though if we see performance problems.
It might make sense to remodel the index in the future to also for example enable search through the actual source code. Right now we are treating all documents equally (tasks, wiki, commits, diffs). But for now this mapping will vastly improve the current search.
In the MySQL search, we've reduced the stopword list from MySQL's builtin defaults (which has about 500 words) to a shorter list with just the 50 most common words (see D10258; https://secure.phabricator.com/diffusion/P/browse/master/resources/sql/stopwords.txt).
The 500-word list caused a significant amount of user confusion (trying to search for terms on the list); the 50 word list seems to be doing OK.
I'm not immediately able to figure out what words are in the _english_ stopword list in ElasticSearch, but if they're similar to these for Lucene here:
...they're probably fine. If the default list is more extensive, we should be more cautious about enabling it.
- Cleaned up base class implementations
- Switched auto_expand_replicas to 0-2
- Made things work better for non-english installs:
- Dropped english posessive stemmer and english stop words
- Swapped english stemmer for kstem
Should be be ready other than that last inline bit.
0-all tends to be needlessly redundant as it would put a replica of each shard on each node. If an index is large enough it could actually be disruptive to have that many replicas. 0-2 has worked well for us in production usage with large high-traffic indexes so that was my reasoning here.