Page MenuHomePhabricator

Properly create Elasticsearch index

Authored by WikiChad on Dec 10 2014, 4:08 AM.


Group Reviewers
Blessed Reviewers
Restricted Diffusion Commit
rPa366f85c118b: Properly create Elasticsearch index

When the index does not exist and auto_create_index isn't
enabled, running ./bin/index results in a failure. That's

Instead create an index properly. This also allows us to do
nice things like do a proper mapping and analysis like for
substring matching like outlined by @fabe in T6552.

Test Plan

Deleted and created index multiple times to verify
proper index creation and usage.

Diff Detail

rP Phabricator
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

WikiChad retitled this revision from to Properly create Elasticsearch index.
WikiChad updated this object.
WikiChad edited the test plan for this revision. (Show Details)
WikiChad added a subscriber: fabe.

I thought about doing something just like this. However i didn't like the fact that every search request now does an additional elasticsearch roundtrip to check if the index already exists (which it usually will except on the first request). Most elasticsearch client libraries do this and i've no real idea how to avoid it since when a user deletes the index it will be automatically recreated without phabricator ever knowing... So if we don't check it might not recreate the mapping

i've no real idea how to avoid it

A setup check can run "does the index exist?" if ElasticSearch is configured, then tell the user to run bin/search init-engine or similar.

Example setup checks are in phabricator/src/config/check/. PhabricatorSetupCheckPygment is a simple check which is somewhat similar to this (if an external option is configured, check if it works, then tell the user what's wrong if it isn't in good shape). This setup check could also, e,g., check that the configured ElasticSearch server is reachable (and possibly check versions / capabilities).

It should also check then if the mapping is correct and warn if a mismatch is detected.
I think the typical problem would be that someone has an phabricator install, enables elasticsearch and then some random task/commit gets indexed automatically and creates the index without the mapping. Even if the user then deletes the index. In the time between running init / reindex just a single object needs to be reindexed to create the index without its mapping again and i don't like the idea of deleting it in an init/setup script. So just a setupcheck telling the user how to set it up in a clean way is probably best.
@WikiChad wannna add the check? otherwise i could do it if you like.

WikiChad edited edge metadata.

Change checks to happen with other site checks instead of at index
time. Introduce new ./bin/search init for initializing the index.

epriestley added a reviewer: epriestley.

This looks good, and I don't really have any substantive feedback, but see some questions and notes inline.

(I'm also out of my depth on evaluating the actual index configuration, but am happy to trust you on that.)


This (bool) cast is unnecessary, as the result is unused.


(Should we throw here? Reaching this section would be unexpected and indicate some serious behavioral change in ElasticSearch?)


Offhand, do you know if this index will perform worse for non-English text than the default index does? That is, is this an improvement for English text but a setback for Chinese text?

Naively, I'd expect it not to have much of an affect since it looks like they'll use the same tokenizer.

(It's not terribly important for the moment since i18n support is largely theoretical, but this should eventually use the install's default language rules when they exist, not be hard-coded as English.)


Why can't we?


You could safely remove this and we'll just no-op MySQL and have a more general implementation, right?

I think putting a couple of extra stubs in the base class is cleaner than having an ElasticSearch-specific chunk of code here, since it seems pretty reasonable that we may eventually support, e.g., Solr or AWS CloudSearch (although neither are on the horizon).

This revision now requires changes to proceed.Dec 11 2014, 2:46 AM

Replies inline.


I meant to return here. I'm actually a little miffed I'm getting a 400 from this as it should return 200.


Probably yeah.


Actually good catch. I kind of just assumed English since like you said i18n is theoretical. In our WMF setup, we actually fall back to default analysis when we don't have a language-specific implementation rather than falling back to English.

Really there's no one-size-fits-all to this problem. CJK typically requires additional considerations (for Japanese and Chinese there are plugins far superior to what Lucene ships by default). It's not just a matter of dropping language names in (although we can do that to some extent).

Since some stuff is generally useful (like auto_expand_replicas: 0-1) across all languages we can probably factor this out a bit and build the array based on the language.


Work in progress :) Basically you need a "recursively compare the two arrays and be really lenient with boolean and integers" and I have a headache today :(


I wasn't sure if you wanted the stubs in the base class which is why I went this way. Can totally go that route though.

manybubbles added inline comments.

I can break down non-English question by walking through the choices.
The standard tokenizer is based on the unicode segmentation rules and does as good as job as you are going to get without a dictionary for word segmentation. Its what we were using before and decided the right choice for a huge percentage of languages. It does a decent job of handling non-space delimited languages even without a dictionary but there are better choices for them.

The english_possessive_stemmer is going to create false positive matches outside of English but is otherwise harmless. Its probably a good idea for most codebases that expect comments in English.

The lowercase filter is fine for all languages except Greek and Persian. They'd need to configure it to properly lowercase for them as it'd make mistakes resulting in somewhat unexpected results.

english_stop is kind of debatable. English speekers expect works like "is", "and", "but", and "the" to be optional in their searches. The way we accomplish this on WMF wikis is by analyzing everything twice - once with stopwords and stemming and stuff and once without. Then we search both and rank results that match the more exact analysis form more highly. Another way to do it is to leave the stop words in the index (don't add the filter) and use stuff like the `cutoff_frequency``` option to match query and its ink to convert those matches to optional. That works well too. The disadvantage to removing stopwords (adding this filter) is that searches that contain only stop words will find nothing and searches that contain stop words in phrases will find stuff that doesn't have the stop words. I vote removing this if we aren't going to analyze twice. Then we have the option of using cutoff_frequency or making stopwords required. Ofcourse you could always make the default operator on search OR rather than AND in which case stopwords are always optional.

I'm not a fan of `english_stemmer`` and I've vote to just switch it with `kstem``. English stemmer uses the porter stemmer but kstem has fewer false positives.

trigrams_filter. This is going to do some funky stuff if you use it here. Maybe good funky stuff.

Searching for `I like eating apples`` will search for `I AND "lik ike" AND "eat ati tin ing" AND "app ppl ple les"`. This'll find `I like eating apples` but also `ilikeeatingapples` and `IllIkeAtIngapPlesiosaur". Whether this is ok all depends on your tolerance for false positives.

All those phrase queries might become expensive. Especially if someone searches for a really long word. That'll make a very large number of phrase queries. I _could_ be a DOS vector.

The way we go about this on Wikipedias is to use word_delimiter filter to cause artificial word breaks at case changes. And we also use a character filter to transform things like `_`` and `( into ` `. This makes finding partial variable names possible without any phrase queries. Its not foolproof. Searching for `randomthings` won't find `random_things`. In fact, it'll only find `randomThings`` if you do the two analyzers trick I mentioned above and leave out the word_delimiter filter on the exact matches. This confusion stems from Lucene's low level contact for TokenStream - the tokens must come out in ascending order (repeats are ok) in terms of token position and start and end offset. That's usually a fine thing for prose but not so good for code.




Oh foo! Markdown doesn't work. I don't have an edit button either.


I think the removing the stopwords is no big deal. Getting no results when searching for stopwords only can indeed be counterintuitive. Elasticsearchs normalisation will deal with the stopwords and sort the results properly. (same thing usually for the ngrams results)
In general i think false positives are not such a big problem for this. Personally i prefer to get similar results to my query than none.

In our install we are seeing quite good results with the ngrams filter. Especially getting search hits for class/method names which tend to be large constructed words can be tricky without it. We could increase the min/max setting to 4 or 5 though if we see performance problems.
I like the user to be able to simply insert search words/phrases and not having to deal with query syntax (such as allowing wildcards or operators).

It might make sense to remodel the index in the future to also for example enable search through the actual source code. Right now we are treating all documents equally (tasks, wiki, commits, diffs). But for now this mapping will vastly improve the current search.

In the MySQL search, we've reduced the stopword list from MySQL's builtin defaults (which has about 500 words) to a shorter list with just the 50 most common words (see D10258;

The 500-word list caused a significant amount of user confusion (trying to search for terms on the list); the 50 word list seems to be doing OK.

I'm not immediately able to figure out what words are in the _english_ stopword list in ElasticSearch, but if they're similar to these for Lucene here:

...they're probably fine. If the default list is more extensive, we should be more cautious about enabling it.

WikiChad edited edge metadata.
  • Cleaned up base class implementations
  • Switched auto_expand_replicas to 0-2
  • Made things work better for non-english installs:
    • Dropped english posessive stemmer and english stop words
    • Swapped english stemmer for kstem
WikiChad edited edge metadata.

Refer to init in check instead of init-index

Implement recursive settings comparison to check index sanity

Is this good to go? One inline, looks good to me otherwise.


From a cursory reading of the documentation, I would expect this to maybe be 0-all instead of 0-2. Can you walk me through the reasoning here?

Should be be ready other than that last inline bit.


0-all tends to be needlessly redundant as it would put a replica of each shard on each node. If an index is large enough it could actually be disruptive to have that many replicas. 0-2 has worked well for us in production usage with large high-traffic indexes so that was my reasoning here.

epriestley edited edge metadata.

Works for me.

This revision is now accepted and ready to land.Dec 22 2014, 9:10 PM
This revision was automatically updated to reflect the committed changes.