Changeset View
Standalone View
src/applications/search/engine/PhabricatorSearchEngineElastic.php
Show First 20 Lines • Show All 232 Lines • ▼ Show 20 Lines | try { | ||||
$query->getParameter('query'), '+-&|!(){}[]^"~*?:\\')); | $query->getParameter('query'), '+-&|!(){}[]^"~*?:\\')); | ||||
$response = $this->executeRequest($uri, $this->buildSpec($query)); | $response = $this->executeRequest($uri, $this->buildSpec($query)); | ||||
} | } | ||||
$phids = ipull($response['hits']['hits'], '_id'); | $phids = ipull($response['hits']['hits'], '_id'); | ||||
return $phids; | return $phids; | ||||
} | } | ||||
public function indexExists() { | |||||
try { | |||||
(bool)$this->executeRequest('/_search/', array()); | |||||
} catch (HTTPFutureHTTPResponseStatus $e) { | |||||
if ($e->getStatusCode() == 404) { | |||||
return false; | |||||
} else if ($e->getStatusCode() == 400) { | |||||
return true; | |||||
} | |||||
throw $e; | |||||
} | |||||
} | |||||
public function createIndex() { | |||||
$data = array( | |||||
'settings' => array( | |||||
'auto_expand_replicas' => '0-1', | |||||
'analysis' => array( | |||||
'filter' => array( | |||||
'english_stop' => array( | |||||
'type' => 'stop', | |||||
'stopwords' => '_english_', | |||||
), | |||||
'english_possessive_stemmer' => array( | |||||
'type' => 'stemmer', | |||||
'language' => 'possessive_english', | |||||
), | |||||
'trigrams_filter' => array( | |||||
'min_gram' => 3, | |||||
'type' => 'ngram', | |||||
'max_gram' => 3, | |||||
), | |||||
'english_stemmer' => array( | |||||
'type' => 'stemmer', | |||||
'language' => 'english', | |||||
), | |||||
), | |||||
'analyzer' => array( | |||||
'english_trigrams' => array( | |||||
'type' => 'custom', | |||||
'filter' => array( | |||||
'english_possessive_stemmer', | |||||
'lowercase', | |||||
'english_stop', | |||||
'english_stemmer', | |||||
'trigrams_filter', | |||||
), | |||||
'tokenizer' => 'standard', | |||||
), | |||||
), | |||||
), | |||||
), | |||||
); | |||||
$types = array_keys( | |||||
PhabricatorSearchApplicationSearchEngine::getIndexableDocumentTypes()); | |||||
foreach ($types as $type) { | |||||
$data['mappings'][$type]['properties']['field']['properties']['corpus'] = | |||||
array( 'type' => 'string', 'analyzer' => 'english_trigrams' ); | |||||
} | |||||
$this->executeRequest('/', $data, true); | |||||
} | |||||
private function executeRequest($path, array $data, $is_write = false) { | private function executeRequest($path, array $data, $is_write = false) { | ||||
$uri = new PhutilURI($this->uri); | $uri = new PhutilURI($this->uri); | ||||
$uri->setPath($this->index); | $uri->setPath($this->index); | ||||
epriestley: This `(bool)` cast is unnecessary, as the result is unused. | |||||
Not Done Inline ActionsI meant to return here. I'm actually a little miffed I'm getting a 400 from this as it should return 200. WikiChad: I meant to return here. I'm actually a little miffed I'm getting a 400 from this as it should… | |||||
$uri->appendPath($path); | $uri->appendPath($path); | ||||
$data = json_encode($data); | $data = json_encode($data); | ||||
$future = new HTTPSFuture($uri, $data); | $future = new HTTPSFuture($uri, $data); | ||||
if ($is_write) { | if ($is_write) { | ||||
$future->setMethod('PUT'); | $future->setMethod('PUT'); | ||||
} | } | ||||
if ($this->getTimeout()) { | if ($this->getTimeout()) { | ||||
$future->setTimeout($this->getTimeout()); | $future->setTimeout($this->getTimeout()); | ||||
Not Done Inline Actions(Should we throw here? Reaching this section would be unexpected and indicate some serious behavioral change in ElasticSearch?) epriestley: (Should we throw here? Reaching this section would be unexpected and indicate some serious… | |||||
Not Done Inline ActionsProbably yeah. WikiChad: Probably yeah. | |||||
} | } | ||||
list($body) = $future->resolvex(); | list($body) = $future->resolvex(); | ||||
if ($is_write) { | if ($is_write) { | ||||
Not Done Inline ActionsFrom a cursory reading of the documentation, I would expect this to maybe be 0-all instead of 0-2. Can you walk me through the reasoning here? epriestley: From a cursory reading of the documentation, I would expect this to maybe be `0-all` instead of… | |||||
Not Done Inline Actions0-all tends to be needlessly redundant as it would put a replica of each shard on each node. If an index is large enough it could actually be disruptive to have that many replicas. 0-2 has worked well for us in production usage with large high-traffic indexes so that was my reasoning here. WikiChad: 0-all tends to be needlessly redundant as it would put a replica of each shard on each node. If… | |||||
return null; | return null; | ||||
Not Done Inline Actions0-2? manybubbles: 0-2? | |||||
} | } | ||||
$body = json_decode($body, true); | $body = json_decode($body, true); | ||||
if (!is_array($body)) { | if (!is_array($body)) { | ||||
throw new Exception('elasticsearch server returned invalid JSON!'); | throw new Exception('elasticsearch server returned invalid JSON!'); | ||||
} | } | ||||
return $body; | return $body; | ||||
} | } | ||||
} | } | ||||
Not Done Inline ActionsOffhand, do you know if this index will perform worse for non-English text than the default index does? That is, is this an improvement for English text but a setback for Chinese text? Naively, I'd expect it not to have much of an affect since it looks like they'll use the same tokenizer. (It's not terribly important for the moment since i18n support is largely theoretical, but this should eventually use the install's default language rules when they exist, not be hard-coded as English.) epriestley: Offhand, do you know if this index will perform worse for non-English text than the default… | |||||
Not Done Inline ActionsWhy can't we? epriestley: Why can't we? | |||||
Not Done Inline ActionsActually good catch. I kind of just assumed English since like you said i18n is theoretical. In our WMF setup, we actually fall back to default analysis when we don't have a language-specific implementation rather than falling back to English. Really there's no one-size-fits-all to this problem. CJK typically requires additional considerations (for Japanese and Chinese there are plugins far superior to what Lucene ships by default). It's not just a matter of dropping language names in (although we can do that to some extent). Since some stuff is generally useful (like auto_expand_replicas: 0-1) across all languages we can probably factor this out a bit and build the array based on the language. WikiChad: Actually good catch. I kind of just assumed English since like you said i18n is theoretical. In… | |||||
Not Done Inline ActionsWork in progress :) Basically you need a "recursively compare the two arrays and be really lenient with boolean and integers" and I have a headache today :( WikiChad: Work in progress :) Basically you need a "recursively compare the two arrays and be really… | |||||
Not Done Inline ActionsI can break down non-English question by walking through the choices. The english_possessive_stemmer is going to create false positive matches outside of English but is otherwise harmless. Its probably a good idea for most codebases that expect comments in English. The lowercase filter is fine for all languages except Greek and Persian. They'd need to configure it to properly lowercase for them as it'd make mistakes resulting in somewhat unexpected results. english_stop is kind of debatable. English speekers expect works like "is", "and", "but", and "the" to be optional in their searches. The way we accomplish this on WMF wikis is by analyzing everything twice - once with stopwords and stemming and stuff and once without. Then we search both and rank results that match the more exact analysis form more highly. Another way to do it is to leave the stop words in the index (don't add the filter) and use stuff like the `cutoff_frequency``` option to match query and its ink to convert those matches to optional. That works well too. The disadvantage to removing stopwords (adding this filter) is that searches that contain only stop words will find nothing and searches that contain stop words in phrases will find stuff that doesn't have the stop words. I vote removing this if we aren't going to analyze twice. Then we have the option of using cutoff_frequency or making stopwords required. Ofcourse you could always make the default operator on search OR rather than AND in which case stopwords are always optional. I'm not a fan of `english_stemmer`` and I've vote to just switch it with `kstem``. English stemmer uses the porter stemmer but kstem has fewer false positives. Finally: Searching for `I like eating apples`` will search for `I AND "lik ike" AND "eat ati tin ing" AND "app ppl ple les"`. This'll find `I like eating apples` but also `ilikeeatingapples` and `IllIkeAtIngapPlesiosaur". Whether this is ok all depends on your tolerance for false positives. All those phrase queries might become expensive. Especially if someone searches for a really long word. That'll make a very large number of phrase queries. I _could_ be a DOS vector. The way we go about this on Wikipedias is to use word_delimiter filter to cause artificial word breaks at case changes. And we also use a character filter to transform things like `_`` and `( into ` `. This makes finding partial variable names possible without any phrase queries. Its not foolproof. Searching for `randomthings` won't find `random_things`. In fact, it'll only find `randomThings`` if you do the two analyzers trick I mentioned above and leave out the word_delimiter filter on the exact matches. This confusion stems from Lucene's low level contact for TokenStream - the tokens must come out in ascending order (repeats are ok) in terms of token position and start and end offset. That's usually a fine thing for prose but not so good for code. manybubbles: I can break down non-English question by walking through the choices.
The standard tokenizer is… | |||||
Not Done Inline ActionsOh foo! Markdown doesn't work. I don't have an edit button either. manybubbles: Oh foo! Markdown doesn't work. I don't have an edit button either. | |||||
Not Done Inline ActionsI think the removing the stopwords is no big deal. Getting no results when searching for stopwords only can indeed be counterintuitive. Elasticsearchs normalisation will deal with the stopwords and sort the results properly. (same thing usually for the ngrams results) In our install we are seeing quite good results with the ngrams filter. Especially getting search hits for class/method names which tend to be large constructed words can be tricky without it. We could increase the min/max setting to 4 or 5 though if we see performance problems. It might make sense to remodel the index in the future to also for example enable search through the actual source code. Right now we are treating all documents equally (tasks, wiki, commits, diffs). But for now this mapping will vastly improve the current search. fabe: I think the removing the stopwords is no big deal. Getting no results when searching for… |
This (bool) cast is unnecessary, as the result is unused.