Differential D10955 Diff 26307 src/applications/search/engine/PhabricatorSearchEngineElastic.php

Changeset View

Standalone View

src/applications/search/engine/PhabricatorSearchEngineElastic.php

Show First 20 Lines • Show All 232 Lines • ▼ Show 20 Lines	try {
$query->getParameter('query'), '+-&\|!(){}[]^"~*?:\\'));		$query->getParameter('query'), '+-&\|!(){}[]^"~*?:\\'));
$response = $this->executeRequest($uri, $this->buildSpec($query));		$response = $this->executeRequest($uri, $this->buildSpec($query));
}		}

$phids = ipull($response['hits']['hits'], '_id');		$phids = ipull($response['hits']['hits'], '_id');
return $phids;		return $phids;
}		}

		public function indexExists() {
		try {
		(bool)$this->executeRequest('/_search/', array());
		} catch (HTTPFutureHTTPResponseStatus $e) {
		if ($e->getStatusCode() == 404) {
		return false;
		} else if ($e->getStatusCode() == 400) {
		return true;
		}
		throw $e;
		}
		}

		public function createIndex() {
		$data = array(
		'settings' => array(
		'auto_expand_replicas' => '0-1',
		'analysis' => array(
		'filter' => array(
		'english_stop' => array(
		'type' => 'stop',
		'stopwords' => '_english_',
		),
		'english_possessive_stemmer' => array(
		'type' => 'stemmer',
		'language' => 'possessive_english',
		),
		'trigrams_filter' => array(
		'min_gram' => 3,
		'type' => 'ngram',
		'max_gram' => 3,
		),
		'english_stemmer' => array(
		'type' => 'stemmer',
		'language' => 'english',
		),
		),
		'analyzer' => array(
		'english_trigrams' => array(
		'type' => 'custom',
		'filter' => array(
		'english_possessive_stemmer',
		'lowercase',
		'english_stop',
		'english_stemmer',
		'trigrams_filter',
		),
		'tokenizer' => 'standard',
		),
		),
		),
		),
		);

		$types = array_keys(
		PhabricatorSearchApplicationSearchEngine::getIndexableDocumentTypes());
		foreach ($types as $type) {
		$data['mappings'][$type]['properties']['field']['properties']['corpus'] =
		array( 'type' => 'string', 'analyzer' => 'english_trigrams' );
		}

		$this->executeRequest('/', $data, true);
		}

private function executeRequest($path, array $data, $is_write = false) {		private function executeRequest($path, array $data, $is_write = false) {
$uri = new PhutilURI($this->uri);		$uri = new PhutilURI($this->uri);
$uri->setPath($this->index);		$uri->setPath($this->index);
		epriestleyUnsubmitted Not Done Inline Actions This `(bool)` cast is unnecessary, as the result is unused. epriestley: This `(bool)` cast is unnecessary, as the result is unused.
		WikiChadAuthorUnsubmitted Not Done Inline Actions I meant to return here. I'm actually a little miffed I'm getting a 400 from this as it should return 200. WikiChad: I meant to return here. I'm actually a little miffed I'm getting a 400 from this as it should…
$uri->appendPath($path);		$uri->appendPath($path);
$data = json_encode($data);		$data = json_encode($data);

$future = new HTTPSFuture($uri, $data);		$future = new HTTPSFuture($uri, $data);
if ($is_write) {		if ($is_write) {
$future->setMethod('PUT');		$future->setMethod('PUT');
}		}
if ($this->getTimeout()) {		if ($this->getTimeout()) {
$future->setTimeout($this->getTimeout());		$future->setTimeout($this->getTimeout());
		epriestleyUnsubmitted Not Done Inline Actions (Should we throw here? Reaching this section would be unexpected and indicate some serious behavioral change in ElasticSearch?) epriestley: (Should we throw here? Reaching this section would be unexpected and indicate some serious…
		WikiChadAuthorUnsubmitted Not Done Inline Actions Probably yeah. WikiChad: Probably yeah.
}		}
list($body) = $future->resolvex();		list($body) = $future->resolvex();

if ($is_write) {		if ($is_write) {
		epriestleyUnsubmitted Not Done Inline Actions From a cursory reading of the documentation, I would expect this to maybe be `0-all` instead of `0-2`. Can you walk me through the reasoning here? epriestley: From a cursory reading of the documentation, I would expect this to maybe be `0-all` instead of…
		WikiChadAuthorUnsubmitted Not Done Inline Actions 0-all tends to be needlessly redundant as it would put a replica of each shard on each node. If an index is large enough it could actually be disruptive to have that many replicas. 0-2 has worked well for us in production usage with large high-traffic indexes so that was my reasoning here. WikiChad: 0-all tends to be needlessly redundant as it would put a replica of each shard on each node. If…
return null;		return null;
		manybubblesUnsubmitted Not Done Inline Actions 0-2? manybubbles: 0-2?
}		}

$body = json_decode($body, true);		$body = json_decode($body, true);
if (!is_array($body)) {		if (!is_array($body)) {
throw new Exception('elasticsearch server returned invalid JSON!');		throw new Exception('elasticsearch server returned invalid JSON!');
}		}

return $body;		return $body;
}		}

}		}
		epriestleyUnsubmitted Not Done Inline Actions Offhand, do you know if this index will perform worse for non-English text than the default index does? That is, is this an improvement for English text but a setback for Chinese text? Naively, I'd expect it not to have much of an affect since it looks like they'll use the same tokenizer. (It's not terribly important for the moment since i18n support is largely theoretical, but this should eventually use the install's default language rules when they exist, not be hard-coded as English.) epriestley: Offhand, do you know if this index will perform worse for non-English text than the default…
		epriestleyUnsubmitted Not Done Inline Actions Why can't we? epriestley: Why can't we?
		WikiChadAuthorUnsubmitted Not Done Inline Actions Actually good catch. I kind of just assumed English since like you said i18n is theoretical. In our WMF setup, we actually fall back to default analysis when we don't have a language-specific implementation rather than falling back to English. Really there's no one-size-fits-all to this problem. CJK typically requires additional considerations (for Japanese and Chinese there are plugins far superior to what Lucene ships by default). It's not just a matter of dropping language names in (although we can do that to some extent). Since some stuff is generally useful (like auto_expand_replicas: 0-1) across all languages we can probably factor this out a bit and build the array based on the language. WikiChad: Actually good catch. I kind of just assumed English since like you said i18n is theoretical. In…
		WikiChadAuthorUnsubmitted Not Done Inline Actions Work in progress :) Basically you need a "recursively compare the two arrays and be really lenient with boolean and integers" and I have a headache today :( WikiChad: Work in progress :) Basically you need a "recursively compare the two arrays and be really…
		manybubblesUnsubmitted Not Done Inline Actions I can break down non-English question by walking through the choices. The standard tokenizer is based on the unicode segmentation rules and does as good as job as you are going to get without a dictionary for word segmentation. Its what we were using before and decided the right choice for a huge percentage of languages. It does a decent job of handling non-space delimited languages even without a dictionary but there are better choices for them. The english_possessive_stemmer is going to create false positive matches outside of English but is otherwise harmless. Its probably a good idea for most codebases that expect comments in English. The lowercase filter is fine for all languages except Greek and Persian. They'd need to configure it to properly lowercase for them as it'd make mistakes resulting in somewhat unexpected results. english_stop is kind of debatable. English speekers expect works like "is", "and", "but", and "the" to be optional in their searches. The way we accomplish this on WMF wikis is by analyzing everything twice - once with stopwords and stemming and stuff and once without. Then we search both and rank results that match the more exact analysis form more highly. Another way to do it is to leave the stop words in the index (don't add the filter) and use stuff like the `cutoff_frequency``` option to match query and its ink to convert those matches to optional. That works well too. The disadvantage to removing stopwords (adding this filter) is that searches that contain only stop words will find nothing and searches that contain stop words in phrases will find stuff that doesn't have the stop words. I vote removing this if we aren't going to analyze twice. Then we have the option of using cutoff_frequency or making stopwords required. Ofcourse you could always make the default operator on search OR rather than AND in which case stopwords are always optional. I'm not a fan of `english_stemmer`` and I've vote to just switch it with `kstem``. English stemmer uses the porter stemmer but kstem has fewer false positives. Finally: trigrams_filter. This is going to do some funky stuff if you use it here. Maybe good funky stuff. Searching for `I like eating apples`` will search for `I AND "lik ike" AND "eat ati tin ing" AND "app ppl ple les"`. This'll find `I like eating apples` but also `ilikeeatingapples` and `IllIkeAtIngapPlesiosaur". Whether this is ok all depends on your tolerance for false positives. All those phrase queries might become expensive. Especially if someone searches for a really long word. That'll make a very large number of phrase queries. I _could_ be a DOS vector. The way we go about this on Wikipedias is to use word_delimiter filter to cause artificial word breaks at case changes. And we also use a character filter to transform things like `_`` and `( `into` ` `. This makes finding partial variable names possible without any phrase queries. Its not foolproof. Searching for `randomthings` won't find `random_things`. In fact, it'll only find `randomThings`` if you do the two analyzers trick I mentioned above and leave out the word_delimiter filter on the exact matches. This confusion stems from Lucene's low level contact for TokenStream - the tokens must come out in ascending order (repeats are ok) in terms of token position and start and end offset. That's usually a fine thing for prose but not so good for code. manybubbles: I can break down non-English question by walking through the choices. The standard tokenizer is…
		manybubblesUnsubmitted Not Done Inline Actions Oh foo! Markdown doesn't work. I don't have an edit button either. manybubbles: Oh foo! Markdown doesn't work. I don't have an edit button either.
		fabeUnsubmitted Not Done Inline Actions I think the removing the stopwords is no big deal. Getting no results when searching for stopwords only can indeed be counterintuitive. Elasticsearchs normalisation will deal with the stopwords and sort the results properly. (same thing usually for the ngrams results) In general i think false positives are not such a big problem for this. Personally i prefer to get similar results to my query than none. In our install we are seeing quite good results with the ngrams filter. Especially getting search hits for class/method names which tend to be large constructed words can be tricky without it. We could increase the min/max setting to 4 or 5 though if we see performance problems. I like the user to be able to simply insert search words/phrases and not having to deal with query syntax (such as allowing wildcards or operators). It might make sense to remodel the index in the future to also for example enable search through the actual source code. Right now we are treating all documents equally (tasks, wiki, commits, diffs). But for now this mapping will vastly improve the current search. fabe: I think the removing the stopwords is no big deal. Getting no results when searching for…