Page MenuHomePhabricator

Implement partial / wildcard searching (Elasticsearch)
Closed, ResolvedPublic

Description

We were getting complaints that searching through our Phabricator install was not straight forward in that you always needed to type words exactly as they appeared in a task's title, for example: "graphingcharts" will not match when searching for "charts".

Our install uses an elasticsearch instance for indexing and searching. As far as I understand, these kind of wildcard queries should be possible using elasticsearch?

See also: T6740: Put some kind of stemmer on the MySQL search index

Event Timeline

GMTA created this task.Nov 14 2014, 9:03 AM
GMTA raised the priority of this task from to Needs Triage.
GMTA updated the task description. (Show Details)
GMTA added a project: Search.
GMTA added a subscriber: GMTA.
qgil added a subscriber: demon.Nov 14 2014, 10:05 AM
qgil added a subscriber: qgil.

Same problem in our instance, and we also have Elasticsearch. See https://phabricator.wikimedia.org/T679

Also, for what is worth: T4818: Advanced Query's "Contains Text" seems to expect entire words instead of substrings?

chad triaged this task as Normal priority.Nov 15 2014, 5:50 AM
chad added a subscriber: chad.

(I would like this tooooo)

fabe added a subscriber: fabe.EditedDec 4 2014, 12:17 PM

One way of doing this is by adjusting the mapping for the elasticsearch index (actually way more powerful than a wildcard search).
I currently switched to the mapping below (added as a template expecting the index to be named 'phabricator' and with an english language setting).
Feel free to change the min/max ngrams setting (e.g. 4 instead of 3 letter ngrams) if you get too many / few results.
If you like to switch to another language the docs to do so are here: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html
For it to take effect you have to delete the index and then reindex all objects (bin/index --all).

{
	"template": "phabricator",
	"settings": {
		"analysis": {
			"filter": {
				"trigrams_filter": {
					"type":     "ngram",
					"min_gram": 3,
					"max_gram": 3
				},
			        "english_stop": {
					"type":       "stop",
					"stopwords":  "_english_"
				},
				"english_stemmer": {
					"type":       "stemmer",
				 	"language":   "english"
			 	},
				"english_possessive_stemmer": {
					"type":       "stemmer",
					"language":   "possessive_english"
				}
			},
			"analyzer": {
				"english_trigrams": {
					"type":      "custom",
					"tokenizer": "standard",
					"filter":   [
						"english_possessive_stemmer",
						"lowercase",
						"english_stop",
						"english_stemmer",
						"trigrams_filter"
					]
				}
			}
		}
	},
	"mappings": {
		"CMIT": {
		    "properties": {
		        "field": {
		            "properties": {
		                "corpus": {
		                    "type": "string",
					"analyzer": "english_trigrams"
		                }
		            }
		        }
		    }
		},
		"DREV": {
		    "properties": {
		        "field": {
		            "properties": {
		                "corpus": {
		                    "type": "string",
					"analyzer": "english_trigrams"
		                }
		            }
		        }
		    }
		},
		"MOCK": {
		    "properties": {
		        "field": {
		            "properties": {
		                "corpus": {
		                    "type": "string",
					"analyzer": "english_trigrams"
		                }
		            }
		        }
		    }
		},
		"PROJ": {
		    "properties": {
		        "field": {
		            "properties": {
		                "corpus": {
		                    "type": "string",
					"analyzer": "english_trigrams"
		                }
		            }
		        }
		    }
		},
		"TASK": {
		    "properties": {
		        "field": {
		            "properties": {
		                "corpus": {
		                    "type": "string",
					"analyzer": "english_trigrams"
		                }
		            }
		        }
		    }
		},
		"USER": {
		    "properties": {
		        "field": {
		            "properties": {
		                "corpus": {
		                    "type": "string",
					"analyzer": "english_trigrams"
		                }
		            }
		        }
		    }
		},
		"WIKI": {
		    "properties": {
		        "field": {
		            "properties": {
		                "corpus": {
		                    "type": "string",
					"analyzer": "english_trigrams"
		                }
		            }
		        }
		    }
		}
	}
}

It is installed like this:
curl -XPUT 'http://elasticsearchhost:9200/_template/template_phabricator' -d @mapping.json
with mapping.json of course being the file containing the json above.
If you do not want that level of detail for certain types you can just remove them from the mapping and elasticsearch will assume the defaults.

qgil moved this task from Backlog to Important on the Wikimedia board.
qgil renamed this task from Implement partial / wildcard searching to Implement partial / wildcard searching (Elasticsearch).Mar 13 2015, 5:00 PM
qgil updated the task description. (Show Details)
qgil removed a project: Wikimedia.
rfergu added a subscriber: rfergu.Mar 31 2015, 2:42 AM
Pawka added a subscriber: Pawka.Sep 14 2015, 1:09 PM
svemir added a subscriber: svemir.
epriestley moved this task from Backlog to ElasticSearch on the Search board.Dec 8 2016, 7:01 PM
anda added a subscriber: anda.Jan 31 2017, 1:29 PM

As a general product decision, I do not expect search to be substring search by default -- searching for pricot on Google does not match documents containing apricot. But we can sort this out in the long run.

With the elasticsearch 'simple_query_string' query parser it only works if you use *pricot, for example, outside of quoted phrases.