Support ElasticSearch 2.0 - 5.1
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	epriestley
	Dec 3 2015, 2:07 PM

Description

ElasticSearch 2.0 was released recently (Oct, 2015). We should support it.

NOTE: We do not currently support ElasticSearch 2.0+. Use it at your own risk.

Revisions and Commits

rP Phabricator
	D17384	rPe41c25de5050 Support multiple fulltext search clusters with 'cluster.search' config

Related Objects
Search...

Status	Assigned	Task
Resolved	20after4	T9893 Support ElasticSearch 2.0 - 5.1
Duplicate	None	T9889 Object selector dialog does not work in Elasticsearch 2.0
Resolved	20after4	T9779 ./bin/search init error with elasticsearch 2.0
Wontfix	ru31337	T9670 Conflicting field mappings for phabricator index when upgrading to Elasticsearch 2.0

Event Timeline

epriestley created this task.Dec 3 2015, 2:07 PM

epriestley raised the priority of this task from to Low.

epriestley updated the task description. (Show Details)

epriestley added a project: Search.

epriestley added subtasks: T9889: Object selector dialog does not work in Elasticsearch 2.0, T9779: ./bin/search init error with elasticsearch 2.0.

epriestley added a subtask: T9670: Conflicting field mappings for phabricator index when upgrading to Elasticsearch 2.0.

epriestley added a subscriber: epriestley.

As an alternative, I'd like to consider dropping upstream support for ElasticSearch entirely. We originally implemented it because Facebook wanted it and the project was more willing to implement and maintain things. However:

Engines like the ElasticSearch engine can now reasonably be maintained as extensions.
The upstream is severely lacking in ElasticSearch expertise and has essentially no idea how to maintain or test this code.
It's not clear that any MySQL search really has any major scalability or quality limitations for most installs.

Some arguments for ElasticSearch:

It can do stemming, and MySQL currently can't (T6740), although this is fixable.
We have no pathway forward for search non-latin languages, nor do we ever expect to (T2632).
As far as I can tell, there's zero documentation on how to configure ElasticSearch but we still have installs doing it, so they must find it useful?

If you're using ElasticSearch on your install, can you let us know why you're choosing to configure it over the builtin MySQL search?

epriestley mentioned this in T5282: Provide documentation on setting up ElasticSearch.Dec 3 2015, 2:16 PM

epriestley mentioned this in T9889: Object selector dialog does not work in Elasticsearch 2.0.

Broadly, to set expectations, this is a very low priority for the upstream and neither of us really have any familiarity with Elasticsearch, so I don't expect to move this forward for a long time.

If you're familiar with Elasticsearch we're relatively willing to accept patches, but would be more interested if you wanted to pull support out of the upstream and maintain it as an extension. This involves a lot of overhead today, but after T5055 there may be a more reasonable pathway toward manageable extension support.

epriestley added a project: Elasticsearch.Dec 3 2015, 2:21 PM

• nwaf1990 closed this task as a duplicate of T9892: I want to be sent email when a build fails for any commit in a repository.Dec 3 2015, 3:50 PM

epriestley reopened this task as Open.Dec 3 2015, 3:53 PM

cburroughs added a subscriber: cburroughs.Dec 3 2015, 4:12 PM

joshuaspence added a subscriber: joshuaspence.Dec 3 2015, 7:37 PM

clippit added a subscriber: clippit.Dec 28 2015, 6:43 AM

If you're using ElasticSearch on your install, can you let us know why you're choosing to configure it over the builtin MySQL search?

At some point, Phabricator began to complain about my MySQL setup. The issues said, "If you later plan to configure ElasticSearch, you can also ignore this warning". I figured that instead of making ad-hoc tweaks to MySQL, I ought to do things properly, so I installed ElasticSearch.

As far as I can tell, there's zero documentation on how to configure ElasticSearch but we still have installs doing it, so they must find it useful?

The lack of documentation should have tipped me off that it's not the preferred search engine! Now that I know the issues weren't actually recommending that I use ElasticSearch with Phabricator and that the latest version won't be supported in the near future, I see no reason not to switch back to the builtin search.

tycho.tatitscheff added a subscriber: tycho.tatitscheff.Dec 30 2015, 6:08 AM

epriestley merged a task: T9889: Object selector dialog does not work in Elasticsearch 2.0.Jan 15 2016, 1:34 PM

epriestley added subscribers: chad, • Basheer.

T10161 is a case of another install that selected ElasticSearch without a specific reason, based on a general belief that it is better in some intangible way.

xiaogaozi added a subscriber: xiaogaozi.Feb 16 2016, 10:00 AM

• cuiwoming added a subscriber: • cuiwoming.Aug 1 2016, 7:25 AM

Herald added a subscriber: eadler. · View Herald TranscriptAug 1 2016, 7:25 AM

sascha-egerer added a subscriber: sascha-egerer.Sep 9 2016, 9:29 AM

We actually have some problems with our phabricator install.
We use it with mysql, but in a Galera (PXC) cluster and it does not work well with myisam tables.

So I was wondering if using ES would be a solution, but it seems not. :(

Wikimedia is considering another stab at elasticsearch because we just hit a fairly serious scalability issue with myisam (https://phabricator.wikimedia.org/T146673) and it looks like innodb isn't a lot better. In addition to that, the mysql fulltext engines don't seem to handle stemming or even simple prefix matching.

MZMcBride added a subscriber: MZMcBride.Sep 28 2016, 5:47 AM

With InnoDB now mostly finished indexing, search seems to be working and the service isn't getting killed by locked tables. More info at the link I pasted above ^

We're planning to put application-level stemming on the MySQL index (see T6740). Are there problems with the InnoDB search that you're aware of that aren't fixable with application adjustments?

It's probably very easy for us to make bin/storage adjust use InnoDB FULLTEXT instead of MyISAM FULLTEXT if it's available.

(I'm not sure what "simple prefix matching" means exactly in terms of search, can you give me an example? Maybe that's such a feature.)

aklapper added a subscriber: aklapper.Sep 28 2016, 12:41 PM

greggrossmeier added a subscriber: greggrossmeier.Sep 28 2016, 4:15 PM

@epriestley:
So far innodb is working. I can't say if it's worse than myisam really, seems pretty similar. Stemming would help a lot but the main complaint I have is the ranking algorithm seems really bad. It doesn't return the most relevant results first. myisam had the same problem though.

What I meant by simple prefix matching is simply that a search for, e.g. "HTTPS" does not match "HTTPSFuture" ... of course exact matches should come before prefix matches but it'd be nice if it could do prefix/suffix matching. Stemming would take care of the most common use cases but when searching for commits rather than tasks it's often that I find myself wanting a bit more fuzzy search.

epriestley moved this task from Backlog to ElasticSearch on the Search board.Dec 8 2016, 6:49 PM

BTW its probably time to make it 5.0, not 2.0

@ramm: I'm already on it: https://phabricator.wikimedia.org/T155299

So, @epriestley: How do you feel about me taking over elasticsearch backend and moving it out to an extension rather than keeping it in phabricator core? The other possibility is I could just fork the elastic backend and name it something different, as it seems phabricator already supports loading the engine extension dynamically.

Broadly, I'm supportive.

I'd maybe like to give you a cleaner way to separate the engines, though -- I think it would be somewhat tough to pull ElasticSearch out right now without leaving some pieces behind in the upstream. In particular, there's some hardcoded stuff like this:

if (elastic_search_is_configured()) {
  use_elastic_search();
} else {
  use_mysql_search();
}

That's probably not hugely difficult to fix up, but I think there are 4-5 of those that need adjustment.

It would also be nice to be able to configure each engine to receive writes independently so that it's possible to keep the indexes in multiple engines up to date. I don't think there's a super strong use case for this, but it should be nearly free to do if everything else is modularized cleanly, and make stuff like "try ElasticSearch and see how it goes" a lot easier.

Ideally, maybe this panel (in Applications → Search) would expand so you can do these things:

Enable/disable writes on each engine; we write to all engines with writes enabled.
Configure options on each engine, like elasticsearch.host -- although this gets tricky because that config is locked, and dangerous to let administrators edit from the web UI, since it massively violates policies to send search indexes to an arbitrary host.
Choose which engine reads go to (or: enable reads for multiple engines, and have them fail over sequentially if one is down? Probably very YAGNI).
Maaaybe add multiple copies of engines like ElasticSearch, so you could, e.g., migrate from a v2 cluster to a v3 cluster without downtime by bringing up the second cluster, enabling double-writes, running the index, then swapping them. Probably very YAGNI again.

That's a whole lot of work, though, and those use cases are pretty far-future / hypothetical / completely-imagined. But if we don't support them now, we won't be able to later without backward compatibility breaks.

@epriestley: This all sounds excellent. I'll start by fixing any instances of

if (elastic_search_is_configured()) { ... }

And I agree that it would be really great of multiple engines could receive writes and the same time, that would make migrating a lot easier.

Another thing I noticed, which I hacked my way around in the Wikimedia fork, is that the back-end is completely unaware of the user context when searching. That means the back-end can't do smart things like boosting objects that the user has interacted with in the past.

@epriestley: Would you be apposed to giving the back-ends access to the user's phid during a search? I assume that you omitted that for a reason, e.g. you wanted to keep the back-ends as dumb as possible and leave the smarts to phabricator. I may be entirely misguided in trying to make elasticsearch do smart context-aware things, but so far it seems to work well enough without making too much of a mess architecturally.

0 removed a subscriber: 0.Feb 2 2017, 4:43 AM