ElasticSearch 2.0 was released recently (Oct, 2015). We should support it.
Description
Revisions and Commits
Status | Assigned | Task | ||
---|---|---|---|---|
Resolved | 20after4 | T9893 Support ElasticSearch 2.0 - 5.1 | ||
Duplicate | None | T9889 Object selector dialog does not work in Elasticsearch 2.0 | ||
Resolved | 20after4 | T9779 ./bin/search init error with elasticsearch 2.0 | ||
Wontfix | ru31337 | T9670 Conflicting field mappings for phabricator index when upgrading to Elasticsearch 2.0 |
Event Timeline
As an alternative, I'd like to consider dropping upstream support for ElasticSearch entirely. We originally implemented it because Facebook wanted it and the project was more willing to implement and maintain things. However:
- Engines like the ElasticSearch engine can now reasonably be maintained as extensions.
- The upstream is severely lacking in ElasticSearch expertise and has essentially no idea how to maintain or test this code.
- It's not clear that any MySQL search really has any major scalability or quality limitations for most installs.
Some arguments for ElasticSearch:
- It can do stemming, and MySQL currently can't (T6740), although this is fixable.
- We have no pathway forward for search non-latin languages, nor do we ever expect to (T2632).
- As far as I can tell, there's zero documentation on how to configure ElasticSearch but we still have installs doing it, so they must find it useful?
If you're using ElasticSearch on your install, can you let us know why you're choosing to configure it over the builtin MySQL search?
Broadly, to set expectations, this is a very low priority for the upstream and neither of us really have any familiarity with Elasticsearch, so I don't expect to move this forward for a long time.
If you're familiar with Elasticsearch we're relatively willing to accept patches, but would be more interested if you wanted to pull support out of the upstream and maintain it as an extension. This involves a lot of overhead today, but after T5055 there may be a more reasonable pathway toward manageable extension support.
If you're using ElasticSearch on your install, can you let us know why you're choosing to configure it over the builtin MySQL search?
At some point, Phabricator began to complain about my MySQL setup. The issues said, "If you later plan to configure ElasticSearch, you can also ignore this warning". I figured that instead of making ad-hoc tweaks to MySQL, I ought to do things properly, so I installed ElasticSearch.
As far as I can tell, there's zero documentation on how to configure ElasticSearch but we still have installs doing it, so they must find it useful?
The lack of documentation should have tipped me off that it's not the preferred search engine! Now that I know the issues weren't actually recommending that I use ElasticSearch with Phabricator and that the latest version won't be supported in the near future, I see no reason not to switch back to the builtin search.
T10161 is a case of another install that selected ElasticSearch without a specific reason, based on a general belief that it is better in some intangible way.
We actually have some problems with our phabricator install.
We use it with mysql, but in a Galera (PXC) cluster and it does not work well with myisam tables.
So I was wondering if using ES would be a solution, but it seems not. :(
Wikimedia is considering another stab at elasticsearch because we just hit a fairly serious scalability issue with myisam (https://phabricator.wikimedia.org/T146673) and it looks like innodb isn't a lot better. In addition to that, the mysql fulltext engines don't seem to handle stemming or even simple prefix matching.
With InnoDB now mostly finished indexing, search seems to be working and the service isn't getting killed by locked tables. More info at the link I pasted above ^
We're planning to put application-level stemming on the MySQL index (see T6740). Are there problems with the InnoDB search that you're aware of that aren't fixable with application adjustments?
It's probably very easy for us to make bin/storage adjust use InnoDB FULLTEXT instead of MyISAM FULLTEXT if it's available.
(I'm not sure what "simple prefix matching" means exactly in terms of search, can you give me an example? Maybe that's such a feature.)
@epriestley:
So far innodb is working. I can't say if it's worse than myisam really, seems pretty similar. Stemming would help a lot but the main complaint I have is the ranking algorithm seems really bad. It doesn't return the most relevant results first. myisam had the same problem though.
What I meant by simple prefix matching is simply that a search for, e.g. "HTTPS" does not match "HTTPSFuture" ... of course exact matches should come before prefix matches but it'd be nice if it could do prefix/suffix matching. Stemming would take care of the most common use cases but when searching for commits rather than tasks it's often that I find myself wanting a bit more fuzzy search.
So, @epriestley: How do you feel about me taking over elasticsearch backend and moving it out to an extension rather than keeping it in phabricator core? The other possibility is I could just fork the elastic backend and name it something different, as it seems phabricator already supports loading the engine extension dynamically.
Broadly, I'm supportive.
I'd maybe like to give you a cleaner way to separate the engines, though -- I think it would be somewhat tough to pull ElasticSearch out right now without leaving some pieces behind in the upstream. In particular, there's some hardcoded stuff like this:
if (elastic_search_is_configured()) { use_elastic_search(); } else { use_mysql_search(); }
That's probably not hugely difficult to fix up, but I think there are 4-5 of those that need adjustment.
It would also be nice to be able to configure each engine to receive writes independently so that it's possible to keep the indexes in multiple engines up to date. I don't think there's a super strong use case for this, but it should be nearly free to do if everything else is modularized cleanly, and make stuff like "try ElasticSearch and see how it goes" a lot easier.
Ideally, maybe this panel (in Applications → Search) would expand so you can do these things:
- Enable/disable writes on each engine; we write to all engines with writes enabled.
- Configure options on each engine, like elasticsearch.host -- although this gets tricky because that config is locked, and dangerous to let administrators edit from the web UI, since it massively violates policies to send search indexes to an arbitrary host.
- Choose which engine reads go to (or: enable reads for multiple engines, and have them fail over sequentially if one is down? Probably very YAGNI).
- Maaaybe add multiple copies of engines like ElasticSearch, so you could, e.g., migrate from a v2 cluster to a v3 cluster without downtime by bringing up the second cluster, enabling double-writes, running the index, then swapping them. Probably very YAGNI again.
That's a whole lot of work, though, and those use cases are pretty far-future / hypothetical / completely-imagined. But if we don't support them now, we won't be able to later without backward compatibility breaks.
@epriestley: This all sounds excellent. I'll start by fixing any instances of
if (elastic_search_is_configured()) { ... }
And I agree that it would be really great of multiple engines could receive writes and the same time, that would make migrating a lot easier.
Another thing I noticed, which I hacked my way around in the Wikimedia fork, is that the back-end is completely unaware of the user context when searching. That means the back-end can't do smart things like boosting objects that the user has interacted with in the past.
@epriestley: Would you be apposed to giving the back-ends access to the user's phid during a search? I assume that you omitted that for a reason, e.g. you wanted to keep the back-ends as dumb as possible and leave the smarts to phabricator. I may be entirely misguided in trying to make elasticsearch do smart context-aware things, but so far it seems to work well enough without making too much of a mess architecturally.
Completely fine to make the engine aware of the viewer running the query, I think we just didn't have use cases for that when this stuff was written.