Provide some guidance about elasticsearch in cluster docs
AbandonedPublic
Actions

Authored by epriestley on Mar 28 2017, 10:06 PM.

Details

Reviewers

20after4

Group Reviewers

Blessed Reviewers

Maniphest Tasks

T12450: New Search Configuration Errata

Summary

This is my attempt at writing an overview of the issues around
scalability & availability. I still need to add more detail,
especially the specifics about versions supported, but I'll add
that in another diff. Before I write that part, I want to do
some testing with elastcsearch 1.x, assuming I can even get it
running locally.

Test Plan

read the docs

Diff Detail

Repository

rP Phabricator

Branch

master

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 16209
Build 21606: Run Core Tests
Build 21522: Run Core Tests
Build 21521: arc lint + arc unit

Event Timeline

20after4 created this revision.Mar 28 2017, 10:06 PM

Herald added a reviewer: Blessed Reviewers. · View Herald TranscriptMar 28 2017, 10:06 PM

Herald added a subscriber: Korvin. · View Herald Transcript

20after4 added a task: T12450: New Search Configuration Errata.Mar 28 2017, 10:06 PM

Harbormaster completed remote builds in B16209: Diff 42283.Mar 28 2017, 10:07 PM

A mild concern I have about this guidance is that, to my knowledge, we've only see this scaling issue at WMF, and the failure was inexplicable (e.g., 10x our index size but 1000x the query time, or whatever the numbers were).

I currently have 250K documents in my local MySQL index with fast fulltext queries, and I believe I've had 2M+ in the past. The local size of the entire index (not just keys) appears to be <200MB, so "the index falls out of RAM" shouldn't explain it.

It's possible that every other large install has moved to ElasticSearch, but given how generally broken it was for a very long time I don't think this is the case.

I feel like we never really established why the WMF MySQL index was so slow (for example: MariaDB bug?) so I'm hesitant to use it to movtivate scaling guidance. We've seen other cases (notably T8588) where only one install sees some sort of weird performance issue but others do not.

Historically, we've seen installs switch to ElasticSearch when they didn't need to out of sort of general fear and uncertainty, and have a harder time upgrading and maintaining Phabricator as a result, so I want to be cautious about how heavily we push ElasticSearch.

But I'm planning to do a touch-up pass on the docs anyway and can tweak the wording here when I do.

src/docs/user/cluster/cluster.diviner
50	I'd call this "No Risk" for loss resistance, since there's no data stored only in the search index (`bin/search init` + `bin/search index` can always completely rebuild it even if all the data from the fulltext index has been completely lost). (And maybe "Moderate" for availability, since Repositories only get a "Moderate" -- the other two "High" values make Phabricator 100% inaccessible if they go down.)

This revision is now accepted and ready to land.Mar 30 2017, 3:18 PM

I figured out that one component of the WMF scaling issue was caused by the very pathological case of searching for a word that appears in very many documents, exacerbated by lots simultaneous queries fired off from users repeatedly searching from the typeahead in the related tasks editor. I agree that we still don't know the exact cause of the 100x slowdown.