Page MenuHomePhabricator

Provide some guidance about elasticsearch in cluster docs
AbandonedPublic

Authored by epriestley on Mar 28 2017, 10:06 PM.
Tags
None
Referenced Files
Unknown Object (File)
Apr 12 2017, 12:35 AM
Unknown Object (File)
Apr 10 2017, 11:35 AM
Unknown Object (File)
Apr 10 2017, 11:35 AM
Unknown Object (File)
Apr 9 2017, 1:31 AM
Unknown Object (File)
Apr 4 2017, 9:16 PM
Unknown Object (File)
Apr 3 2017, 9:45 PM
Unknown Object (File)
Apr 3 2017, 5:07 AM
Unknown Object (File)
Mar 31 2017, 5:36 PM
Subscribers

Details

Reviewers
20after4
Group Reviewers
Blessed Reviewers
Maniphest Tasks
T12450: New Search Configuration Errata
Summary

This is my attempt at writing an overview of the issues around
scalability & availability. I still need to add more detail,
especially the specifics about versions supported, but I'll add
that in another diff. Before I write that part, I want to do
some testing with elastcsearch 1.x, assuming I can even get it
running locally.

Test Plan

read the docs

Diff Detail

Repository
rP Phabricator
Branch
master
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 16209
Build 21606: Run Core Tests
Build 21522: Run Core Tests
Build 21521: arc lint + arc unit

Event Timeline

A mild concern I have about this guidance is that, to my knowledge, we've only see this scaling issue at WMF, and the failure was inexplicable (e.g., 10x our index size but 1000x the query time, or whatever the numbers were).

I currently have 250K documents in my local MySQL index with fast fulltext queries, and I believe I've had 2M+ in the past. The local size of the entire index (not just keys) appears to be <200MB, so "the index falls out of RAM" shouldn't explain it.

It's possible that every other large install has moved to ElasticSearch, but given how generally broken it was for a very long time I don't think this is the case.

I feel like we never really established why the WMF MySQL index was so slow (for example: MariaDB bug?) so I'm hesitant to use it to movtivate scaling guidance. We've seen other cases (notably T8588) where only one install sees some sort of weird performance issue but others do not.

Historically, we've seen installs switch to ElasticSearch when they didn't need to out of sort of general fear and uncertainty, and have a harder time upgrading and maintaining Phabricator as a result, so I want to be cautious about how heavily we push ElasticSearch.

But I'm planning to do a touch-up pass on the docs anyway and can tweak the wording here when I do.

src/docs/user/cluster/cluster.diviner
50

I'd call this "No Risk" for loss resistance, since there's no data stored only in the search index (bin/search init + bin/search index can always completely rebuild it even if all the data from the fulltext index has been completely lost).

(And maybe "Moderate" for availability, since Repositories only get a "Moderate" -- the other two "High" values make Phabricator 100% inaccessible if they go down.)

This revision is now accepted and ready to land.Mar 30 2017, 3:18 PM

I figured out that one component of the WMF scaling issue was caused by the very pathological case of searching for a word that appears in very many documents, exacerbated by lots simultaneous queries fired off from users repeatedly searching from the typeahead in the related tasks editor. I agree that we still don't know the exact cause of the 100x slowdown.

(I'm going to pull this so I can reference it in the guidance task, I'll countediff you for wordsmithing once I write things up.)

epriestley edited reviewers, added: 20after4; removed: epriestley.

Oh, this snuck into rP654f0f6043f.

This revision now requires review to proceed.Apr 2 2017, 6:18 PM