This is my attempt at writing an overview of the issues around
scalability & availability. I still need to add more detail,
especially the specifics about versions supported, but I'll add
that in another diff. Before I write that part, I want to do
some testing with elastcsearch 1.x, assuming I can even get it
running locally.
Details
- Reviewers
20after4 - Group Reviewers
Blessed Reviewers - Maniphest Tasks
- T12450: New Search Configuration Errata
read the docs
Diff Detail
- Repository
- rP Phabricator
- Branch
- master
- Lint
Lint Passed - Unit
No Test Coverage - Build Status
Buildable 16209 Build 21606: Run Core Tests Build 21522: Run Core Tests Build 21521: arc lint + arc unit
Event Timeline
A mild concern I have about this guidance is that, to my knowledge, we've only see this scaling issue at WMF, and the failure was inexplicable (e.g., 10x our index size but 1000x the query time, or whatever the numbers were).
I currently have 250K documents in my local MySQL index with fast fulltext queries, and I believe I've had 2M+ in the past. The local size of the entire index (not just keys) appears to be <200MB, so "the index falls out of RAM" shouldn't explain it.
It's possible that every other large install has moved to ElasticSearch, but given how generally broken it was for a very long time I don't think this is the case.
I feel like we never really established why the WMF MySQL index was so slow (for example: MariaDB bug?) so I'm hesitant to use it to movtivate scaling guidance. We've seen other cases (notably T8588) where only one install sees some sort of weird performance issue but others do not.
Historically, we've seen installs switch to ElasticSearch when they didn't need to out of sort of general fear and uncertainty, and have a harder time upgrading and maintaining Phabricator as a result, so I want to be cautious about how heavily we push ElasticSearch.
But I'm planning to do a touch-up pass on the docs anyway and can tweak the wording here when I do.
src/docs/user/cluster/cluster.diviner | ||
---|---|---|
50 | I'd call this "No Risk" for loss resistance, since there's no data stored only in the search index (bin/search init + bin/search index can always completely rebuild it even if all the data from the fulltext index has been completely lost). (And maybe "Moderate" for availability, since Repositories only get a "Moderate" -- the other two "High" values make Phabricator 100% inaccessible if they go down.) |
I figured out that one component of the WMF scaling issue was caused by the very pathological case of searching for a word that appears in very many documents, exacerbated by lots simultaneous queries fired off from users repeatedly searching from the typeahead in the related tasks editor. I agree that we still don't know the exact cause of the 100x slowdown.
(I'm going to pull this so I can reference it in the guidance task, I'll countediff you for wordsmithing once I write things up.)