Details

Reviewers

epriestley

Group Reviewers

Blessed Reviewers

Maniphest Tasks

T9893: Support ElasticSearch 2.0 - 5.1

Commits

rPe41c25de5050: Support multiple fulltext search clusters with 'cluster.search' config

Summary

The goal is to make fulltext search back-ends more extensible, configurable and robust.

When this is finished it will be possible to have multiple search storage back-ends and
potentially multiple instances of each.

Individual instances can be configured with roles such as 'read', 'write' which control
which hosts will receive writes to the index and which hosts will respond to queries.

These two roles make it possible to have any combination of:

read-only
write-only
read-write
disabled

This 'roles' mechanism is extensible to add new roles should that be needed in the future.

In addition to supporting multiple elasticsearch and mysql search instances, this refactors
the connection health monitoring infrastructure from PhabricatorDatabaseHealthRecord and
utilizes the same system for monitoring the health of elasticsearch nodes. This will
allow Wikimedia's phabricator to be redundant across data centers (mysql already is,
elasticsearch should be as well).

The real-world use-case I have in mind here is writing to two indexes (two elasticsearch clusters
in different data centers) but reading from only one. Then toggling the 'read' property when
we want to migrate to the other data center (and when we migrate from elasticsearch 2.x to 5.x)

Hopefully this is useful in the upstream as well.

Remaining TODO:

test cases
documentation

Test Plan

This will most likely require the elasticsearch index to be deleted and re-created due to schema changes.

Tested with elasticsearch versions 2.4 and 5.2 using the following config:

  "cluster.search": [
    {
      "type": "elasticsearch",
      "hosts": [
        {
          "host": "localhost",
          "roles": { "read": true, "write": true }
        }
      ],
      "port": 9200,
      "protocol": "http",
      "path": "/phabricator",
      "version": 5
    },
    {
      "type": "mysql",
      "roles": { "write": true }
     }
  ]

Also deployed the same changes to Wikimedia's production Phabricator instance without any issues whatsoever.

Diff Detail

Repository

rP Phabricator

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Added index stas to status ui
Separate mysql status from elasticsearch status and show different set of columns appropriate to each cluster type.

Harbormaster completed remote builds in B15961: Diff 42055.Mar 10 2017, 3:49 PM

I'm pleased to report that this has been live on wikimedia's phabricator for about a week without any incidents whatsoever. Additionally, we are in the process of migrating from elasticsearch 2.x to 5.x and the ability to write to multiple clusters is really working out nicely for transition.

Move the stats definitions into the engine so the status UI remains engine agnostic.
Fix a bug where role => false was being treated like role => true in the UI

Harbormaster completed remote builds in B15998: Diff 42091.Mar 16 2017, 3:26 PM

Harbormaster completed remote builds in B15998: Diff 42091.

20after4 edited the test plan for this revision. (Show Details)Mar 16 2017, 3:27 PM

20after4 mentioned this in D17509: Updated PhabricatorElasticFulltextStorageEngine for elasticsearch 5.Mar 20 2017, 12:09 PM

20after4 added a parent revision: D17509: Updated PhabricatorElasticFulltextStorageEngine for elasticsearch 5.Mar 20 2017, 12:10 PM

rebased on top of D17509: Updated PhabricatorElasticFulltextStorageEngine for elasticsearch 5

Harbormaster completed remote builds in B16032: Diff 42116.Mar 20 2017, 12:20 PM

20after4 marked 8 inline comments as done.Mar 20 2017, 12:26 PM

20after4 added inline comments.

src/applications/config/check/PhabricatorExtraConfigSetupCheck.php
356–359	I guess they were WMF enhancements.
src/applications/search/fulltextstorage/PhabricatorElasticFulltextStorageEngine.php
62–110	see D17509: Updated PhabricatorElasticFulltextStorageEngine for elasticsearch 5

Just to make sure I haven't missed anything:

We currently write health checks but never read them, right? So there's no effect (other than the UI "Status" changing) when a service fails health checks? That seems fine for now, I just want to make sure I didn't miss a health check read somewhere.
This references a "Cluster: Search" document which hasn't been written yet (also fine, just making sure I didn't miss it). We should write this document before promoting to stable on Friday at the latest (ideally, hold this until that document is ready).
Each Host has its own Engine instance. I could imagine that in the future we might possibly want to change this so that each Cluster/Service has an Engine, and all hosts are passed the same Engine. This would let an Engine do more-stateful things by knowing about all hosts in a service. But I think this is fine as written for now (we have no use cases for anything fancier yet), and fairly easy to change in the future if we run into situations where we'd benefit form tweaking architecture.
I think the static $cache (see inline) thing is still a bug; pretty much everything else here is style stuff / edge cases / suggestions.

src/applications/config/check/PhabricatorExtraConfigSetupCheck.php
202	For consistency, prefer ending the sentence with a period.
356–359	Let's get rid of them in this patch, then.
src/applications/config/controller/PhabricatorConfigClusterSearchController.php
103–104	Unconventional indent formatting.
123–129	Remove, unused?
131–137	Remove, unused?
139–145	Remove, unused?
src/applications/config/option/PhabricatorClusterConfigOptions.php
43	Maybe "storage services" instead of "storage clusters"? I think that's slightly more consistent with other teminology.
44	Maybe just get rid of the MySQL/Elastic mention, since we'll probably forget to update this if/when we add more support and third-parties can add support for other engines but can't easily update this string (and maybe we'll spin out Elastic into an extension after Packages actually works). If you prefer to name the services explicitly, maybe say something like "MySQL, ElasticSearch, or other engines" so we have more room to add or remove support in the future without this text becoming misleading.
src/applications/search/fulltextstorage/PhabricatorElasticFulltextStorageEngine.php
479–488	This cache should still get nuked, I think?
src/infrastructure/cluster/search/PhabricatorSearchCluster.php
3 ↗	(On Diff #42116)	This can be `final`, right? This is enough of a pain that I probably wouldn't fix it here, but `PhabricatorSearchService` might be a more clear/consistent name for this class (e.g., the primary access method is `::getAllServices()`, but as named it returns a list of "cluster" objects, not a list of "service" objects).
181–184 ↗	(On Diff #42116)	Unconventional indent fomatting.
198–200 ↗	(On Diff #42116)	In this codebase, instead of "comment that this is impossible + continue", strongly prefer "throw".
247 ↗	(On Diff #42116)	This is largely philosophical, but two other approaches you could take here: Catch all of the exceptions, and throw `PhutilAggregateException` instead (probably best, but you lose the ability to catch specific exceptions). If you're throwing only one exception, consider throwing the first exception, not the last exception. I think the first exception tends to be the most relevant one more often (sometimes subsequent exceptions are really symptoms of the first exception, but the first exception is rarely/never a symptom of a later exception). But `AggregateException` is probably cleaner if no caller is trying to catch specific types of exceptions.

This revision is now accepted and ready to land.Mar 20 2017, 1:20 PM

In D17384#209987, @epriestley wrote:

Just to make sure I haven't missed anything:

We currently write health checks but never read them, right? So there's no effect (other than the UI "Status" changing) when a service fails health checks? That seems fine for now, I just want to make sure I didn't miss a health check read somewhere.

Yeah, I want to implement the health checks to at least avoid the wasted attempts to query a server that's known to be down. I just hadn't gotten to it yet.

This references a "Cluster: Search" document which hasn't been written yet (also fine, just making sure I didn't miss it). We should write this document before promoting to stable on Friday at the latest (ideally, hold this until that document is ready).

Noted.

Each Host has its own Engine instance. I could imagine that in the future we might possibly want to change this so that each Cluster/Service has an Engine, and all hosts are passed the same Engine. This would let an Engine do more-stateful things by knowing about all hosts in a service. But I think this is fine as written for now (we have no use cases for anything fancier yet), and fairly easy to change in the future if we run into situations where we'd benefit form tweaking architecture.

Agreed, it would make sense for the engine to be aware of hosts (and their health status). I might have gone that direction but this is already a large change and I'm trying to avoid making more major changes to the engine class. I'm especially avoiding more changes that would necessitate more refactoring of the mysql engine in order to maintain parity between the two.

I think the static $cache (see inline) thing is still a bug; pretty much everything else here is style stuff / edge cases / suggestions.

src/infrastructure/cluster/search/PhabricatorSearchCluster.php
3 ↗	(On Diff #42116)	This can be final, right? Yeah, I believe so. This is enough of a pain that I probably wouldn't fix it here, but PhabricatorSearchService might be a more clear/consistent name for this class Naming things is hard. It wouldn't be too much of a pain to change it, if you want me to.
198–200 ↗	(On Diff #42116)	Noted. I'll fix that.
247 ↗	(On Diff #42116)	I like the idea of aggregate exception, though the relevant failure case is almost always going to be network related. Either something from CURL failing to make the connection, DNS failure, or some low level network error. In this case we should probably show the user an error along the lines of "Unable to reach any search services, try again later...", irrespective of which exception is encountered.

Ok I've reworked this quite a bit and I may have messed up somewhere in the process.

Now the engine is more responsible for knowing about hosts and subclassing the
engine should be sufficient to customize the behavior without touching the built
in cluster infrastructure. This should make separation between upstream and
downstream implementations a lot cleaner in the long run.

I also renamed PhabricatorSearchCluster to PhabricatorSearchService.

Still TODO: write a page of docs about how to configure cluster.search

Harbormaster failed remote builds in B16078: Diff 42163!Mar 22 2017, 5:27 AM

Harbormaster failed remote builds in B16078: Diff 42163!

Cleaned up the elastic query and added comments describing the purpose of the clauses
a couple of bugfixes found by further testing

Harbormaster failed remote builds in B16079: Diff 42164!Mar 22 2017, 6:43 AM

Note: I'm not sure why harbormaster is failing?

address review feedback that I hadn't gotten to yet.

Harbormaster failed remote builds in B16080: Diff 42165!Mar 22 2017, 6:53 AM

Get rid of static.

Harbormaster failed remote builds in B16081: Diff 42166!Mar 22 2017, 7:00 AM

Ok I think I've eliminated the problematic parts like indexing project slugs.

Also, this revision no longer depends on D17509

20after4 removed a parent revision: D17509: Updated PhabricatorElasticFulltextStorageEngine for elasticsearch 5.Mar 22 2017, 7:03 AM

20after4 added inline comments.Mar 22 2017, 7:07 AM

src/applications/search/fulltextstorage/PhabricatorElasticFulltextStorageEngine.php
170–175	Hopefully this makes more sense now?

20after4 added inline comments.Mar 22 2017, 7:13 AM

src/applications/search/fulltextstorage/PhabricatorElasticFulltextStorageEngine.php
164	The main 'must' clause looks for all words to match against all fields, but the words don't have to be in order and it filters for stopwords, stemming, etc.
178–180	the `should` clause boosts the score of documents which have an exact match on one or more of these fields. The exact match is due to `analyzer` => `english_exact` and the boosting is due to the nature of adding a should clause to a bool query in elasticsearch.

Fix method signature un-final PhabricatorElasticFulltextStorageEngine

Harbormaster failed remote builds in B16098: Diff 42183!Mar 22 2017, 11:59 PM

epriestley mentioned this in T12443: Applying fulltext limits first causes missing results.Mar 23 2017, 1:48 AM

Fix searching relationships which I had inadvertantly broken.
Better elasticsearch 2.x and 5.x support
more optimized query

Harbormaster failed remote builds in B16103: Diff 42188!Mar 23 2017, 12:58 PM

Created diviner documentation: Cluster: Search
removed stray phlog

Harbormaster failed remote builds in B16106: Diff 42191!Mar 23 2017, 3:29 PM

@epriestley: I think this is ready to land but I want to give you one more chance to change your mind.

I've included a basic outline of the documentation in diviner, as well as various other improvements since you accepted this revision.

Would you like me to land this now or wait until after the promote to stable on Friday?

Haha, thanks. Let me finish testing one other change and then I'll give this another look.

I don't see any noteworthy issues here.

master has a bunch of sketchy stuff in it already so let's hold this until after it promotes (in ~30 hours), but I think this is good to go after that. I'll kick the tires a bit more thoroughly once it lands and can counter-diff you if I catch any other issues, but this looks like it's in pretty good shape to me.

@epriestley sweet, I'll land this as soon as I see that you've merged to stable.