Page MenuHomePhabricator

Support multiple fulltext search clusters with 'cluster.search' config
ClosedPublic

Authored by 20after4 on Feb 18 2017, 7:55 AM.
Referenced Files
F14046057: D17384.diff
Wed, Nov 13, 5:32 PM
F14039721: D17384.id41877.diff
Mon, Nov 11, 6:14 AM
F14039695: D17384.id41877.diff
Mon, Nov 11, 6:12 AM
F14037069: D17384.diff
Sun, Nov 10, 1:41 PM
F14032437: D17384.id.diff
Sat, Nov 9, 2:25 PM
F14032158: D17384.id.diff
Sat, Nov 9, 1:08 PM
F14021197: D17384.diff
Wed, Nov 6, 6:08 AM
F14017455: D17384.id41973.diff
Mon, Nov 4, 5:14 PM
Subscribers

Details

Summary

The goal is to make fulltext search back-ends more extensible, configurable and robust.

When this is finished it will be possible to have multiple search storage back-ends and
potentially multiple instances of each.

Individual instances can be configured with roles such as 'read', 'write' which control
which hosts will receive writes to the index and which hosts will respond to queries.

These two roles make it possible to have any combination of:

  • read-only
  • write-only
  • read-write
  • disabled

This 'roles' mechanism is extensible to add new roles should that be needed in the future.

In addition to supporting multiple elasticsearch and mysql search instances, this refactors
the connection health monitoring infrastructure from PhabricatorDatabaseHealthRecord and
utilizes the same system for monitoring the health of elasticsearch nodes. This will
allow Wikimedia's phabricator to be redundant across data centers (mysql already is,
elasticsearch should be as well).

The real-world use-case I have in mind here is writing to two indexes (two elasticsearch clusters
in different data centers) but reading from only one. Then toggling the 'read' property when
we want to migrate to the other data center (and when we migrate from elasticsearch 2.x to 5.x)

Hopefully this is useful in the upstream as well.

Remaining TODO:

  • test cases
  • documentation
Test Plan
This will most likely require the elasticsearch index to be deleted and re-created due to schema changes.

Tested with elasticsearch versions 2.4 and 5.2 using the following config:

  "cluster.search": [
    {
      "type": "elasticsearch",
      "hosts": [
        {
          "host": "localhost",
          "roles": { "read": true, "write": true }
        }
      ],
      "port": 9200,
      "protocol": "http",
      "path": "/phabricator",
      "version": 5
    },
    {
      "type": "mysql",
      "roles": { "write": true }
     }
  ]

Also deployed the same changes to Wikimedia's production Phabricator instance without any issues whatsoever.

Diff Detail

Repository
rP Phabricator
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
  • Added index stas to status ui
  • Separate mysql status from elasticsearch status and show different set of columns appropriate to each cluster type.

I'm pleased to report that this has been live on wikimedia's phabricator for about a week without any incidents whatsoever. Additionally, we are in the process of migrating from elasticsearch 2.x to 5.x and the ability to write to multiple clusters is really working out nicely for transition.

  • Move the stats definitions into the engine so the status UI remains engine agnostic.
  • Fix a bug where role => false was being treated like role => true in the UI
20after4 added inline comments.
src/applications/config/check/PhabricatorExtraConfigSetupCheck.php
356–359

I guess they were WMF enhancements.

src/applications/search/fulltextstorage/PhabricatorElasticFulltextStorageEngine.php
62–110

Just to make sure I haven't missed anything:

  • We currently write health checks but never read them, right? So there's no effect (other than the UI "Status" changing) when a service fails health checks? That seems fine for now, I just want to make sure I didn't miss a health check read somewhere.
  • This references a "Cluster: Search" document which hasn't been written yet (also fine, just making sure I didn't miss it). We should write this document before promoting to stable on Friday at the latest (ideally, hold this until that document is ready).
  • Each Host has its own Engine instance. I could imagine that in the future we might possibly want to change this so that each Cluster/Service has an Engine, and all hosts are passed the same Engine. This would let an Engine do more-stateful things by knowing about all hosts in a service. But I think this is fine as written for now (we have no use cases for anything fancier yet), and fairly easy to change in the future if we run into situations where we'd benefit form tweaking architecture.
  • I think the static $cache (see inline) thing is still a bug; pretty much everything else here is style stuff / edge cases / suggestions.
src/applications/config/check/PhabricatorExtraConfigSetupCheck.php
202

For consistency, prefer ending the sentence with a period.

356–359

Let's get rid of them in this patch, then.

src/applications/config/controller/PhabricatorConfigClusterSearchController.php
103–104

Unconventional indent formatting.

123–129

Remove, unused?

131–137

Remove, unused?

139–145

Remove, unused?

src/applications/config/option/PhabricatorClusterConfigOptions.php
43

Maybe "storage services" instead of "storage clusters"? I think that's slightly more consistent with other teminology.

44

Maybe just get rid of the MySQL/Elastic mention, since we'll probably forget to update this if/when we add more support and third-parties can add support for other engines but can't easily update this string (and maybe we'll spin out Elastic into an extension after Packages actually works). If you prefer to name the services explicitly, maybe say something like "MySQL, ElasticSearch, or other engines" so we have more room to add or remove support in the future without this text becoming misleading.

src/applications/search/fulltextstorage/PhabricatorElasticFulltextStorageEngine.php
479–488

This cache should still get nuked, I think?

src/infrastructure/cluster/search/PhabricatorSearchCluster.php
3 ↗(On Diff #42116)

This can be final, right?

This is enough of a pain that I probably wouldn't fix it here, but PhabricatorSearchService might be a more clear/consistent name for this class (e.g., the primary access method is ::getAllServices(), but as named it returns a list of "cluster" objects, not a list of "service" objects).

181–184 ↗(On Diff #42116)

Unconventional indent fomatting.

198–200 ↗(On Diff #42116)

In this codebase, instead of "comment that this is impossible + continue", strongly prefer "throw".

247 ↗(On Diff #42116)

This is largely philosophical, but two other approaches you could take here:

  • Catch all of the exceptions, and throw PhutilAggregateException instead (probably best, but you lose the ability to catch specific exceptions).
  • If you're throwing only one exception, consider throwing the first exception, not the last exception. I think the first exception tends to be the most relevant one more often (sometimes subsequent exceptions are really symptoms of the first exception, but the first exception is rarely/never a symptom of a later exception). But AggregateException is probably cleaner if no caller is trying to catch specific types of exceptions.
This revision is now accepted and ready to land.Mar 20 2017, 1:20 PM

Just to make sure I haven't missed anything:

  • We currently write health checks but never read them, right? So there's no effect (other than the UI "Status" changing) when a service fails health checks? That seems fine for now, I just want to make sure I didn't miss a health check read somewhere.

Yeah, I want to implement the health checks to at least avoid the wasted attempts to query a server that's known to be down. I just hadn't gotten to it yet.

  • This references a "Cluster: Search" document which hasn't been written yet (also fine, just making sure I didn't miss it). We should write this document before promoting to stable on Friday at the latest (ideally, hold this until that document is ready).

Noted.

  • Each Host has its own Engine instance. I could imagine that in the future we might possibly want to change this so that each Cluster/Service has an Engine, and all hosts are passed the same Engine. This would let an Engine do more-stateful things by knowing about all hosts in a service. But I think this is fine as written for now (we have no use cases for anything fancier yet), and fairly easy to change in the future if we run into situations where we'd benefit form tweaking architecture.

Agreed, it would make sense for the engine to be aware of hosts (and their health status). I might have gone that direction but this is already a large change and I'm trying to avoid making more major changes to the engine class. I'm especially avoiding more changes that would necessitate more refactoring of the mysql engine in order to maintain parity between the two.

  • I think the static $cache (see inline) thing is still a bug; pretty much everything else here is style stuff / edge cases / suggestions.
src/infrastructure/cluster/search/PhabricatorSearchCluster.php
3 ↗(On Diff #42116)

This can be final, right?

Yeah, I believe so.

This is enough of a pain that I probably wouldn't fix it here, but PhabricatorSearchService might be a more clear/consistent name for this class

Naming things is hard.

It wouldn't be too much of a pain to change it, if you want me to.

198–200 ↗(On Diff #42116)

Noted. I'll fix that.

247 ↗(On Diff #42116)

I like the idea of aggregate exception, though the relevant failure case is almost always going to be network related. Either something from CURL failing to make the connection, DNS failure, or some low level network error.

In this case we should probably show the user an error along the lines of "Unable to reach any search services, try again later...", irrespective of which exception is encountered.

20after4 marked an inline comment as done.

Ok I've reworked this quite a bit and I may have messed up somewhere in the process.

Now the engine is more responsible for knowing about hosts and subclassing the
engine should be sufficient to customize the behavior without touching the built
in cluster infrastructure. This should make separation between upstream and
downstream implementations a lot cleaner in the long run.

I also renamed PhabricatorSearchCluster to PhabricatorSearchService.

Still TODO: write a page of docs about how to configure cluster.search

  • Cleaned up the elastic query and added comments describing the purpose of the clauses
  • a couple of bugfixes found by further testing

Note: I'm not sure why harbormaster is failing?

20after4 marked 14 inline comments as done.

address review feedback that I hadn't gotten to yet.

Ok I think I've eliminated the problematic parts like indexing project slugs.

Also, this revision no longer depends on D17509

src/applications/search/fulltextstorage/PhabricatorElasticFulltextStorageEngine.php
170–175

Hopefully this makes more sense now?

src/applications/search/fulltextstorage/PhabricatorElasticFulltextStorageEngine.php
164

The main 'must' clause looks for all words to match against all fields, but the words don't have to be in order and it filters for stopwords, stemming, etc.

178–180

the should clause boosts the score of documents which have an exact match on one or more of these fields.

The exact match is due to analyzer => english_exact and the boosting is due to the nature of adding a should clause to a bool query in elasticsearch.

Fix method signature un-final PhabricatorElasticFulltextStorageEngine

  • Fix searching relationships which I had inadvertantly broken.
  • Better elasticsearch 2.x and 5.x support
  • more optimized query
  • Created diviner documentation: Cluster: Search
  • removed stray phlog

@epriestley: I think this is ready to land but I want to give you one more chance to change your mind.

I've included a basic outline of the documentation in diviner, as well as various other improvements since you accepted this revision.

Would you like me to land this now or wait until after the promote to stable on Friday?

Haha, thanks. Let me finish testing one other change and then I'll give this another look.

I don't see any noteworthy issues here.

master has a bunch of sketchy stuff in it already so let's hold this until after it promotes (in ~30 hours), but I think this is good to go after that. I'll kick the tires a bit more thoroughly once it lands and can counter-diff you if I catch any other issues, but this looks like it's in pretty good shape to me.

@epriestley sweet, I'll land this as soon as I see that you've merged to stable.

  • Improved the status monitoring UI in config/cluster/search/
  • Actually utilize the health monitoring cache to avoid connecting to downed servers.
  • actually, acutally utilize the health monitoring...

try to get harbormaster to build (push to staging?)

This revision was automatically updated to reflect the committed changes.