Page MenuHomePhabricator

Updated PhabricatorElasticFulltextStorageEngine for elasticsearch 5
AbandonedPublic

Authored by 20after4 on Mar 20 2017, 12:09 PM.

Details

Reviewers
epriestley
Group Reviewers
Blessed Reviewers
Summary

As requested in D17384: Support multiple fulltext search clusters with 'cluster.search' config,
this separates out the substantial changes to the elastic
index and query building. These changes will need to land before D17384 simply
because I am unable to test D17384 without these changes. Or more to the point,
elasticsearch support in the upstream is badly broken when using recent (or
even relatively outdated) versions of elasticsearch, so these changes unbreak
it to the point of usability in order to build further improvements on top.

I will rebase D17384 on top of this change and update that revision shortly.

Test Plan

indexed and ran queries with elasticsearch 5, saw reasonable results.

Diff Detail

Repository
rP Phabricator
Branch
master
Lint
Lint Passed
Unit
Tests Passed
Build Status
Buildable 16031
Build 21255: Run Core Tests
Build 21254: arc lint + arc unit

Event Timeline

Does this require a full bin/search index for installs using Elastic? It looks like the index structure changes...

src/applications/search/fulltextstorage/PhabricatorElasticFulltextStorageEngine.php
9

This comment probably doesn't bestow any great enlightenment on any reasonable reader.

57

$tagCache is never defined as a property, I think? Prefer to define properties even if your access patterns don't technically require that they be defined.

Actually, since it isn't defined as a static property, I don't think this works at all? Does self::$tagCache[$x] = $y on an un-initialized property create it as a static property?

On my system (PHP 7.1), it looks like it doesn't: I get a fatal error instead:

PHP Fatal error: Uncaught Error: Access to undeclared static property: C::$value in /Users/epriestley/test.php:7

Does this need to be cached? We normally (always?) hit this cache only once per page, right? So just running the query without caching would have the same performance?

If this must use a cache, it should use PhabricatorCaches::getRequestCache(), not static. In the future, we may run multiple different requests (for performance) or multiple different Phabricator instances (for reducing memory footprints) in the same process. In this case, the scope of a static variable might span different viewers or instances. We don't do this today, but I want to leave room to do it in the future if it makes sense to pursue it.

So the scope of static means that it's only safe to use to store data which does not depend on the viewer, instance, or any configuration. If data does depend on the viewer/instance/config, use PhabricatorCaches::getRequestCache() instead, which works like static but is guaranteed to persist for only the scope of a logical request (e.g., a single web request, or a single task execution in the daemons).

151

Oh, this is happening on the indexing pathway. The desire for a cache makes sense, then.

206

I think this is a change of semantics, causing a search for "apple" to find objects tagged with "apple". I don't want this change in the upstream: I think it's confusing on its own, exceptionally confusing when projects are renamed (renaming projects does not rebuild the index, making it stale), and is also a meaningful semantic change between the MySQL and ElasticSearch engines. As long as ElasticSearch is in the upstream, I'd like the different engines to work the same way as much as possible.

But, particularly, I think this leads to cases where it isn't obvious why results are present in the result set (e.g., the document is tagged with a project which previously had one of the query terms in one of its slugs).

I'm okay with an upstream change to make searching for #asdf mean "search for documents tagged asdf", not "search for documents containing text #asdf" (see T7860) but I think making a search for "differential" imply that the result set should include tagged documents isn't desirable, especially when the index isn't rename-resistant.

I'd even be potentially OK with a change like: do a second search for all the terms the user submitted to see if they could match projects, then provide the user some hint text about using "#" to search for projects, e.g. "Searched for the text 'differential'. To search for documents tagged with Differential, search for 'Differential' instead.", similar to the Google spelling corrector hint.

213

This also seems like a semantic change, making fields like "Authors" and "Owners" optional instead of required? Maybe I'm misunderstanding how SHOULD and MUST work. If I query for "authored by X", I think the results should only include results which are authored by X.

Again, I'd be fine with a change like a hint that says "nothing matched, try a more-general query, perhaps by removing the authors constraint, click here to query without 'authors: x'" if that's really what's going on here, but I'd like to see some evidence that users are really having trouble with this first.

(But I might just be totally misunderstanding what's going on here.)

Okay, from below, I think I have no clue what's happening here. Are you just trying to boost results with matches in other fields? Since none of the other fields have text in them (I think?) this seems non-useful? It just makes "phid" hit every document, "user" hit every document with an author, etc?

Or maybe I'm just wrong about that too.

20after4 added inline comments.
src/applications/search/fulltextstorage/PhabricatorElasticFulltextStorageEngine.php
15

Note: in the wmf fork, the version was configurable. I had assumed that was already upstreamed but apparently not.

Since D17384 reworks the config for this and other values, I've hard-coded it here, just to get this change merged cleanly and separately.

57

This method essentially just looks up the slugs for projects by phid so that we index searchable text rather than only the relationship phid references.

Now it occurs to me that I should document this in the source ;)

57

Sorry that was my mistake when splitting out the patches, it's supposed to be static class property.

Does this need to be cached? We normally (always?) hit this cache only once per page, right? So just running the query without caching would have the same performance?

It's called once per document when indexing.

If this must use a cache, it should use PhabricatorCaches::getRequestCache(), not static. In the future, we may run multiple different requests (for performance) or multiple different Phabricator instances (for reducing memory footprints) in the same process. In this case, the scope of a static variable might span different viewers or instances. We don't do this today, but I want to leave room to do it in the future if it makes sense to pursue it.

ok, I'll switch to using getRequestCache...

Does this require a full bin/search index for installs using Elastic? It looks like the index structure changes...

Yes currently it does, however, I think I can rework it to be compatible with the old index.

src/applications/search/fulltextstorage/PhabricatorElasticFulltextStorageEngine.php
213

Okay, from below, I think I have no clue what's happening here. Are you just trying to boost results with matches in other fields? Since none of the other fields have text in them (I think?) this seems non-useful? It just makes "phid" hit every document, "user" hit every document with an author, etc?

That wasn't my intent, however, after testing, it appears that you are right about this.

I think I'm going to pull out some of these changes into a subclass for wikimedia-specific experiments and I'll update this revision to be mostly similar to the upstream version, just with updated es5 syntax.

Yes currently it does, however, I think I can rework it to be compatible with the old index.

We can make users re-index fairly easily too, but we should either be backward compatible or create a "reindex" activity (context on / connected to T11932) for ElasticSearch users as part of this change so we issue the appropriate setup guidance. I'm completely fine with making installs reindex (I think ElasticSearch is relatively rare, and reindexing isn't usually a big deal) if backward compatibility is going to be a big pain.

So I've done a bit more thinking about how to implement the changes to the engine class, especially with regards to any bits that are not wanted in the upstream but are desirable for wikimedia's implementation.

I think the best way to go for this is to make the engine class configurable in the cluster.search json value...

So each cluster would specify the specific engine class name and downstream we can just subclass PhabricatorElasticFulltextStorageEngine and specify WMFElasticFulltextStorageEngine in our config. We could have a basic version of buildSpec in the upstream and Wikimedia / other adventurous installs can override that method with something more complex in the subclass.

Right now, I think type identifies a Host subclass, which knows which Engine subclass to use, but it should maybe really go the other way: have type identify an Engine subclass instead, and then the Engine could have a method like newHost() which knew which type of Host to build, or maybe Host could even become generic.

Then the upstream could have "type": "elastic" (to select BoringUpstreamElasticFulltextStorageEngine) and you could have "type": "elastic.wmf" or whatever to select WMFElasticFulltextStorageEngine.

One advantage to flipping that is that you can provide a custom Engine with one class (just the engine) instead of two (engine + host).

20after4 added inline comments.
src/applications/search/fulltextstorage/PhabricatorElasticFulltextStorageEngine.php
57

ok I addressed this in D17384 by removing the entire tags feature.

213

Okay, from below, I think I have no clue what's happening here. Are you just trying to boost results with matches in other fields? Since none of the other fields have text in them (I think?) this seems non-useful? It just makes "phid" hit every document, "user" hit every document with an author, etc?

This should be much clearer now in D17384#inline-54823