Page MenuHomePhabricator

New Search Configuration Errata
Open, NormalPublic

Description

D17384 has just landed, making search configuration much more powerful. Just filing this to keep track of followups.

  • @20after4 hit a case where "no readable hosts" throws a setup error?
  • @epriestley is locally seeing a -b not work.

Guidance

  • bin/search index should maybe guide you to bin/search init?

Future

  • When users search for fulltext results in Maniphest and hit too many documents, the current behavior is approximately silent failure (see T12443). D17384 has also lowered the ceiling for ElasticSearch, although previous changes lowered it for MySQL search. We should not fail silently, and ideally should build toward T12003.
  • When a search cluster is down, we probably don't degrade with much grace (unhandled exception)?
  • It would be nice to build a practical test suite instead, where we put specific documents into the index and then search for them. The upstream test could run against MySQL, and some bin/search test could run against a configured engine like ElasticSearch. This would make it easier to make sure that behavior was as uniform as possible across engine implementations.
  • Maybe let bin/search commands target a specific service.

  • Does every assigned task now match "user" in ElasticSearch?
  • PhabricatorElasticFulltextStorageEngine has a json_encode() which should be phutil_json_encode().
  • Has T8602 been resolved? (Presuming yes.)
  • PhabricatorSearchEngineTestCase is now pointless and only detects local misconfigurations.
  • PhabricatorSearchService throws an untranslated exception.
  • D17384 added a new "keywords" field, but MySQL does not search it (I think?). The behavior should be as consistent across MySQL and Elastic as we can make it. Likely cleaner is giving "Project" objects a body, with "slugs" and "description" separated by newlines?
  • bin/search index should have summary output.
  • The internal index version should not cause us to skip inserts into new engines.
  • Missing "protocol" in ElasticSearch doesn't hit a warning? (Resolved elsewhere, it seems.)
  • If bin/search index fails to write to one engine, it fails all engines? (Bad Elastic config, misses write to good MySQL config?)
  • indexExists() may not work correctly for ElasticSearch? (can't repro, maybe fixed elsewhere)
  • Write an "Upgrading: ..." guidance task with narrow instructions for installs that are upgrading.
  • Some minor upgrade/release guidance on indexing D17300 (bin/search index --type repository or whatever) would be nice in the changelog if we don't end up issuing something more grand.
  • We should more clearly detail exactly which versions of ElasticSearch are supported (for example, is ElasticSearch <2 no longer supported)? From T9893 it seems like we may only have supported ElasticSearch <2 before, so are the two regions of support totally nonoverlapping and all ElasticSearch users will need to upgrade? [ Uhhhhhh kinda just said "use 5" ]
  • Do we need to add an indexing activity (T11932) for installs with ElasticSearch? [ Task guidance instead? ]
  • Documentation should provide stronger guidance toward MySQL and away from Elastic for the vast majority of installs, because we've historically seen users choosing Elastic when they aren't actually trying to solve any specific problem.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I haven't been testing with elasticsearch < 2.0 so this might break backwards compatibility. It wouldn't be difficult to fix any compatibility issues though, with a tiny bit of testing.

  • Write an "Upgrading: ..." guidance task with narrow instructions for installs that are upgrading.

TODO

  • Do we need to add an indexing activity (T11932) for installs with ElasticSearch?

Yes, I think so

  • We should more clearly detail exactly which versions of ElasticSearch are supported (for example, is ElasticSearch <2 no longer supported)? From >T9893 it seems like we may only have supported ElasticSearch <2 before, so are the two regions of support totally nonoverlapping and all ElasticSearch users will need to upgrade?

Previously there were some minor issues with 2.x and 5.x was impossible. Now I suspect that there are minor issues with 1.x and 2.x-5.x work flawlessly.

  • Documentation should provide stronger guidance toward MySQL and away from Elastic for the vast majority of installs, because we've historically >seen users choosing Elastic when they aren't actually trying to solve any specific problem.

Agreed. Although I feel that elasticsearch provides a vastly superior query parser and fulltext scoring, it is definitely not something that people should default to immediately after installing phabricator.

  • Has T8602 been resolved?

I can not reproduce it on wikimedia's install.

epriestley updated the task description. (Show Details)
epriestley updated the task description. (Show Details)

Personal notes:

  • ElasticSearch requires Java.
  • ElasticSearch requires 2GB of free RAM to even start.

Okay:

$ ./bin/elasticsearch
Exception in thread "main" java.lang.UnsupportedClassVersionError: org/elasticsearch/bootstrap/Elasticsearch : Unsupported major.minor version 52.0
	at java.lang.ClassLoader.defineClass1(Native Method)
	at java.lang.ClassLoader.defineClass(ClassLoader.java:803)
	at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
	at java.net.URLClassLoader.defineClass(URLClassLoader.java:442)
	at java.net.URLClassLoader.access$100(URLClassLoader.java:64)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:354)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:348)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:347)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
	at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:482)

That's Java for "ElasticSearch requires Java 1.8". I resolved this by installing Java 1.8 and then removing Java 1.7, which I'm sure nothing on the system depends on.

max file descriptors [4096] for elasticsearch process is too low

"Change nofile system ulimit."

max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]

"...perhaps in /etc/sysctl.conf"


I configured cluster.search like this, copying the mysql config:

[
    {
      "type": "elastic",
      "roles": {
        "read": true,
        "write": true
      }
    }
]

That got me a fatal error:

[2017-03-26 08:33:55] ERROR 8: Undefined index: elastic at [/Users/epriestley/dev/core/lib/phabricator/src/infrastructure/cluster/search/PhabricatorSearchService.php:208]
  • My type is not valid, but the error is not detected at runtime.
  • Phabricator's web UI fatals rather than detecting + repairing this error.
  • For now, I worked around this by typing "elasticsearch" correctly.

That got me this error:

EXCEPTION: (PhabricatorWorkerPermanentFailureException) Failed to update search index for document "PHID-TASK-gfvrhh4g6twcjonrprwb": [cURL/6] (phabricator/TASK/PHID-TASK-gfvrhh4g6twcjonrprwb/) <CURLE_COULDNT_RESOLVE_HOST> There was an error resolving the server hostname. Check that you are connected to the internet and that DNS is correctly configured. (Did you add the domain to `/etc/hosts` on some other machine, but not this one?) at [<phabricator>/src/applications/search/worker/PhabricatorSearchWorker.php:54]
  • This is because I have not specified a "host", but this engine can not possibly run in this mode and should abort with a useful error earlier.
  • I provided a "host" and "port", which got me this error:
[HTTP/500] Internal Server Error
<!DOCTYPE html>
<html>
  <head>
    <meta charset="UTF-8" />
    <title>Unrecognized Hostname</title>
...
  • This is because I did not specify "protocol".
  • But the install can't possibly work without "protocol", and this doesn't guide me to the issue.
  • The UI issues misleading guidance:
Configuration option 'cluster.search' has invalid value and was restored to the default: Search engine configuration has an invalid service specification (at index "0"): Got unexpected parameters: host.
  • This is actually not right; "host" is valid a top-level I think? bin/search index appears to be working with this config, but the UI says it isn't valid:
{
  "type": "elasticsearch",
  "host": "elastic001.epriestley.com",
  "port": 9200,
  "protocol": "http",
  "roles": {
    "read": true,
    "write": true
  }
}
  • I haven't run bin/search init, but bin/search index doesn't warn me that I may want to. This might be worth adding. The UI does warn me.
  • The "index incorrect" warning UI uses inconsistent title case.
  • The "index incorrect" warning UI could format the command to be run more cleanly (with addCommand(), I think).
  • bin/search init warns me that the index is "incorrect". It might be more clear to distinguish between "missing" and "incorrect", since it's more comforting to users to see "everything is as we expect, doing normal first-time setup now" than "something is wrong, fixing it".
  • CLI message "Initializing search service "ElasticSearch"" does not end with a period, which is inconsistent with other UI messages.
  • It might be nice to let bin/search commands like init and index select a specific service (or even service + host) to act on, as bin/storage --ref ... now does. You can generally get the result you want by fiddling with config.
  • When a service isn't writable, bin/search init reports "Search cluster has no hosts for role "write".". This is accurate but does not provide guidance: it might be more useful to the user to explain "This service is not writable, so we're skipping index check for it.".
  • Even with write off for MySQL, bin/search index --type task --trace still updates MySQL, I think? I may be misreading the trace output. But this behavior doesn't make sense if it is the actual behavior, and it seems like reindexAbstractDocument() uses "all services", not "writable services", and the MySQL engine doesn't make sure it's writable before indexing.

  • Searching for f*a*c*t*o*r*y*s*u*r*p*l*u*s*z*z*q*q*z*z*q*q produces nonsenical results (many results, when I would expect no results: the results do not contain that sequence of letters in order).
  • Searching or user fails to find task Grant users tokens when a mention is created, suggesting that stemming is not working.
  • Searching for users finds that task, but fails to find a task containing "per user per month" in a comment, also suggesting that stemming is not working.
  • Searching for maniphest fails to find task maniphest.query elephant, suggesting that tokenization is ElasticSearch is not as good as the MySQL tokenization for these words (see D17330).
  • Searching for users -blue returns a huge number of results: significantly more than users. Expected behavior: fewer results, omitting those results matching blue.
  • Searching for users blue returns more results than users or blue. Expected behavior: fewer results, because only results which match "users" AND "blue" are returned. The result set includes completely irrelevant results.

I ran into a lot of confusion because the versioned object indexes are not namespaced per-service. Basically, if you insert version 95 of a document into Elastic, the indexer thinks that version 95 doesn't need to go into MySQL, even though it does. So when you run bin/search index ..., you may get only a subset of the updates you actually need. The object index versions need to change to become engine-aware so they are stored per-service, not globally, and/or the whole mechanism needs to include a hash of cluster.search or just be turned off. Until this is fixed, it can be worked around with using --force everywhere.

bin/search index might reasonably provide summary output about this ("392 documents were not indexed because they haven't changed, use --force to update them.").

@epriestley: Thanks for the detailed feedback... I'll get to work ;)


  • Searching for f*a*c*t*o*r*y*s*u*r*p*l*u*s*z*z*q*q*z*z*q*q produces nonsenical results (many results, when I would expect no results: the results do not contain that sequence of letters in order).
  • Searching or user fails to find task Grant users tokens when a mention is created, suggesting that stemming is not working.
  • Searching for users finds that task, but fails to find a task containing "per user per month" in a comment, also suggesting that stemming is not working.
  • Searching for maniphest fails to find task maniphest.query elephant, suggesting that tokenization is ElasticSearch is not as good as the MySQL tokenization for these words (see D17330).
  • Searching for users -blue returns a huge number of results: significantly more than users. Expected behavior: fewer results, omitting those results matching blue.
  • Searching for users blue returns more results than users or blue. Expected behavior: fewer results, because only results which match "users" AND "blue" are returned. The result set includes completely irrelevant results.

I can't actually explain the search anomalies you've encountered, my experience with elasticsearch has been the opposite.. I will test these cases and try to identify the cause.

I ran into a lot of confusion because the versioned object indexes are not namespaced per-service. Basically, if you insert version 95 of a document into Elastic, the indexer thinks that version 95 doesn't need to go into MySQL, even though it does. So when you run bin/search index ..., you may get only a subset of the updates you actually need. The object index versions need to change to become engine-aware so they are stored per-service, not globally, and/or the whole mechanism needs to include a hash of cluster.search or just be turned off. Until this is fixed, it can be worked around with using --force everywhere.

bin/search index might reasonably provide summary output about this ("392 documents were not indexed because they haven't changed, use --force to update them.").

This explains something that I totally missed - I've been using --force a lot because of the same confusion.

f*a*c*t*o*r*y*s*u*r*p*l*u*s*z*z*q*q*z*z*q*q returns the same results as
f a c t o r y s u r p l u s z z q q z z q q so it appears to be treating those as individual single-letter tokens. strange.


  • Searching for f*a*c*t*o*r*y*s*u*r*p*l*u*s*z*z*q*q*z*z*q*q produces nonsenical results (many results, when I would expect no results: the results do not contain that sequence of letters in order).
  • Searching or user fails to find task Grant users tokens when a mention is created, suggesting that stemming is not working.
  • Searching for users finds that task, but fails to find a task containing "per user per month" in a comment, also suggesting that stemming is not working.
  • Searching for maniphest fails to find task maniphest.query elephant, suggesting that tokenization is ElasticSearch is not as good as the MySQL tokenization for these words (see D17330).

Downstream, a search for "phids to maniphest query" finds "add ids and phids to maniphest.query" so it's tokenizing maniphest.query as two separate tokens. Not sure what happened in your test case.

  • Searching for users -blue returns a huge number of results: significantly more than users. Expected behavior: fewer results, omitting those results matching blue.
  • Searching for users blue returns more results than users or blue. Expected behavior: fewer results, because only results which match "users" AND "blue" are returned. The result set includes completely irrelevant results.

Downstream search for "users -blocked -deprecated -anonymous" effectively removes results for blocked, deprecated and anonymous

I don't think this is due to any Wikimedia-specific configuration on our elasticsearch, especially because my local testing has been done with a vanilla docker image directly from elasticsearch upstream.

so it's tokenizing maniphest.query as two separate tokens

I think the commit message just has "maniphest" and "query" as separate words, and it's finding those -- not finding them from "maniphest.query". Here's a case showing this doesn't work on the WMF install:

I've updated D17564: Address some New Search Configuration Errata to address the tokenization and word stemming issues.

bin/search init warns me that the index is "incorrect". It might be more clear to distinguish between "missing" and "incorrect", since it's more comforting to users to see "everything is as we expect, doing normal first-time setup now" than "something is wrong, fixing it".

It looks like we already have logic for this, so maybe indexExists() doesn't detect that the ElasticSearch index is missing properly in some cases.

So I ran into one problem when deploying the latest code to WMF production. We now throw a setup error if we have a cluster with no readable hosts. It actually makes sense to have a cluster that is write-only so that setup error is bogus.

So I ran into one problem when deploying the latest code to WMF production. We now throw a setup error if we have a cluster with no readable hosts. It actually makes sense to have a cluster that is write-only so that setup error is bogus.

I couldn't immediately reproduce this -- I used a config like this:

{
  "type": "elasticsearch",
  "hosts": [
    {
      "host": "elastic001.epriestley.com",
      "port": 9200,
      "protocol": "http",
      "roles": {
        "read": false,
        "write": true
      }
    }
  ]
},

That didn't seem to raise a config/setup error or an error when actually searching -- do you have a copy of the error itself or a reproducing config?

(This is all wrong, see below.)

One note here is that elastic.search.namespace has been removed without replacement. Since this was added by WMF in D9798 and effectively removed by WMF in D17384, I presume it no longer has any use cases.

If that isn't the case, we should maybe consider piggy-backing on storage.default-namespace rather than having a separate setting, although cluster.search is sufficiently flexible that it probably doesn't matter too much.

If we do use storage.default-namespace (either as the entire setting, or as a default for the setting) we should include it in the index version hash added by D17598.

Oh, I'm wrong. It hasn't been removed without replacement: there's a new path option for Elastic. I think there's probably a bug in indexExists() then, since it looks for 'phabricator' by name:

$res = $this->executeRequest($host, $uri, array());
return isset($res['indices']['phabricator']);

Presumably that should actually be trim($path, '/') or something?

It would be nice to build a practical test suite instead, where we put specific documents into the index and then search for them.

@epriestley: It's straightforward to launch an Elasticsearch instance using Docker. I could write a test that fires up a new docker instance, configures Phabricator to use it, indexes a few specific documents and then does a search to make sure results look correct.

You came up with some pretty good test cases when you were testing my patches. The only thing I'm not really clear about is how to temporarily override phabricator's config to use the test container for search. Is there a config override mechanism for use in phabricator unit tests?

Maybe let bin/search commands target a specific service.

This is a good idea, and would be very helpful for operational use during a migration.

It might also be handy if bin/search could actually execute queries and return results, just for easy testing of the backend. Not sure about potential policy bypass though.

This reminds me of something else I have been thinking about.... If we add the view policy field to search indexes then it would be possible to implement a public search engine for logged out users without worrying about filtering results or pagination issues (at least with Elasticsearch, this is a simple filter on ViewPolicy:public). This obviously wouldn't be useful beyond identifying public documents - the search engine can't evaluate complex policy logic but it can match on static values.

If you can run bin/whatever, policies no longer matter since you can bin/storage shell or any other number of dangerous things.

If we add the view policy field to search indexes then it would be possible to implement a public search engine for logged out users without worrying about filtering results or pagination issues

Only if the objects aren't part of Spaces, don't later implement SpacesInterface, don't currently implement ExtendedPolicyInterface, don't later implement it, don't have dependent/implicit policies, no one turns off policy.allow-public, and so on. Offhand, this feels like a lot of complexity for pretty limited utility.