Paths

Table of Contentst

Add highlighting support to Elasticsearch fulltext engine
AbandonedPublic
Actions

Authored by 20after4 on Apr 4 2017, 2:10 AM.

Details

Reviewers

epriestley
chad

Group Reviewers

Blessed Reviewers

Maniphest Tasks

T12500: Render highlighted search result body snippets in global search results views

Summary

This utilizes the built-in highlight support from elastic to
display snippets of the search document body with matching
terms highlighted (where possible).

Note: I'm not sure if upstream is interested in this since it does not implement
equivalent functionality for MySQL results.

Feel free to ignore this if not interested.

Refs T12500

Test Plan

Ran several "global" searches, saw results with snippets of
the document body, highlighted

Diff Detail

Repository

rP Phabricator

Branch

master

Lint

Lint Passed

Unit

Tests Passed

Build Status

Buildable 16316
Build 21691: Run Core Tests
Build 21690: arc lint + arc unit

Event Timeline

20after4 created this revision.Apr 4 2017, 2:10 AM

Herald added a reviewer: chad. · View Herald TranscriptApr 4 2017, 2:10 AM

Herald added a reviewer: Blessed Reviewers. · View Herald Transcript

Herald added a subscriber: Korvin. · View Herald Transcript

Harbormaster completed remote builds in B16279: Diff 42352.Apr 4 2017, 2:11 AM

I'm not really comfortable trusting Elasticsearch to return a safe blob of HTML here: it seems like we're putting a lot of trust in it by doing this (willingly applying PhutilSafeHTML() to anything it returns) but not really getting very much benefit (basically just some highlighting, which we already have code for elsewhere) relative to the level of trust involved.

In theory there's almost no possible way Elasticsearch could get things wrong since the task is so simple, and I don't think this is likely to ever lead to a security issue, it just creates a very large amount of new attack surface which we don't really need to have.

It also changes the display behavior for Elasticsearch only, which is surprising and could easily be missed in testing. If we shipped a new layout here we could break this and never notice.

With the current project and repository indexers, "body" is probably no longer ideal to show to users since it's a mishmash of other things now (slugs / short names). That's okay if it's just an index, but if it's both index and display we should probably be more careful about it to avoid confusion.

This sets body content on all handles, even those which represent objects the user can not see. Although they are presumably filtered out later so this isn't actually a policy issue, it's somewhat dangerous because it creates handles which appear to have survived policy checks but contain a substantial amount of private information the viewer may not have permission to see. Normally, handles don't work like this, so code which accepts handles may have expectations about how it can safely handle them which don't align well with the handles this generates.

Finally, it overrides subtitles on handles which really have them. Today I think this is only Users, but could be more types of objects in the future.

I think the underlying goal is a worthy one, but we should pursue it in the context of broader search design changes (T8646) and make it apply to all engines, not just Elasticsearch (since nothing that's happening here should be Elasticsearch-specific, except that it's very slightly little easier to hook up this way).

Outside of the policy and XSS risks (which I think are both very minor) I don't see anything here that makes this a bad change if you want to maintain it locally until we get to T8646,

This revision now requires changes to proceed.Apr 4 2017, 12:10 PM

In D17608#211656, @epriestley wrote:

I'm not really comfortable trusting Elasticsearch to return a safe blob of HTML here: it seems like we're putting a lot of trust in it by doing this (willingly applying PhutilSafeHTML() to anything it returns) but not really getting very much benefit (basically just some highlighting, which we already have code for elsewhere) relative to the level of trust involved.

In theory there's almost no possible way Elasticsearch could get things wrong since the task is so simple, and I don't think this is likely to ever lead to a security issue, it just creates a very large amount of new attack surface which we don't really need to have.

I could have it return anything to mark the highlighted terms (see line 277). For example, I could have it return ** instead of <strong>, then process it as remarkup.

It also changes the display behavior for Elasticsearch only, which is surprising and could easily be missed in testing. If we shipped a new layout here we could break this and never notice.

This is the part that concerns me most.

With the current project and repository indexers, "body" is probably no longer ideal to show to users since it's a mishmash of other things now (slugs / short names). That's okay if it's just an index, but if it's both index and display we should probably be more careful about it to avoid confusion.

This is why I introduced a keywords field.

This sets body content on all handles, even those which represent objects the user can not see. Although they are presumably filtered out later so this isn't actually a policy issue, it's somewhat dangerous because it creates handles which appear to have survived policy checks but contain a substantial amount of private information the viewer may not have permission to see. Normally, handles don't work like this, so code which accepts handles may have expectations about how it can safely handle them which don't align well with the handles this generates.

Isn't there a way to avoid this? I could check the handles somehow?

Finally, it overrides subtitles on handles which really have them. Today I think this is only Users, but could be more types of objects in the future.

I could limit it to specific types? This is really only useful for tasks, currently...

I think the underlying goal is a worthy one, but we should pursue it in the context of broader search design changes (T8646) and make it apply to all engines, not just Elasticsearch (since nothing that's happening here should be Elasticsearch-specific, except that it's very slightly little easier to hook up this way).

Outside of the policy and XSS risks (which I think are both very minor) I don't see anything here that makes this a bad change if you want to maintain it locally until we get to T8646,

Yeah I'm pretty sure we will keep this in Wikimedia's fork and I'll look into resolving the issues you pointed out. Meanwhile I'll think about better ways to make it more generally applicable.

In D17608#211656, @epriestley wrote:

Finally, it overrides subtitles on handles which really have them. Today I think this is only Users, but could be more types of objects in the future.

I think it would make more sense to give handles a new property called $snippet (or something similar?) and render that separately, but this was a rough proof of concept.

I lean toward having this return a SearchResult kind of object which has methods like getHandle() and getBodySnippet() and whatever. I think we need this for T8646 anyway. It doesn't make sense to put stuff like "breadcrumb hierarchy" or "thumbnail document preview" or whatever on handles, since there's like a 99% chance that only global search results will ever use them.

(I recall thinking that such an object would make T8285 easier too, but don't recall the specifics, so I may be misremembering.)

In D17608#211696, @epriestley wrote:

I lean toward having this return a SearchResult kind of object which has methods like getHandle() and getBodySnippet() and whatever. I think we need this for T8646 anyway. It doesn't make sense to put stuff like "breadcrumb hierarchy" or "thumbnail document preview" or whatever on handles, since there's like a 99% chance that only global search results will ever use them.

That sounds like the right way to go... Let me see if I can come up with something.

Significant refactor to use a PhabricatorFulltextResult object to represent search
result hits. Results views are made extensible using the EngineExtension pattern:

PhabricatorSearchResultEngineExtension abstract base class models extensions
which can alter the way individual search results are displayed.

PhabricatorSearchResultHighlightsEngineExtension implements the snippet
highlighting functionality when the PhabricatorFulltextResult object has
highlight metadata.

Another notable improvement is that this no longer uses setSubhead() and instead appends the highlighting as a child node within the PHUIObjectItemView

Harbormaster completed remote builds in B16316: Diff 42386.Apr 5 2017, 6:58 AM

Warning: This code is still a bit messy, I will clean things up a bit more in a future diff.

@epriestley: I don't expect you to consider this for inclusion in the upstream as it stands currently, I'm mainly updating the diff in order to get feedback about the general direction that this is heading.

My thinking is this: if the PhabricatorSearchResultEngineExtension pattern seems reasonable to you, then I believe this will form a good foundation for a future change to address T8646: Provide more context for search results, particularly wiki documents.

There is quite a bit of work remaining to clean up the mess I made ~~integrating~~ crudely splicing this into the various places that previously dealt with handles.

I think this approach is at least somewhere in the realm of reasonable, yeah.

The extensions should probably get passed a list of all results at once, not individual results, so they can bulk-load additional data (for example, hierarchies for wiki documents or thumbnails for files or whatever) instead of needing to do 100 queries to show 100 items.

So maybe the API ends up looking like $extension->messAroundWithTheseViewsBeforeWeShowThemToTheUserIfYouWant($items), but general flow seems reasonable to me.

In D17608#211910, @epriestley wrote:

I think this approach is at least somewhere in the realm of reasonable, yeah.

The extensions should probably get passed a list of all results at once, not individual results, so they can bulk-load additional data (for example, hierarchies for wiki documents or thumbnails for files or whatever) instead of needing to do 100 queries to show 100 items.

So maybe the API ends up looking like $extension->messAroundWithTheseViewsBeforeWeShowThemToTheUserIfYouWant($items), but general flow seems reasonable to me.

That's a really good point, and it would definitely make a difference, performance-wise. I'll look at that further.

@chad and @epriestley: This is live now on https://phabricator.wikimedia.org if either of you want to take a look at the overall UX. I'm sure there is some room for improvement but it feels pretty usable to me.

webroot/rsrc/css/application/search/search-results.css
5	This turned out to look really bad. The highlighted search results shouldn't be red, at least not the body highlights.

@epriestley: Given that this is an elastic-only feature and not really wanted upstream, and given that you have implemented result sets as objects in rP3245e74f16bb: Show users how fulltext search queries are parsed and executed; don't query…, I think I should abandon this and submit a new revision with just the PhabricatorSearchResultEngineExtension infrastructure.

Are you comfortable considering that for inclusion in the upstream so that I can take the rest of this downstream without too many merge conflicts? :)

@epriestley: I'll submit a new revision that expands on the resultset class as I mentioned above. Of course this is without any expectations regarding when you'll have time to look at it :)