Page MenuHomePhabricator

Add highlighting support to Elasticsearch fulltext engine
AbandonedPublic

Authored by 20after4 on Apr 4 2017, 2:10 AM.

Details

Summary

This utilizes the built-in highlight support from elastic to
display snippets of the search document body with matching
terms highlighted (where possible).

Note: I'm not sure if upstream is interested in this since it does not implement
equivalent functionality for MySQL results.

Feel free to ignore this if not interested.

Refs T12500

Test Plan

Ran several "global" searches, saw results with snippets of
the document body, highlighted

Diff Detail

Repository
rP Phabricator
Branch
master
Lint
Lint Passed
Unit
Tests Passed
Build Status
Buildable 16316
Build 21691: Run Core Tests
Build 21690: arc lint + arc unit

Event Timeline

I'm not really comfortable trusting Elasticsearch to return a safe blob of HTML here: it seems like we're putting a lot of trust in it by doing this (willingly applying PhutilSafeHTML() to anything it returns) but not really getting very much benefit (basically just some highlighting, which we already have code for elsewhere) relative to the level of trust involved.

In theory there's almost no possible way Elasticsearch could get things wrong since the task is so simple, and I don't think this is likely to ever lead to a security issue, it just creates a very large amount of new attack surface which we don't really need to have.

It also changes the display behavior for Elasticsearch only, which is surprising and could easily be missed in testing. If we shipped a new layout here we could break this and never notice.

With the current project and repository indexers, "body" is probably no longer ideal to show to users since it's a mishmash of other things now (slugs / short names). That's okay if it's just an index, but if it's both index and display we should probably be more careful about it to avoid confusion.

This sets body content on all handles, even those which represent objects the user can not see. Although they are presumably filtered out later so this isn't actually a policy issue, it's somewhat dangerous because it creates handles which appear to have survived policy checks but contain a substantial amount of private information the viewer may not have permission to see. Normally, handles don't work like this, so code which accepts handles may have expectations about how it can safely handle them which don't align well with the handles this generates.

Finally, it overrides subtitles on handles which really have them. Today I think this is only Users, but could be more types of objects in the future.

I think the underlying goal is a worthy one, but we should pursue it in the context of broader search design changes (T8646) and make it apply to all engines, not just Elasticsearch (since nothing that's happening here should be Elasticsearch-specific, except that it's very slightly little easier to hook up this way).

Outside of the policy and XSS risks (which I think are both very minor) I don't see anything here that makes this a bad change if you want to maintain it locally until we get to T8646,

This revision now requires changes to proceed.Apr 4 2017, 12:10 PM

I'm not really comfortable trusting Elasticsearch to return a safe blob of HTML here: it seems like we're putting a lot of trust in it by doing this (willingly applying PhutilSafeHTML() to anything it returns) but not really getting very much benefit (basically just some highlighting, which we already have code for elsewhere) relative to the level of trust involved.

In theory there's almost no possible way Elasticsearch could get things wrong since the task is so simple, and I don't think this is likely to ever lead to a security issue, it just creates a very large amount of new attack surface which we don't really need to have.

I could have it return anything to mark the highlighted terms (see line 277). For example, I could have it return ** instead of <strong>, then process it as remarkup.

It also changes the display behavior for Elasticsearch only, which is surprising and could easily be missed in testing. If we shipped a new layout here we could break this and never notice.

This is the part that concerns me most.

With the current project and repository indexers, "body" is probably no longer ideal to show to users since it's a mishmash of other things now (slugs / short names). That's okay if it's just an index, but if it's both index and display we should probably be more careful about it to avoid confusion.

This is why I introduced a keywords field.

This sets body content on all handles, even those which represent objects the user can not see. Although they are presumably filtered out later so this isn't actually a policy issue, it's somewhat dangerous because it creates handles which appear to have survived policy checks but contain a substantial amount of private information the viewer may not have permission to see. Normally, handles don't work like this, so code which accepts handles may have expectations about how it can safely handle them which don't align well with the handles this generates.

Isn't there a way to avoid this? I could check the handles somehow?

Finally, it overrides subtitles on handles which really have them. Today I think this is only Users, but could be more types of objects in the future.

I could limit it to specific types? This is really only useful for tasks, currently...

I think the underlying goal is a worthy one, but we should pursue it in the context of broader search design changes (T8646) and make it apply to all engines, not just Elasticsearch (since nothing that's happening here should be Elasticsearch-specific, except that it's very slightly little easier to hook up this way).

Outside of the policy and XSS risks (which I think are both very minor) I don't see anything here that makes this a bad change if you want to maintain it locally until we get to T8646,

Yeah I'm pretty sure we will keep this in Wikimedia's fork and I'll look into resolving the issues you pointed out. Meanwhile I'll think about better ways to make it more generally applicable.

Finally, it overrides subtitles on handles which really have them. Today I think this is only Users, but could be more types of objects in the future.

I think it would make more sense to give handles a new property called $snippet (or something similar?) and render that separately, but this was a rough proof of concept.

I lean toward having this return a SearchResult kind of object which has methods like getHandle() and getBodySnippet() and whatever. I think we need this for T8646 anyway. It doesn't make sense to put stuff like "breadcrumb hierarchy" or "thumbnail document preview" or whatever on handles, since there's like a 99% chance that only global search results will ever use them.

(I recall thinking that such an object would make T8285 easier too, but don't recall the specifics, so I may be misremembering.)

I lean toward having this return a SearchResult kind of object which has methods like getHandle() and getBodySnippet() and whatever. I think we need this for T8646 anyway. It doesn't make sense to put stuff like "breadcrumb hierarchy" or "thumbnail document preview" or whatever on handles, since there's like a 99% chance that only global search results will ever use them.

That sounds like the right way to go... Let me see if I can come up with something.

20after4 edited edge metadata.

Significant refactor to use a PhabricatorFulltextResult object to represent search
result hits. Results views are made extensible using the EngineExtension pattern:

PhabricatorSearchResultEngineExtension abstract base class models extensions
which can alter the way individual search results are displayed.

PhabricatorSearchResultHighlightsEngineExtension implements the snippet
highlighting functionality when the PhabricatorFulltextResult object has
highlight metadata.

Another notable improvement is that this no longer uses setSubhead() and instead appends the highlighting as a child node within the PHUIObjectItemView

20after4 planned changes to this revision.EditedApr 5 2017, 7:03 AM

Warning: This code is still a bit messy, I will clean things up a bit more in a future diff.

@epriestley: I don't expect you to consider this for inclusion in the upstream as it stands currently, I'm mainly updating the diff in order to get feedback about the general direction that this is heading.

My thinking is this: if the PhabricatorSearchResultEngineExtension pattern seems reasonable to you, then I believe this will form a good foundation for a future change to address T8646: Provide more context for search results, particularly wiki documents.

There is quite a bit of work remaining to clean up the mess I made integrating crudely splicing this into the various places that previously dealt with handles.

I think this approach is at least somewhere in the realm of reasonable, yeah.

The extensions should probably get passed a list of all results at once, not individual results, so they can bulk-load additional data (for example, hierarchies for wiki documents or thumbnails for files or whatever) instead of needing to do 100 queries to show 100 items.

So maybe the API ends up looking like $extension->messAroundWithTheseViewsBeforeWeShowThemToTheUserIfYouWant($items), but general flow seems reasonable to me.

I think this approach is at least somewhere in the realm of reasonable, yeah.

The extensions should probably get passed a list of all results at once, not individual results, so they can bulk-load additional data (for example, hierarchies for wiki documents or thumbnails for files or whatever) instead of needing to do 100 queries to show 100 items.

So maybe the API ends up looking like $extension->messAroundWithTheseViewsBeforeWeShowThemToTheUserIfYouWant($items), but general flow seems reasonable to me.

That's a really good point, and it would definitely make a difference, performance-wise. I'll look at that further.

@chad and @epriestley: This is live now on https://phabricator.wikimedia.org if either of you want to take a look at the overall UX. I'm sure there is some room for improvement but it feels pretty usable to me.

webroot/rsrc/css/application/search/search-results.css
5

This turned out to look really bad. The highlighted search results shouldn't be red, at least not the body highlights.

@epriestley: Given that this is an elastic-only feature and not really wanted upstream, and given that you have implemented result sets as objects in rP3245e74f16bb: Show users how fulltext search queries are parsed and executed; don't query…, I think I should abandon this and submit a new revision with just the PhabricatorSearchResultEngineExtension infrastructure.

Are you comfortable considering that for inclusion in the upstream so that I can take the rest of this downstream without too many merge conflicts? :)

@epriestley: I'll submit a new revision that expands on the resultset class as I mentioned above. Of course this is without any expectations regarding when you'll have time to look at it :)