Page MenuHomePhabricator

Applying fulltext limits first causes missing results
Closed, ResolvedPublic

Description

D16944 changed fulltext search to conduct the fulltext limit first, capping results at 1000, before filtering for other criteria. This can lead to missing results.

Take, for example, a query for "daemon" -- I've limited it down to other criteria that I perhaps remember. It finds no results because the particular task isn't in the special 1000 found by full-text search.

Adding an additional word makes it find results. It is counter-intuitive that adding constraints would broaden the result set. The logic of "Better to return quickly and let the user refine their results" from D16944 does not make sense if the user sees no results, or at least not the result they are expecting -- no user will intuitively assume they need to further refine their query from such a results page.

Event Timeline

jboning added a project: Restricted Project.Mar 23 2017, 1:33 AM

There's some additional stuff in D17384, where Elastic has an internal hard limit of 10K results.

See also T12353 for significant discussion.

I don't think we can reasonably "fix" this in the general case, but we should, at a minimum, provide more specific guidance to users about what's happening so they can reasonably figure out how to move forward.

In this particular case, the result set for all constraints other than fulltext is only 6 results, and we could imagine doing that half of the query first and then passing a "and only look at these documents" constraint to the search engine.

But it will always be possible to construct a non-fulltext constraint which matches a million documents and a fulltext constraint which matches a million different documents.

T12003 is also related to this (providing more explanatory, contextual help for users executing searches which may not be doing what they want for technical reasons).

I think it would make a lot of sense to construct the two queries separately (and in parallel) with a short timeout, then handle the timeout gracefully allowing the user to refine their query further. This would avoid the denial of service situation which happened to Wikimedia more than once due to users repeatedly executing really expensive searches until mysql fell over from the load.

It's also possible to pass the constraints on to elasticsearch so that it can handle all of the filters, not just the fulltext part. That is, however, quite a bit more complex, requires indexing more fields, and it's doubly complex to support that on top of supporting mysql constraints-based search filters.

In this specific situation it seems like it would make sense to automatically repeat the search without the fulltext portion and give the 6 results, along with a warning that the fulltext portion wasn't being applied. I don't know if that's practical to implement though.

epriestley claimed this task.

This is resolved by the Ferret engine, which can execute all parts of the query logic in MySQL.