When Ferret queries match a large number of documents, execution may be dominated by result ordering. From PHI1938, a particular query matching 185,000 results takes 0.07s to return the first 100 results, and 4m40s to return the "best" 100 results.
In most cases, these queries are short and not particularly useful (queries like test or the code), and users likely do not benefit from ordering anyway: they will probably refine their query when they see how many results matched, not browse through the results.
Ranking Function
Result ordering is currently O(N) on the size of the result set. The most common ranking function is:
IF(`ft_rank`.termCorpus LIKE '% xx1 %', 2, 0) + IF(`ft_rank`.normalCorpus LIKE '% xx2 %', 1, 0) + IF(`ft_rank`.termCorpus LIKE '% yy1 %', 2, 0) + IF(`ft_rank`.normalCorpus LIKE '% yy2 %', 1, 0) + 0 AS `_ft_rank`,
...where xx1, yy2, etc., are variations of search terms. Roughly, documents get 2 points for each term appearing exactly as queried in the raw text of the title and 1 point for each normalized term appearing in the normalized text of the title.
There is a maximum score (3 * number of terms) and MySQL could stop once it finds one full page of results with the maximum score. It actually could compute this, but expecting it to figure this out is not realistic. This means it must compute a rank for every value in the result set.
Can we express this query in a way that MySQL can execute more efficiently?
In theory, we could imagine including a term like _ft_has_max_score and somehow hinting to MySQL that once it finds a full page of results with the maximum score, it may stop searching. That is, for the ordering <has_max_score, score, id> it should examine the table row-by-row in id order, stopping once it finds a page of results with a maximum score. I don't know any way to do this in MySQL, and suspect one does not exist.
Unless a mechanism for this emerges, I don't think this is an option.
Degrading Ranking
We can degrade to a cheap ranking function if we determine that ranking will not be worthwhile. Degrading is the easy part, determining that ranking won't be worthwhile is a bit trickier.
There are a set of cases where we can cheaply and easily guess that ranking is not worthwhile:
- Query has no ngrams (e).
- A "common ngram" threshold is configured, and the query has no uncommon ngrams (the and a).
In other cases, I think we have to either pre-query (issue a query to test the result size) or timeout the well-ranked query.
I'm inclined to do this:
- Handle timeouts better, showing the user a more helpful error message with guidance about what happened and why.
- Expose "Relevance" and at least one other cheap ranking explicitly, ideally in a UI control that allows the UI to communicate that relevance wasn't available as a ranking for a given query (e.g., a modal multi-button control).
- Time out queries more quickly (5s or 10s).
- If not unduly difficult, degrade from "Relevance" when cheap predictions indicate it won't work.
- If not unduly difficult, degrade from "Relevance" when queries time out.
This leaves some improvable behavior: the set of queries that match many documents, but which we don't guess are expensive before we issue them.
I don't want to spend a ton of time getting this perfect since I believe these queries are almost always useless anyway and in almost all cases, any response to a query matching 185,000 documents isn't useful.