Page MenuHomePhabricator

Datasource queries with many "JOIN ... LIKE" can have explosive complexity
Open, LowPublic

Description

In a few cases, notably the Project tokenizer datasource, we currently satisfy queries by issuing a query with a large number of JOIN ... LIKE clauses. For example, if a user types cat dog pig bee into a tokenizer, we issue a query like this:

SELECT * FROM project p
  JOIN project_datasourcetoken t1 ON t1.projectID = p.id AND t1.token LIKE 'cat%'
  JOIN project_datasourcetoken t2 ON t2.projectID = p.id AND t1.token LIKE 'dog%'
  JOIN project_datasourcetoken t3 ON t3.projectID = p.id AND t1.token LIKE 'pig%'
  JOIN project_datasourcetoken t4 ON t4.projectID = p.id AND t1.token LIKE 'bee%'
  ...

In a couple of cases, we've seen installs report that this query takes thousands of years to run (see PHI47).

In D18506, I artificially limited this specific query (to join a maximum of 5 terms) as a workaround.

The problem cases involve very long searches with 40+ terms. I haven't been successful in reproducing these issues locally.

Some possible causes:

  • This might just be bad key cardinality, resolved by ANALYZE TABLE, as with ngrams in T12819. If so, this is probably effectively solved already by the new behavior of bin/storage upgrade (to auto-analyze all tables on every upgrade).
  • This might be a legitimate sort of badness in MySQL with LIKE, or with a large enough number of joins (ngrams never use more than 16). Either could explain why the similar query pattern in ngrams is fine.

Some possible solutions:

  • Maybe just revert D18506 if we come to believe this is a key cardinality problem.
  • Maybe use something more like ngrams if we believe this is a problem with LIKE.
  • Maybe write a nameTokens sort of field to each object, similar to how "terms" are denormalized by Ferret, and build the query with a lot of LIKEs and no JOINs, since none of the underlying tables are ever very large.
  • Maybe do a combination of things if the major problem is just the number of JOINS.
  • Unlimited joins eventually hit the table query limit anyway (64 tables per query, often).

None of this is too urgent since the workaround's behavior is almost impossible to distinguish from the correct behavior.

Event Timeline

Some followup SHOW INDEXES from the case in PHI47 provided output which didn't have any evidence of key cardinality issues, so it seems less likely that this is a cardinality problem.

See also T13102/PHI442. The MySQL setting optimizer_search_depth=0 may fix the weird explosive complexity here.