There are a number of situations where users would like to be able to search for "File Name Contains:" or "Package Name Contains:" or similar:
- See T9964 for searching Owners packages by name.
- See D14411 for searching Files by name.
- See T6721 for searching tasks by title (only).
- More than twice in my life, I've wanted to search for Pastes by title.
- We have an existing "Contains Words" query in Maniphest.
We currently have these implementations available to us when a user searches for, say, "orange":
- WHERE Prefix: Issue a WHERE name LIKE "orange%" query.
- WHERE Substring: Issue a WHERE name LIKE "%orange%" query.
- Fulltext Intersect: Query for the term in the fulltext index. Then issue a separate query in the main index, constrained by the PHIDs from the fulltext index.
These implementations all have major downsides and drawbacks:
WHERE Prefix: This is fast, but very limited. It can only find text that starts with the search term. This is not what users expect, and not very useful in most
WHERE Substring: This is slow (already >200ms per query on this host's dataset with 1m files). It has a reasonable level of power, but won't do what users expect with multi-word terms, at least on its own.
Fulltext Intersect: This method currently opens us up to a lot of weird failures when we perform the intersection, is difficult to observe and understand, relies on the fulltext index, is very complex, and mitigating the issues implies greater complexity. For example: if you search for a very common term (say, "task") and then specify very narrow additional constraints (say, in project X, assigned to Y), the results might not include matches if the fulltext query matches more than 1,000 results.
While we can mitigate this through the UI and through smarter infrastructure, the nature of this approach means that if both engines (the fulltext engine and the applicationsearch engine) produce large result lists, we need to intersect them in PHP, and we must hold both result sets in memory to intersect them. This is an upper limit on scalability, and this approach will have poor performance long before we get there.
Broadly, I have three concerns about picking a not-so-good solution here:
- we don't have a clear pathway forward if users hit scalability limits, other than "tell all your users not to use this field", which is extremely silly;
- when we do have a clear pathway forward, it may have different semantics than the approach we might pick today, so the meaning of the field could change, which would be confusing; and
- I don't want to implement a solution which is confusing/surprising on its own (and I think the "prefix" approach is pretty weird and confusing).
Some approaches forward might be:
- Dedicated index tables for text columns which we LIKE "%orange%". I'm not sure if we can meaningfully improve the performance of these queries by letting MySQL issue them against a dedicated table, but the total amount of text to be searched is not particularly large so it could help in theory. This is easy to test, at least.
- Build our own bigram or trigram indexes? A trigram index breaks "orange" into "ora", "ran", "ang", "nge" and stores those in a table, which allows us to reduce the number of rows we're examining. These are somewhat complex and probably do not work well for non-latin languages (at least, without adjustment). I've implemented on once, but generally don't have enough experience to be confident this is a pathway forward. This is generally a simple way to do substring matches quickly, though.
- Build our own token (whole word) indexes? We probably need to dip our toes into the water here for T6740 anyway. This is generally better than trigrams for "fulltext" search, but probably not as good for "filename / package name contains" search. I would expect fulltext search to find "orange" and "orangery" if I search for "oranges", but would not expect "Filename Contains: oranges" to give me "orange.jpg".
My inclination is:
- Try the dedicated table on the files dataset and see how happy MySQL is about that (if it's better but still not great, we might still do this in conjunction with bigrams/trigrams).
- Prototype trigrams and see what the performance looks like.
If one of those give us a good result, I think that gives us a pathway forward for "substring search" ("file name contains", "package name contains"), and we can commit to substring semantics there indefinitely because fulltext semantics probably never make sense.