Ref T8788. Allow querying for files by name. This currently only performs exact string matching. In the future, it would be nice to support partial string matching.
Details
- Reviewers
epriestley - Group Reviewers
Blessed Reviewers - Maniphest Tasks
- T8788: Allow querying for files by name
Search for a bunch of files by name.
Diff Detail
- Repository
- rP Phabricator
- Branch
- master
- Lint
Lint Passed - Unit
Tests Passed - Build Status
Buildable 8648 Build 10017: Run Core Tests Build 10016: arc lint + arc unit
Event Timeline
I don't want to bring this upstream because it doesn't scale and the utility seems very marginal to me.
This install has a small amount of data (~1M files) but %...% queries take ~150-200ms to execute. I'm not immediately sure what the best strategy for providing these kinds of queries is, but I strongly suspect that just issuing LIKE against string columns in main tables isn't it. Some possibilities include:
- Maybe we can reduce the cost of the scan by pulling the data into a separate <id, string> table with just the data we want to LIKE?
- Build an index in MySQL (token/digraph/trigraph?).
- Build an index in the SearchEngine (but JOINs are hard?).
- Build some other sort of dedicated index.
I'd like to assess approaches, then implement support for a standard approach here before proceeding (e.g., a way to tag columns for submission to a separate LIKE index on DAO objects).
This implementation is also problematic: if a user searches for % or _, the character will be interpreted literally. See this recent post on the GitHub engineering blog:
http://githubengineering.com/like-injection/
Use the %~ (LIKE substring), %> (LIKE prefix) and %< (LIKE suffix) conversions to safely escape a LIKE clause in qsprintf(), not %s.
src/applications/files/query/PhabricatorFileSearchEngine.php | ||
---|---|---|
54 | Loose check prevents searching for "0". |
This can be built with proper substring matching using Ngrams now (see D14846 for the implementation) but I'd like to give it some time to settle first because it requires reindexing all files and installs may get grumpy if they have to do that multiple times.
T9979 has more discussion. There are People and Projects implementations of Ngrams forthcoming to swap typeahead stuff over to it, and those might provide better examples (and work out some of the kinks).