We currently support repository text search in git and hg by running git grep and hg locate (or similar) against the repositories on disk.
This isn't especially fast (although it actually looks like we're spending 90% of the time on the page highlighting the search results), and doesn't work at all for SVN because there's no local repository on disk.
To improve performance and support SVN, we need to build an index of the codebase and implement our own search. However, the only thing we ship with which can reasonably do this is MySQL, and running LIKE queries is probably not going to be especially fast, at least for large repositories.
The state of the art here seems to involve building a trigram index (e.g., http://swtch.com/~rsc/regexp/regexp4.html), per Etsy's recent frontend Hound (https://github.com/etsy/Hound). It is possible we can implement this algorithm in PHP, put the index in MySQL, and get reasonable-enough performance for a range of repositories.
The more brute-force solution (keep the whole thing in memory and use a bunch of threads to search it, possibly across a bunch of machines) is straightforward, but we don't ship with anything which we can reasonably build a daemon for this out of, so this implies adding more dependencies.
I think we'll likely:
- make code search engines modular;
- implement a LIKE-based MySQL brute-force engine;
- if it's excessively terrible, evaluate adding a trigram index and see if we get anything out of that;
- support binding to some external index like Hound, or a custom in-memory thing (Facebook had a custom service, and we've heard from at least one other company with a custom service);
- maybe some day ship a C/Java/Go-based in-memory index on a dedicated daemon.