Page MenuHomePhabricator

Search repositories using a dedicated index for performance (and SVN support)
Open, LowPublic

Description

We currently support repository text search in git and hg by running git grep and hg locate (or similar) against the repositories on disk.

This isn't especially fast (although it actually looks like we're spending 90% of the time on the page highlighting the search results), and doesn't work at all for SVN because there's no local repository on disk.

To improve performance and support SVN, we need to build an index of the codebase and implement our own search. However, the only thing we ship with which can reasonably do this is MySQL, and running LIKE queries is probably not going to be especially fast, at least for large repositories.

The state of the art here seems to involve building a trigram index (e.g., http://swtch.com/~rsc/regexp/regexp4.html), per Etsy's recent frontend Hound (https://github.com/etsy/Hound). It is possible we can implement this algorithm in PHP, put the index in MySQL, and get reasonable-enough performance for a range of repositories.

The more brute-force solution (keep the whole thing in memory and use a bunch of threads to search it, possibly across a bunch of machines) is straightforward, but we don't ship with anything which we can reasonably build a daemon for this out of, so this implies adding more dependencies.

I think we'll likely:

  • make code search engines modular;
  • implement a LIKE-based MySQL brute-force engine;
  • if it's excessively terrible, evaluate adding a trigram index and see if we get anything out of that;
  • support binding to some external index like Hound, or a custom in-memory thing (Facebook had a custom service, and we've heard from at least one other company with a custom service);
  • maybe some day ship a C/Java/Go-based in-memory index on a dedicated daemon.

Event Timeline

jcdoll created this task.Mar 6 2015, 3:26 AM
jcdoll assigned this task to kbrownlees.
jcdoll raised the priority of this task from to Needs Triage.
jcdoll updated the task description. (Show Details)
jcdoll added a project: Diffusion.
jcdoll added subscribers: kbrownlees, epriestley, jcdoll.
chad removed kbrownlees as the assignee of this task.Mar 6 2015, 3:43 AM
chad removed a subscriber: kbrownlees.
epriestley merged a task: Restricted Maniphest Task.Mar 6 2015, 12:13 PM
epriestley added subscribers: datr, kbrownlees, cburroughs and 3 others.
epriestley renamed this task from Full text svn search to Search repositories using a dedicated index for performance (and SVN support).Mar 6 2015, 12:15 PM
epriestley triaged this task as Low priority.
epriestley removed a project: Subversion.
epriestley updated the task description. (Show Details)Mar 6 2015, 12:41 PM

We could also implement a very sketchy version of this by maintaining an SVN working copy on disk and running normal grep against it. This is terrible, but maybe only moderately terrible.

tolbrino removed a subscriber: tolbrino.Aug 24 2015, 7:34 AM
hskiba added a subscriber: hskiba.Sep 18 2017, 3:10 AM
pasik added a subscriber: pasik.Jun 1 2019, 10:53 AM