Search repositories using a dedicated index for performance (and SVN support)
Open, LowPublic
Actions

Assigned To

None

Authored By

	jcdoll
	Mar 6 2015, 3:26 AM

Description

We currently support repository text search in git and hg by running git grep and hg locate (or similar) against the repositories on disk.

This isn't especially fast (although it actually looks like we're spending 90% of the time on the page highlighting the search results), and doesn't work at all for SVN because there's no local repository on disk.

To improve performance and support SVN, we need to build an index of the codebase and implement our own search. However, the only thing we ship with which can reasonably do this is MySQL, and running LIKE queries is probably not going to be especially fast, at least for large repositories.

The state of the art here seems to involve building a trigram index (e.g., http://swtch.com/~rsc/regexp/regexp4.html), per Etsy's recent frontend Hound (https://github.com/etsy/Hound). It is possible we can implement this algorithm in PHP, put the index in MySQL, and get reasonable-enough performance for a range of repositories.

The more brute-force solution (keep the whole thing in memory and use a bunch of threads to search it, possibly across a bunch of machines) is straightforward, but we don't ship with anything which we can reasonably build a daemon for this out of, so this implies adding more dependencies.

I think we'll likely:

make code search engines modular;
implement a LIKE-based MySQL brute-force engine;
if it's excessively terrible, evaluate adding a trigram index and see if we get anything out of that;
support binding to some external index like Hound, or a custom in-memory thing (Facebook had a custom service, and we've heard from at least one other company with a custom service);
maybe some day ship a C/Java/Go-based in-memory index on a dedicated daemon.

Revisions and Commits

rP Phabricator
	D12141	rP0efae2858ecc Don't syntax highlight codebase pattern search results

Related Objects

Mentioned In: D20650: Add Ferret support to Paste
T10900: Repository content search with `hg grep` returns matches found in old versions of files
T12974: Upgrading: "Ferret" Fulltext Engine
T12819: InnoDB FULLTEXT appears to fail catastrophically once it reaches a moderate size
T12010: Untangle the Gordian Knot of iterating on Differential/Diffusion/Arcanist
T4610: Implement robots.txt
T7701: Surface Code/File search better in Diffusion

Event Timeline

jcdoll created this task.Mar 6 2015, 3:26 AM

jcdoll assigned this task to kbrownlees.

jcdoll raised the priority of this task from to Needs Triage.

jcdoll updated the task description. (Show Details)

jcdoll added a project: Diffusion.

jcdoll added subscribers: kbrownlees, epriestley, jcdoll.

chad removed kbrownlees as the assignee of this task.Mar 6 2015, 3:43 AM

chad removed a parent task: T4070: Feature Request: full text search of repositories.

chad removed a subscriber: kbrownlees.

chad added a project: Subversion.Mar 6 2015, 3:54 AM

epriestley merged a task: Restricted Maniphest Task.Mar 6 2015, 12:13 PM

epriestley added subscribers: datr, kbrownlees, cburroughs and 3 others.

epriestley renamed this task from Full text svn search to Search repositories using a dedicated index for performance (and SVN support).Mar 6 2015, 12:15 PM

epriestley triaged this task as Low priority.

epriestley removed a project: Subversion.

epriestley updated the task description. (Show Details)Mar 6 2015, 12:41 PM

We could also implement a very sketchy version of this by maintaining an SVN working copy on disk and running normal grep against it. This is terrible, but maybe only moderately terrible.

MarcLindenberg awarded a token.Mar 6 2015, 3:10 PM

epriestley added a revision: D12141: Don't syntax highlight codebase pattern search results.Mar 24 2015, 12:59 PM

epriestley added a commit: rP0efae2858ecc: Don't syntax highlight codebase pattern search results.Mar 24 2015, 7:47 PM

epriestley mentioned this in T7701: Surface Code/File search better in Diffusion.Mar 30 2015, 9:04 PM

joshuaspence added a subscriber: joshuaspence.Mar 30 2015, 9:10 PM

epriestley mentioned this in T4610: Implement robots.txt.Jul 6 2015, 3:24 PM

tolbrino removed a subscriber: tolbrino.Aug 24 2015, 7:34 AM

greggrossmeier added a subscriber: greggrossmeier.Dec 2 2015, 6:53 PM

tycho.tatitscheff awarded a token.Dec 26 2015, 1:18 AM

tycho.tatitscheff added a subscriber: tycho.tatitscheff.

epriestley mentioned this in T12010: Untangle the Gordian Knot of iterating on Differential/Diffusion/Arcanist.Dec 13 2016, 5:06 PM

epriestley mentioned this in T12819: InnoDB FULLTEXT appears to fail catastrophically once it reaches a moderate size.Jun 12 2017, 4:49 PM