Page MenuHomePhabricator

Basic filename search support for Diffusion

Authored by vlada on Jan 29 2014, 1:30 PM.
Referenced Files
F11794357: D8093.diff
Fri, Jun 2, 10:15 PM
Unknown Object (File)
Tue, May 30, 5:41 PM
Unknown Object (File)
Tue, May 30, 12:41 AM
Unknown Object (File)
Fri, May 12, 6:44 AM
Unknown Object (File)
May 1 2023, 11:38 AM
Unknown Object (File)
Apr 13 2023, 12:58 PM
Unknown Object (File)
Apr 5 2023, 12:40 PM
Unknown Object (File)
Apr 5 2023, 4:57 AM
"Grey Medal" token, awarded by epriestley.


Group Reviewers
Blessed Reviewers
Maniphest Tasks
Restricted Maniphest Task
Restricted Diffusion Commit
rP2d27324bef95: Basic filename search support for Diffusion

Ref T156. Adds basic filename search support for Diffusion,
currently only for Git repositories.

This is preliminary, and it's up for discussion:

  • is the UI in the right place;
  • what should the search query syntax be (e.g. whether to put *s in the beginning and end of it);
  • how to best approach it for Mercurial and/or SVN;
  • what's the cleanest result format for lsquery (I went for the minimum necessary change to DiffusionBrowseSearchController).
Test Plan

Browse to a repository in Diffusion, and use both
Search File Names and Search File Content.

Diff Detail

Lint Skipped
Tests Skipped

Event Timeline

  • I think those lint errors are some mixture of the copy of arc you are running being very out of date, and what I'm guessing might be HHVM bugs. Modern arc has a different message for the "Unknown Symbol" error, and no version of arc should raise symbol errors for functions which exist.
  • The UI is a little funky, but it's good enough for now. We can refine that later.
  • The name lsquery is specific to Git. Let's call this diffusion.querypaths instead (I might eventually merge it into diffusion.browsequery, but that can happen later).
  • The Git command doesn't work as intended. git ls-files doesn't take a commit/tree, so this is searching for files matching either "the commit hash" or "path*pattern*". If you search for a file which exists at HEAD, and the working copy has a checkout, this will work OK. However, it will produce incorrect results in other cases, and no results if the working copy is bare (at HEAD in the upstream, modern Phabricator prefers bare repositories). Instead, we should use git ls-tree, which takes a "tree-ish", but passing a commit hash should also work.
  • I can't come up with any reason that a caller would want to handle the git command failing differently than they handle a generic failure. If we don't have an example of why this would be useful, let's let the exception escape (it will be dealt with reasonably at top level). If we come up with a reason later, we can always add a more specific error code back in.
  • Some inlines.
3–5 ↗(On Diff #18311)

You can remove this, Diviner can figure this out on its own shortly.

6 ↗(On Diff #18311)

Rename to diffusion.querypaths as per comment.

10 ↗(On Diff #18311)

Wrap this in pht().

14 ↗(On Diff #18311)

More precisely, this should be 'list<string>'.

20 ↗(On Diff #18311)

Let's call this 'commit'.

21 ↗(On Diff #18311)

Let's call this 'query' or 'pattern'?

27–38 ↗(On Diff #18311)

Remove special error handling, as per comment: this is fine and simpler as a generic error, unless we can think of something a caller might want to do differently in this specific situation.

57 ↗(On Diff #18311)

Change to ls-tree.

Specifically, I think this will have to look something like:

ls-tree -r -z %s -- %s,

...and then manually doing glob matching.

73 ↗(On Diff #18311)

You can skip this for now, but it's going to be based on hg manifest. I can implement this if you don't really feel like it.

vlada updated this revision to Unknown Object (????).Feb 1 2014, 4:03 PM

git ls-tree, preliminary hg support, address comments

So far it's using regexes instead of globs. It fits the
Enter a regular expression message already on the page, but I
can implement a glob-to-regex transformation if you want.

I adapted the Mercurial support from browsequery, it seems to work.

I'm wondering about efficiency. Would it be worth caching the output of
ls-tree or manifest? I can imagine it taking a long time for a large
enough repo.

Thanks for the linter tips! I tried the modern version of arc and got a different formatting of errors, but it still failed. I've since set up Phabricator in a VM and I think I'm good to go.


Is this the right way to handle delimiters?

Cool, this looks good to me. We could look into caching if this is too slow in practice; I'd guess it will perform well on small-to-moderately-sized repositories and be hard to cache on large repositories without special hardware / software.

One strategy which might be easy is to figure out how to write a runtime-loadable Mercurial extension which can do the prefix filtering inside of hg, so we can run hg manifest <path> and not need to read every path in, process it, and then throw it away. Not sure how much of an impact that would have, though, and obviously it's hg-only.

I can imagine some other strategies here, but most of the approaches I can come up with that would make this fast in the general case and can deal with SVN involve deploying separate indexing servers.


A very small improvement might be to use LinesOfALargeExecFuture and PhutilCallbackFilterIterator to do the filtering lazily. However, I'd guess it will be rare that we can produce results without processing a large amount of the input, and the overhead of another iterator and a callback per line might overwhelm the benefits and end up being slower.


I believe this is correct, yes. A very small improvement on this might be:

  • Do this, and preg_match() it against the empty string. If that returns false exactly, the user has entered an invalid regular expression.
  • At that point, we could either fail with an error, or use preg_quote() to escape all the metacharacters and perform a literal string match.
  • We could also do a preg_match() test first with exactly their input, to let them enter an expression like /quack/i, to match case-insenstively.
  • The only dangerous things they can do are DOS us (which is a hard problem, and there are plenty of other things an attacker can do to achieve the same end) and potentially use the /e flag with preg_replace. We're OK here because preg_match doesn't have any "evaluate arbitrary code" flags.

For the moment, I don't think it's a big deal to produce no results on an invalid regular expression. This is easy to clean up later on, and the vast majority of use cases should be straightforward.

Closed by commit rP2d27324bef95 (authored by @vlada, committed by @epriestley).