Page MenuHomePhabricator

Update "diffusion.searchquery" to handle multiple path arguments and various weird cases
Open, NormalPublic

Description

See PHI1330. Currently, diffusion.searchquery performs git grep or hg grep. It accepts an optional path argument. PHI1330 would like it to accept multiple path arguments.

Both git grep and hg grep accept multiple path arguments, so this part isn't problematic technically.

However, there are a lot of weird edge cases possible here that we don't currently handle, and this is probably generally a fairly good test accommodating many of the unusual cases mentioned in PHI13337. In particular, these situations may occur and we probably don't handle them well today:

  • The caller may want to search for a binary pattern.
  • The caller may want to search in a list of binary paths.
  • The caller may want to search an arbitrarily long pathname.
  • The caller may want to search an arbitrarily large number of paths.
  • The grep operation may take arbitrarily long to run.
  • The caller may pass an arbitrarily large offset.
  • The caller may pass an arbitrarily large limit.
  • The pattern may match in files with arbitrarily long path names.
  • The pattern may match on an arbitrarily long line. For example, the pattern xxx may match on a line which is 3GB of the character x. git grep, at least, does not appear to limit the output.
  • The pattern may match on a line with data that can not be represented in JSON.

These cases we handle reasonably well:

  • The grep operation may produce an arbitrarily large number of matches. We use LinesOfALargeExecFuture in all cases to stream the results, although we hold the entire result set in memory until we find offset + limit results today, so this can still problems for a very large offset value.

The solutions here are probably:

  • No searching for binary patterns. Patterns must be representable in JSON.
  • No searching for binary paths. Paths must be representable in JSON.
  • Add a global 4MB request payload limit and reject requests which are larger than 4MB. No searching for paths where the encoding + envelope is larger than 4MB. If you want to search for more than 4MB of total paths, break them into smaller pieces and perform multiple searches. ("4MB" is arbitrary, but basically the file storage engine chunk size limit, which has good properties as a general request size limit.)
  • Add a time limit to the grep operation and a way to communicate to the caller that we aborted the request after reaching the time limit.
  • Possibly, limit the maximum value of offset.
  • Limit the maximum value of limit.
  • Add a limit to LinesOfALarge that discards the streaming buffer after storing some prefix, and a way for callers to test if they are receiving the complete line or a truncated prefix of the line.
  • As we iterate over the LinesOfALarge output, discard results we are not returning (instead of buffering them and discarding them later).
  • Return matches in a "readable" vs "base64" format, or perhaps just accept that this method will only return JSON-approximated prefixes of matches which are suitable for human readers but that callers which are doing something especially insane should fetch the raw file and read the line from it directly. Possibly, we could flag results as "this is exactly the line as it appears in the file" vs "this is an adulterated, human-readable approximation of the line".
  • Likewise, return path names in some "readable" variation with a prefix marker. Probably accept that this method can not usefully identify results in files with exceptionally long path names, although we could technically provide an index into the repository_path table which stores LONGTEXT raw paths for especially ambitious callers working in insane repositories.

Event Timeline

epriestley triaged this task as Normal priority.Jul 12 2019, 5:04 PM
epriestley created this task.
epriestley added a project: Diffusion.
epriestley updated the task description. (Show Details)

See also https://phabricator.wikimedia.org/T197935.

In addition to everything else above, a given line may match a term more than once, and the matches may overlap. In the general case, I suppose that the matches may even be strict substrings of one another, e.g. A.*B matched against AAAB.

Our highlighting of multiple terms currently appears to be broken, and our highlighting of overlapping terms is probably no better.

See also https://discourse.phabricator-community.org/t/diffusion-search-how-to-do-case-insensitive-search/2854/.

These searches are case sensitive, but that's not the best mode for human use. Case insensitive would be a better default. We could provide some kind of flag/mode support, but it's not clear to me that it would be of much use.