Page MenuHomePhabricator

Build support for ngram indexes for substring searches (e.g., file, paste, package, task titles)
Closed, ResolvedPublic

Description

There are a number of situations where users would like to be able to search for "File Name Contains:" or "Package Name Contains:" or similar:

  • See T9964 for searching Owners packages by name.
  • See D14411 for searching Files by name.
  • See T6721 for searching tasks by title (only).
  • More than twice in my life, I've wanted to search for Pastes by title.
  • We have an existing "Contains Words" query in Maniphest.

We currently have these implementations available to us when a user searches for, say, "orange":

  • WHERE Prefix: Issue a WHERE name LIKE "orange%" query.
  • WHERE Substring: Issue a WHERE name LIKE "%orange%" query.
  • Fulltext Intersect: Query for the term in the fulltext index. Then issue a separate query in the main index, constrained by the PHIDs from the fulltext index.

These implementations all have major downsides and drawbacks:

WHERE Prefix: This is fast, but very limited. It can only find text that starts with the search term. This is not what users expect, and not very useful in most

WHERE Substring: This is slow (already >200ms per query on this host's dataset with 1m files). It has a reasonable level of power, but won't do what users expect with multi-word terms, at least on its own.

Fulltext Intersect: This method currently opens us up to a lot of weird failures when we perform the intersection, is difficult to observe and understand, relies on the fulltext index, is very complex, and mitigating the issues implies greater complexity. For example: if you search for a very common term (say, "task") and then specify very narrow additional constraints (say, in project X, assigned to Y), the results might not include matches if the fulltext query matches more than 1,000 results.

While we can mitigate this through the UI and through smarter infrastructure, the nature of this approach means that if both engines (the fulltext engine and the applicationsearch engine) produce large result lists, we need to intersect them in PHP, and we must hold both result sets in memory to intersect them. This is an upper limit on scalability, and this approach will have poor performance long before we get there.


Broadly, I have three concerns about picking a not-so-good solution here:

  • we don't have a clear pathway forward if users hit scalability limits, other than "tell all your users not to use this field", which is extremely silly;
  • when we do have a clear pathway forward, it may have different semantics than the approach we might pick today, so the meaning of the field could change, which would be confusing; and
  • I don't want to implement a solution which is confusing/surprising on its own (and I think the "prefix" approach is pretty weird and confusing).

Some approaches forward might be:

  • Dedicated index tables for text columns which we LIKE "%orange%". I'm not sure if we can meaningfully improve the performance of these queries by letting MySQL issue them against a dedicated table, but the total amount of text to be searched is not particularly large so it could help in theory. This is easy to test, at least.
  • Build our own bigram or trigram indexes? A trigram index breaks "orange" into "ora", "ran", "ang", "nge" and stores those in a table, which allows us to reduce the number of rows we're examining. These are somewhat complex and probably do not work well for non-latin languages (at least, without adjustment). I've implemented on once, but generally don't have enough experience to be confident this is a pathway forward. This is generally a simple way to do substring matches quickly, though.
  • Build our own token (whole word) indexes? We probably need to dip our toes into the water here for T6740 anyway. This is generally better than trigrams for "fulltext" search, but probably not as good for "filename / package name contains" search. I would expect fulltext search to find "orange" and "orangery" if I search for "oranges", but would not expect "Filename Contains: oranges" to give me "orange.jpg".

My inclination is:

  • Try the dedicated table on the files dataset and see how happy MySQL is about that (if it's better but still not great, we might still do this in conjunction with bigrams/trigrams).
  • Prototype trigrams and see what the performance looks like.

If one of those give us a good result, I think that gives us a pathway forward for "substring search" ("file name contains", "package name contains"), and we can commit to substring semantics there indefinitely because fulltext semantics probably never make sense.

Related Objects

Event Timeline

epriestley claimed this task.
epriestley raised the priority of this task from to Normal.
epriestley updated the task description. (Show Details)
epriestley changed the edit policy from "All Users" to "Community (Project)".
epriestley added subscribers: epriestley, nickz, joshuaspence.

Thanks for thinking through all of these! Let me know if there is anything I can help with.

Try the dedicated table on the files dataset and see how happy MySQL is about that (if it's better but still not great, we might still do this in conjunction with bigrams/trigrams).

MySQL doesn't really seem any happier about putting this stuff in a dedicated table. I tried a dedicated utf8mb4_general_ci table, a dedicated utf8mb4_bin table, and a dedicated varbinary table, but they all appear to have performance roughly comparable to the performance of the main file table.

I'll try prototyping a trigram index next.

I created an ngrams table like this:

CREATE TABLE `ngrams` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `objectID` int(10) unsigned NOT NULL,
  `ngram` char(3) COLLATE utf8mb4_bin NOT NULL,
  PRIMARY KEY (`id`),
  KEY `objectID` (`objectID`),
  KEY `ngram` (`ngram`,`objectID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin;

...and I'm filling it in with this script:

<?php

require_once 'scripts/__init_script__.php';

$ngram_table = 'ngrams';

$table = new PhabricatorFile();
$conn_w = $table->establishConnection('w');
foreach (new LiskMigrationIterator($table) as $file) {
  $name = $file->getName();
  $id = $file->getID();

  $ngrams = get_ngrams($name);

  $visible = array();
  foreach ($ngrams as $ngram) {
    $visible[] = '<'.$ngram.'>';
  }

  echo "{$name} -> ".implode(', ', $visible)."\n";

  $sql = array();
  foreach ($ngrams as $ngram) {
    $sql[] = qsprintf(
      $conn_w,
      '(%d, %s)',
      $file->getID(),
      $ngram);
  }

  queryfx(
    $conn_w,
    'DELETE FROM %T WHERE objectID = %d',
    $ngram_table,
    $id);

  if (!$sql) {
    continue;
  }

  queryfx(
    $conn_w,
    'INSERT INTO %T (objectID, ngram) VALUES %Q',
    $ngram_table,
    implode(', ', $sql));
}

function get_ngrams($str) {
  if (!strlen($str)) {
    return array();
  }

  $str = phutil_utf8_strtolower($str);
  $str = preg_replace('/ +/', ' ', $str);
  $str = ' '.$str.' ';
  $str = phutil_utf8v($str);

  $ngrams = array();
  $count = (count($str) - 2);
  for ($ii = 0; $ii < $count; $ii++) {
    $ngram = $str[$ii].$str[$ii + 1].$str[$ii + 2];
    $ngrams[$ngram] = $ngram;
  }

  return $ngrams;
}

While that populates, I'll write a little query script.

Okay, here's a query script:

<?php

require_once 'scripts/__init_script__.php';

$ngram_table = 'ngrams';

$table = new PhabricatorFile();
$conn_r = $table->establishConnection('r');

$query = $argv[1];

$ngrams = get_ngrams($query);

$joins = array();
$idx = 0;
foreach ($ngrams as $ngram) {
  $alias = 'ngram_'.$idx++;
  $joins[] = qsprintf(
    $conn_r,
    'JOIN %T %T ON %T.objectID = file.id AND %T.ngram = %s',
    $ngram_table,
    $alias,
    $alias,
    $alias,
    $ngram);
}

$full_ngrams = qsprintf(
  $conn_r,
  'SELECT file.id FROM file %Q
    WHERE name LIKE %~
    GROUP BY file.id',
  implode(' ', $joins),
  $query);

$full_like = qsprintf(
  $conn_r,
  'SELECT file.id FROM file
    WHERE name LIKE %~',
    $query);

$queries = array(
  'ngram' => $full_ngrams,
  'like' => $full_like,
);

$results = array();
foreach ($queries as $key => $raw_query) {
  echo ">>> ".$raw_query."\n";

  $t_start = microtime(true);
  $rows = queryfx_all($conn_r, '%Q', $raw_query);
  $t_end = microtime(true);

  $rows = ipull($rows, 'id');
  sort($rows);
  $results[$key] = array(
    'time' => ($t_end - $t_start),
    'rows' => $rows,
  );
}


if ($results['ngram']['rows'] !== $results['like']['rows']) {
  var_dump($results);
  throw new Exception('Different results!');
}

echo "Got ".count($results['ngram']['rows'])." results.\n";
echo "Same results.\n";
echo "LIKE Time: ".((int)(1000000 * $results['like']['time']))."us\n";
echo "NGRAM Time: ".((int)(1000000 * $results['ngram']['time']))."us\n";


function get_ngrams($str) {
  if (!strlen($str)) {
    return array();
  }

  $str = phutil_utf8_strtolower($str);
  $str = preg_replace('/ +/', ' ', $str);
//  $str = ' '.$str.' ';
  $str = phutil_utf8v($str);

  $ngrams = array();
  $count = (count($str) - 2);
  for ($ii = 0; $ii < $count; $ii++) {
    $ngram = $str[$ii].$str[$ii + 1].$str[$ii + 2];
    $ngrams[$ngram] = $ngram;
  }

  return $ngrams;
}

Results seem pretty promising:

$ php -f query.php rainbow
>>> SELECT file.id FROM file JOIN `ngrams` `ngram_0` ON `ngram_0`.objectID = file.id AND `ngram_0`.ngram = 'rai' JOIN `ngrams` `ngram_1` ON `ngram_1`.objectID = file.id AND `ngram_1`.ngram = 'ain' JOIN `ngrams` `ngram_2` ON `ngram_2`.objectID = file.id AND `ngram_2`.ngram = 'inb' JOIN `ngrams` `ngram_3` ON `ngram_3`.objectID = file.id AND `ngram_3`.ngram = 'nbo' JOIN `ngrams` `ngram_4` ON `ngram_4`.objectID = file.id AND `ngram_4`.ngram = 'bow'
    WHERE name LIKE '%rainbow%'
    GROUP BY file.id
>>> SELECT file.id FROM file
    WHERE name LIKE '%rainbow%'
Got 23 results.
Same results.
LIKE Time: 77180us
NGRAM Time: 1049us

So it's almost 100x faster for a "reasonable" query like that.

We have a lot of files named things like profile.jpg, and tokens from those files are the most common ones in the index:

mysql> select ngram, count(*) N from ngrams group by ngram order by N desc limit 10;
+-------+-------+
| ngram | N     |
+-------+-------+
| ile   | 76440 |
| fil   | 76352 |
| pro   | 74372 |
| rof   | 73567 |
| ofi   | 73491 |
| .jp   | 68672 |
| jpg   | 68472 |
| pg    | 68281 |
| le.   | 68131 |
| -pr   | 67232 |
+-------+-------+
10 rows in set (0.91 sec)

Performance is roughly 2x worse for these:

$ php -f query.php ile
>>> SELECT file.id FROM file JOIN `ngrams` `ngram_0` ON `ngram_0`.objectID = file.id AND `ngram_0`.ngram = 'ile'
    WHERE name LIKE '%ile%'
    GROUP BY file.id
>>> SELECT file.id FROM file
    WHERE name LIKE '%ile%'
Got 76439 results.
Same results.
LIKE Time: 146003us
NGRAM Time: 280315us

...but can get even worse in pathological cases -- it's 10x worse here:

$ php -f query.php profile.jpg
>>> SELECT file.id FROM file JOIN `ngrams` `ngram_0` ON `ngram_0`.objectID = file.id AND `ngram_0`.ngram = 'pro' JOIN `ngrams` `ngram_1` ON `ngram_1`.objectID = file.id AND `ngram_1`.ngram = 'rof' JOIN `ngrams` `ngram_2` ON `ngram_2`.objectID = file.id AND `ngram_2`.ngram = 'ofi' JOIN `ngrams` `ngram_3` ON `ngram_3`.objectID = file.id AND `ngram_3`.ngram = 'fil' JOIN `ngrams` `ngram_4` ON `ngram_4`.objectID = file.id AND `ngram_4`.ngram = 'ile' JOIN `ngrams` `ngram_5` ON `ngram_5`.objectID = file.id AND `ngram_5`.ngram = 'le.' JOIN `ngrams` `ngram_6` ON `ngram_6`.objectID = file.id AND `ngram_6`.ngram = 'e.j' JOIN `ngrams` `ngram_7` ON `ngram_7`.objectID = file.id AND `ngram_7`.ngram = '.jp' JOIN `ngrams` `ngram_8` ON `ngram_8`.objectID = file.id AND `ngram_8`.ngram = 'jpg'
    WHERE name LIKE '%profile.jpg%'
    GROUP BY file.id
>>> SELECT file.id FROM file
    WHERE name LIKE '%profile.jpg%'
Got 64650 results.
Same results.
LIKE Time: 144378us
NGRAM Time: 1389069us

However, if I make the query a little more realistic and add ORDER BY file.id DESC LIMIT 100 we start getting very good results even for common queries. Here's roughly what I'm seeing, with some weird queries to avoid hitting the query cache:

QueryLikeNgramSpeedup
epriest959ms1ms~1000x
rofile.jpg1ms5ms5x SLOWER
.gif28ms6ms~5x
screenshot155ms16ms~10x
halifax91ms1ms100x
woiwqehfqnewlff100ms1ms100x
engine.php161ms3m50x
supercalafragalistic......expialadocious-pneumonoultramicroscopicsillicovulcanioconosisError, MySQL can only use 61 tables in a joinlol

So this looks faster in every case except totally degenerate ones, where it's still really fast, just slightly slower than the other method. I need to add some kind of limit so we only join a maximum of a dozen or so ngrams but it looks like this is generally a viable approach.

Upshot is that this seems very promising to me. I think the pathway forward here is something like:

As general infrastructure cleanup, I would like to split the existing SearchEngine and DestructionEngine components apart into "Extensions", following the ApplicationSearchEngine and EditEngine models. (I'd also like to rename SearchEngine to FulltextEngine or something so that SearchEngine as a term has less ambiguity between ApplicationSearchEngine and SearchEngine.) Specifically, both SearchEngine and DestructionEngine currently have big instanceof blocks where they test for capabilities, and those should get modularized into "Extension" classes so third-party code can add new indexing and destruction effects.

As part of this cleanup, we should also fix T9890.

After things are cleaned up, we can add ngram indexes to objects (pastes, files, tasks, packages) and have the search/fulltext engine regenerate them during normal indexing (using new extension stuff), then teach the Query stuff and ApplicationSearch about them. The 61-table limit is maybe going to get a bit weird eventually (we may already join a significant number of tables to do custom field queries) but if we just limit queries to, say, 16 ngrams per field I think we should be fine. No current objects have more than one ngram-indexable field anyway.

Maybe IndexEngine is a better name than FulltextEngine.

epriestley renamed this task from Plan the future of string-within-string searches to Build support for ngram indexes for substring searches (e.g., file, paste, package, task titles).Dec 17 2015, 4:40 PM

As general infrastructure cleanup, I would like to split the existing SearchEngine and DestructionEngine components apart...
As part of this cleanup, we should also fix T9890.

These should both be done once all the stuff above lands.

The ngram stuff is now unblocked, although I don't have a specific design for it yet. I'm probably going to spend a few minutes on it and see how straightforward it is, but if it looks like a fair amount of work I'm going to skip it for now so I can move T10010 forward.

D14846 should fix the package case, but I'm going to leave this open for followup work: Files, Pastes, Maniphest should get indexes; and we should replace the powerful systems in Projects and People.

Looking through the code, looks like Maniphest doesn't use ngrams yet? Any reason?

Partly, just haven't gotten there yet.

Maniphest also has an existing "Contains Words" field, and it's maybe-odd to have separate "Title" and "Contains Words" fields, but "Title" would not be a strict superset of "Contains Words". We could index everything to make it a strict superset, but that would change the semantics.

Also generally not sure this is exactly what we want in Maniphest -- exact substring search does not seem like the right search to use for Maniphest. It's possible that a stemmed, ngram-backed search is, after T6740, but I'm not yet sure if ngrams + stemming will mesh cleanly.

Essentially all of the places where we've deployed ngrams so far are places where exact substring search is unambiguously or almost-unambiguously correct. Owners packages are maybe kinda arguable but I think they probably don't often have a lot of wordy English prose in them.

I don't think this is quite the same issue, but seems related, so I thought I'd err on the side of not creating a dupe.

The current search seems to have trouble matching full words from the title in some cases. Example from secure (query string typed into global search from home page):

Maybe because of the punctuation in the title?

They are not related. That search uses fulltext indexes (see T11922 / T11741 for recent discussion), not ngram indexes.

Not directly related, but it's much easier for us to deal with duplicate tasks by merging (which takes a couple of seconds) than de-duplicate tasks by splitting them (which is much more involved, and leaves unrelated discussion on an unrelated task), so we'd actually prefer if you err on the side of filing duplicates.

Feel free to file a new task if you'd like us to follow up.

I'm going to call this resolved; T9979 is the survivor for Files and the Maniphest case is probably more involved (T6721).