Build support for ngram indexes for substring searches (e.g., file, paste, package, task titles)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	epriestley
	Dec 14 2015, 11:50 AM

Description

There are a number of situations where users would like to be able to search for "File Name Contains:" or "Package Name Contains:" or similar:

See T9964 for searching Owners packages by name.
See D14411 for searching Files by name.
See T6721 for searching tasks by title (only).
More than twice in my life, I've wanted to search for Pastes by title.
We have an existing "Contains Words" query in Maniphest.

We currently have these implementations available to us when a user searches for, say, "orange":

WHERE Prefix: Issue a WHERE name LIKE "orange%" query.
WHERE Substring: Issue a WHERE name LIKE "%orange%" query.
Fulltext Intersect: Query for the term in the fulltext index. Then issue a separate query in the main index, constrained by the PHIDs from the fulltext index.

These implementations all have major downsides and drawbacks:

WHERE Prefix: This is fast, but very limited. It can only find text that starts with the search term. This is not what users expect, and not very useful in most

WHERE Substring: This is slow (already >200ms per query on this host's dataset with 1m files). It has a reasonable level of power, but won't do what users expect with multi-word terms, at least on its own.

Fulltext Intersect: This method currently opens us up to a lot of weird failures when we perform the intersection, is difficult to observe and understand, relies on the fulltext index, is very complex, and mitigating the issues implies greater complexity. For example: if you search for a very common term (say, "task") and then specify very narrow additional constraints (say, in project X, assigned to Y), the results might not include matches if the fulltext query matches more than 1,000 results.

While we can mitigate this through the UI and through smarter infrastructure, the nature of this approach means that if both engines (the fulltext engine and the applicationsearch engine) produce large result lists, we need to intersect them in PHP, and we must hold both result sets in memory to intersect them. This is an upper limit on scalability, and this approach will have poor performance long before we get there.

Broadly, I have three concerns about picking a not-so-good solution here:

we don't have a clear pathway forward if users hit scalability limits, other than "tell all your users not to use this field", which is extremely silly;
when we do have a clear pathway forward, it may have different semantics than the approach we might pick today, so the meaning of the field could change, which would be confusing; and
I don't want to implement a solution which is confusing/surprising on its own (and I think the "prefix" approach is pretty weird and confusing).

Some approaches forward might be:

Dedicated index tables for text columns which we LIKE "%orange%". I'm not sure if we can meaningfully improve the performance of these queries by letting MySQL issue them against a dedicated table, but the total amount of text to be searched is not particularly large so it could help in theory. This is easy to test, at least.
Build our own bigram or trigram indexes? A trigram index breaks "orange" into "ora", "ran", "ang", "nge" and stores those in a table, which allows us to reduce the number of rows we're examining. These are somewhat complex and probably do not work well for non-latin languages (at least, without adjustment). I've implemented on once, but generally don't have enough experience to be confident this is a pathway forward. This is generally a simple way to do substring matches quickly, though.
Build our own token (whole word) indexes? We probably need to dip our toes into the water here for T6740 anyway. This is generally better than trigrams for "fulltext" search, but probably not as good for "filename / package name contains" search. I would expect fulltext search to find "orange" and "orangery" if I search for "oranges", but would not expect "Filename Contains: oranges" to give me "orange.jpg".

My inclination is:

Try the dedicated table on the files dataset and see how happy MySQL is about that (if it's better but still not great, we might still do this in conjunction with bigrams/trigrams).
Prototype trigrams and see what the performance looks like.

If one of those give us a good result, I think that gives us a pathway forward for "substring search" ("file name contains", "package name contains"), and we can commit to substring semantics there indefinitely because fulltext semantics probably never make sense.

Revisions and Commits

rP Phabricator
	D14846	rP96fe8c0b83cf Implement basic ngram search for Owners Package names
	D14843	rP23c42486e457 Rename "SearchEngine" to "FulltextStorageEngine"
	D14842	rP99c9df96b4ff Convert all "DocumentIndexers" into "FulltextEngines"
	D14841	rP99bd12b98d18 Lift Conpherence indexing up out of the Fulltext index
	D14837	rPecc3314a2527 Modularize transaction/comment indexing in the FulltextEngine
	D14836	rPaab1574e336e Remove TYPE_SEARCH_DIDUPDATEINDEX event
	D14835	rP02f82c2af5c4 Modularize fulltext indexing of Projects, Subscriptions and Custom Fields
	D14834	rP2447d9bdf270 Begin improving modularity of IndexEngine, add locks
	D14832	rP4bba3fd4c147 Fully modularize DestructionEngine
	D14831	rP674388ce6aee Prepare DestructionEngine to be modularized

Related Objects
Search...

Status	Assigned	Task
Resolved	epriestley	T12974 Upgrading: "Ferret" Fulltext Engine
Resolved	epriestley	T6721 Would like ability to search tasks by title
Resolved	epriestley	T9979 Build support for ngram indexes for substring searches (e.g., file, paste, package, task titles)

Event Timeline

epriestley created this task.Dec 14 2015, 11:50 AM

epriestley claimed this task.

epriestley raised the priority of this task from to Normal.

epriestley updated the task description. (Show Details)

epriestley changed the edit policy from "All Users" to "Community (Project)".

epriestley added projects: ApplicationSearch, Files, Owners.

epriestley added subscribers: epriestley, nickz, joshuaspence.

Thanks for thinking through all of these! Let me know if there is anything I can help with.

epriestley updated the task description. (Show Details)Dec 16 2015, 9:48 PM

Try the dedicated table on the files dataset and see how happy MySQL is about that (if it's better but still not great, we might still do this in conjunction with bigrams/trigrams).

MySQL doesn't really seem any happier about putting this stuff in a dedicated table. I tried a dedicated utf8mb4_general_ci table, a dedicated utf8mb4_bin table, and a dedicated varbinary table, but they all appear to have performance roughly comparable to the performance of the main file table.

I'll try prototyping a trigram index next.

I created an ngrams table like this:

CREATE TABLE `ngrams` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `objectID` int(10) unsigned NOT NULL,
  `ngram` char(3) COLLATE utf8mb4_bin NOT NULL,
  PRIMARY KEY (`id`),
  KEY `objectID` (`objectID`),
  KEY `ngram` (`ngram`,`objectID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin;

...and I'm filling it in with this script:

<?php

require_once 'scripts/__init_script__.php';

$ngram_table = 'ngrams';

$table = new PhabricatorFile();
$conn_w = $table->establishConnection('w');
foreach (new LiskMigrationIterator($table) as $file) {
  $name = $file->getName();
  $id = $file->getID();

  $ngrams = get_ngrams($name);

  $visible = array();
  foreach ($ngrams as $ngram) {
    $visible[] = '<'.$ngram.'>';
  }

  echo "{$name} -> ".implode(', ', $visible)."\n";

  $sql = array();
  foreach ($ngrams as $ngram) {
    $sql[] = qsprintf(
      $conn_w,
      '(%d, %s)',
      $file->getID(),
      $ngram);
  }

  queryfx(
    $conn_w,
    'DELETE FROM %T WHERE objectID = %d',
    $ngram_table,
    $id);

  if (!$sql) {
    continue;
  }

  queryfx(
    $conn_w,
    'INSERT INTO %T (objectID, ngram) VALUES %Q',
    $ngram_table,
    implode(', ', $sql));
}

function get_ngrams($str) {
  if (!strlen($str)) {
    return array();
  }

  $str = phutil_utf8_strtolower($str);
  $str = preg_replace('/ +/', ' ', $str);
  $str = ' '.$str.' ';
  $str = phutil_utf8v($str);

  $ngrams = array();
  $count = (count($str) - 2);
  for ($ii = 0; $ii < $count; $ii++) {
    $ngram = $str[$ii].$str[$ii + 1].$str[$ii + 2];
    $ngrams[$ngram] = $ngram;
  }

  return $ngrams;
}

While that populates, I'll write a little query script.

Okay, here's a query script:

<?php

require_once 'scripts/__init_script__.php';

$ngram_table = 'ngrams';

$table = new PhabricatorFile();
$conn_r = $table->establishConnection('r');

$query = $argv[1];

$ngrams = get_ngrams($query);

$joins = array();
$idx = 0;
foreach ($ngrams as $ngram) {
  $alias = 'ngram_'.$idx++;
  $joins[] = qsprintf(
    $conn_r,
    'JOIN %T %T ON %T.objectID = file.id AND %T.ngram = %s',
    $ngram_table,
    $alias,
    $alias,
    $alias,
    $ngram);
}

$full_ngrams = qsprintf(
  $conn_r,
  'SELECT file.id FROM file %Q
    WHERE name LIKE %~
    GROUP BY file.id',
  implode(' ', $joins),
  $query);

$full_like = qsprintf(
  $conn_r,
  'SELECT file.id FROM file
    WHERE name LIKE %~',
    $query);

$queries = array(
  'ngram' => $full_ngrams,
  'like' => $full_like,
);

$results = array();
foreach ($queries as $key => $raw_query) {
  echo ">>> ".$raw_query."\n";

  $t_start = microtime(true);
  $rows = queryfx_all($conn_r, '%Q', $raw_query);
  $t_end = microtime(true);

  $rows = ipull($rows, 'id');
  sort($rows);
  $results[$key] = array(
    'time' => ($t_end - $t_start),
    'rows' => $rows,
  );
}


if ($results['ngram']['rows'] !== $results['like']['rows']) {
  var_dump($results);
  throw new Exception('Different results!');
}

echo "Got ".count($results['ngram']['rows'])." results.\n";
echo "Same results.\n";
echo "LIKE Time: ".((int)(1000000 * $results['like']['time']))."us\n";
echo "NGRAM Time: ".((int)(1000000 * $results['ngram']['time']))."us\n";


function get_ngrams($str) {
  if (!strlen($str)) {
    return array();
  }

  $str = phutil_utf8_strtolower($str);
  $str = preg_replace('/ +/', ' ', $str);
//  $str = ' '.$str.' ';
  $str = phutil_utf8v($str);

  $ngrams = array();
  $count = (count($str) - 2);
  for ($ii = 0; $ii < $count; $ii++) {
    $ngram = $str[$ii].$str[$ii + 1].$str[$ii + 2];
    $ngrams[$ngram] = $ngram;
  }

  return $ngrams;
}

Results seem pretty promising:

$ php -f query.php rainbow
>>> SELECT file.id FROM file JOIN `ngrams` `ngram_0` ON `ngram_0`.objectID = file.id AND `ngram_0`.ngram = 'rai' JOIN `ngrams` `ngram_1` ON `ngram_1`.objectID = file.id AND `ngram_1`.ngram = 'ain' JOIN `ngrams` `ngram_2` ON `ngram_2`.objectID = file.id AND `ngram_2`.ngram = 'inb' JOIN `ngrams` `ngram_3` ON `ngram_3`.objectID = file.id AND `ngram_3`.ngram = 'nbo' JOIN `ngrams` `ngram_4` ON `ngram_4`.objectID = file.id AND `ngram_4`.ngram = 'bow'
    WHERE name LIKE '%rainbow%'
    GROUP BY file.id
>>> SELECT file.id FROM file
    WHERE name LIKE '%rainbow%'
Got 23 results.
Same results.
LIKE Time: 77180us
NGRAM Time: 1049us

So it's almost 100x faster for a "reasonable" query like that.

We have a lot of files named things like profile.jpg, and tokens from those files are the most common ones in the index:

mysql> select ngram, count(*) N from ngrams group by ngram order by N desc limit 10;
+-------+-------+
| ngram | N     |
+-------+-------+
| ile   | 76440 |
| fil   | 76352 |
| pro   | 74372 |
| rof   | 73567 |
| ofi   | 73491 |
| .jp   | 68672 |
| jpg   | 68472 |
| pg    | 68281 |
| le.   | 68131 |
| -pr   | 67232 |
+-------+-------+
10 rows in set (0.91 sec)

Performance is roughly 2x worse for these:

$ php -f query.php ile
>>> SELECT file.id FROM file JOIN `ngrams` `ngram_0` ON `ngram_0`.objectID = file.id AND `ngram_0`.ngram = 'ile'
    WHERE name LIKE '%ile%'
    GROUP BY file.id
>>> SELECT file.id FROM file
    WHERE name LIKE '%ile%'
Got 76439 results.
Same results.
LIKE Time: 146003us
NGRAM Time: 280315us

...but can get even worse in pathological cases -- it's 10x worse here:

$ php -f query.php profile.jpg
>>> SELECT file.id FROM file JOIN `ngrams` `ngram_0` ON `ngram_0`.objectID = file.id AND `ngram_0`.ngram = 'pro' JOIN `ngrams` `ngram_1` ON `ngram_1`.objectID = file.id AND `ngram_1`.ngram = 'rof' JOIN `ngrams` `ngram_2` ON `ngram_2`.objectID = file.id AND `ngram_2`.ngram = 'ofi' JOIN `ngrams` `ngram_3` ON `ngram_3`.objectID = file.id AND `ngram_3`.ngram = 'fil' JOIN `ngrams` `ngram_4` ON `ngram_4`.objectID = file.id AND `ngram_4`.ngram = 'ile' JOIN `ngrams` `ngram_5` ON `ngram_5`.objectID = file.id AND `ngram_5`.ngram = 'le.' JOIN `ngrams` `ngram_6` ON `ngram_6`.objectID = file.id AND `ngram_6`.ngram = 'e.j' JOIN `ngrams` `ngram_7` ON `ngram_7`.objectID = file.id AND `ngram_7`.ngram = '.jp' JOIN `ngrams` `ngram_8` ON `ngram_8`.objectID = file.id AND `ngram_8`.ngram = 'jpg'
    WHERE name LIKE '%profile.jpg%'
    GROUP BY file.id
>>> SELECT file.id FROM file
    WHERE name LIKE '%profile.jpg%'
Got 64650 results.
Same results.
LIKE Time: 144378us
NGRAM Time: 1389069us

However, if I make the query a little more realistic and add ORDER BY file.id DESC LIMIT 100 we start getting very good results even for common queries. Here's roughly what I'm seeing, with some weird queries to avoid hitting the query cache:

Query	Like	Ngram	Speedup
epriest	959ms	1ms	~1000x
rofile.jpg	1ms	5ms	5x SLOWER
.gif	28ms	6ms	~5x
screenshot	155ms	16ms	~10x
halifax	91ms	1ms	100x
woiwqehfqnewlff	100ms	1ms	100x
engine.php	161ms	3m	50x
supercalafragalistic...	...expialadocious-pneumonoultramicroscopicsillicovulcanioconosis	Error, MySQL can only use 61 tables in a join	lol

So this looks faster in every case except totally degenerate ones, where it's still really fast, just slightly slower than the other method. I need to add some kind of limit so we only join a maximum of a dozen or so ngrams but it looks like this is generally a viable approach.

Upshot is that this seems very promising to me. I think the pathway forward here is something like:

As general infrastructure cleanup, I would like to split the existing SearchEngine and DestructionEngine components apart into "Extensions", following the ApplicationSearchEngine and EditEngine models. (I'd also like to rename SearchEngine to FulltextEngine or something so that SearchEngine as a term has less ambiguity between ApplicationSearchEngine and SearchEngine.) Specifically, both SearchEngine and DestructionEngine currently have big instanceof blocks where they test for capabilities, and those should get modularized into "Extension" classes so third-party code can add new indexing and destruction effects.

As part of this cleanup, we should also fix T9890.

After things are cleaned up, we can add ngram indexes to objects (pastes, files, tasks, packages) and have the search/fulltext engine regenerate them during normal indexing (using new extension stuff), then teach the Query stuff and ApplicationSearch about them. The 61-table limit is maybe going to get a bit weird eventually (we may already join a significant number of tables to do custom field queries) but if we just limit queries to, say, 16 ngrams per field I think we should be fine. No current objects have more than one ngram-indexable field anyway.

Maybe IndexEngine is a better name than FulltextEngine.

epriestley renamed this task from Plan the future of string-within-string searches to Build support for ngram indexes for substring searches (e.g., file, paste, package, task titles).Dec 17 2015, 4:40 PM

epriestley added a parent task: T6721: Would like ability to search tasks by title.

epriestley mentioned this in T10010: Implement Subprojects.Dec 19 2015, 5:19 PM

johnny-bit added a subscriber: johnny-bit.Dec 19 2015, 5:46 PM

epriestley added a revision: D14831: Prepare DestructionEngine to be modularized.Dec 20 2015, 3:54 PM

epriestley added a revision: D14832: Fully modularize DestructionEngine.Dec 20 2015, 4:33 PM

epriestley mentioned this in D14828: Add basic NUX support to SearchEngines.Dec 20 2015, 5:02 PM

epriestley added a revision: D14834: Begin improving modularity of IndexEngine, add locks.Dec 21 2015, 1:50 PM

epriestley added a revision: D14835: Modularize fulltext indexing of Projects, Subscriptions and Custom Fields.Dec 21 2015, 2:49 PM

epriestley added a revision: D14836: Remove TYPE_SEARCH_DIDUPDATEINDEX event.Dec 21 2015, 2:58 PM

epriestley added a revision: D14837: Modularize transaction/comment indexing in the FulltextEngine.Dec 21 2015, 3:23 PM

epriestley added a revision: D14841: Lift Conpherence indexing up out of the Fulltext index.Dec 21 2015, 5:56 PM

epriestley added a revision: D14842: Convert all "DocumentIndexers" into "FulltextEngines".Dec 21 2015, 6:18 PM

epriestley added a revision: D14843: Rename "SearchEngine" to "FulltextStorageEngine".Dec 21 2015, 6:29 PM

As general infrastructure cleanup, I would like to split the existing SearchEngine and DestructionEngine components apart...
As part of this cleanup, we should also fix T9890.

These should both be done once all the stuff above lands.

The ngram stuff is now unblocked, although I don't have a specific design for it yet. I'm probably going to spend a few minutes on it and see how straightforward it is, but if it looks like a fair amount of work I'm going to skip it for now so I can move T10010 forward.

epriestley added a revision: D14846: Implement basic ngram search for Owners Package names.Dec 21 2015, 9:07 PM

D14846 should fix the package case, but I'm going to leave this open for followup work: Files, Pastes, Maniphest should get indexes; and we should replace the powerful systems in Projects and People.

epriestley added a commit: rP674388ce6aee: Prepare DestructionEngine to be modularized.Dec 22 2015, 1:03 AM

epriestley added a commit: rP4bba3fd4c147: Fully modularize DestructionEngine.

epriestley added a commit: rP2447d9bdf270: Begin improving modularity of IndexEngine, add locks.

epriestley added a commit: rP02f82c2af5c4: Modularize fulltext indexing of Projects, Subscriptions and Custom Fields.

epriestley added a commit: rPaab1574e336e: Remove TYPE_SEARCH_DIDUPDATEINDEX event.Dec 22 2015, 1:24 AM

epriestley added a commit: rPecc3314a2527: Modularize transaction/comment indexing in the FulltextEngine.

epriestley added a commit: rP99bd12b98d18: Lift Conpherence indexing up out of the Fulltext index.

epriestley added a commit: rP99c9df96b4ff: Convert all "DocumentIndexers" into "FulltextEngines".

epriestley added a commit: rP23c42486e457: Rename "SearchEngine" to "FulltextStorageEngine".Dec 22 2015, 1:26 AM

epriestley added a commit: rP96fe8c0b83cf: Implement basic ngram search for Owners Package names.Dec 22 2015, 4:00 PM

epriestley mentioned this in D14411: Allow querying for files by name.Dec 22 2015, 4:14 PM

epriestley mentioned this in Blog Post: Development Notes (2015 Week 52).Dec 26 2015, 2:27 PM

epriestley mentioned this in T10690: Allow badges to be searched by name.Mar 29 2016, 2:24 PM

eadler added a subscriber: eadler.Apr 14 2016, 4:58 AM

epriestley mentioned this in T11675: "Text" type custom field in Maniphest not displayed in Maniphest Advanced Search form.Sep 21 2016, 2:23 PM

chad mentioned this in T11905: Search shouldn't use word boundaries.Nov 21 2016, 10:03 PM

Looking through the code, looks like Maniphest doesn't use ngrams yet? Any reason?

Partly, just haven't gotten there yet.

Maniphest also has an existing "Contains Words" field, and it's maybe-odd to have separate "Title" and "Contains Words" fields, but "Title" would not be a strict superset of "Contains Words". We could index everything to make it a strict superset, but that would change the semantics.

Also generally not sure this is exactly what we want in Maniphest -- exact substring search does not seem like the right search to use for Maniphest. It's possible that a stemmed, ngram-backed search is, after T6740, but I'm not yet sure if ngrams + stemming will mesh cleanly.

Essentially all of the places where we've deployed ngrams so far are places where exact substring search is unambiguously or almost-unambiguously correct. Owners packages are maybe kinda arguable but I think they probably don't often have a lot of wordy English prose in them.

epriestley mentioned this in D17018: Add a `title` field to the `maniphest.query` conduit method.Dec 9 2016, 8:23 PM

I don't think this is quite the same issue, but seems related, so I thought I'd err on the side of not creating a dupe.

The current search seems to have trouble matching full words from the title in some cases. Example from secure (query string typed into global search from home page):

Term: phabricator_maniphest.maniphest_transaction
url: /search/query/b12dYvnVN1rL/#R
Result: empty
Expected: T6771: Create index for transactionType in phabricator_maniphest.maniphest_transaction

Maybe because of the punctuation in the title?

They are not related. That search uses fulltext indexes (see T11922 / T11741 for recent discussion), not ngram indexes.

Not directly related, but it's much easier for us to deal with duplicate tasks by merging (which takes a couple of seconds) than de-duplicate tasks by splitting them (which is much more involved, and leaves unrelated discussion on an unrelated task), so we'd actually prefer if you err on the side of filing duplicates.

Feel free to file a new task if you'd like us to follow up.

ftdysa mentioned this in Z1336: General Chat.Jan 20 2017, 9:43 PM

ftdysa added a subscriber: ftdysa.Jan 20 2017, 10:17 PM

ftdysa mentioned this in T12137: MySQL fulltext search returns no results in boolean mode when a query includes +"x", where "x" is a stopword or short token.Jan 26 2017, 3:19 PM

epriestley moved this task from Backlog to v3 on the Files board.Mar 30 2017, 7:31 PM

I'm going to call this resolved; T9979 is the survivor for Files and the Maniphest case is probably more involved (T6721).

epriestley mentioned this in T8788: Allow querying for files by name.Apr 9 2017, 11:24 AM

epriestley mentioned this in T2632: MyISAM fulltext does not support non-latin languages and we don't warn you about it.Apr 13 2017, 12:18 AM

Build support for ngram indexes for substring searches (e.g., file, paste, package, task titles)Closed, ResolvedPublicActions

Description

Revisions and Commits

Related ObjectsSearch...

Event Timeline

Build support for ngram indexes for substring searches (e.g., file, paste, package, task titles)
Closed, ResolvedPublic
Actions

Related Objects
Search...