⚓ T12819 InnoDB FULLTEXT appears to fail catastrophically once it reaches a moderate size

Status	Assigned	Task
Resolved	epriestley	T12974 Upgrading: "Ferret" Fulltext Engine
Resolved	epriestley	T12819 InnoDB FULLTEXT appears to fail catastrophically once it reaches a moderate size
Resolved	epriestley	T12928 Tokens in the form "v0.1" are not handled well by the MySQL FULLTEXT index

Unclear if building a search engine by just doing a lot of JOINs actually scales or not. Seems OK here.

I generated and indexed 65,000 tasks locally with bin/lipsum. The lipsum corpus is very small (approximately 60 "words") so this "should" be a more stressful test of the index than real data, because almost every task has the same words and the same ngrams (that is, the ngram index isn't very useful for distinguishing between tasks). The query "lorem" matches 28,000 tasks and resolves in ~5-10s for me locally. This drops to ~0.01s if I remove the ORDER constraints and just let MySQL return whatever part of the result set it wants to. This does a pile of I/O but memory doesn't do anything crazy, which is consistent with spending time doing a big grep over the primary index while processing the LIKE constraints, which is what I was hoping for in the worst case.

For uncommon words (e.g., any word not part of the lipsum corpus) the ngram index is effective and results are instantaneous. For sufficiently constrained queries (e.g., adding a priority and a tag so we hit a few hundred results instead of 28,000) the indexes seem to get used sensibly and results are also instantaneous.

The actual indexes are fairly large (3GB for ~500MB of task data, although lispum tasks have an exceptionally large amount of text).

None of this is perfect, but it all seems better than InnoDB fulltext in terms of having mostly reasonable behaviors and not completely exploding for no reason.

epriestley added a revision: D18539: Add support for relevance-ranking Ferret engine results.Sep 5 2017, 11:00 PM

epriestley added a revision: D18540: Add Ferret engine index support to Differential.Sep 5 2017, 11:32 PM

epriestley added a commit: rP64b7778f3257: Add support for relevance-ranking Ferret engine results.Sep 5 2017, 11:45 PM

epriestley added a commit: rPf40f3ca74c91: Add Ferret engine index support to Differential.

20after4 awarded a token.Sep 6 2017, 3:51 PM

epriestley added a revision: D18544: Remove the fulltext "reconstructDocument()" method.Sep 6 2017, 5:50 PM

epriestley added a commit: rPfaca1deea5b5: Remove the fulltext "reconstructDocument()" method.Sep 6 2017, 6:48 PM

epriestley added a revision: D18547: Support Ferret engine queries in ApplicationSearch via extension instead of hard-code.Sep 6 2017, 7:15 PM

epriestley added a revision: D18548: Skeleton support for running global fulltext queries via the Ferret engine.Sep 6 2017, 7:30 PM

epriestley added a commit: rP551c62b91af5: Support Ferret engine queries in ApplicationSearch via extension instead of….Sep 6 2017, 8:15 PM

epriestley added a commit: rP4ea677ba979d: Skeleton support for running global fulltext queries via the Ferret engine.

epriestley added a revision: D18550: Use the Ferret engine fulltext document table to drive auxiliary fulltext constraints.Sep 7 2017, 3:49 PM

epriestley added a revision: D18551: Sort global fulltext results by overall relevance.Sep 7 2017, 4:13 PM

epriestley added a revision: D18552: Support Ferret engine for searching users.Sep 7 2017, 4:45 PM

epriestley added a revision: D18553: Lightly modernize FundInitiativeQuery.Sep 7 2017, 4:54 PM

epriestley added a revision: D18554: Lightly modernize FundInitiativeSearchEngine.Sep 7 2017, 5:06 PM

epriestley added a revision: D18555: Support Ferret engine in Fund initiatives.Sep 7 2017, 5:23 PM

epriestley added a revision: D18556: Support Ferret engine for Passphrase credentials.Sep 7 2017, 5:30 PM

epriestley added a revision: D18559: Reduce the amount of boilerplate that implementing FerretInterface requires.Sep 7 2017, 7:06 PM

epriestley added a revision: D18564: Support Ferret engine in Owners.Sep 7 2017, 7:06 PM

epriestley added a revision: D18565: Support Ferret engine in Phame.Sep 7 2017, 7:14 PM

I briefly hit a bizarre case where a Ferret engine query took 10 seconds to find a document in 163 projects. However, running ANALYZE TABLE on the ngrams table resolved this completely. I suspect the ngrams join may require some tweaking (and maybe a bin/storage analyze). Analyzing the Maniphest table actually seems to improve performance by ~50% too (???) although that's not a very scientific measurement.

epriestley added a revision: D18566: Support Ferret engine in Projects.Sep 7 2017, 7:34 PM

Here's a closer look at what's probably happening:

mysql> show indexes from phriction_document_fngrams;
+----------------------------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table                      | Non_unique | Key_name   | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+----------------------------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| phriction_document_fngrams |          0 | PRIMARY    |            1 | id          | A         |        3560 |     NULL | NULL   |      | BTREE      |         |               |
| phriction_document_fngrams |          1 | key_ngram  |            1 | ngram       | A         |        3560 |     NULL | NULL   |      | BTREE      |         |               |
| phriction_document_fngrams |          1 | key_ngram  |            2 | documentID  | A         |        3560 |     NULL | NULL   |      | BTREE      |         |               |
| phriction_document_fngrams |          1 | key_object |            1 | documentID  | A         |        3560 |     NULL | NULL   |      | BTREE      |         |               |
+----------------------------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
4 rows in set (0.00 sec)

mysql> analyze table phriction_document_fngrams;
+--------------------------------------------+---------+----------+----------+
| Table                                      | Op      | Msg_type | Msg_text |
+--------------------------------------------+---------+----------+----------+
| local_phriction.phriction_document_fngrams | analyze | status   | OK       |
+--------------------------------------------+---------+----------+----------+
1 row in set (0.01 sec)

mysql> show indexes from phriction_document_fngrams;
+----------------------------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table                      | Non_unique | Key_name   | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+----------------------------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| phriction_document_fngrams |          0 | PRIMARY    |            1 | id          | A         |        3560 |     NULL | NULL   |      | BTREE      |         |               |
| phriction_document_fngrams |          1 | key_ngram  |            1 | ngram       | A         |        3560 |     NULL | NULL   |      | BTREE      |         |               |
| phriction_document_fngrams |          1 | key_ngram  |            2 | documentID  | A         |        3560 |     NULL | NULL   |      | BTREE      |         |               |
| phriction_document_fngrams |          1 | key_object |            1 | documentID  | A         |          46 |     NULL | NULL   |      | BTREE      |         |               |
+----------------------------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
4 rows in set (0.00 sec)

Note that the cardinality for the documentID row drops from 3560 (inaccurately estimating a unique value for each row) to 46 (more accurately estimating a unique value for each document) after analysis.

lvital added a subscriber: lvital.Sep 7 2017, 7:49 PM

epriestley added a revision: D18567: Support Ferret engine in Phriction.Sep 7 2017, 7:50 PM

epriestley added a revision: D18568: Support Ferret engine in Calendar.Sep 7 2017, 8:02 PM

epriestley added a revision: D18569: Support Ferret engine in Pholio.Sep 7 2017, 8:07 PM

epriestley added a revision: D18572: Support the Ferret engine in Diffusion.Sep 7 2017, 8:20 PM

epriestley added a commit: rP8059db894d64: Use the Ferret engine fulltext document table to drive auxiliary fulltext….Sep 7 2017, 8:21 PM

epriestley added a commit: rPa2a2b3f7f4b5: Sort global fulltext results by overall relevance.

epriestley added a commit: rP3ff9d4a4cab3: Support Ferret engine for searching users.

epriestley added a commit: rP60deec36d8ed: Lightly modernize FundInitiativeQuery.

epriestley added a commit: rPcf0bc32e18a9: Lightly modernize FundInitiativeSearchEngine.

epriestley added a commit: rPf23717b416fe: Support Ferret engine in Fund initiatives.

epriestley added a commit: rP2020c1e7bd22: Support Ferret engine for Passphrase credentials.

epriestley mentioned this in rP2218caee0f50: Reduce the amount of boilerplate that implementing FerretInterface requires.

epriestley added a commit: rPc9152b586b64: Support Ferret engine in Owners.Sep 7 2017, 8:23 PM

epriestley added a commit: rPb1703c8801f4: Support Ferret engine in Phame.

epriestley added a commit: rP184f201ce29a: Support Ferret engine in Projects.

epriestley added a commit: rPa25bbc1dcab4: Support Ferret engine in Phriction.

epriestley added a commit: rPe0f3de9c6495: Support Ferret engine in Calendar.

epriestley added a commit: rPd8132db75b83: Support Ferret engine in Pholio.

epriestley added a revision: D18573: Provide "bin/storage analyze" and make "bin/storage upgrade" run analysis automatically.Sep 7 2017, 8:40 PM

epriestley added a commit: rPb1b638bd14ae: Support the Ferret engine in Diffusion.Sep 7 2017, 8:41 PM

epriestley added a commit: rP8e9f0496266a: Provide "bin/storage analyze" and make "bin/storage upgrade" run analysis….Sep 7 2017, 9:44 PM

epriestley added a revision: D18576: Correct `bin/storage analyze` internal API for cluster environments.Sep 7 2017, 11:31 PM

epriestley added a commit: rP4cae4a3b767f: Correct `bin/storage analyze` internal API for cluster environments.Sep 7 2017, 11:35 PM

epriestley added a revision: D18579: Split Ferret engine strings for tokenization on any sequence of whitespace.Sep 8 2017, 3:08 PM

epriestley added a revision: D18580: Remove some redundant information from the Ferret engine index.Sep 8 2017, 3:16 PM

epriestley added a revision: D18581: When selecting Ferret ngrams, select term ngrams (not raw ngrams) for term search.Sep 8 2017, 4:13 PM

epriestley added a commit: rP7ea6de6e9c9d: Split Ferret engine strings for tokenization on any sequence of whitespace.Sep 8 2017, 4:40 PM

epriestley added a commit: rPd67cc8e5c59c: Remove some redundant information from the Ferret engine index.

epriestley added a commit: rPc662dda0f17d: When selecting Ferret ngrams, select term ngrams (not raw ngrams) for term….Sep 8 2017, 4:48 PM

All the search which was previously driven by InnoDB FULLTEXT is now driven by the Ferret engine on this install.

epriestley mentioned this in T10773: When the search table is marked as crashed, bin/storage should automatically repair it.Sep 11 2017, 12:02 PM

epriestley mentioned this in T10560: Find ways to detect/fix/warn about MySQL indexes with poor cardinality estimates.

epriestley mentioned this in T11741: Quickstart can fail to initialize databases if MyISAM is not available (currently, only in Google Cloud).Sep 11 2017, 12:05 PM

The actual indexes are fairly large (3GB for ~500MB of task data, although lispum tasks have an exceptionally large amount of text).

D18580 improved this somewhat, but the tables are still large. On this install, 319MB of task data produces 1329MB of index data, i.e. the indexes are about 4x larger than the tasks.

This is generally fine -- trading away disk space, which is cheap, to get an index which doesn't explode all the time is broadly desirable (that is, any time we can trade disk space for basically any other benefit, we're probably eager to do so).

This data is also extremely compressible and the backup size increases by a much smaller amount (3.6G before Ferret engine indexing to 3.9GB afterwards on this install). Much of the data is indexes, not actually data, and constructed implicitly by MySQL and thus not present explicitly in the backup at all. The largest tables (like the changeset data) also aren't indexed and don't expand, so even at runtime this far from a 4x increase in size. Finally, we'll be able to drop the old InnoDB tables later which will reclaim some space, although these are not exceptionally large.

However, this presents a possible deployment problem for the Phacility cluster, as some of the older shards may not have the free space to accommodate this increase in data size (see T12932). We're also already populating these indexes for new objects, and thus rocketing toward our doom in some sense.

I expect to complete T12932 this week since that's the lowest hanging change, but the actual deployment schedule for the Ferret engine may end up being complicated by this: I don't want to unprototype it if the cluster isn't prepared to upgrade into it. We have two well-defined pathways available to resolve this by increasing the available space (either moving instances to separate shards as in T12817, or upgrading the volume) but both involve a meaningful amount of downtime for instances on those shards so I'd prefer to find less disruptive solutions if they're reasonably available.

epriestley mentioned this in D18584: Support storage of Differential hunk data in Files.Sep 11 2017, 3:31 PM

From T12932, I don't think it's especially promising as a way forward, so I'm just going take an operations pathway instead (see T12978), so I expect to just move to unprototype the Ferret engine here.

epriestley mentioned this in rPd15fb20fe6a9: Support storage of Differential hunk data in Files.Sep 11 2017, 11:09 PM

epriestley added a revision: D18586: Remove "Contains Words" constraint from Maniphest.Sep 11 2017, 11:21 PM

epriestley added a revision: D18587: Unprototype the Ferret UI fields.Sep 11 2017, 11:29 PM

epriestley added a revision: D18588: Remove "Name Contains" query constraint from Diffusion for Repositories.Sep 11 2017, 11:31 PM

epriestley added a revision: D18589: Return fulltext tokens from the Ferret fulltext engine.Sep 11 2017, 11:38 PM

epriestley added a revision: D18590: Make "mysql" mean "Ferret engine" in Fulltext search.Sep 11 2017, 11:43 PM

epriestley added a revision: Restricted Differential Revision.

epriestley added a commit: rP495ab7363b5b: Remove "Contains Words" constraint from Maniphest.Sep 12 2017, 1:04 AM

epriestley added a commit: rP6edf98eb3b8a: Unprototype the Ferret UI fields.

epriestley added a commit: rPdd3f05ec2503: Remove "Name Contains" query constraint from Diffusion for Repositories.

epriestley added a commit: rP39b74572e6f7: Return fulltext tokens from the Ferret fulltext engine.

epriestley added a commit: rPda0a08a7e13e: Make "mysql" mean "Ferret engine" in Fulltext search.

epriestley added a commit: Restricted Diffusion Commit.

epriestley added a revision: D18592: Document Ferret engine fulltext search features.Sep 12 2017, 2:36 PM

epriestley added a revision: D18593: Fix an issue with selecting the right stemmed ngrams with Ferret engine queries.Sep 12 2017, 2:48 PM

epriestley closed subtask T12928: Tokens in the form "v0.1" are not handled well by the MySQL FULLTEXT index as Resolved.Sep 12 2017, 3:07 PM

epriestley added a revision: D18594: Issue upgrade guidance to rebuild indexes for the Ferret engine.Sep 12 2017, 3:28 PM

epriestley added a commit: rPe6f0f865187a: Document Ferret engine fulltext search features.Sep 12 2017, 7:13 PM

epriestley added a commit: rPfdc0d8c2f6f3: Fix an issue with selecting the right stemmed ngrams with Ferret engine queries.

epriestley added a commit: rP124e580f6ec7: Issue upgrade guidance to rebuild indexes for the Ferret engine.Sep 12 2017, 7:21 PM

This has promoted to stable. See T12974 for upgrade guidance. See T12985 for one followup change. This may still have some bugs (or it may scale very differently on real-world workloads and need substantial additional work) but I think we can handle those separately.

epriestley mentioned this in T12993: Datasource queries with many "JOIN ... LIKE" can have explosive complexity.Sep 20 2017, 7:22 PM

		Restricted Differential Revision	Restricted Diffusion Commit
rPHU libphutil
		D18492	rPHU0cd92b1ff5c4 Add support for parsing more syntax in search queries
rP Phabricator
	Closed		D18559 Reduce the amount of boilerplate that implementing FerretInterface requires
		D18594	rP124e580f6ec7 Issue upgrade guidance to rebuild indexes for the Ferret engine
		D18593	rPfdc0d8c2f6f3 Fix an issue with selecting the right stemmed ngrams with Ferret engine queries
		D18592	rPe6f0f865187a Document Ferret engine fulltext search features
		D18590	rPda0a08a7e13e Make "mysql" mean "Ferret engine" in Fulltext search
		D18589	rP39b74572e6f7 Return fulltext tokens from the Ferret fulltext engine
		D18588	rPdd3f05ec2503 Remove "Name Contains" query constraint from Diffusion for Repositories
		D18587	rP6edf98eb3b8a Unprototype the Ferret UI fields
		D18586	rP495ab7363b5b Remove "Contains Words" constraint from Maniphest
		D18581	rPc662dda0f17d When selecting Ferret ngrams, select term ngrams (not raw ngrams) for term…
		D18580	rPd67cc8e5c59c Remove some redundant information from the Ferret engine index
		D18579	rP7ea6de6e9c9d Split Ferret engine strings for tokenization on any sequence of whitespace
		D18576	rP4cae4a3b767f Correct `bin/storage analyze` internal API for cluster environments
		D18573	rP8e9f0496266a Provide "bin/storage analyze" and make "bin/storage upgrade" run analysis…
		D18572	rPb1b638bd14ae Support the Ferret engine in Diffusion
		D18569	rPd8132db75b83 Support Ferret engine in Pholio
		D18568	rPe0f3de9c6495 Support Ferret engine in Calendar
		D18567	rPa25bbc1dcab4 Support Ferret engine in Phriction
		D18566	rP184f201ce29a Support Ferret engine in Projects
		D18565	rPb1703c8801f4 Support Ferret engine in Phame
		D18564	rPc9152b586b64 Support Ferret engine in Owners
		D18556	rP2020c1e7bd22 Support Ferret engine for Passphrase credentials
		D18555	rPf23717b416fe Support Ferret engine in Fund initiatives
		D18554	rPcf0bc32e18a9 Lightly modernize FundInitiativeSearchEngine
		D18553	rP60deec36d8ed Lightly modernize FundInitiativeQuery
		D18552	rP3ff9d4a4cab3 Support Ferret engine for searching users
		D18551	rPa2a2b3f7f4b5 Sort global fulltext results by overall relevance
		D18550	rP8059db894d64 Use the Ferret engine fulltext document table to drive auxiliary fulltext…
		D18548	rP4ea677ba979d Skeleton support for running global fulltext queries via the Ferret engine
		D18547	rP551c62b91af5 Support Ferret engine queries in ApplicationSearch via extension instead of…
		D18544	rPfaca1deea5b5 Remove the fulltext "reconstructDocument()" method
		D18540	rPf40f3ca74c91 Add Ferret engine index support to Differential
		D18539	rP64b7778f3257 Add support for relevance-ranking Ferret engine results
		D18536	rP20aad35e6036 Move Ferret engine "title:..." field definitions to the engine itself
		D18534	rP46abc11114c3 Reduce the number of magic strings in the Ferret implementation
		D18533	rP4a7593f47fc7 Consolidate more Ferret engine code into FerretEngine
		D18513	rPf4f73e0a7e46 Separate fulltext engine extensions into "enrich" and "index" phases
		D18503	rP3b43a7077345 Add "title:..." support to the Ferret engine
		D18502	rP048aa36c236d Support "-term" in Ferret engine queries
		D18500	rPdf9c24e75058 Provide some "term vs substring" support for the Ferret engine
		D18499	rPe5a495f43510 Parse raw Ferret queries into tokens before processing them
		D18498	rP0e2e525bb41a Add a "terms" corpus to Ferret fields
		D18497	rP77ef38f9a87f Aggregate corpus data in Ferret field rows
		D18487	rP4005a465f7d0 Make Ferret indexing more robust (UTF8, exception handling)
		D18484	rPf97157e7edb1 Build a prototype fulltext engine ("Ferret") using only basic MySQL primitives

InnoDB FULLTEXT appears to fail catastrophically once it reaches a moderate size
Closed, ResolvedPublic
Actions

Description

Revisions and Commits

Related Objects
Search...

Event Timeline

	epriestley
	Jun 10 2017, 12:06 AM

InnoDB FULLTEXT appears to fail catastrophically once it reaches a moderate sizeClosed, ResolvedPublicActions

Description

Revisions and Commits

Related ObjectsSearch...

Event Timeline

InnoDB FULLTEXT appears to fail catastrophically once it reaches a moderate size
Closed, ResolvedPublic
Actions

Related Objects
Search...