Ferret may exhaust AUTO_INCREMENT ID space of "ngrams" table after many reindexes
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	epriestley
	Nov 19 2020, 8:24 PM

Description

See PHI1934. An active install reports imminent ID space exhaustion of the revision_fngrams and commit_fngrams tables.

When Phabricator indexes a Ferret document, it completely deletes the old document and then inserts an entirely new document. This is simple to implement, but if a document has, say, 10K ngrams, each reindex consumes 10K AUTO_INCREMENT ID slots. Under reasonable use, this may eventually reach the 4.2B maximum value for a 32-bit autoincrement ID.

The ID is not meaningful and not referenced elsewhere: it just has to be unique, and is only used to make table manipulation easier. So possible solutions include things like:

Make the column 64-bit.
Remove the column entirely and rewrite any code which uses it (this isn't much code, and may be no code at all).
As an immediate remedy, defragment the table. (I wrote a script for this in PHI1934.)
Change the reindex logic to selectively insert/delete instead of just throwing everything out.

I'm inclined to pursue (4) here since I think it's likely fairly simple and has the fewest entanglements with other things.

Revisions and Commits

rP Phabricator
	D21495	rP4f647fb6be2b When updating a Ferret search index document, reuse existing rows where possible
	D21560	rP6703fec3e27d When documents are indexed, record the indexer version (versus the object…

Related Objects

Mentioned In: 2021 Week 8 (Late February)

Event Timeline

epriestley triaged this task as Normal priority.Nov 19 2020, 8:24 PM

epriestley created this task.

epriestley added a revision: D21495: When updating a Ferret search index document, reuse existing rows where possible.Nov 19 2020, 9:36 PM

This has stalled for a while because it's moderately expensive to recover from if the updated index logic has a bug: rebuilding all document indexes is expensive, and it's difficult to identify the set of documents that need to be reindexed if a bug is present.

This could be made easier by adding an indexVersion and/or epochIndexed field to the fdocument table. An indexVersion is more robust, but difficult to capture completely because index behavior depends on a large amount of code (extensions, stemmers, etc). Modifying fdocument tables is also a big pain. Still, maybe this is worthwhile.

The existing SearchIndexVersion table (which stores document versions) may reasonably be able to store index versions too. This limits the need to apply changes to fdocument.

epriestley added a revision: D21560: When documents are indexed, record the indexer version (versus the object version) and index epoch.Feb 16 2021, 11:59 PM

epriestley added a commit: rP4f647fb6be2b: When updating a Ferret search index document, reuse existing rows where possible.Feb 17 2021, 12:09 AM

epriestley added a commit: rP6703fec3e27d: When documents are indexed, record the indexer version (versus the object….

If something goes wrong with this, the patch which fixes the problem can now change the indexer version and then all mis-indexed documents can be reindexed with:

$ bin/search index --version 2021-02-16-A

I've deployed these changes to secure, so hopefully any issues will present themselves.

(Triskaidekaphobia.)

epriestley mentioned this in 2021 Week 8 (Late February).Feb 19 2021, 7:03 PM

Nothing new has arisen for a while, so presuming this is resolved.

Ferret may exhaust AUTO_INCREMENT ID space of "ngrams" table after many reindexesClosed, ResolvedPublicActions

Description

Revisions and Commits

Related Objects

Event Timeline

Ferret may exhaust AUTO_INCREMENT ID space of "ngrams" table after many reindexes
Closed, ResolvedPublic
Actions