Page MenuHomePhabricator

Ferret may exhaust AUTO_INCREMENT ID space of "ngrams" table after many reindexes
Closed, ResolvedPublic

Description

See PHI1934. An active install reports imminent ID space exhaustion of the revision_fngrams and commit_fngrams tables.

When Phabricator indexes a Ferret document, it completely deletes the old document and then inserts an entirely new document. This is simple to implement, but if a document has, say, 10K ngrams, each reindex consumes 10K AUTO_INCREMENT ID slots. Under reasonable use, this may eventually reach the 4.2B maximum value for a 32-bit autoincrement ID.

The ID is not meaningful and not referenced elsewhere: it just has to be unique, and is only used to make table manipulation easier. So possible solutions include things like:

  1. Make the column 64-bit.
  2. Remove the column entirely and rewrite any code which uses it (this isn't much code, and may be no code at all).
  3. As an immediate remedy, defragment the table. (I wrote a script for this in PHI1934.)
  4. Change the reindex logic to selectively insert/delete instead of just throwing everything out.

I'm inclined to pursue (4) here since I think it's likely fairly simple and has the fewest entanglements with other things.

Related Objects

Event Timeline

epriestley triaged this task as Normal priority.Nov 19 2020, 8:24 PM
epriestley created this task.

This has stalled for a while because it's moderately expensive to recover from if the updated index logic has a bug: rebuilding all document indexes is expensive, and it's difficult to identify the set of documents that need to be reindexed if a bug is present.

This could be made easier by adding an indexVersion and/or epochIndexed field to the fdocument table. An indexVersion is more robust, but difficult to capture completely because index behavior depends on a large amount of code (extensions, stemmers, etc). Modifying fdocument tables is also a big pain. Still, maybe this is worthwhile.

The existing SearchIndexVersion table (which stores document versions) may reasonably be able to store index versions too. This limits the need to apply changes to fdocument.

If something goes wrong with this, the patch which fixes the problem can now change the indexer version and then all mis-indexed documents can be reindexed with:

$ bin/search index --version 2021-02-16-A

I've deployed these changes to secure, so hopefully any issues will present themselves.

(Triskaidekaphobia.)

Nothing new has arisen for a while, so presuming this is resolved.