Compute an internal content hash of large files using hash-of-hashes
Open, WishlistPublic
Actions

Assigned To

None

Authored By

	epriestley
	Apr 27 2020, 2:47 PM

Description

See PHI1708. Currently, only text changes can generate an "effect hash" for changesets in Differential.

For binary changes (like images), we can reasonably use a content hash of the new state as an effect hash.

However, we don't currently compute a usable content hash for files larger than one chunk. The "contentHash" property is randomly generated for chunked files, and not reflective of the file content.

Since we may receive large files as a series of out-of-order chunks (e.g., a 10GB file arriving as 2,500 x 4MB chunks) there's currently no place we can pipe the entire content of the file through a hash algorithm to compute a well-known hash (like SHA256).

We could do this in the daemons, but it's not clear that doing an extra 10GB of I/O after uploading a 10GB chunked file is terribly useful. We'd also have to wait for the daemons to actually do this I/O and compute the hash before we could use it as a content hash.

A more reasonable hash to compute during a chunked file "finalize" step is a hash of all the 4MB chunk content hashes. This only requires us to load 2,500 rows from the database for a 10GB file, which will certainly fit in memory and should compute in a very reasonable amount of time. The only real downsides here are:

third-party clients can't easily compute it for verification; and
the hash is dependent on the chunk size.

Neither of these seem like major issues. "Divide the file into 4MB chunks, hash them, then hash all the hashes" isn't that hard for third-party clients, and there's no obvious need for them to do it anyway. The 4MB chunk size never (or very rarely) needs to be changed.

The steps here are probably:

Can we store this in contentHash? It seems like we should be able to. Otherwise, add a new realContentHashOMEGALUL field.
For files that are not chunked, copy the SHA256 contentHash over.
For other files, when chunks finalize, compute the hash-of-hashes and store it.