When we write an edge transaction to the database, we currently write the entire old and new edge lists into the record.
For example, if the old project list was "A, B, C, D" and the new project list is "A, C, D, E", we write the complete lists, for a total of 8 PHIDs. Today, all readers compute changes from the lists anyway, so writing "A", "C" and "D" to both lists has no effect. The transaction has identical behavior if we write just "B" and "E", respectively -- it will still render as "alice changed projects; added: E; removed: B.".
This doesn't seem like it should be a very big deal since we're just talking about a handful of extra PHIDs, but some repositories do merges in a way that causes the same commits to be mentioned hundreds and hundreds of times. When we add the 101st mention, we're writing 201 PHIDs. When we add the 1001st mention, we're writing 2001 PHIDs. We have one hosted instance with approximately 130GB of "commit X mentioned commit Y" data in the repository_audit.audit_transaction table because of this.
(This is a test instance which is importing an especially large, well-known repository so no actual work is directly impacted, although other instances on the same shard may be suffering.)
This is also made worse because the actual storage format is very verbose:
"PHID-CMIT-xxx": { "src":"PHID-CMIT-yyy", "type":"51", "dst":"PHID-CMIT-xxx", "dateCreated":"123456", "seq":"0", "dataID":null, "data":[] }
- We no longer use dataID or data, and these can be removed.
- I think there is no value in writing seq. We don't rely on edge ordering and don't support reordering. If we did in the future, we could treat "no seq data present means it was added at the end" safely.
- There is no value in writing dateCreated, since this is always the same as the transaction date.
- The type is always the same for all transactions in a group, and always present in edge:type metadata.
- src is always the object PHID.
- dst is always the dictionary key (the other end of the edge).
So I think we can write something like this record instead:
{ "dst": [...] }
...where the list is just modified PHIDs. This could probably even just be [...] but if we wrap it in {"dst":[...]} we might be a little more future-proof.
I'm planning to:
- Put a translation layer into the edge transaction handling.
- Define all reads in terms of the translation layer; get test coverage on the reads.
- Add support for a new compact format.
- Start writing the compact format.
- Build a tool for compacting old rows and run it in production.
- I removed a piece of code which hid very old metadata-only edge transactions in Differential (they used to be written as a side effect of accepting a revision). Any eventual migration should probably just delete these transactions since they've never had any user-visible effect (when written, they were redundant with associated "accepted this revision" transactions).