Change Details

When we write an edge transaction to the database, we currently write the entire old and new edge lists into the record. For example, if the old project list was "A, B, C, D" and the new project list is "A, C, D, E", we write the complete lists, for a total of 8 PHIDs. Today, all //readers// compute changes from the lists anyway, so writing "A", "C" and "D" to both lists has no effect. The transaction has identical behavior if we write just "B" and "E", respectively -- it will still render as "alice changed projects; added: E; removed: B.". This doesn't seem like it should be a very big deal since we're just talking about a handful of extra PHIDs, but some repositories do merges in a way that causes the same commits to be mentioned hundreds and hundreds of times. When we add the 101st mention, we're writing 201 PHIDs. When we add the 1001st mention, we're writing 2001 PHIDs. We have one hosted instance with approximately 130GB of "commit X mentioned commit Y" data in the `repository_audit.audit_transaction` table because of this. (This is a test instance which is importing an especially large, well-known repository so no actual work is directly impacted, although other instances on the same shard may be suffering.) This is also made worse because the actual storage format is very verbose: ``` "PHID-CMIT-xxx": { "src":"PHID-CMIT-yyy", "type":"51", "dst":"PHID-CMIT-xxx", "dateCreated":"123456", "seq":"0", "dataID":null, "data":[] } ``` - We no longer use `dataID` or `data`, and these can be removed. - I think there is no value in writing `seq`. We don't rely on edge ordering and don't support reordering. If we did in the future, we could treat "no seq data present means it was added at the end" safely. - There is no value in writing `dateCreated`, since this is always the same as the transaction date. - The `type` is always the same for all transactions in a group, and always present in `edge:type` metadata. - Either `src` or `dst` is always the object PHID. So I think we can write this record instead: ``` { "src": [...] "dst": [...] } ``` ...where each key is an optional list of the changed edges where the `src` or `dst` is not the object PHID, respectively. I'm planning to: - Put a translation layer into the edge transaction handling. - Define all reads in terms of the translation layer; get test coverage on the reads. - Add support for a new compact format. - Start writing the compact format. - Build a tool for compacting old rows and run it in production.

When we write an edge transaction to the database, we currently write the entire old and new edge lists into the record. For example, if the old project list was "A, B, C, D" and the new project list is "A, C, D, E", we write the complete lists, for a total of 8 PHIDs. Today, all //readers// compute changes from the lists anyway, so writing "A", "C" and "D" to both lists has no effect. The transaction has identical behavior if we write just "B" and "E", respectively -- it will still render as "alice changed projects; added: E; removed: B.". This doesn't seem like it should be a very big deal since we're just talking about a handful of extra PHIDs, but some repositories do merges in a way that causes the same commits to be mentioned hundreds and hundreds of times. When we add the 101st mention, we're writing 201 PHIDs. When we add the 1001st mention, we're writing 2001 PHIDs. We have one hosted instance with approximately 130GB of "commit X mentioned commit Y" data in the `repository_audit.audit_transaction` table because of this. (This is a test instance which is importing an especially large, well-known repository so no actual work is directly impacted, although other instances on the same shard may be suffering.) This is also made worse because the actual storage format is very verbose: ``` "PHID-CMIT-xxx": { "src":"PHID-CMIT-yyy", "type":"51", "dst":"PHID-CMIT-xxx", "dateCreated":"123456", "seq":"0", "dataID":null, "data":[] } ``` - We no longer use `dataID` or `data`, and these can be removed. - I think there is no value in writing `seq`. We don't rely on edge ordering and don't support reordering. If we did in the future, we could treat "no seq data present means it was added at the end" safely. - There is no value in writing `dateCreated`, since this is always the same as the transaction date. - The `type` is always the same for all transactions in a group, and always present in `edge:type` metadata. - `src` is always the object PHID. - `dst` is always the dictionary key (the other end of the edge). So I think we can write something like this record instead: ``` { "dst": [...] } ``` ...where the list is just modified PHIDs. This could probably even just be `[...]` but if we wrap it in `{"dst":[...]}` we might be a little more future-proof. I'm planning to: - Put a translation layer into the edge transaction handling. - Define all reads in terms of the translation layer; get test coverage on the reads. - Add support for a new compact format. - Start writing the compact format. - Build a tool for compacting old rows and run it in production.

When we write an edge transaction to the database, we currently write the entire old and new edge lists into the record. For example, if the old project list was "A, B, C, D" and the new project list is "A, C, D, E", we write the complete lists, for a total of 8 PHIDs. Today, all //readers// compute changes from the lists anyway, so writing "A", "C" and "D" to both lists has no effect. The transaction has identical behavior if we write just "B" and "E", respectively -- it will still render as "alice changed projects; added: E; removed: B.". This doesn't seem like it should be a very big deal since we're just talking about a handful of extra PHIDs, but some repositories do merges in a way that causes the same commits to be mentioned hundreds and hundreds of times. When we add the 101st mention, we're writing 201 PHIDs. When we add the 1001st mention, we're writing 2001 PHIDs. We have one hosted instance with approximately 130GB of "commit X mentioned commit Y" data in the `repository_audit.audit_transaction` table because of this. (This is a test instance which is importing an especially large, well-known repository so no actual work is directly impacted, although other instances on the same shard may be suffering.) This is also made worse because the actual storage format is very verbose: ``` "PHID-CMIT-xxx": { "src":"PHID-CMIT-yyy", "type":"51", "dst":"PHID-CMIT-xxx", "dateCreated":"123456", "seq":"0", "dataID":null, "data":[] } ``` - We no longer use `dataID` or `data`, and these can be removed. - I think there is no value in writing `seq`. We don't rely on edge ordering and don't support reordering. If we did in the future, we could treat "no seq data present means it was added at the end" safely. - There is no value in writing `dateCreated`, since this is always the same as the transaction date. - The `type` is always the same for all transactions in a group, and always present in `edge:type` metadata. - Either `src` or- `src` is always the object PHID. - `dst` is always the object PHIDdictionary key (the other end of the edge). So I think we can write something like this record instead: ``` { "src": [...] "dst": [...] } ``` ...where each key is an optionalthe list of the changed edges where the `src` or `dst` is not the objectt is just modified PHID,s. respectivelyThis could probably even just be `[...]` but if we wrap it in `{"dst":[...]}` we might be a little more future-proof. I'm planning to: - Put a translation layer into the edge transaction handling. - Define all reads in terms of the translation layer; get test coverage on the reads. - Add support for a new compact format. - Start writing the compact format. - Build a tool for compacting old rows and run it in production.