Blame operations (git blame, svn blame, hg blame) are slow, especially for large codebases. It would generally be good to make these operations faster, although this is somewhat difficult.
Here is a design for a blame cache. It may not be the best design, but it seems workable.
The blame information for a file has commit and author information for each line. However, we only really need commit information, since we can derive author information from it (this is available in fast storage). Further, we are guaranteed to have 32-bit integers (commitID) assigned for all commits already, so we can store the blame for a file as a sequence of 32-bit integers -- one for each line in the file. This will be fairly compact.
Storing blame information for every file at every commit isn't practical since it would be enormous, but we can cache blame every time a file is modified. We still need to do a somewhat-expensive log operation to figure out the last modification time of a file, which is somewhat slow, but this reduces a very slow blame operation to a somewhat slow log operation with a very simple cache mechanism. Log slowness is a general problem that causes other performance issues; addressing it is desirable but it's really a different/larger problem.
So we end up with a table like this:
commitID | pathID | blame: uint32 uint32 uint32 ... |
To read this cache:
- Use log to find the commit when a file was last modified.
- Look up blame information with that commitID and the file's pathID.
- Decode the list of integers.
- Look up the commits with those commitIDs.
To fill this cache:
- Use log to find the commit when the file was last modified.
- Look up blame information with that commitID and the file's `pathID'.
- Fail to find a cache entry.
- Run blame normally.
- Look up the commitIDs for the blamed commits.
- Encode the list of commit IDs.
- Shove them into the cache.
We could store this cache more efficiently -- maybe by gzipping or using run-lenght encoding, or by creating a lookup table at the head of the blame info -- but this probably won't be an issue, and the improvement probably isn't huge. We could add an extra format column to the table if we anticipate changing this later (or not, and just drop the old cache).