Page MenuHomePhabricator

Build a blame cache
Closed, ResolvedPublic

Description

Blame operations (git blame, svn blame, hg blame) are slow, especially for large codebases. It would generally be good to make these operations faster, although this is somewhat difficult.

Here is a design for a blame cache. It may not be the best design, but it seems workable.

The blame information for a file has commit and author information for each line. However, we only really need commit information, since we can derive author information from it (this is available in fast storage). Further, we are guaranteed to have 32-bit integers (commitID) assigned for all commits already, so we can store the blame for a file as a sequence of 32-bit integers -- one for each line in the file. This will be fairly compact.

Storing blame information for every file at every commit isn't practical since it would be enormous, but we can cache blame every time a file is modified. We still need to do a somewhat-expensive log operation to figure out the last modification time of a file, which is somewhat slow, but this reduces a very slow blame operation to a somewhat slow log operation with a very simple cache mechanism. Log slowness is a general problem that causes other performance issues; addressing it is desirable but it's really a different/larger problem.

So we end up with a table like this:

commitIDpathIDblame: uint32 uint32 uint32 ...

To read this cache:

  • Use log to find the commit when a file was last modified.
  • Look up blame information with that commitID and the file's pathID.
  • Decode the list of integers.
  • Look up the commits with those commitIDs.

To fill this cache:

  • Use log to find the commit when the file was last modified.
  • Look up blame information with that commitID and the file's `pathID'.
  • Fail to find a cache entry.
  • Run blame normally.
  • Look up the commitIDs for the blamed commits.
  • Encode the list of commit IDs.
  • Shove them into the cache.

We could store this cache more efficiently -- maybe by gzipping or using run-lenght encoding, or by creating a lookup table at the head of the blame info -- but this probably won't be an issue, and the improvement probably isn't huge. We could add an extra format column to the table if we anticipate changing this later (or not, and just drop the old cache).

Event Timeline

epriestley added a subscriber: epriestley.

Let me know if you have any questions on this. You can find instructions on developing database patches (e.g., to add new tables) in this document:

http://www.phabricator.com/docs/phabricator/article/Database_Schema.html

awesometown

This is a really exciting one...! :D

The general goal is to improve the performance of all blame operations, whether they're web-based or not. Having a blame cache will let us implement faster CLI operations ("arc blame"), but most of the benefit will probably come just from making the web view faster. For now, the integration point you should focus on is the web interface, since the other stuff doesn't exist yet.

The actual changes you'll make to interact with the blame cache should probably happen in DiffusionFileContentQuery and/or its subclasses. This class is responsible for issuing all the blame queries we currently execute. Since the cache should be VCS-agnostic, it should probably be implemented in the base class.

Basically, DiffusionFileContentQuery->loadFileContent() looks like this right now:

$content = $this->executeQuery();
// ... do encoding junk ...
return $content;

Instead, you want it to look like this (greatly simplified):

$content = $this->readCache();
if (!$content) {
  $content = $this->executeQuery();
  $this->writeCache($content);
}
// ... do encoding junk ...
return $content;

DiffusionFileContentQuery operates on a single path, so you shouldn't need to worry about mapping commits to paths. You can also ignore the "cache only when changed" part for now -- we'll have a less effective cache without it, but it will be fairly easy to add that later and it reduces the amount of work you need to do to get it working. It might be easier to get started by implementing the cache for only one VCS (e.g., implement in DiffusionGitFileContentQuery->executeQuery()) and then generalizing it once it works.

Does that make sense? Let me know if anything is unclear -- this code is structured a bit weirdly.

epriestley added a subscriber: vedharaju.

Had some free time today and implemented this. The improvement however is not as big as i expected.
Doing a blame on phabricators __phutil_library_map__.php will take ~20sec without and ~15sec with the cache on my dev vm.

Right now the cache is only updated when DiffusionFileContentQuery gets called with a needs_blame = true.
Still needs to be integrated into the commit parsing pipeline.

chad removed blc as the assignee of this task.Jan 15 2015, 5:11 PM
chad removed a project: OS Students 2013.
chad changed the visibility from "All Users" to "Public (No Login Required)".

I would be interested in helping with benchmarks as desired. I have a Mercurial repository with commits dating back to 2004, ~97k commits in total with currently ~7.5k files. Though, I tried running hg blame on a few files which contain unmodified lines since as far back as 2009 and the blame happens near instantaneous.