Build a blame cache
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	epriestley
	Jan 29 2013, 12:48 PM

Description

Blame operations (git blame, svn blame, hg blame) are slow, especially for large codebases. It would generally be good to make these operations faster, although this is somewhat difficult.

Here is a design for a blame cache. It may not be the best design, but it seems workable.

The blame information for a file has commit and author information for each line. However, we only really need commit information, since we can derive author information from it (this is available in fast storage). Further, we are guaranteed to have 32-bit integers (commitID) assigned for all commits already, so we can store the blame for a file as a sequence of 32-bit integers -- one for each line in the file. This will be fairly compact.

Storing blame information for every file at every commit isn't practical since it would be enormous, but we can cache blame every time a file is modified. We still need to do a somewhat-expensive log operation to figure out the last modification time of a file, which is somewhat slow, but this reduces a very slow blame operation to a somewhat slow log operation with a very simple cache mechanism. Log slowness is a general problem that causes other performance issues; addressing it is desirable but it's really a different/larger problem.

So we end up with a table like this:

commitID

pathID

blame: uint32 uint32 uint32 ...

To read this cache:

Use log to find the commit when a file was last modified.
Look up blame information with that commitID and the file's pathID.
Decode the list of integers.
Look up the commits with those commitIDs.

To fill this cache:

Use log to find the commit when the file was last modified.
Look up blame information with that commitID and the file's `pathID'.
Fail to find a cache entry.
Run blame normally.
Look up the commitIDs for the blamed commits.
Encode the list of commit IDs.
Shove them into the cache.

We could store this cache more efficiently -- maybe by gzipping or using run-lenght encoding, or by creating a lookup table at the head of the blame info -- but this probably won't be an issue, and the improvement probably isn't huge. We could add an extra format column to the table if we anticipate changing this later (or not, and just drop the old cache).

Revisions and Commits

rP Phabricator
	D14964	rP438100691d42 Don't let blame run for longer than 15 seconds
	D14963	rP9ab1b5a22db6 Make mundane performance improvements to Diffusion browse views
	D10600	rPe8d3071452f0 Implement a git blame cache
	D14962	rP0759b84d77c0 Improve construction of commit queries from blame lookups
	D14958	rP9728c65e9358 Drive blame generation through `diffusion.blame`

Related Objects
Search...

Status	Assigned	Task
Open	None	T2452 Implement "arc blame"
Resolved	None	T2451 Build a `diffusion.blame` Conduit method
Wontfix	epriestley	T1743 Blame-based automatic Herald rule
Wontfix	blc	T2453 Populate the blame cache when parsing commits
Resolved	None	T2450 Build a blame cache

Event Timeline

epriestley triaged this task as Normal priority.Jan 29 2013, 12:48 PM

epriestley added projects: Diffusion, OS Students 2013.

epriestley added a subscriber: epriestley.

epriestley edited this Maniphest Task.Jan 29 2013, 12:50 PM

vedharaju claimed this task.Feb 6 2013, 9:17 PM

Let me know if you have any questions on this. You can find instructions on developing database patches (e.g., to add new tables) in this document:

http://www.phabricator.com/docs/phabricator/article/Database_Schema.html

awesometown

This is a really exciting one...! :D

btrahan added a subscriber: btrahan.Feb 9 2013, 1:58 AM

The general goal is to improve the performance of all blame operations, whether they're web-based or not. Having a blame cache will let us implement faster CLI operations ("arc blame"), but most of the benefit will probably come just from making the web view faster. For now, the integration point you should focus on is the web interface, since the other stuff doesn't exist yet.

The actual changes you'll make to interact with the blame cache should probably happen in DiffusionFileContentQuery and/or its subclasses. This class is responsible for issuing all the blame queries we currently execute. Since the cache should be VCS-agnostic, it should probably be implemented in the base class.

Basically, DiffusionFileContentQuery->loadFileContent() looks like this right now:

$content = $this->executeQuery();
// ... do encoding junk ...
return $content;

Instead, you want it to look like this (greatly simplified):

$content = $this->readCache();
if (!$content) {
  $content = $this->executeQuery();
  $this->writeCache($content);
}
// ... do encoding junk ...
return $content;

DiffusionFileContentQuery operates on a single path, so you shouldn't need to worry about mapping commits to paths. You can also ignore the "cache only when changed" part for now -- we'll have a less effective cache without it, but it will be fairly easy to add that later and it reduces the amount of work you need to do to get it working. It might be easier to get started by implementing the cache for only one VCS (e.g., implement in DiffusionGitFileContentQuery->executeQuery()) and then generalizing it once it works.

Does that make sense? Let me know if anything is unclear -- this code is structured a bit weirdly.

epriestley removed vedharaju as the assignee of this task.May 27 2013, 8:04 PM

epriestley added a subscriber: vedharaju.

blc claimed this task.Jun 5 2013, 3:00 AM

epriestley edited this Maniphest Task.Oct 18 2013, 12:50 AM

Had some free time today and implemented this. The improvement however is not as big as i expected.
Doing a blame on phabricators __phutil_library_map__.php will take ~20sec without and ~15sec with the cache on my dev vm.

Right now the cache is only updated when DiffusionFileContentQuery gets called with a needs_blame = true.
Still needs to be integrated into the commit parsing pipeline.

fabe added a revision: D10600: Implement a git blame cache.Sep 28 2014, 3:59 PM

chad removed blc as the assignee of this task.Jan 15 2015, 5:11 PM

chad removed a project: OS Students 2013.

chad changed the visibility from "All Users" to "Public (No Login Required)".

joshuaspence added a subscriber: joshuaspence.Jan 23 2015, 11:39 AM

joshuaspence added a parent task: T2453: Populate the blame cache when parsing commits.Feb 25 2015, 9:49 AM

chad mentioned this in T2451: Build a `diffusion.blame` Conduit method.Jul 3 2015, 5:21 AM

chad mentioned this in T2453: Populate the blame cache when parsing commits.

edibiase added a subscriber: edibiase.Oct 6 2015, 6:53 PM

Herald added a subscriber: cspeckmim. · View Herald TranscriptOct 6 2015, 6:53 PM

benstiglitz added a subscriber: benstiglitz.Oct 6 2015, 6:53 PM

I would be interested in helping with benchmarks as desired. I have a Mercurial repository with commits dating back to 2004, ~97k commits in total with currently ~7.5k files. Though, I tried running hg blame on a few files which contain unmodified lines since as far back as 2009 and the blame happens near instantaneous.

epriestley mentioned this in D14957: Implement a dedicated "diffusion.blame" API method.Jan 6 2016, 1:17 PM

epriestley added a revision: D14958: Drive blame generation through `diffusion.blame`.Jan 6 2016, 2:20 PM

epriestley mentioned this in rPf561dc172d49: Implement a dedicated "diffusion.blame" API method.Jan 6 2016, 5:24 PM

epriestley added a commit: rP9728c65e9358: Drive blame generation through `diffusion.blame`.

epriestley added a revision: D14962: Improve construction of commit queries from blame lookups.Jan 6 2016, 11:40 PM

epriestley added a revision: D14963: Make mundane performance improvements to Diffusion browse views.Jan 7 2016, 1:56 AM

epriestley added a revision: D14964: Don't let blame run for longer than 15 seconds.Jan 7 2016, 2:10 AM

epriestley mentioned this in D14964: Don't let blame run for longer than 15 seconds.Jan 7 2016, 2:17 AM

epriestley closed this task as Resolved by committing rP438100691d42: Don't let blame run for longer than 15 seconds.Jan 7 2016, 2:43 AM

epriestley added a commit: rP0759b84d77c0: Improve construction of commit queries from blame lookups.

epriestley added a commit: rPe8d3071452f0: Implement a git blame cache.

epriestley added a commit: rP9ab1b5a22db6: Make mundane performance improvements to Diffusion browse views.