Page MenuHomePhabricator

Store Diffusion lint information against a commit hash instead of a branch
Closed, InvalidPublic

Description

I've been converting the lint save runner to a Harbormaster step, and saving against a branch causes a few problems:

  • It's quite subject to race conditions; someone can push a commit while the analysis is running. The only sane way to make sure it happens in order is to have a "wait for previous builds" step, but this is reasonably terrible since it forces every run to happen sequentially (and since running lint on an entire repo is slow, this would result in a backlog of builds).
  • File deletes are very hard to detect; this was easy in the command line runner because it could use the Arcanist APIs to find out what files were deleted. With most of the logic moving into Harbormaster and calling arc remotely, this information is much harder to get.
  • No historical information.

This lint information really needs to be against a commit, with Diffusion showing the latest information for a file (if the last change of that file occurred on or before the latest commit with lint information) or a ? if newer lint information is not yet available.

Event Timeline

hach-que raised the priority of this task from to Needs Triage.
hach-que updated the task description. (Show Details)
hach-que added a project: Diffusion.
hach-que added subscribers: hach-que, epriestley.
epriestley triaged this task as Wishlist priority.May 9 2014, 1:06 AM

If we save all lint information at every commit, this won't scale. For example, say Facebook has a repository with 1M commits and about 10K unresolved lint messages: we end up with a table with around 10B rows. If each row takes only 16 bytes (<commitID, pathID, lineNumber, lintID>) this is 160GB of lint data. This greatly exceeds the size of the entire repository, and these size numbers are conservative relative to the real numbers.

If we save only changed lint messages at every commit, we no longer have the size problem but it becomes prohibitively slow to compute aggregations in Git and Mercurial. To compute the total number of lint messages in the repository at commit X, we need to log every file, find the most recent change in an ancestor of X, and then compile the data. This could be done offline and iteratively, but this turns a simple, relatively quick process into a very complex one. Some of the queries we can currently issue would become impossible to issue, as well (find all issues of type Y at HEAD).

This isn't necessarily impractical, but seems hard to prioritize. I suspect "find all the lint messages of type Y at HEAD" is far more useful than having historical information. We could provide limited historical information cheaply by saving a few aggregations into some side table -- enough to draw some trend graphs or whatever. This is probably the only utility of the historical information anyway?

The only sane way to make sure it happens in order is to have a "wait for previous builds" step

We can introduce a "run build repeatedly on HEAD" mechanism to address this; we need it for other things like "publish documentation" anyway where saving data from the process at every commit doesn't make sense.

I'd expect that we'd be using the garbage collecror mechanism to clean up records that are older than X commits or X days. No-one needs this information forever, but you do want to be able to calculate recent trends (is our technical debt increasing or decreasing?)

For aggregation of data we can end up with a table that is just the date and number of total lint messages across the whole repo, potentially grouped by type. This can happen just prior to garbage collection and should be a pretty simple COUNT(id) WHERE commitId = <the one being garbage collected>. We can limit this to occurring only if the previous aggregation is older than one day ago.

Plan of attack for this task:

  • A new table repository_history. This contains ID, repositoryID, commitHash, historyType. History type can be (1) base, (2) incremental or (3) aggregated.
  • Both base and incremental entries use similar tables to the ones that exist, except that branchID becomes historyID. incremental entries specify the new lint messages for files and base is expected to be a full set of information as of the commit it maps to.
  • The aggregator / garbage collection daemon is responsible for flattening incremental records into the base record when needed. Before flattening into a newer base, it creates aggregated history records which instead reference different tables which might contain information like "issues per type per commit" (I say tables because it might end up being multiple tables to store the aggregated data).

Once this is done:

  • Update the changes made in D9022 and D9113 to target the new tables (but make the changes in a seperate diff).
  • Run a Harbormaster task with the new lint code to verify that it creates incremental / base records as appropriate in the history table.
  • Hook up the Diffusion UI to pull the information from the new data tables.

Then onto the hard part:

  • Implement the aggregation / garbage collection daemon for this stuff that aggregates data over time. Ensure that both the commit limit and day limit is configurable so that @epriestley can configure the main install to use a commit limit of 1.. :)

Concerns:

The incremental records make it much easier to store new data without doing a full scan, but I worry that resolving the resulting set of information (i.e. tracing incremental sets back to the base) might be expensive in terms of resolving commit parents.

hach-que claimed this task.

With the changes in D9130, this is less relevant than it was originally. In particular, we don't need to be doing any special storage for aggregated lint or coverage data, as this data can be sent directly to Fact and stored generically.

The incremental and base storage is also much less of an issue; although it might be preferable to wake the Fact processing daemon right after a lint scan is done, the Fact daemon operates on a 5 minute basis anyway, which is usually a refined enough interval for aggregated data over time (and certainly enough to ascertain trends over a day or week of code changes).