Page MenuHomePhabricator

Provide a rough, unstable API for reporting coverage into Diffusion
ClosedPublic

Authored by epriestley on May 9 2014, 2:01 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Dec 26, 2:30 PM
Unknown Object (File)
Mon, Dec 23, 9:43 AM
Unknown Object (File)
Mon, Dec 23, 9:43 AM
Unknown Object (File)
Mon, Dec 23, 9:43 AM
Unknown Object (File)
Mon, Dec 23, 8:12 AM
Unknown Object (File)
Mon, Dec 23, 2:47 AM
Unknown Object (File)
Tue, Dec 17, 10:47 AM
Unknown Object (File)
Tue, Dec 17, 4:49 AM

Details

Summary

Ref T4994. This stuff works:

  • You can dump a blob of coverage information into diffusion.updatecoverage. This wipes existing coverage information and replaces it.
  • It shows up when viewing files.
  • It shows up when viewing commits.

This stuff does not work:

  • When viewing files, the Javascript hover interaction isn't tied in yet.
  • We always show this information, even if you're behind the commit where it was generated.
  • You can't do incremental updates.
  • There's no aggregation at the file (this file has 90% coverage), diff (the changes in this commit are 90% covered), or directory (the code in this directory has 90% coverage) levels yet.
  • This is probably not the final form of the UI, storage, or API, so you should expect occasional changes over time. I've marked the method as "Unstable" for now.
Test Plan
  • Ran save_lint.php to check for collateral damage; it worked fine.
  • Ran save_lint.php on a new branch to check creation.
  • Published some fake coverage information.
  • Viewed an affected commit.
  • Viewed an affected file.

Screen_Shot_2014-05-09_at_6.49.59_AM.png (817×1 px, 157 KB)

Screen_Shot_2014-05-09_at_6.49.44_AM.png (817×1 px, 159 KB)

Diff Detail

Repository
rP Phabricator
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

epriestley retitled this revision from to Provide a rough, unstable API for reporting coverage into Diffusion.
epriestley updated this object.
epriestley edited the test plan for this revision. (Show Details)
epriestley added a reviewer: btrahan.
epriestley added subscribers: zeeg, jhurwitz.

Also:

  • This only requires "View" permission on the repository, not "Edit" or "Push".
  • There's no reliable log for these updates.
  • So, a malicious user who otherwise has access can mess with this information with less accountability than they probably should have. To what end, I'm not sure.
  • I'd plan to put this in the push log or similar, probably.
epriestley edited edge metadata.
  • Slightly smaller patch.
btrahan edited edge metadata.

Seems like a good start.

This revision is now accepted and ready to land.May 9 2014, 7:40 PM

Thanks @epriestley for knocking this out so quickly! I have one small request -- if you don't want to address it in this diff, I can file a task.

We always show this information, even if you're behind the commit where it was generated.

This is possibly a launch blocker for us, but it also seems easy to fix.

The repository_branch table has a lintCommit column to store the latest commit for which Diffusion has lint information. I'd propose adding a similar column coverageCommit and then modifying the behavior as follows: in DiffusionBrowseFileController, show coverage (exactly as you're already doing in this diff) if the file has not been modified since the branch's coverageCommit; if the file has been modified, show no coverage information (and, optionally, show a message saying why).

How does this sound?

Hm, I guess it's actually slightly more complicated than that. The Conduit endpoint needs another field for the commit. In our case, we're not guaranteed to be hitting this endpoint in increasing commit order. (I don't know Phabricator well enough to know if this is easy -- if it's not, you can ignore me.) A nice-to-have would be for the Conduit endpoint to ignore the request and not update the coverage database if it already has coverage data for a more recent commit.

if the file has not been modified since the branch's coverageCommit

I think this approach is correct, but testing this is currently slow. There is a tiny amount of discussion in T2683 (which is essentially the task for this), and more discussion elsewhere that I can't immediately dig up -- fishing around in the dependencies might reveal it.

The short version is that if we can make this operation fast (essentially, git log -n 1 --branch <whatever> -- <path>, and the equivalent for Mercurial; we can already query this efficiently in Subversion) we can reduce essentially everything else that is slow (like blame, this check, an equivalent check for lint, and all the places in the UI where we issue this directly) to some small set of fast operations on top of this operation.

This is difficult because:

  1. No one has come up with a way to represent this efficiently in MySQL yet. It's easy in SVN, but not easy in Git/Mercurial where we need to walk a graph. Walking the graph by issuing queries for each step is slower than asking git / hg to do it, in the absence of some really clever way to structure the data and query it. I've thought about structuring a lot, but not much about query construction. It's possible that, e.g., walking the graph by doing hundreds of joins in a single query would work fine, or that we could write a stored procedure to walk the graph efficiently. So this isn't necessarily a lost cause, but I think it's unlikely that we'll find the solution in clever storage or query construction.
  2. We could put this in a dedicated graph server and get access to the data efficiently, but I'm hesitant to require you to install a graph server (even as an optional component), especially a big complicated one (Neo4j seems to be the most common one in this space?). We could write a server fairly easily that could serve this data, since it basically just needs to hold a datastructure in RAM and serve exactly one kind of query, but loading the data in seems like a significant undertaking. For example, if it thrashes the server loading gigs of stuff every time it starts up, that's not so awesome.
  3. D5257 and similar use an approach where we store large chunks of the graph in MySQL and then walk the graph in PHP. This is a reasonable solution for small repositories, but did not work as well at Facebook scale.
  4. The "viable attack" that the margins of T2683 were too small to contain is to load the graph into APC instead, and fake a "graph server" in a lazy way. I prototyped this and it seemed to have reasonable behaviors and scalability, but haven't actually built it yet.

I'd ideally like to make this query cheap (hopefully by implementing (4)) before we start issuing the query on these pages. I don't believe we currently issue it when rendering either of these pages (file detail, change detail), so beginning to issue it would make them slower for everyone in order to make the coverage results slightly better when they're present.

Some approaches we could take:

  • Accept this rule for now, try to build (4), and figure out how to proceed after we're convinced it does or doesn't work.
  • Switch to an alternate cheap rule: show coverage only if the current commit is the branch HEAD. I think we can check this for free on both these pages. This will probably be wrong less often than the previous rule, but also just won't show coverage information most of the time unless the regeneration process runs constantly and branch heads move slowly.
  • Look for coverage, and issue the log query only if we find some coverage info. This will slow things down when coverage is present, but not impact other users.
  • Bring coverage in via Ajax. I don't favor this approach because it involves a lot of complexity which would be made obsolete if (4) above works.

So maybe let's do this:

  • I'll add commit information to the API and tables. We would like to have this eventually anyway, and I don't think it hurts anything to put it in now.
  • I'll leave the coverage display rule the same for this diff, but issuing the query after finding that we have coverage information feels like a reasonable approach to me in the short term if you want to pursue it.
  • I'll make an attempt to build (4) above in the short term so we can get a better sense about whether it represents a viable approach or not.

Sound reasonable?

Actually, I think I might need (4) to make the best decision about where to put commit information. It can either go as coverageCommit on the branch, or as commitID on the coverage information. These have tradeoffs (some discussion in T5001):

  • On the branch, we can't have historic information. This seems nice-to-have.
  • On the branch, we can do queries across the repository at a given commit more easily. But this seems useless for coverage (it's useful for lint, to find all instances of a given message, e.g.).
  • On the branch, we can't do incremental updates. This seems pretty useless, though.
  • On the branch, we don't have an opportunity to improve performance by issuing one test ("what's the last time this file was changed?") instead of two ("what's the last time this file was changed?" + "is that newer or older than the commit this coverage was generated for?"). If we can implement (4) above and make it very fast in the average case, this optimization might be beneficial, especially if we also allow historic data.
  • On the branch, data size is slightly smaller, I guess.

So here's a revised plan:

  • I'll add this to the API.
  • I won't do anything with it yet; it will just be ignored (or maybe sanity checked for existing and being on the branch).
  • I'll pursue (4) and use that to motivate a decision about storage.

If I don't make any progress on (4) soon we can come up with a better plan, but it's probably the biggest open technical question in the product and it would be good to get a resolution for it.

Sounds good. I've been reading through the diffs and tasks you've CCed me on, and I like the direction you're headed in. I'll probably stay silent and lurk in the shadows for now unless I have further questions.

I look forward to using all of the improvements you're trying out -- thanks!

(Before landing, the colspan of the lint messages needs to be adjusted when we emit coverage information. There's an existing bug with this in T5044, but this will exacerbate it without adjustment.)

epriestley edited edge metadata.
  • Add "commit" to the API, but don't do much with it yet.
  • Fix some colspan issues for lint messages (fixes T5044).
epriestley updated this revision to Diff 21784.

Closed by commit rPa74545c9da4b (authored by @epriestley).