Page MenuHomePhabricator

Loading differential revision slow when lots of unit test messages exist
Open, Needs TriagePublic

Description

Case in point:

Our test suite produces a lot of messages, which cause noticeable slowdowns when trying to view a revision. (It takes ~13 seconds with xhprof enabled).

I've pinpointed the issue to (amongst others) HarbormasterUnitStatus::getUnitStatusSort, which takes up the bulk of loading the page.

Sample xhprof profile:

Event Timeline

This comment was removed by epriestley.

We don't currently have a sortable column on the unit message table, since pass, fail, etc., aren't naturally sortable by any key MySQL can construct.

Consequently, we pull everything and then sort it in the UI. This doesn't scale gracefully to large result sizes.

Building a sortable key is kind of iffy because we may introduce new results in the future or want to change the display order of results (that is, there's no unambiguously correct ordering of pass, fail, skip, broken in the way that there is a single unambiguous ordering of 1, 2, 3), but I think we just have to do our best. The consequence of getting this ordering wrong is minor: changes to the ordering just won't be reflected in older builds.

I think the pathway forward here is:

  1. Add an ordering column with an appropriate key.
  2. Generate a surrogate ordering value which MySQL can sort when writing test data.
  3. Start doing limits in queries instead of pulling all the data.

None of this is too tricky, although step (1) may be a hefty migration on installs with lots of data and step (3) may be very disruptive to the ordering of older builds. We might need to add a bin/harbormaster fix-all-the-old-order-surrogate-keys or similar to help with that.

T9704 also discusses slowness on inserting the tests. That may be unnecessarily slow right now, but I don't expect clients to submit unlimited numbers of tests in one API call. Instead, submit tests in chunks (say, of 1K tests per page or whatever) by calling harbormaster.sendmessage repeatedly with a work status.

Ideally, you can do this as the tests run, providing a stream of results to the user sooner.

In the extreme case where you have, say, a million tests, I'd expect there is probably little value in reporting each test into Harbormaster as a "pass", and your harness might want to summarize all passes into "999,998 additional tests passed" and submit two failures and one aggregate-pass. We haven't started to formally explore this yet so there may be more support in the future.

Broadly, the expectations are:

  • Reporting ten million tests per build should not impact the usability of the UI. It currently does; the steps described above will fix this.
  • Reporting ten million tests per build will require multiple calls to harbormaster.sendmessage.
  • Reporting ten million tests per build may take a long time.
  • Reporting ten million tests per build may cause storage to grow at an uncomfortable rate.
  • If you have ten million tests, some aggregation mechanism for passing tests is probably desirable as the value of recording individual passes is probably far lower than the cost of storing them, but we don't have a formal recommendation on where this should go or what it should look like yet.

In the extreme case where you have, say, a million tests, I'd expect there is probably little value in reporting each test into Harbormaster as a "pass", and your harness might want to summarize all passes into "999,998 additional tests passed" and submit two failures and one aggregate-pass.

This is pretty what I've done right now. The basic unit tests are reported via arc (~2000 separate cases), and only the failing tests from CI are reported back via`harbormaster.sendmessage`.

thoughtpolice moved this task from Backlog to Details on the Haskell.org board.May 13 2016, 10:05 PM
eadler added a project: Restricted Project.Aug 5 2016, 4:45 PM
avivey added a comment.Sep 2 2016, 5:35 PM

For people who suffer from this: Note that D16483 might have actually made things a little worse, by searching the DB for old messages (Not loading them though); I tested with 1M messages on an SSD drive, and I didn't see any changes, but YMMV.