Page MenuHomePhabricator

Skip copied code detection for changes that are too large for it to be useful

Authored by epriestley on Oct 19 2018, 8:27 PM.
Referenced Files
F13354618: D19748.id47177.diff
Mon, Jun 24, 6:34 AM
F13334679: D19748.diff
Wed, Jun 19, 12:28 AM
F13311967: D19748.diff
Tue, Jun 11, 5:57 AM
F13301837: D19748.id47179.diff
Sat, Jun 8, 3:29 AM
F13301835: D19748.id47177.diff
Sat, Jun 8, 3:29 AM
F13301834: D19748.diff
Sat, Jun 8, 3:29 AM
F13300673: D19748.diff
Fri, Jun 7, 9:26 AM
F13269620: D19748.diff
Wed, May 29, 6:31 AM



Ref T13210. See PHI944. When parsing certain large diffs (the case in PHI944 is an 2.5-million line JSON diff), we spend ~66% of runtime and ~80% of memory doing copy detection (the little yellow bar which shows up to give you a hint that code was moved around within a diff).

This is pretty much pointless and copy hints are almost certainly never useful on large changes. Instead, just bail if the change is larger than some arbitrary "probably too big for copy hints to ever be useful" threshold (here, 65535 lines).

Test Plan

Roughly, ran this against a 2.5 million line JSON diff:

$changes = id(new ArcanistDiffParser())->parseDiff($raw_diff);
$diff = DifferentialDiff::newFromRawChanges($viewer, $changes);

Before the changes, it took 20s + 2.5GB RAM to parse. After the changes, it took 7s + 500MB RAM to parse.

Diff Detail

rP Phabricator
Lint Not Applicable
Tests Not Applicable

Event Timeline

amckinley added inline comments.

Should this hook into any of the HUGE/TRULY_CRAZY_HUGE diff detection stuff that got fixed a few months ago instead of defining a new constant here?

This revision is now accepted and ready to land.Oct 19 2018, 9:26 PM

Should this hook into any of the HUGE/TRULY_CRAZY_HUGE diff detection stuff that got fixed a few months ago instead of defining a new constant here?

I think my gut reaction here is sort of a "maybe". The diff stuff is large enough that it has quite a few different magic constants which are somewhat similar to this one (e.g., intradiff bailout in T1246, which must be in arcanist/; the highlighting limit in T7895, which is used by both Differential and Diffusion; the "ENORMOUS" stuff discussed in T13143; various other limits like "maximum number of things we show in the web UI before degrading"). I think this stuff is spread out enough that it probably doesn't necessarily make sense to try to group it all together -- it tends to affect different workflows and different areas of the code, even though many things are implementing similar ideas.

There's some kind of argument we could make that we should have no hard-coded magic numbers and all constants should be configurable, but the number of actual cases we run into where installs want to change one of these numbers is very small. Many of these constants can be difficult to fully understand the implications of, and mis-configuring them can cause big messes which are hard to support (e.g., PHI944 took an hour or so of digging to end up here), and the cases where installs do want to change these numbers often feel questionable (e.g., the driving use case is somewhat sketchy or abusive). And these constants are ultimately configurable since you can just edit them if you really want, as long as you're accepting the cost/risk of maintaining a one-line patch which adjusts a number.

The thing I like least about this patch is really that we don't explicitly tell the user we skipped the detection. I think a "better" version of this patch would probably write a metadata property like copy:lines:skipped-because-it-is-too-big = 2243921 (see inline) and then the UI would say "Copied code detection was skipped for for this diff because it is very large (2,243,921 lines). To improve performance, copied code is detected only for small revisions (65,535 or fewer lines)." However, I think this bailout is very likely above the threshold where anyone will ever actually notice or care, and adding that notice would make the change a bit larger and more complex, and the notice itself is probably just clutter in close to 100% of the cases where it would appear.



This revision was automatically updated to reflect the committed changes.