Page MenuHomePhabricator

Ability to pass more info to custom linters
Closed, ResolvedPublic

Description

As far as I can tell, custom linters receive only the modified filenames, one file at a time. Certainly script-and-regex linters have this limitation.

This makes certain kinds of linters that you might reasonably want to write very difficult. For example, I would like to add to LLVM a linter that checks that all lines modified are formatted according to clang-format. There are scripts [1] [2] that can take as input a git commit range or a unified diff and reformat them -- using either these, it would be simple to construct a linter, if one had access to either a diff or a commit range. But with just a list of files, there's not much to be done.

Apologies if this already exists; I read the docs a few times and didn't see anything that looked promising.

[1] https://llvm.org/svn/llvm-project/cfe/trunk/tools/clang-format/git-clang-format
[2] https://llvm.org/svn/llvm-project/cfe/trunk/tools/clang-format/clang-format-diff.py

Event Timeline

  • Emit messages for every line with a problem.
  • If you raise them at "warning" severity, they will only be shown if the user modified the line.

Does this solve your problem?

Thank you for your quick response.

It seems nontrivial to map a diff (the result of applying clang-format to a file) to a list of errors mapped to specific lines that the user may or may not have touched. As one example, suppose the original code is

Foo::Foo() : bar(1),
  baz(2) {}

The user then modifies this to

Foo::Foo() : bar(1),
  baz(2), foobar(3) {}

And clang-format reformats this to

Foo::Foo() :
  bar(1),
  baz(2),
  foobar(3) {}

Note that although the user didn't touch the first line, it still got reformatted, and I think we want the linter to suggest the full change. But how are we supposed to figure out that the formatting change on the first line was a result of the user's change on the second? (The way git-clang-format figures it out is by passing a list of lines to clang-format; the tool may then reformat more than was provided.)

In addition, there's also the difficulty of outputting this information in a human-friendly way. It's not clear to me how to readably show a diff as a map from line to warnings. But maybe that's not as big of a deal -- perhaps the line-by-line errors could be consumed by arc, and then the tool could separately output a diff.

The expectation is that this will just work properly already. We have a number of similar rules in the upstream.

Have you tried actually writing this rule and run into problems?

Specifically, you emit a message that says "the block XYZ, starting at line 13, offset 12, is not great, and should probably be replaced by the different block QWERTY". arc handles everything else.

(Where "XYZ" and "QWERTY" can be arbitrarily long sequences and may each contain zero or more newlines, and may each be of arbitrary length -- you can replace a single character with a huge block, or vice versa, etc. -- lint is not line-oriented internally.)

I haven't tried to write this rule, and you've almost convinced me to try, but -- forgive me -- I don't quite believe it's possible to do this correctly (or, as correctly as clang-format) in the context of formatting C++.

Suppose we have the following original code

foo( );

which clang-format formats as

foo();

then the user changes the original code to

foo( );
bar( );

which clang-format formats as

foo();
bar();

The correct thing is for the linter to say that we should reformat bar but not foo. clang-format's suggestion after the user's modifications comes in as a single lint error covering foo and bar, but maybe you can split it apart and figure out that, originally, clang-format made an identical suggestion for foo(), so we can ignore that one and only show a suggestion for bar() to the user.

But that rule -- "don't show changes to the user that were present in the unmodified file" -- isn't quite correct. Suppose we have something like

foo(a, b, c,
 d, e,
 f, g);

Suppose clang-format wants to format this as

foo(a, b, c,
    d, e,
    f, g);

Now suppose I add an arg to the original code:

foo(a, b, c,
 d, e,
 f, g, h);

and suppose clang-format wants to format this as:

foo(a, b, c,
    d, e,
    f, g, h);

Using the heuristic above, we'd suggest:

foo(a, b, c,
  d, e,
    f, g, h);

because the second line has an identical lint suggestion before and after the user's change. In contrast, git-clang-format, which understands that this is all one function call, suggests the right thing:

foo(a, b, c,
    d, e,
    f, g, h);

I don't see how a system that doesn't semantically understand C++ can get both of the examples here right; they see indistinguishable.

Here's what I did:

I created a file, example.clang, which looks like this:

$ cat example.clang 
foo( );

foo(a, b, c,
 d, e,
 f, g);

This file contains has both of your "before" examples. I committed this file on a local branch as commit1:

$ git add example.clang
$ git commit -m 'commit1'
[clang1 5c008c4] commit1
 1 file changed, 5 insertions(+)
 create mode 100644 example.clang

I changed the file to look like this and committed the new version as commit2, on top of commit1:

foo( );
bar( );

foo(a, b, c,
 d, e,
 f, g, h);

The specific operations I performed were:

  • Add bar( ) on line 2.
  • Add , h on line 6.

Here is the diff between the two files:

diff --git a/example.clang b/example.clang
index 7f6cc77..8e83808 100644
--- a/example.clang
+++ b/example.clang
@@ -1,5 +1,6 @@
 foo( );
+bar( );
 
 foo(a, b, c,
  d, e,
- f, g);
+ f, g, h);

I wrote this "linter", which hard-codes the correct messages to emit for this file:

SyntheticClangLinter.php
...

  public function lintPath($path) {
    if ($path != 'example.clang') {
      return;
    }

    $this->raiseLintAtLine(
      1,
      1,
      'synthetic-1',
      'synthetic-1',
      'foo( );',
      'foo();');

    $this->raiseLintAtLine(
      2,
      1,
      'synthetic-2',
      'synthetic-2',
      'bar( );',
      'bar();');

    $this->raiseLintAtLine(
      4,
      1,
      'synthetic-3',
      'synthetic-3',
      "foo(a, b, c,\n d, e,\n f, g, h);",
      "foo(a, b, c,\n    d, e\n    f, g, h);");
  }

...

I positioned HEAD at commit2, and ran arc lint --rev HEAD^ to lint the most recent change (i.e., only the changes made by commit2). Here is the command output:

$ arc lint --rev HEAD^
>>> Lint for example.clang:


   Warning  (CHMODsynthetic-2) Second Synthetic Fix
    synthetic-2

               1 foo( );
    >>> -      2 bar( );
        +        bar();
               3 
               4 foo(a, b, c,
               5  d, e,

   Warning  (CHMODsynthetic-3) Third Synthetic Fix
    synthetic-3

               1 foo( );
               2 bar( );
               3 
               4 foo(a, b, c,
    >>> -      5  d, e,
        -      6  f, g, h);
        +            d, e
        +           
--- /Users/epriestley/dev/core/lib/arcanist/example.clang	2016-01-29 03:26:17.000000000 -0800
+++ /private/var/folders/pj/3dbkgfxd783_9yvv7c1kss9m0000gn/T/b8yweogejbww8o8w/2896-H6t6u9	2016-01-29 03:38:09.000000000 -0800
@@ -1,6 +1,6 @@
 foo( );
-bar( );
+bar();
 
 foo(a, b, c,
- d, e,
- f, g, h);
+    d, e
+    f, g, h);


    Apply this patch to example.clang? [Y/n]

Note that:

  • A correction to foo( ) is not suggested (the diff did not touch that line).
  • The proper correction to bar( ) is suggested.
  • The proper correction to foo(a, b, ...) is suggested, although there is a rendering glitch (see T9846).
  • The aggregate patch arc suggests is correct.
  • The linter has zero knowledge of the diff.

I answered "Y" at the prompt and arc adjusted the file to have this content:

$ cat example.clang 
foo( );
bar();

foo(a, b, c,
    d, e
    f, g, h);

Specifically, here is the diff between commit2 and the version arc produced:

$ git diff HEAD
diff --git a/example.clang b/example.clang
index 8e83808..4ac521c 100644
--- a/example.clang
+++ b/example.clang
@@ -1,6 +1,6 @@
 foo( );
-bar( );
+bar();
 
 foo(a, b, c,
- d, e,
- f, g, h);
+    d, e
+    f, g, h);

I think the piece of information that differs between your model and the actual behavior of arc is that the linter does not (and must not) emit three separate corrections to fix foo(a, b, ...) one line at a time. It should emit a single correction for the entire block. When patching, arc takes the whole correction or none of it, so a suggestion which indents only the f, g, h); line is not possible. It offers the correction to the user if any subsequence of the affected block lies on a modified line.

Breaking patches into line-by-line fixes would be worse for implementing linters (which would have to split all corrections into multiple lines), worse for human users (who would have to deal with more messages, many of which would be similar), introduce ambiguity (as above), make textual descriptions of messages relatively meaningless, and open users up to the possibility of applying half a patch. All of these are undesirable, so arc does not work like this.

(Oh, there's a missing comma after e in some of the output above, but that's just because I typed the replacement text for the patch wrong in my "linter".)

I think the piece of information that differs between your model and the actual behavior of arc is that the linter does not (and must not) emit three separate corrections to fix foo(a, b, ...) one line at a time. It should emit a single correction for the entire block.

Yes. The problem is, the output of running clang-format on a file is a new file, or a *single* diff. The 'foo ( )' and 'bar( )' changes are right next to each other, so they'd appear together in one hunk. The tool that is translating this information into the format the linter wants cannot know that the 'foo( )' correction should appear in a separate warning from the 'bar( )' correction, while the two changes for 'foo(a, .., h)' should appear in a single correction, unless it itself has semantic knowledge of C++.

Maybe clang-format could be changed to give an output that's properly "split up" -- I don't know how feasible it would be, but given how fiddly it is (the whole tool is a collection of special cases), I expect that would be Hard. Again the way that other tools work around this is by feeding a set of line numbers to clang-format; the tool will then reformat those lines and any others incidentally affected.

Maybe clang-format could be changed to give an output that's properly "split up" -- I don't know how feasible it would be, but given how fiddly it is (the whole tool is a collection of special cases), I expect that would be Hard.

From 30 seconds of looking at the manual page, it looks like -output-replacements-xml does this?

Yeah, I checked before writing that last post, and I thought the same thing, but it's not split up the way we want. e.g.

    $ cat foo.cpp
    int foo() {
      foo( );
      bar( );
    }

    int a = foo(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa, bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb,
	  ccccccccccccccccccccccccccccccccc, ddddddddddddddddddddddddddddddd,
	  eeeeeeeeeeeeeeeeeeeeeee, fffffffffffffffffffffffffffffffff);

    $ clang-format foo.cpp  
    int foo() {  
      foo();  
      bar();  
    }  
      
    int a =  
        foo(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa, bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb,  
	    ccccccccccccccccccccccccccccccccc, ddddddddddddddddddddddddddddddd,  
	    eeeeeeeeeeeeeeeeeeeeeee, fffffffffffffffffffffffffffffffff); 

    $ clang-format -output-replacements-xml foo.cpp
    <?xml version='1.0'?>
    <replacements xml:space='preserve' incomplete_format='false'>
    <replacement offset='18' length='1'></replacement>
    <replacement offset='28' length='1'></replacement>
    <replacement offset='42' length='1'>&#10;    </replacement>
    <replacement offset='116' length='7'>&#10;        </replacement>
    <replacement offset='190' length='7'>&#10;        </replacement>
    </replacements>

There's no grouping in the XML; it seems to be just an XML version of a plain, boring diff.

what the heck

that's super silly

I think you can use $this->getEngine()->getPathChangedLines($path) to get a list of changed lines, and pass that to -lines=... like git-clang-format does, but the API clang-format is exposing isn't very flexible so you're ultimately going to be limited in what you can accomplish if it can't emit a list of individual changes.

epriestley claimed this task.

Presuming resolved by getPathChangedLines($path) + -lines.

I don't plan to expose diffs or commit ranges to linters, because these may not exist (e.g., lint can operate on arbitrary files, including uncommitted changes) and I'm not aware of any other linters which have diff-oriented APIs. This seems like a bizarre atom of API to select.

Any linter using this kind of API will be fundamentally limited in how flexible it can be. For example, users can never selectively accept a subset of patches if there is no way for the linter to emit subsets of patches.

Modifying clang-format to expose individual messages with distinct patches is the best way forward here.

Modifying clang-format to expose individual messages with distinct patches is the best way forward here.

C++ is hard, I think that's likely infeasible to do in a sane manner.

Anyway my need is addressed; I wrote a wrapper around arc diff that just invokes git-clang-format. The PHP solution would probably be nicer, but this is good enough for my purposes. I call the wrapper "arc" and never have to think about it. Here it is, in all its hacky glory, in case others find it useful.

def main():  
  if len(sys.argv) != 3 or sys.argv[1] != 'diff':  
    # Not "arc diff blah"; fall through.  
    return  
  
  try:  
    # TODO: This loses us color and pagination, but oh well.  
    diff = subprocess.check_output(['git', 'clang-format', '--diff', sys.argv[2]])  
  except subprocess.CalledProcessError:  
    yorn = raw_input('Error running git-clang-format.  Continue running arc? [yN] ')  
    if yorn.lower() == 'y':  
      return  
    sys.exit(1)  
  
  if not diff:  
    return  
  
  print('clang-format suggested the following changes:\n')  
  print(diff)  
  
  mode = os.fstat(sys.stdin.fileno()).st_mode  
  if stat.S_ISFIFO(mode) or stat.S_ISREG(mode):  
    print('Detected non-interactive mode.  Proceeding to arc diff...',  
	  file=sys.stderr)  
    return  
  
  yorn = raw_input('Proceed with arc diff? [yN] ')  
  if yorn.lower() != 'y':  
    sys.exit(1)  
  
  
if __name__ == '__main__':  
  main()  
  os.execv(REAL_ARC, sys.argv)