Page MenuHomePhabricator

Make PHPAST parser tests stable and human-readable
AbandonedPublic

Authored by epriestley on May 3 2017, 5:07 PM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, Apr 16, 5:11 AM
Unknown Object (File)
Thu, Apr 11, 8:16 AM
Unknown Object (File)
Thu, Apr 11, 3:56 AM
Unknown Object (File)
Fri, Apr 5, 6:08 PM
Unknown Object (File)
Sat, Mar 30, 4:36 AM
Unknown Object (File)
Wed, Mar 20, 7:04 PM
Unknown Object (File)
Feb 20 2024, 1:34 PM
Unknown Object (File)
Feb 20 2024, 10:26 AM
Tokens
"100" token, awarded by Firehed.

Details

Summary

Ref T4334. See D17802. Currently, PHPAST parser tests are huge blobs of JSON. This is a problem for several reasons:

  • The actual numeric constants aren't stable across PHP versions, so these tests can fail for literally no reason.
  • No human can hope to read or understand them.
NOTE: This just changes the way we build/run the tests. The next change is the ~15K lines of actual test changes.

Instead, write out the nodes in a nice little tree:

+ n_NODE_NAME
 + n_SUBNODE
  + n_ANOTHER_NODE

...and write the tokens in a list:

> T_HELLO         hi
> T_GOODBYE       bye

This isn't a completely exhaustive representation of the tree (parent nodes could still have the wrong token range and be missed by the tests) but I think it's a reasonable balance of readability and stability.

Test Plan

See next change.

Diff Detail

Repository
rPHU libphutil
Branch
xhp1
Lint
Lint Passed
Unit
Test Failures
Build Status
Buildable 16790
Build 22408: Run Core Tests
Build 22407: arc lint + arc unit

Unit TestsFailed

TimeTest
20 msPHPASTParserTestCase::Unknown Unit Message ("")
Assertion failed, expected values to be equal (at PHPASTParserTestCase.php:109): Parser output for "base-pass.php.test". Expected vs Actual Output Diff --- Old Value
35 msPHPASTParserTestCase::Unknown Unit Message ("")
Assertion failed, expected values to be equal (at PHPASTParserTestCase.php:109): Parser output for "base-pass.php.test". Expected vs Actual Output Diff --- Old Value
1 msAbstractDirectedGraphTestCase::Unknown Unit Message ("")
1 assertion passed.
0 msAbstractDirectedGraphTestCase::Unknown Unit Message ("")
1 assertion passed.
0 msAbstractDirectedGraphTestCase::Unknown Unit Message ("")
1 assertion passed.
View Full Test Results (2 Failed · 332 Passed · 2 Skipped)

Event Timeline

I'm doing a lot of fancy hand-waving with escaping and wrapping here, but it's basically just to keep things sane when we print token values that contain newlines, special characters, span multiple lines, etc.

The actual numeric constants aren't stable across PHP versions, so these tests can fail for literally no reason.

Don't the numeric constants come from XHPAST? Specifically, from xhpast_parser_token_constants and xhp_parser_node_constants.

src/parser/xhpast/__tests__/PHPASTParserTestCase.php
254–376

I wonder if all of this code should be moved to XHPASTTree? My concern is that this code is fairly complex and it would be nice to have some test coverage (i.e. some code that tests the tree -> readable string logic rather than testing the parser itself).

Oh, I misremembered slightly.

I think the node constants are stable (defined in generate_nodes.php) but the token constants are not.

The token constants are generated by Flex/Bison when they process parser.y. They automatically assign values (starting slightly after 256) as they encounter tokens.

In D17817, you added a new token in the middle, so Flex/Bison learned about it before some other token, assigned it the next number, and threw everything after it out of whack. That's basically fine -- there's no real value in keeping the constants stable, as long as we don't encode the constant values in unit tests.

It would be nice to have a PHP definition of the token constants, then a generation script which generated a .php file and a .hpp file. I'm not sure if this would actually work or how the preprocessing phases work. The n_NODE_CONSTANTS are nice normal C #define statements so they don't need any weird magic.

It looks like PHP itself deals with this by always adding new tokens at the bottom to keep things stable? But I think it's basically fine for things to not be stable if it makes it easier to edit or read parser.y, and any program which relies on T_WHATEVER being exactly 329 for all time is asking for trouble anyway.

I also think you have to edit parser_tokens.php by hand currently, which is silly and unintuitive, but we don't get any super convenient output from Bison about which numbers it decided to assign.

If you want to test the test logic, just write a normal test aimed at whatever feature you want to test (e.g., encoding of multiline tokens or utf8 strings or whatever)? You're testing a little more than purely the wrap/encode functions, but since the input can be very small I think you'll almost exclusively hit the code you actually care about.

The formatting overall feels very unit-test specific to me and I'd worry that putting it in XHPASTTree could encourage off-label use and end in teras. Like, we'd end up with getReadableTreeUseThisMethodONLYForUnitTests_OR_ELSE().

Oh, I misremembered slightly.

"Slightly" in the sense of "every part of my claim was technically wrong", but it was mostly right in spirit!

In D17817, you added a new token in the middle, so Flex/Bison learned about it before some other token, assigned it the next number, and threw everything after it out of whack. That's basically fine -- there's no real value in keeping the constants stable, as long as we don't encode the constant values in unit tests.

Makes sense

I also think you have to edit parser_tokens.php by hand currently, which is silly and unintuitive, but we don't get any super convenient output from Bison about which numbers it decided to assign.

Yeah, it's a bit of PITA.

If you want to test the test logic, just write a normal test aimed at whatever feature you want to test (e.g., encoding of multiline tokens or utf8 strings or whatever)? You're testing a little more than purely the wrap/encode functions, but since the input can be very small I think you'll almost exclusively hit the code you actually care about.

The formatting overall feels very unit-test specific to me and I'd worry that putting it in XHPASTTree could encourage off-label use and end in teras. Like, we'd end up with getReadableTreeUseThisMethodONLYForUnitTests_OR_ELSE().

Fair point.

src/parser/xhpast/__tests__/PHPASTParserTestCase.php
105–106

I'm wondering if we need this? It seems like there are three possible modes here:

  • pass, in which you assert that the output from xhpast is equivalent to some human-readable representation of the ree.
  • fail-parse, in which you expect a non-zero exit status from xhpast.
  • fail-syntax, in which you specify that the output from xhpast is not what you expected.

The fail-syntax mode feels a bit odd to me, because it's too easy to get wrong. Like, you might just have a single extra space in your human-readable tree?

Just submitting some comments whilst I go along, in case I don't get a chance to fully review this.

src/parser/xhpast/__tests__/PHPASTParserTestCase.php
266

It feels slightly odd to me that newReadableTree returns an array.

270–285

Maybe move this to a newReadableStream() method? Then we can just do this:

private function newReadableAST(array $data, $source) {
  $tree = new XHPASTTree($data['tree'], $data['stream'], $source);
  $root = $tree->getRootNode();
  
  $output = implode('', $this->newReadableTree($root, 0));
  $output .= str_repeat('-', 80)."\n";
  $output .= $this->newReadableStream($tree);

  return $output;
}
326–346

I understand escaping slashes and newlines, but why change whitespace to underscores?

I think "fail-parse" is actually "XHPAST thinks it parsed things right, but it didn't" and the goal was let you to write a test that says "this test represents a bug in XHPAST, and should be fixed/removed/whatever when the test passes".

But we also have "comment" which does the same thing, and only one test like this, plus one meta-test which tests that the mode works. I'll just remove the mode and the weird test, I can't really see us getting much use out of it. I also didn't actually update it here or in the next change because the test continues to pass when the output changes.

src/parser/xhpast/__tests__/PHPASTParserTestCase.php
351–376

I don't quite understand why we need this? Is this an optimization to avoid having really long lines in the test format?

(If we don't change spaces to underscores we get a bunch of result-affecting trailing whitespace which my editor strips and which is invisible in diffs.)

(If we don't change spaces to underscores we get a bunch of result-affecting trailing whitespace which my editor strips and which is invisible in diffs.)

How about quotes instead? So a single space becomes ' '?

The wrapping is just to make things more readable and diff-able for tests which include, e.g., large HEREDOCs, nodes with long token sequences, etc., which may span multiple lines. Letting the lines go out to a billion characters would be the same from the perspective of test correctness, but presumably less readable for humans, and 70,000 character lines or whatever tend to be a bit of a mess in Differential.

(If we don't change spaces to underscores we get a bunch of result-affecting trailing whitespace which my editor strips and which is invisible in diffs.)

Also, we have the problem of editors stripping whitespace in the unit tests for ArcanistTextLinter. We dealt with this by adding this to .editorconfig:

[src/lint/linter/**/__tests__/**.lint-test]
indent_style =
end_of_line =
max_line_length =
trim_trailing_whitespace = false

It might be reasonable to do something similar for *.php.test

I can't really see us getting much use out of it. I also didn't actually update it here or in the next change because the test continues to pass when the output changes.

Exactly my point :)

The wrapping is just to make things more readable and diff-able for tests which include, e.g., large HEREDOCs, nodes with long token sequences, etc., which may span multiple lines. Letting the lines go out to a billion characters would be the same from the perspective of test correctness, but presumably less readable for humans, and 70,000 character lines or whatever tend to be a bit of a mess in Differential.

Are there any examples of this wrapping in D17820? From a cursory glance, I couldn't see any.

We could do quotes, but then we should prrrrobably escape real quotes too -- to avoid questions about why they aren't escaped, if nothing else, although maybe the space replacement is already inviting those questions, and then a lot of actual stuff like string literals will get messier and become '\'quote\\\\'quote\'' which is technically correct but probably less human-readable.

I don't really feel too strongly. We could quote everything with or without quoting quotes, or put "$" at the end of every line or whatever else. I picked this rule just because it felt like a better balance of convenience and readability than other choices, without creating much risk that "_" and " " would collide in a weird way that created a wrong test result. Since all the token text is reconstructed from substrings of the input anyway, I think there's almost no chance that any mangling rule we apply will misfire -- we can't really end up with tokens that have arbitrary weird values, even if we write very creative bugs in the parser.

src/parser/xhpast/__tests__/PHPASTParserTestCase.php
279

This makes the output not technically stable because if someone adds a new token with a longer name then all of the tests will break? Is it important to pad the string here? I'm wondering if we could just do something like this instead:

>
  | T_OPEN_TAG
  | <?php

In cases where we can easily write a test to be trailing-whitespace-insensitive, I generally think it's worthwhile. In theory, everyone can configure their editors to respect .editorconfig and show invisibles and be careful and get this right and we can add appropriate config to cover these files, but in practice plenty of people don't do some of those steps, and even if they do it's not necessarily clear why the test is failing from, say, the CLI diff output unless you configure your terminal to show invisibles too, or in Harbormaster unless we change that to show invisibles. Feels like a lot of made up work to throw on an altar of purity when we can easily select a visible test format here.

That one string won't align properly but output will otherwise be stable. We could make a decision to break existing tests to add more padding if we wanted.

The padding as written felt more readable to me than breaking things onto separate lines, because many tokens are short so this requires about half as many lines, and it makes it easier to scan the token values by reading the right-hand column.

I suppose I can dynamically pad the column to length of the longest string it actually uses to future-proof this.

I just think it would be better to use real escaping, rather than people needing to know that we change spaces to underscores. Can we use addcslashes perhaps? Sorry if I am nitpicking here, feel free to just ignore me.

That one string won't align properly but output will otherwise be stable. We could make a decision to break existing tests to add more padding if we wanted.

The padding as written felt more readable to me than breaking things onto separate lines, because many tokens are short so this requires about half as many lines, and it makes it easier to scan the token values by reading the right-hand column.

I suppose I can dynamically pad the column to length of the longest string it actually uses to future-proof this.

Maybe in the future we can make the tests ensure that the output is equivalent rather than identical? Its probably a lot of work to normalise the human-readable format, hence we can just worry about it when we start adding lots of really long tokens.

addcslashes() on its own won't fix the trailing whitespace problem, even if we escape spaces, since \ is still trailing whitespace. addcslashes() with quoting will, but then we end up with a whole lot of extra quotes in the "content" column. Many tokens have length 1 so { becomes "{".

Not a big deal; I think any real escaping generally trades away readability, but quoted strings are readable enough and don't raise a question of why the test is using a weird non-reversible format instead of a standard format, since the answer "it's subjectively a little more readable and escaping/reversibility are irrelevant" isn't really obvious.

Yeah sorry, I meant addcslashes with quotes. I agree with you regarding readability, but I think developers are used to quotes and escacaping anyway.

Just CCing some people that I work with that will be interested in this change.

  • Refactor into smaller methods.
  • Remove "fail-parse" and associated test.
  • Include a self-test which explicitly covers wrapping of long tokens and token sequences.
  • Use quoted C-style strings instead of escaping spaces to underscores.
  • Select a padding size dynamically, based on the longest token which actually appears in the output rather than hard-coding 30.
  • Don't use Word curly quotes which I copy-pasta'd when copy-pasta'ing the tragedy of Darth Plaugeis.
joshuaspence added inline comments.
src/parser/xhpast/__tests__/PHPASTParserTestCase.php
102–104

Debugging code?

117–121

Good call.

137

Should you use "\n" as a separate instead of ''?

152

Should we make this ~~~~~~~~~~ to be more consistent with the *.php.test format?

160

Technically better might be PHP_INT_MIN, but that was only added in PHP 7.0.

193

When would this happen? Would it be better to simply throw the exception here?

216

Maybe instead of this:

> T_TOKEN <null>

We can just do this?

> T_TOKEN
src/parser/xhpast/__tests__/data/a-self-test.php.test
22

In the distant future, it might be nice to make quotes optional when the string is already unambiguous.

This revision is now accepted and ready to land.May 3 2017, 11:29 PM
  • If we use "\n" to implode, we'll get extra "\n" because the individual parts emit their own "\n". They could return values without "\n", but then the code to construct them would be more complicated and/or just end in rtrim() to remove a newline which is immediately added back by the caller.
  • If we use ~~~~, the code which splits the test case into pieces will split the node tree and token stream into separate pieces, and we'll need to compare them separately, or merge them back together with a ~~~~ before comparing them. Either would make the rendering and comparison code slightly more complicated, and there is no value to splitting them since the block is only aiming to be human-readable, not machine-parseable.
  • "INVALID TYPE" is likely a bug in the existing code; this happens today in php-traits.php.test. We could throw, but then the test would fail even after D17820. My intent here was just to make a quick change to unblock the other pending changes so I didn't dig into the actual bug. A future change to make this throw would be reasonable, but it should fix the bug first.
  • If we do > T_TOKEN , we have trailing whitespace (the alignment space after the token name) or more complex code (to selectively add the padding space). I think this would also make the nature of some test failures a bit less obvious, especially in the case of nodes with no tokens as children in the "tree" part of the output, which would look like this:
+ n_EMPTY
 >

That seems less obvious than this, especially if it's half of an output diff:

+ n_EMPTY
 > "<null>"

I'm not really happy with where this ended up -- putting quotes on everything has kind of turned it into a mess -- so let me see if I can come up with a better approach.

This looks a lot better already.

This isn't a completely exhaustive representation of the tree (parent nodes could still have the wrong token range and be missed by the tests) but I think it's a reasonable balance of readability and stability.

One possible solution would be to include all tokens (and their types) in the tree as well? E.g.:

+ n_PROGRAM
 + n_STATEMENT_LIST
  + n_OPEN_TAG
   > T_OPEN_TAG "<?php"
  > T_WHITSPACE "\n\n"
  + n_STATEMENT
   + n_BINARY_EXPRESSION
    + n_VARIABLE
     > T_VARIABLE "$the_senate"
    > T_WHITESPACE " "
    + n_OPERATOR
     > = "="
    > T_WHITESPACE " "
    + n_HEREDOC
     > "<<<EOTRAGEDY\nDid you ever hear the tragedy of Darth Plagueis The Wise?"
     . " I\nthought not. It's not a story the Jedi would tell you. It's a Sith "
     . "legend.\nDarth Plagueis was a Dark Lord of the Sith, so powerful and so"
     . " wise he could\nuse the Force to influence the midichlorians to create "
     . "life... He had such a\nknowledge of the dark side that he could even ke"
     . "ep the ones he cared about from\ndying. The dark side of the Force is a"
     . " pathway to many abilities some consider\nto be unnatural. He became so"
     . " powerful... the only thing he was afraid of was\nlosing his power, whi"
     . "ch eventually, of course, he did. Unfortunately, he taught\nhis apprent"
     . "ice everything he knew, then his apprentice killed him in his sleep.\nI"
     . "ronic. He could save others from death, but not himself.\nEOTRAGEDY"
  > ; ";"
 > T_WHITESPACE "\n\n"

Granted, this does not make it particularly more readable, but it does solve testing token ranges for nodes.

This looks a lot better already.

This isn't a completely exhaustive representation of the tree (parent nodes could still have the wrong token range and be missed by the tests) but I think it's a reasonable balance of readability and stability.

One possible solution would be to include all tokens (and their types) in the tree as well? E.g.:

<snip>

Granted, this does not make it particularly more readable, but it does solve testing token ranges for nodes.

See P2046 for a proof-of-concept that more-or-less does this. Right now this produces:

+ n_PROGRAM
 + n_STATEMENT_LIST
  + n_OPEN_TAG
   > "T_OPEN_TAG <?php"
  > "T_WHITESPACE \n\n"
  + n_STATEMENT
   + n_BINARY_EXPRESSION
    + n_VARIABLE
     > "T_VARIABLE $the_senate"
    > "T_WHITESPACE  "
    + n_OPERATOR
     > "= ="
    > "T_WHITESPACE  "
    + n_HEREDOC
     > "T_HEREDOC <<<EOTRAGEDY\nDid you ever hear the tragedy of Darth Plagueis"
     . " The Wise? I\nthought not. It's not a story the Jedi would tell you. It"
     . "'s a Sith legend.\nDarth Plagueis was a Dark Lord of the Sith, so power"
     . "ful and so wise he could\nuse the Force to influence the midichlorians "
     . "to create life... He had such a\nknowledge of the dark side that he cou"
     . "ld even keep the ones he cared about from\ndying. The dark side of the "
     . "Force is a pathway to many abilities some consider\nto be unnatural. He"
     . " became so powerful... the only thing he was afraid of was\nlosing his "
     . "power, which eventually, of course, he did. Unfortunately, he taught\nh"
     . "is apprentice everything he knew, then his apprentice killed him in his"
     . " sleep.\nIronic. He could save others from death, but not himself.\nEOT"
     . "RAGEDY"
   > "; ;"
 > "T_WHITESPACE \n\n"

Is there anything I can do to help this move forward?

Not at the moment. I was hoping to sneak this in quickly between other work, but it was a more involved change than I'd anticipated.

We are currently pursuing T12681 which will provide a more direct pathway, although maybe not a desirable one (since it's "pay us money").

(This was ultimately superseded by D21065.)