Parse remarkup tables with something like a real parser instead of regular…


Parse remarkup tables with something like a real parser instead of regular expressions

Ref T13160. See PHI1275. We have several use cases where Remarkup documents are used as more polished/permanent content (blog posts, "nice" documentation) and/or automatically generated.

In these cases, users would generally like more control and formatting options, since they're either expecting the document to be long-lived / frequently read (so the extra work to make it look nice is worthwhile), or they're generating the document and the cost of doing one-time formatting setup is small compared to how often stuff will be generated.

We do a bit of this ourselves, since I partially codegen the changelog and then manually edit from there, although I'm not in super urgent need of fancier formatting options.

A lot of these use cases involve some kind of display tables and wanting better control over layout/spacing options. Although I think T13158 might eventually serve this use case in some cases, it probably won't cover everything.

We already support a verbose, HTML-like <table> syntax in Remarkup aimed at these more-deliberate use cases: this syntax isn't very convenient if you're writing a comment, but fairly good if you're code-generating a unit test results table.

Move toward improving this syntax by using something more parser-flavored instead of a cobbled-together mess of regular expressions.

In future changes, I plan to support:

  • Making <colgroup /> do something.
  • Probably some formatting options like fill colors and borders.

PHP has some builtin HTML parsing support but: (a) I don't trust it; (b) we aren't really parsing HTML, but HTML-surrounding-remarkup; (c) I tried to make this parser handle any prefix of a valid input as approximately that input to make previews work better; and (d) we probably need more flexibility to get all the behavior right than any existing parser will give us.

Also note that this parser completely parses the input into an intermediate format and then emits that format through the standard HTML rendering pipeline, so this can't (directly) introduce XSS no matter how badly I screwed up the parser. I could have an infinite loop or something, but since the parser itself never marks anything as "safe" it can't violate escaping rules.

Test Plan:

  • Added unit tests, made them pass.
  • Existing unit tests of <table /> syntax continue to pass.
  • Here's an example output, note the input isn't entirely valid but the parser is reasonably forgiving and gets the right result:

Screen Shot 2019-06-04 at 9.29.29 AM.png (565×389 px, 25 KB)

Reviewers: amckinley

Reviewed By: amckinley

Maniphest Tasks: T13160

Differential Revision: https://secure.phabricator.com/D20568