Page MenuHomePhabricator

Parse remarkup tables with something like a real parser instead of regular expressions
Needs ReviewPublic

Authored by epriestley on Tue, Jun 4, 4:32 PM.



Ref T13160. See PHI1275. We have several use cases where Remarkup documents are used as more polished/permanent content (blog posts, "nice" documentation) and/or automatically generated.

In these cases, users would generally like more control and formatting options, since they're either expecting the document to be long-lived / frequently read (so the extra work to make it look nice is worthwhile), or they're generating the document and the cost of doing one-time formatting setup is small compared to how often stuff will be generated.

We do a bit of this ourselves, since I partially codegen the changelog and then manually edit from there, although I'm not in super urgent need of fancier formatting options.

A lot of these use cases involve some kind of display tables and wanting better control over layout/spacing options. Although I think T13158 might eventually serve this use case in some cases, it probably won't cover everything.

We already support a verbose, HTML-like <table> syntax in Remarkup aimed at these more-deliberate use cases: this syntax isn't very convenient if you're writing a comment, but fairly good if you're code-generating a unit test results table.

Move toward improving this syntax by using something more parser-flavored instead of a cobbled-together mess of regular expressions.

In future changes, I plan to support:

  • Making <colgroup /> do something.
  • Probably some formatting options like fill colors and borders.

PHP has some builtin HTML parsing support but: (a) I don't trust it; (b) we aren't really parsing HTML, but HTML-surrounding-remarkup; (c) I tried to make this parser handle any prefix of a valid input as approximately that input to make previews work better; and (d) we probably need more flexibility to get all the behavior right than any existing parser will give us.

Also note that this parser completely parses the input into an intermediate format and then emits that format through the standard HTML rendering pipeline, so this can't (directly) introduce XSS no matter how badly I screwed up the parser. I could have an infinite loop or something, but since the parser itself never marks anything as "safe" it can't violate escaping rules.

Test Plan
  • Added unit tests, made them pass.
  • Existing unit tests of <table /> syntax continue to pass.
  • Here's an example output, note the input isn't entirely valid but the parser is reasonably forgiving and gets the right result:

Diff Detail

rPHU libphutil
Lint OK
Unit Tests OK
Build Status
Buildable 22942
Build 31479: Run Core Tests
Build 31478: arc lint + arc unit

Event Timeline

epriestley created this revision.Tue, Jun 4, 4:32 PM
epriestley requested review of this revision.Tue, Jun 4, 4:32 PM
epriestley added inline comments.Tue, Jun 4, 4:36 PM

I'm picking <colgroup /> children out now, but not doing anything with them yet.

epriestley updated this revision to Diff 49067.Tue, Jun 4, 4:37 PM
  • Add a couple of extra non-semantic newlines to the test cases for prettiness.