Page MenuHomePhabricator

Parse remarkup tables with something like a real parser instead of regular expressions
ClosedPublic

Authored by epriestley on Jun 4 2019, 4:32 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Jan 4, 5:42 PM
Unknown Object (File)
Mon, Dec 30, 12:01 AM
Unknown Object (File)
Sun, Dec 29, 10:30 AM
Unknown Object (File)
Fri, Dec 27, 11:02 AM
Unknown Object (File)
Fri, Dec 27, 8:50 AM
Unknown Object (File)
Tue, Dec 24, 3:56 PM
Unknown Object (File)
Dec 20 2024, 3:30 PM
Unknown Object (File)
Dec 19 2024, 1:03 AM
Subscribers
None

Details

Summary

Ref T13160. See PHI1275. We have several use cases where Remarkup documents are used as more polished/permanent content (blog posts, "nice" documentation) and/or automatically generated.

In these cases, users would generally like more control and formatting options, since they're either expecting the document to be long-lived / frequently read (so the extra work to make it look nice is worthwhile), or they're generating the document and the cost of doing one-time formatting setup is small compared to how often stuff will be generated.

We do a bit of this ourselves, since I partially codegen the changelog and then manually edit from there, although I'm not in super urgent need of fancier formatting options.

A lot of these use cases involve some kind of display tables and wanting better control over layout/spacing options. Although I think T13158 might eventually serve this use case in some cases, it probably won't cover everything.

We already support a verbose, HTML-like <table> syntax in Remarkup aimed at these more-deliberate use cases: this syntax isn't very convenient if you're writing a comment, but fairly good if you're code-generating a unit test results table.

Move toward improving this syntax by using something more parser-flavored instead of a cobbled-together mess of regular expressions.

In future changes, I plan to support:

  • Making <colgroup /> do something.
  • Probably some formatting options like fill colors and borders.

PHP has some builtin HTML parsing support but: (a) I don't trust it; (b) we aren't really parsing HTML, but HTML-surrounding-remarkup; (c) I tried to make this parser handle any prefix of a valid input as approximately that input to make previews work better; and (d) we probably need more flexibility to get all the behavior right than any existing parser will give us.

Also note that this parser completely parses the input into an intermediate format and then emits that format through the standard HTML rendering pipeline, so this can't (directly) introduce XSS no matter how badly I screwed up the parser. I could have an infinite loop or something, but since the parser itself never marks anything as "safe" it can't violate escaping rules.

Test Plan
  • Added unit tests, made them pass.
  • Existing unit tests of <table /> syntax continue to pass.
  • Here's an example output, note the input isn't entirely valid but the parser is reasonably forgiving and gets the right result:

Screen Shot 2019-06-04 at 9.29.29 AM.png (565×389 px, 25 KB)

Diff Detail

Repository
rPHU libphutil
Branch
table1
Lint
Lint Passed
Unit
Tests Passed
Build Status
Buildable 22941
Build 31477: Run Core Tests
Build 31476: arc lint + arc unit

Event Timeline

src/markup/engine/remarkup/blockrule/PhutilRemarkupTableBlockRule.php
96

I'm picking <colgroup /> children out now, but not doing anything with them yet.

  • Add a couple of extra non-semantic newlines to the test cases for prettiness.

I'm a little worried about a Postel's Law-style HTML parser. Later on I can envision getting more strict about what we accept in the interest of delivering more precise error messages, which might break existing pages that previously worked just by accident. I guess if it ever comes to that, we can write a migration that warns installs about suddenly-malformed wiki pages.

This revision is now accepted and ready to land.Jun 18 2019, 11:26 PM

Yeah, there's a lot of very ambiguous behavior here in the face of ambiguous inputs. I think we're probably not walking into too much of a minefield, but I'm not confident I picked the best behavior for all malformed/suspicious inputs.

One tool we do have to deal with increasing strictness is that we could separate "preview behavior" from "rendering behavior" -- so when you're typing a table it could complain that you're making everything up, but then we could render it more permissively.

I think many of the use cases for this today are auto-generating the tables anyway, so it's probably pretty moot.