Parse remarkup tables with something like a real parser instead of regular expressions
ClosedPublic
Actions

Authored by epriestley on Jun 4 2019, 4:32 PM.

Details

Reviewers

amckinley

Maniphest Tasks

T13160: Support <colgroup> in the verbose remarkup <table> syntax for specifying column widths

Commits

rPHUb9f35642c4e0: Parse remarkup tables with something like a real parser instead of regular…

Summary

Ref T13160. See PHI1275. We have several use cases where Remarkup documents are used as more polished/permanent content (blog posts, "nice" documentation) and/or automatically generated.

In these cases, users would generally like more control and formatting options, since they're either expecting the document to be long-lived / frequently read (so the extra work to make it look nice is worthwhile), or they're generating the document and the cost of doing one-time formatting setup is small compared to how often stuff will be generated.

We do a bit of this ourselves, since I partially codegen the changelog and then manually edit from there, although I'm not in super urgent need of fancier formatting options.

A lot of these use cases involve some kind of display tables and wanting better control over layout/spacing options. Although I think T13158 might eventually serve this use case in some cases, it probably won't cover everything.

We already support a verbose, HTML-like <table> syntax in Remarkup aimed at these more-deliberate use cases: this syntax isn't very convenient if you're writing a comment, but fairly good if you're code-generating a unit test results table.

Move toward improving this syntax by using something more parser-flavored instead of a cobbled-together mess of regular expressions.

In future changes, I plan to support:

Making <colgroup /> do something.
Probably some formatting options like fill colors and borders.

PHP has some builtin HTML parsing support but: (a) I don't trust it; (b) we aren't really parsing HTML, but HTML-surrounding-remarkup; (c) I tried to make this parser handle any prefix of a valid input as approximately that input to make previews work better; and (d) we probably need more flexibility to get all the behavior right than any existing parser will give us.

Also note that this parser completely parses the input into an intermediate format and then emits that format through the standard HTML rendering pipeline, so this can't (directly) introduce XSS no matter how badly I screwed up the parser. I could have an infinite loop or something, but since the parser itself never marks anything as "safe" it can't violate escaping rules.

Test Plan

Added unit tests, made them pass.
Existing unit tests of <table /> syntax continue to pass.
Here's an example output, note the input isn't entirely valid but the parser is reasonably forgiving and gets the right result:

Screen Shot 2019-06-04 at 9.29.29 AM.png (565×389 px, 25 KB)

Diff Detail

Repository

rPHU libphutil

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

epriestley created this revision.Jun 4 2019, 4:32 PM

Harbormaster completed remote builds in B22941: Diff 49066.Jun 4 2019, 4:32 PM

epriestley requested review of this revision.Jun 4 2019, 4:32 PM

epriestley added inline comments.Jun 4 2019, 4:36 PM

src/markup/engine/remarkup/blockrule/PhutilRemarkupTableBlockRule.php
96	I'm picking `<colgroup />` children out now, but not doing anything with them yet.

Add a couple of extra non-semantic newlines to the test cases for prettiness.

Harbormaster completed remote builds in B22942: Diff 49067.Jun 4 2019, 4:37 PM

I'm a little worried about a Postel's Law-style HTML parser. Later on I can envision getting more strict about what we accept in the interest of delivering more precise error messages, which might break existing pages that previously worked just by accident. I guess if it ever comes to that, we can write a migration that warns installs about suddenly-malformed wiki pages.

This revision is now accepted and ready to land.Jun 18 2019, 11:26 PM

Yeah, there's a lot of very ambiguous behavior here in the face of ambiguous inputs. I think we're probably not walking into too much of a minefield, but I'm not confident I picked the best behavior for all malformed/suspicious inputs.

One tool we do have to deal with increasing strictness is that we could separate "preview behavior" from "rendering behavior" -- so when you're typing a table it could complain that you're making everything up, but then we could render it more permissively.

I think many of the use cases for this today are auto-generating the tables anyway, so it's probably pretty moot.

Closed by commit rPHUb9f35642c4e0: Parse remarkup tables with something like a real parser instead of regular… (authored by epriestley). · Explain WhyJun 24 2019, 5:51 PM

This revision was automatically updated to reflect the committed changes.

epriestley added a commit: rPHUb9f35642c4e0: Parse remarkup tables with something like a real parser instead of regular….

epriestley mentioned this in T5427: Force a line break in a table cell.Feb 6 2020, 10:50 PM

epriestley mentioned this in D20971: Respect linebreaks in full HTML tables in Remarkup.Feb 6 2020, 10:53 PM

epriestley mentioned this in rP2327578adc94: Respect linebreaks in full HTML tables in Remarkup.Feb 6 2020, 11:01 PM