Page MenuHomePhabricator

Inline Remarkup escaping
Closed, WontfixPublic

Description

Related to: https://github.com/phacility/libphutil/issues/38

Currently, it's hard to tell Remarkup to not parse something when it is being overzealous.

You can turn off markup on a block level, with

%%% Remarkup **won't** be parsed
%%%

Yielding:

Remarkup **won't** be parsed


But what if I wanted to have that line be on a header? Or part of a paragraph with other items that I wanted to be parsed? Or if I have something that Remarkup wants to parse as a Phabricator item, like F1?

{F164597}

I think allowing %%%Don't **parse** me bro!%%% inline should work. Are there any potential problems with this solution?

Event Timeline

bluehawk raised the priority of this task from to Needs Triage.
bluehawk updated the task description. (Show Details)
bluehawk added a project: Remarkup.
bluehawk added a subscriber: bluehawk.

Upon reflection, the easy and obvious solution to my examples is to just wrap those in ` ticks anyways, since they probably should be. But perhaps there are cases where wrapping them in code ticks is undesirable?

epriestley triaged this task as Wishlist priority.Jun 9 2014, 11:12 PM
epriestley added a subscriber: epriestley.

We probably should have some way to do this, but it's reasonable to use backticks in all the use cases I've actually encountered.

I filed the same issue on Github. Backticks worked for me this morning.

I just missed the backticks in the remarkup help article. Didn't realize that monospaced was more than just the font, that it escaped as well. Thanks.

joshuaspence renamed this task from Inline Remarkup escaping. to Inline Remarkup escaping.Jul 25 2014, 10:58 AM

I'm unable to escape the bar | to literally put it in a Remarkup table. Used both backticks and %%%|%%%.

I encountered the same when I tried to write "Amazon S3". I don't think using `backticks` is really appropriate here. Further: How to escape the backticks themselves?

T5397 is another copy of the pipes-in-tables issue.

You can use remarkup.ignored-object-names to prevent S3 from being marked up: S1 S2 S3 S4

Showing inline bash command snippets containing backticks seems like a pretty common usecase. The only workaround I found is to use the triple-backtick syntax but that takes up more space and breaks the flow of the text.

See the documentation: sh -c `some subcommand`

I'll take this and T6002. Where in Phabricator's source code can I find the relevant PHP code? Edit: Found it in libphutil.

Also, will I need to implement some form of versioning, so as to not change existing reMarkup?

I'd like to see real use cases for this before pursuing it. In the use cases above:


Remarkup examples can be wrapped in backticks (or literal blocks, or code blocks), which seems reasonable. To produce this:

**bold text**

**bold text**

...type this:

`**bold text**`
%%%**bold text**%%%

Backticks can be wrapped in double-hashes. To produce this:

The command is `ls -alh`

..type this:

The command is ##`ls -alh`##

Table contents can use the full HTML-like syntax. To produce this:

ls | more

...type this:

<table>
  <tr>
    <td>
      `ls | more`
    </td>
  </tr>
</table>

You might need to read the documentation to figure some of these out, but you'd have to read the documentation to figure out how to escape stuff, too.

Neither this task nor T6002 seem to have any examples of reasonable text that is impossible (or even unreasonably difficult) to stylize in Remarkup today.

@epriestley I hope to fix this in a very general way, which will also solve many of the other tickets related to Remarkup (T6276, T5951, T5427).

One approach is to allow certain HTML entities through unescaped. This would instantly solve most of the tickets, though in a very awkward way (ASCII entity codes are not easy to remember) and potentially break compatibility.

Another approach is to allow for backslash escaping: any character that follows a backslash is automatically considered literal text, except in code blocks. For code blocks (denoted by backticks), sequences of up to n backticks are ignored (considered part of the code block) if surrounded by n+1 backticks.

Unfortunately, I am not sure what the best way is to integrate this into Remarkup's rendering engine, which seems to operate by means of a series of regex replacements rather than a single parser. That is great for extensibility, but bad for handling of nesting, since Remarkup isn't a regular language.

I am not sure how to go from here. Creating a recursive-descent parser in PHP probably has too bad performance. Creating a hand-written parser in C is horrible for extensibility. Honestly, C and PHP are not the best languages for this task (that honor belongs to Haskell and friends, most likely followed by Rust (easily called from C, but still)), but they are what we are stuck with.

Probably the best approach that I can think of would be some like Marpa: a general parser library. But implementing such a library for PHP would be a massive amount of effort.

One last approach, which I think would be the best: have the various passes parse Remarkup into an AST of strings, which are then parsed in-order. But again I am not very sure how to do this. I also don't have much experience with PHP or Phabricator's internals; my sole experience with Phabricator is using it for GHC development and having watched it be used for HHVM development before that ceased.

All changes in Phabricator need to be driven by strong, real-world use-cases.

I don't see any use cases on this ticket.

I don't think escaping is the best solution to T6276 -- the examples there can be parsed unambiguously without escaping.

I don't actually understand the issue in T5951.

I don't think escaping is the best solution to T5427: instead, we could just respect linebreaks in <table> tables instead.

Backslash escaping breaks this sequence of characters:

🤷 ¯\_(ツ)_/¯

On the one hand, this is silly. On the other hand, I've probably seen 100x more users get tripped up by this on Reddit than intentionally use backslashes to escape anything. It would also badly mangle any legitimately escaped text which was copy-pasted into Phabricator.

We already have a pure-PHP parser generator, but the low barrier to extensibility offered by the regexp-based design is dramatically more valuable than being able to generate an AST.

Today, it's fairly easy for users to write their own rules and fairly hard for them to properly escape a handful of obscure sequences in edge cases. Trading that for a world where writing your own rules is hard and escaping obscure sequences is easier is not desirable.

T6143, while closed, seems to be a real-world use case to me.

What about triple backticks in code blocks? Code blocks can contain
anything, so I consider not being able to put three consecutive backticks
in them a bug.

It sounds counter-intuitive, be the upstream prefers not to spend time on theoretical issues. Basically we have thousands of open tasks, and we complete them based on overall impact. Something hypothetically an issue isn't going to receive much attention.

That said, if it's open, we do plan to address it eventually.

T5397 is another copy of the pipes-in-tables issue.

You can use remarkup.ignored-object-names to prevent S3 from being marked up: S1 S2 S3 S4

This doesn't work that well.
Commentary around code for DHCP clients for example will frequently reference T1 and T2 as timers and not tasks.
When dealing with other stuff, Z1 is frequently used as well and it's not a chat room either!
This plays havoc in commit messages, and I'm not sure escaping is the correct thing to do.

For starters, this regex fails to work ^((Q|V)\d)|(T[12])$, so I had to use this ^(Q|T|V)\d)$.
But that is still not satisfactory because that means I have a different workflow for tasks 1-9 .... and for Q and V objects as well.

One easy workaround is to start fresh databases off with a sensible seed for auto increment values.
By settings tasks and chats start at 100 for example, I am able to import an old git database that deals with T1 T2 and Z1 into a fresh phabricator database with initial tasks already setup.

For starters, this regex fails to work

| is stronger than ^ or $. The regular expression you've written evaluates like this:

  • Match ^((Q|V)\d)
  • OR
  • Match (T[12])$.

If you write the regular expression properly, it works fine:

Screen Shot 2017-02-09 at 2.31.01 PM.png (172×403 px, 18 KB)

You can check the PHP documentation for help with regexes, or there are a lot of tools online.

You're correct, using brackets makes the regex work, thanks.

But still, my point stands that the default workflow is different for Q and V objects 0 through 9 compared to say Q11 or V123.
It's highly probable that a one letter prefix and a single digit suffix will collide in commits, messages or discussion about the project itself but unrelated to any phabricator entity.
The more characters in play, the less probable this becomes, hence my suggestion to change the default auto increment value from 1 to say 100 or even 1000 for any entity that can link like so.
This means that the default regex can be removed and the workflow is consistent.

This doesn't solve the case of Q101, but it does solve it for DHCP clients and projects which use low value Z co-ordinates.
Commentary on this is welcome, even just to say no it won't fly.

Commentary on this is welcome, even just to say no it won't fly.

I think this adds a great deal of complexity to solve a very very small problem that almost no installs experience, so I'm not interested in bringing it upstream.

What about triple backticks in code blocks? Code blocks can contain
anything, so I consider not being able to put three consecutive backticks
in them a bug.

I'm happy to review these changes:

  • The "N > 3 backticks to start a codeblock, terminated by that many backticks" change.
    • If the codeblock change goes in, I'll review a similar change to %%% blocks for consistency (I think this is probably not of much use, but worth making consistent).
    • I'd also accept a similar change to PhutilRemarkupInterpreterBlockRule to make cowsay {{{ ... }}} use the same rule ("three or more opening marks, closed by the same number of closing marks"), again motivated by for consistency.
  • For T6276, I think parser changes are possible which improve parsing for URIs which contain no internal ]] sequences without weird side effects. That is, [[ http://example.com/?x[]=1 ]] should be parseable unambiguously without needing to build an AST or add escaping.
  • For T5951, I'm not exactly sure what the problem is. If someone can come up with a more concise problem description, there may be a way forward without escaping. Particularly, I think [[ x | y | z ]] could probably treat y | z as the link title safely. However, that task is also from an era before [[ x ]] created a link with the page title as the name -- you used to have to type [[ apples | Discussion of the Manufacture and Sale of Apples ]]. Fixing the behavior of [[ x ]] may have mooted that task in practice. I'll probably just go close it now and wait for a new modern report if a problem still exists. I'd also vaguely like to leave [[ x | y | z ]] open to allow for the possibility that z supports more exotic options in the future, like [[ x | y | icon=paw ]] or similar.
  • For T5427, I'll accept a change which respects linebreaks in the <table> longhand syntax. That is, this input should produce a properly formatted, emotionally captivating haiku:
<table>
  <tr>
    <td>
      Pasta is made of wheat
      Except weird gluten-free pasta
      I eat spaghetti
    </td>
  </tr>
</table>

It may be possible to accomplish this with only a CSS change, adding a rule like .phabricator-remarkup .remarkup-table td { whitespace: pre-wrap; }. Otherwise, starting with the preserve-linebreaks option may provide some clues on how to move forward.

In all cases, the change should be accompanied by test coverage using the existing format.

There is no need to version things (and no mechanism for doing so -- we have cache versioning, but not parser versioning). I think all of these changes are vanishingly unlikely to cause any confusion or difficulty since they all tend to make Remarkup do something reasonable/expected instead of something confusing/broken, not replace one useful behavior with a different useful behavior.

If you test things with the web UI, note that it's usually easier to use a comment preview for testing than a static block like a task description: the preview regenerates every time you type something, but most descriptions are cached and will not reflect your changes unless you bump the cache version or purge the cache.

Some of these could be accompanied by documentation updates (table whitespace, remarkup backticks). The documentation lives in phabricator/. Others (link parsing) probably aren't useful to document since they're basically just bug fixes that no one would be surprised by.

What are your thoughts on allowing valid HTML entities through to the
browser without further escaping (outside of code blocks)? There seems to
be no reason to have them in text, and they could allow escaping.

Also, what about allowing |, [, and ] in tables and links to be escaped
with a backslash?

I think the complexity of those rules exceeds their utility.

I'm hesitant to give magic behavior to sequences which reasonably may occur in source code even if they only apply outside of code blocks, because new users don't necessarily know how code blocks work. Remarkup tries fairly hard to prevent misfires on input like this, which is reasonable C code with an HTML entity:

char *amp_ptr = &amp;

I see your point.

I do think that there should be a way to put | in a table, without using
the longhand syntax. Also, I think there needs to be a way to link to any
valid URL, no matter how contrived. But I am not sure what that way is.

everytime I type D3 meaning the javascript visualization lib,
it references that Diff #. Similarly with common strings like
#servers looking up projects. I want the auto-linking behavior
most times, I just want a simple escape character for the few
times when I don't.

Late friday night brings good ideas

Simple escape character that won't be used in combinations causing problems is really hard to conceive. But! With current support for emoji suggestions it would probably be possible to combine emoji as escape? Currently least used emoji is pager (📟), so maaaybe 📟 before remarkup rule could disable rule? Regex could be (i think) prefixed with [^📟] or something like this to disable matching and viola?

Disclaimer: late night + bad jokes = good ideas. Do or do not take this seriously.

@epriestley, @chad, is the use-case in T5301#218589 clear enough?

I'm in the same situation, I need to reference the Renesas R-Car H3 board, and the autolinking to Herald is clearly wrong. I don't want to prevent others from linking to the H3 Herald rule, nor I can change the instance-wide configuration at will. I would rather avoid putting H3 between backticks, as it's not really a piece of code.

I understand that it's a very, very low priority issue that can worked around or even completely ignored without much impact, but I just wanted to be sure that the use case described by @rsmarples and @nharkins was clear. WONTFIX is fine as well.

Perfectly clear.

I do not currently plan to ever implement a simple escape character, particularly backslash.

You can implement this yourself locally if you believe that the cost of this rule (specifically: interpreting backslash as an escape character in contexts where it is not intended as an escape character) is worth the benefit. I believe the cost of this rule greatly outweighs the benefit, and that the overwhelming majority of backslashes appearing in Remarkup would not be intended as escape characters even if this rule was in place. Remarkup is very easy to extend if you disagree.

I didn't mention backslashes on purpose, as I'm not specifically interested in any kind of escape characters. I just would like to have a specific occurrence of H3 to not be hyperlinked: any mechanism to do so would be fine, be it an escape character, an escape sequence, an inline version of the literal block, or something else entirely like an emoji. :) Thanks!

Heh, zero-width spaces are an option, yes. A bit too subtle in my eyes, but fair enough.

Would it be possible for reMarkup to strip all zero-width non-breaking
spaces from its output, unless there was more than one in a row (in which
case the number would be decreased by 1)? That would solve the problem.

Another possible escape character is the ASCII escape character. Either
could be bound to a key combination in the UI.

Yes, it would be very easy! Feel free to implement such a rule locally.

epriestley claimed this task.
  • If you want to implement backslashes or thunderclouds as escape characters, feel free to do so locally. I don't plan to bring these changes upstream.
  • If you want to strip zero-width spaces or escape literals, feel free to do so locally. I don't plan to bring these changes upstream.
  • I do not plan to implement any sort of generalized inline escape character, sequence, or block in upstream Remarkup at this time.

I was trying to type the phrases "visible within O2's network" and "the O2 SIM card" but something is interpreting the company name as markup. I could type "visible within O2's network" but that seems a little odd. Is there a better solution for this off edge case?

Reading above I see that O2 with a zero width space copied and pasted seems to work OK. I will go with that for now. Thanks.

If you want to type sequences like O2 without triggering remarkup, you can:

  • (Recommended) Surround them with backticks: `O2` produces O2.
  • (Recommended) Add them to remarkup.ignored-object-names: V1, V2, Q1, Q2.
  • (Discouraged) Pursue creative solutions like embedding zero-width spaces.

Thank you for the helpful summary. The Discouraged creative solution worked for me. I expect this will be a one off, if it is not I'll try and setup some ignored phrases.