Page MenuHomePhabricator

Replace Differential ZWS "oncopy" behavior with a grand Javascript sorcery
Closed, ResolvedPublic


We currently use Zero Width Space characters to "cleverly" make side-by-side diffs somewhat-copyable (we want to copy only one side of the text, and do not want to copy line numbers). This implementation has some rough edges:

  • On the face of it, it's obviously a giant hack.
  • When ZWS characters appear in actual text, they break things (see below).
  • You can only copy the right-hand side of a diff (see PHI974 and PHI504 for users wanting to copy the left side of a diff).
  • We copy the displayed text. We'd like to always copy the exact original text, but we'd sometimes like to display different text. We currently have to display the original text to make copy behavior authentic. (See PHI973 for \r.)

We can fix all three cases with a great Javascript ritual:

  • When the user starts a selection, tag either the left or right hand side of the diff for visual selection based on where they clicked.
  • When the user copies, walk the whole DOM and build the text manually.
  • Allow nodes to have alternate copy-text which is used instead of the display text when the two differ.

I believe the ZWS implementation is an outgrowth of a previous, very ancient implementation was motivated partly by concerns that node.textValue was wildly unreliable prior to 2011, but modern (non-IE) browsers seem less crazy about this and ReviewBoard has a node.textValue-based Javascript ritual which suggests this probably isn't too awful anymore nowadays.


Zero width characters, e.g. <U+200B>, are usually accidental and unintended. They are also insidious in that they are hard to detect during code review or later on when they've caused issues in production. In particular, the differential will highlight the affected line but there's no individual character to highly in dark green.

The differential tool should make the diff painfully obvious to make sure it's not an accidental change. A good example of this is what's shown by git show b4409d8c95885b22277d30d031f405d7bfbc0618 (this is a value commit in Phacility):

-echo "Test complete"
+<U+200B><U+200B>echo "Test complete"

This has caught out other users, for example:

As an addition to this (possibly a separate feature), add simple character match check to the text linter. This could also be used to check for tabs or other undesirable characters. As a workaround, I configured a herald rule - this works but the checks should really be done by the arc linter.

Can a lint check be easily added without the complexity of using the script-and-regex lint tool?

Other reference found to zero width space:

Event Timeline

Anything programmatically detectable should be raised through lint. Asking humans to detect it is still likely to not solve the core problem since they'll always be less reliable. .

well script-and-regex is NOT anything complex... or you can try and add zero-widht-space as a rule to arcanist Text linter or make own linter extension... anything is good ;)

I think that ArcanistTextLinter actually checks for ASCII right now, so it should be easy to enable for this case. See also D6050, which discusses selecting encoding for it.

epriestley added a subscriber: epriestley.

Per above, these are already detected by the Text linter:

$ arc diff --conduit-uri=
>>> Lint for zws.txt:

   Error  (TXT5) Bad Charset
    Source code should contain only ASCII bytes with ordinal decimal values
    between 32 and 126 inclusive, plus linefeed. Do not use UTF-8 or other
    multibyte charsets.

    >>>        1 Here are some zero-width spaces: >>>​​<<<
 LINT ERRORS  Lint raised errors!

On the CLI on my system, at least, the colorization makes the location of the problem more clear than the raw text conveys:

There is also an "Invsible" syntax highlighter which visually shows invisible characters (like tabs and newlines). You can find this in View Options → Highlight As... → Invisible.

Currently, this only shows ASCII invisible characters (bytes \x00 through \x1F). We could reasonably expand this to include Unicode space characters as well, although there are a substantial number of them outside of ZWS (e.g., one result on Google:

This list excludes some marks which should probably be rendered by a unicode-aware invisibles mode, like the various RTL markers and Byte Order Marks.

I think rounding these up and improving the "Invisible" highlighter would strictly be an improvement with no costs or downsides.

But I think this is probably not a good fit for the described problem (users accidentally inserting ZWS characters into source code). I think this is a good tool if lint raises an error, a developer isn't sure how to fix it and asks for help, and you can use this mode to see what the problem is. But it's not a good tool to find these characters in the first place, since the implied workflow (switch every file to "Invisibles" and look at it just in case a user made a mistake) is entirely unreasonable.

To step back, how did ZWS characters get there in the first place? This seems like a mistake which is normally very difficult to make.

However, we use ZWS characters in Differential to improve copy/paste behavior (so that when a user selects a chunk of text on the right hand side of a diff and copies it, they get only the right-hand side on their clipboard, instead of a big mishmash of left, right, and line numbers). Layered on top of this, we intercept browser "copy" events and strip the ZWS characters.

It is possible that this code has a bug (or, for example, doesn't work properly anymore on a newer version of some browse), and that the ZWS characters came from copy/pasting code out of Differential in the first place. If so, we should either try to fix the bug or consider removing all the ZWS magic entirely (I think the utility is very marginal, and not worth even a small number of users actually ending up with ZWS characters on their clipboards).


  • The text linter we ship with enforces ASCII-only source code by default. It is what we use in this project, and it appears to completely solve this problem for us. If that sounds like a good approach, let us know if you need help configuring it. The most common reason this isn't a good approach for installs is that they prefer to write their source code in UTF8 (usually because non-English text appears in the source for some reason).
  • If you want a UTF8 mode for the Text linter, we can note your interest on T11150 and give it some more weight for prioritization because you're a customer, but note that this is probably a bit complicated (i.e., I suspect users are likely to disagree on which characters are and are not allowed in "UTF8 source code" and that this mode will be less of a distinct mode and more of a collection of filter lists) and largely entangled with T10038, a large task.
  • If the ZWS characters came from Differential in the first place, it would be helpful to know that (and as much information about how it happened as possible) so we can either prevent it or re-evaluate our behavior.

Finally, I don't think it's totally unreasonable to consider displaying ZWS and other characters in this class in Differential by default. I don't think it's the best solution to any problem here, but it shouldn't normally hurt anything.

However, it conflicts with the existing copy/paste behavior. As long as we're supporting copy/paste, I don't think we should transform the source for display: copy/paste should actually create a copy of the source that was selected. If we replace ZWS, etc., with visible text, this won't be true. (We could "highlight" these characters in a bright color instead of actually replacing them, but I worry this would be unclear. We could "highlight" + tooltip, maybe. Any approach in this vein will be complex because we can not integrate this directly into syntax highlighters in the general case, and run into HTML safety issues if we try to layer it on top.)

I don't think the copy/paste behavior is very important and I'm not opposed to removing it in favor of visible ZWS characters -- especially if it's causing ZWS characters to leak into source somehow -- but at least some users use it. This would be an easier call to make with evidence that the ZWS strategy is causing problems.

We could also look at a switch away from the ZWS strategy. For example, instead of ZWS, I believe we could use <img src="..." alt="🐐" /> or similar. If the oncopy magic failed, at least we'd get a super obvious failure: a herd of goats appear in your source code. However, this could break source that legitimately used goat characters (which are highly questionable, but I think less questionable than ZWS characters) and might cause other problems (at a minimum, we'd need to hide it from screenreaders).

FWIW I only code via copy/paste.

me too but not everyone is on our level

Two questions pick at me:

  1. Which editor did the user use to add the ZWS in? Any editor I remember does something to warn you about it.
  2. Why are there Unicode characters for "goat" (🐐) and for "cooked rice" (🍚)?
epriestley renamed this task from zero width space should be visible and emphasized in differential tool to Replace Differential ZWS "oncopy" behavior with a grand Javascript sorcery.Nov 21 2018, 5:28 PM
epriestley triaged this task as Normal priority.
epriestley updated the task description. (Show Details)
epriestley edited projects, added Differential; removed Customer Impact, Feature Request.

In Firefox, selecting a range which crosses an inline comment currently drops the line after the comment. Here, return arra does not copy:

The following line, after the problem line, copies fine:

This currently copies as:

rchAttachments() {
      id(new Differen

(This is possibly some kind of TextFragment / .textContent issue.)

I'm probably not going to fix this since it's pretty minor and moderately complicated to fix, but this interaction is not ideal:

  • Begin selecting text on either side of the diff.
  • With the mouse button held down, move the cursor over the line numbers.

Desired behavior: nothing special happens.
Actual behavior: the inline comment reticle is shown.

epriestley claimed this task.

There are probably still a few remaining kinks to work out here, but I believe this is substantially resolved:

  • Text may be copied from either side of a diff.
  • ZWS, NBSP, and 0x00-0x19 are now rendered in an attention-grabbing way, but copy to the clipboard as the raw source bytes.