Page MenuHomePhabricator

ArcanistTextLinter: Support UTF8
Open, Needs TriagePublic

Description

The ArcanistTextLinter currently only permits ASCII characters, and this one linter feature can't be turned off. This makes it impossible to get the other benefits of the TextLinter and still be able to do things like put things like UTF8-based diagrams in a text document. It would be nice to be able to use UTF8 Letters, Marks, Numbers, Punctuation and Symbols in documents linted with the ArcanistTextLinter.

An example of such a diagram would be a file system diagram describing the repo or other feature of the repository. e.g.:

├── foo
├── bar
└── baz

Would a diff for this simple feature be accepted? If so, I submitted one here: D16085

Event Timeline

Another, very legitimate example, would be diacritic characters in localized README (or other) files. Until UTF-8 support is not present in text linter, i simply have to disable linter for files that have words like "żółć" in them. Plus using ¯\_(ツ)_/¯ in doc would be beneficial to some 😉

eadler added a project: Restricted Project.Jun 16 2016, 6:26 PM
eadler moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Jun 16 2016, 6:28 PM
eadler moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.Jul 4 2016, 9:04 PM

You can disable just the LINT_BAD_CHARSET function and still use the rest of TextLinter by adding the following to the .arclint file:

"type": "text",

"severity": {
   "5": "disabled"
}

Obviously, if you've already added the ability to accept UTF-8 characters, this is a moot point, but just for reference.

Anyway I agree that this functionality can be useful and is desirable.

See some related discussion in T12822. In particular:

  • Any UTF8 mode should still blacklist "zero width space" and likely most other invisible space marks (I think there are about ~20 of these) by default, although some may be in normal use (is "IDEOGRAPHIC SPACE" common in CJK typesetting?).
  • Any UTF8 mode should still probably blacklist RTL/LTR marks (or "greylist" them, and raise a warning rather than an error? They're legitimate if you truly want to write inline Hebrew in source)?
  • Any UTF8 mode should still probably blacklist BOM by default, with options to require a particular initial BOM in each file.

I'm less sure about what to do in these cases:

  • Should we make any effort to deal with Z̭̖̦͖̿̂͐͌ͅḀ̟͇͕̙̋͒ͤ̔ͧ̽ͩL̪̼͎̞͇̍ͤ͂̚G̮̹͓͗̋O͈͕̟̻̥̙̜͆́ (long sequences of combining marks)?
  • We should probably detect and fail on non-preferred forms (e.g., over-long representations)
  • Should we have modes to canonicalize "n + combining tilde" into "n with tilde"? In source, I think we probably should not, but maybe there's some argument or use case for this.
  • There are a large number of characters which look very similar to other characters and could hide bugs. For example, many Cyrillic characters are difficult (or, depending on font face, impossible) to distinguish from Latin characters. Although there is no reason for a user to type "checkРermissions()" instead of "checkPermissions()" (one uses Cyrillic "Er" instead of latin "P"), it is possible that an attacker might.

The actual use case presented here might actually be a better fit for a whitelist: ASCII + Unicode Box Drawing Character Block. That dodges all the other problems here while apparently solving the problem presented (wanting to put box-drawing UTF8 in documentation?).

D6050 was another request but it seems like the root problem there was BOM handling only (?), not actually wanting to Zalgo source files or name functions ☃().

At Facebook, a use case was unit tests with inline Japanese Hirigana/Katakana to test some aspect of mail encoding. In this codebase, we either put these kinds of tests in separate data files or encode them in an ASCII safe way. My experience at Facebook was that most encoding problems were the result of users with improperly configured editors opening and saving files and corrupting them by mistake. The state of the editor world is presumably better now, but the number of test cases in question here is fairly small and I think this remains a reasonable tradeoff for this project, at least, although perhaps not all projects. I don't immediately recall any requests directly in this vein ("we want to put a large range of Unicode characters -- like all of glyphs in the CJK blocks -- into source code routinely").

On a purely personal/aesthetic level, the box-drawing character block is total garbage and master ASCII artisans would never use it in serious work. A true connoisseur of the craft should accept nothing more than single-byte ASCII range in their collection.

|
+-- foo
+-- bar
+-- baz