Page MenuHomePhabricator

Adds an option to ArcanistTextLinter to permit non-ASCII charsets.
AbandonedPublic

Authored by hach-que on May 27 2013, 5:16 AM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, Apr 17, 3:09 PM
Unknown Object (File)
Fri, Mar 29, 2:13 AM
Unknown Object (File)
Mar 10 2024, 7:19 AM
Unknown Object (File)
Feb 4 2024, 2:24 AM
Unknown Object (File)
Jan 16 2024, 5:51 AM
Unknown Object (File)
Jan 13 2024, 1:27 AM
Unknown Object (File)
Dec 14 2023, 2:06 PM
Unknown Object (File)
Nov 26 2023, 1:05 AM

Details

Reviewers
epriestley
Summary

Some languages such as C# default to a UTF-8 charset for source code. The default ArcanistTextLinter prevents all non-ASCII character in a text file, so this adds an option which can be used to turn off bad charset detection.

Ideally there'd be an option to turn on UTF-8 source code mode instead of just a blanket ignore (since then it could also detect when the source code will trip up arc diff's unicode detection), but I don't know enough about character sets to implement UTF-8 validation.

Test Plan

Call setBadCharsetsAllowed from within a lint engine and save a file as UTF-8. The linter should not complain.

Diff Detail

Branch
master
Lint
Lint Passed
Unit
Tests Passed

Event Timeline

Hmm, so on second thoughts, this might not be such a swell idea. The repository I was using this on has a mix of files with UTF-8 BOM and ASCII files, so it's probably better if I just sanitize the repository back to pure ASCII and let the linter do it's thing. Generally I think it's a case of UNIX preferring ASCII text files and all the Windows editors (like Visual Studio) adding the BOM by default.

I'll re-open this if there's a case where the editors are causing too much friction with the ASCII-only policy.

A couple of notes here:

  • Part of the reason for this rule was that we did not handle unicode diffing correctly (it was OK, but not as nice as ASCII diffing). We recently finished T2379, and should now handle the vast majority of unicode stuff correctly (e.g., including combining characters). We'll probably still get some edge cases like RTL markers wrongish or not-quite-ideal, but all reasonable "text-like" unicode text should diff correctly now (and we'll never truncate characters midway through their bytes or anything actually bad).
  • Historically, the major reason for this rule is users with misconfigured editors. Pretty much all UTF8 text in Facebook's codebase got mangled by someone's editor at some point, and the policy that was easiest to communicate and easiest to get engineers to adhere to was "only use ASCII" -- everyone could understand that, but not everyone understood how character encodings work or how to configure their editor.
  • This policy is probably too opinionated to enable by default. Although I think it's probably still the best policy, and plan to retain it in Phabricator itself, a default policy "all files must match the configured repository encoding, defaulting to utf8" would probably make more sense. However, I'd rather wait until after T2039 to monkey with this stuff for reasons discussed there.
  • We don't have any special handling of BOM (I believe only some Windows editors produce it by default -- I haven't seen any files in the wild with it) but possibly should. If BOM causes issues other than triggering this, let me know.
  • You can already disable a warning like this, without touching the implementation:

    $linter->setSeverity( array( ArcanistTextLinter::LINT_BAD_CHARSET => ArcanistLintSeverity::SEVERITY_DISABLED, ));