Page MenuHomePhabricator

Arcanist linters do not have convenient tools for converting between encoding-aware character offsets and raw byte offsets
Open, WishlistPublic

Description

My external linter reports offsets taking into account UTF8 characters. That is, if there's a UTF8 byte order mark at the start of the file, the first real text character is reported as offset 1, line 1, character 1, while internally inside Arcanist this is actually offset 3, line 1, character 3. I initially tried just offsetting all reported lint errors by 3 if a BOM is detected, but this doesn't handle the scenario where there are other UTF8 characters in the source file.

The Arcanist linter API needs to offer methods that count offsets based on UTF8 characters instead of strict ASCII characters.

Event Timeline

hach-que assigned this task to epriestley.
hach-que raised the priority of this task from to Needs Triage.
hach-que updated the task description. (Show Details)
hach-que added a project: Arcanist.
hach-que added a subscriber: hach-que.

char in Arcanist means "byte". If your linter works in UTF8-character offsets, you should convert them to byte offsets before passing messages up to Arcanist.

If we used char to mean "UTF8 character" in the way your external linter does, I think it would be impossible to write a linter which removes BOMs?

Files aren't necessarily written in UTF8 or ASCII (or any single encoding, or a valid encoding, or an encoding whatsoever) so I think byte offsets are the only sensible unit we can use. It is otherwise potentially impossible to write rules that, e.g., detect or fix invalid encodings, remove BOMs, convert encodings, etc.

We could change all the APIs to say "byte" instead of "char" but this doesn't seem worthwhile.

I don't believe there's a way to convert the offsets on the C# side of things; the offsets returned from the external linter are from the compiler API. When I read the UTF8 encoded file as ASCII in C# (so that everything is byte-based), and pass the resulting text into the compiler API, it then can't parse the file because the UTF8 bytes mangle the source code.

Even if I could convert the offsets, this doesn't help when addressing in a line / column format (because the line / column addressing in Arcanist is going to be based on ASCII characters and not UTF8 characters).

I also didn't mean replace the byte-based API with UTF8, but rather an option like $this->setUTF8Offsets(true) on ArcanistLinter.

I would also be fine with some way of running iconv on a file before it gets passed into Arcanist; this could be done as a separate linter but would require the multi-pass linter stuff so that after fixing it up with iconv it could then be processed by the C# linter.

epriestley renamed this task from Arcanist linters do not handle UTF8 well to Arcanist linters do not have convenient tools for converting between encoding-aware character offsets and raw byte offsets.May 15 2018, 6:37 PM
epriestley removed epriestley as the assignee of this task.
epriestley triaged this task as Wishlist priority.
epriestley added a subscriber: avivey.
epriestley added a subscriber: epriestley.