When we detect a file as binary, explain why
Open, NormalPublic
Actions

Assigned To

None

Authored By

	epriestley
	Apr 15 2013, 6:29 PM

Description

There are various reasons we detect files as binary now, and users are more-than-occasionally confused by the rules. When arc decides a file is binary, it should explain why.

Related Objects
Search...

Status	Assigned	Task
Resolved	epriestley	T5644 Add controls to prevent `pygmentize` from saturating system resources
Resolved	epriestley	T5179 Unearth Differential rendering technical debt
Open	None	T12664 Update diff/patch parsing to extract more metadata and parse a wider range of formats
Open	None	T2999 When we detect a file as binary, explain why

Event Timeline

epriestley triaged this task as Normal priority.Apr 15 2013, 6:29 PM

epriestley added a project: Arcanist.

epriestley added subscribers: epriestley, brennantaylor.

Also we should do this in Diffusion, apparently.

The actual issue here was a non-utf8 file being made utf8, and the diff in the commit being confusingly marked binary even though neither file was apparently binary.

When the VCS produces a textual diff we should probably produce a textual diff full of replacement characters with a warning or something like that.

mbishopim3 added a subscriber: mbishopim3.Jul 2 2013, 6:33 PM

hoffigk added a subscriber: hoffigk.Sep 30 2013, 6:36 AM

altendky added a subscriber: altendky.Jan 2 2014, 3:27 PM

We seem to be running into this issue as well. The file in question is a PL/SQL code file and it's a regular text file, except there are some non-UTF8 characters in it. Is that what's throwing it off and declaring it as a binary file?

Phabricator expects all text to be in UTF-8 (according to TFM), so I assumed that was the cause. I believe I used iconv but it appears recode may be better? I'm thinking I did it incorrectly and probably actually took it down to pure ASCII... :[

epriestley changed the visibility from "All Users" to "Public (No Login Required)".Mar 31 2014, 2:05 AM

epriestley edited this Maniphest Task.May 25 2014, 8:41 PM

dratwa added a subscriber: dratwa.Jul 18 2014, 1:37 PM

octera added a subscriber: octera.Aug 15 2014, 9:58 PM

◀ Merged tasks: T5935.

mormegil added a subscriber: mormegil.Sep 24 2014, 3:28 PM

Same problem with our assembly files (working directory on Windows7 disk folder): is not possible to view them in Audit because they're seen as "binary" (while they are visible as diffs in Differential probably because unix diff we use fixes encoding). Verifying their nature with cygwin 'file' tool they are seen as:

C source, Non-ISO extended-ASCII text, with CRLF line terminators

Then I converted them using iconv -f ISO-8859-1 -t UTF-8 and re-tested with 'file' the source files with the expected result:

C source, UTF-8 Unicode text, with CRLF line terminators

I tried to commit the new files to repository but, again, audit tool sees the files as binary, with the laconic "This is a binary file" indication.

I cannot use pure ASCII on our source files because of all the comments we have in native language.

The first time you commit a good (UTF-8) file it will still list as binary because the old side of the diff is binary. The second good commit will result in a good new and old version that can be text diff'ed. Also, pure ASCII is not needed, UTF-8 is fine so you should be able to use whatever characters you need for your native language.

Just for the record, this is the little Make/Bash 'script' I used (in Cygwin) to cleanse my files:

for f in $(addprefix $(DIR_TO_CLEAN)/*., ext1 ext2 c h etc); do \
    encoding=$$(file -i "$$f" | sed "s/.*charset=\(.*\)$$/\1/"); \
    recode -f $$encoding..utf-8 "$$f"; \
done

joshuaspence added a subscriber: joshuaspence.Oct 8 2014, 11:15 AM

In T2999#78548, @altendky wrote:

The first time you commit a good (UTF-8) file it will still list as binary because the old side of the diff is binary. The second good commit will result in a good new and old version that can be text diff'ed. Also, pure ASCII is not needed, UTF-8 is fine so you should be able to use whatever characters you need for your native language.

That's right: I tried today and commits are 'viewable' from the second UTF-8 commit where both sides of the diff are UTF-8, thank you for the clarification.

Thank you also for the script: I made a similar one but yours is better and I'll take it.

epriestley mentioned this in T6633: 'differential.creatediff' exception when repository paths contain SHIFT-JIS characters.Dec 1 2014, 9:46 PM

chad mentioned this in T3788: Improve encoding management in Diffusion.Jul 3 2015, 5:22 AM

fanis added a subscriber: fanis.Aug 18 2015, 3:06 PM

chad merged a task: T9517: phabricator is not supporting Spanish characters.Oct 6 2015, 1:53 PM

chad added subscribers: revi, tejasbhosale009.

Herald added a subscriber: cspeckmim. · View Herald TranscriptOct 6 2015, 1:53 PM

revi removed a subscriber: revi.Oct 6 2015, 1:55 PM

cspeckmim removed a subscriber: cspeckmim.Oct 7 2015, 2:34 AM

spawnlt added a subscriber: spawnlt.Apr 26 2016, 12:26 PM

Herald added a subscriber: eadler. · View Herald TranscriptApr 26 2016, 12:26 PM

dgruncha added a subscriber: dgruncha.May 3 2016, 10:29 PM