There are various reasons we detect files as binary now, and users are more-than-occasionally confused by the rules. When arc decides a file is binary, it should explain why.
Description
Status | Assigned | Task | ||
---|---|---|---|---|
Resolved | epriestley | T5644 Add controls to prevent `pygmentize` from saturating system resources | ||
Resolved | epriestley | T5179 Unearth Differential rendering technical debt | ||
Open | None | T12664 Update diff/patch parsing to extract more metadata and parse a wider range of formats | ||
Open | None | T2999 When we detect a file as binary, explain why |
Event Timeline
The actual issue here was a non-utf8 file being made utf8, and the diff in the commit being confusingly marked binary even though neither file was apparently binary.
When the VCS produces a textual diff we should probably produce a textual diff full of replacement characters with a warning or something like that.
We seem to be running into this issue as well. The file in question is a PL/SQL code file and it's a regular text file, except there are some non-UTF8 characters in it. Is that what's throwing it off and declaring it as a binary file?
Phabricator expects all text to be in UTF-8 (according to TFM), so I assumed that was the cause. I believe I used iconv but it appears recode may be better? I'm thinking I did it incorrectly and probably actually took it down to pure ASCII... :[
Same problem with our assembly files (working directory on Windows7 disk folder): is not possible to view them in Audit because they're seen as "binary" (while they are visible as diffs in Differential probably because unix diff we use fixes encoding). Verifying their nature with cygwin 'file' tool they are seen as:
C source, Non-ISO extended-ASCII text, with CRLF line terminators
Then I converted them using iconv -f ISO-8859-1 -t UTF-8 and re-tested with 'file' the source files with the expected result:
C source, UTF-8 Unicode text, with CRLF line terminators
I tried to commit the new files to repository but, again, audit tool sees the files as binary, with the laconic "This is a binary file" indication.
I cannot use pure ASCII on our source files because of all the comments we have in native language.
The first time you commit a good (UTF-8) file it will still list as binary because the old side of the diff is binary. The second good commit will result in a good new and old version that can be text diff'ed. Also, pure ASCII is not needed, UTF-8 is fine so you should be able to use whatever characters you need for your native language.
Just for the record, this is the little Make/Bash 'script' I used (in Cygwin) to cleanse my files:
for f in $(addprefix $(DIR_TO_CLEAN)/*., ext1 ext2 c h etc); do \ encoding=$$(file -i "$$f" | sed "s/.*charset=\(.*\)$$/\1/"); \ recode -f $$encoding..utf-8 "$$f"; \ done
That's right: I tried today and commits are 'viewable' from the second UTF-8 commit where both sides of the diff are UTF-8, thank you for the clarification.
Thank you also for the script: I made a similar one but yours is better and I'll take it.