Page MenuHomePhabricator

Force all mercurial commands to use UTF-8 encoding
ClosedPublic

Authored by cspeckmim on Jun 28 2021, 12:42 AM.
Tags
None
Referenced Files
F14076848: D21676.diff
Thu, Nov 21, 7:56 PM
F14075787: D21676.id.diff
Thu, Nov 21, 1:18 PM
F14075697: D21676.diff
Thu, Nov 21, 12:52 PM
F14072555: D21676.id51590.diff
Wed, Nov 20, 9:03 PM
Unknown Object (File)
Wed, Nov 20, 6:03 PM
Unknown Object (File)
Mon, Oct 28, 3:20 PM
Unknown Object (File)
Mon, Oct 28, 3:18 PM
Unknown Object (File)
Mon, Oct 28, 3:17 PM
Subscribers

Details

Summary

When non-ascii characters appear in revision titles/summaries the patch and diff (to update) commands will fail on Windows systems. This often occurs due to “smart quotes” or "em—dash" characters being inserted into commit messages by editors on "user-friendly" operating systems like macOS.

This can be worked around by forcing all mercurial commands to use the global option --encoding utf-8 which applies for any mercurial command. This option was added in ~2006 so this should work across all supported versions of mercurial.

Refs T13649

Test Plan

I created a diff on a mercurial repository using smart quotes in the "Title" and "Summary" fields as well as in the content of a file being changed. Then on macOS, Windows (PowerShell), and Windows (cmd.exe) I was able to patch down the revision, make a modification, and diff the change back up to Phabricator, as well as land the change. I verified the commit and content looked correct on macOS as well as on Windows by using nvim which seems to properly detect and render the encoding, whereas mercurial displays the smart quotes and em-dashes with odd characters instead.

I did a grep through Arcanist codebase to find other places where --encoding might be specified for mercurial commands and could not find any. In the event that somehow this argument is added elsewhere I verified that multiple specifications of --encoding utf-8 does not cause any issues and the later specification of --encoding appears to "win".

$ hg --encoding utf-8 --encoding utf-8 log -r tip
# prints out results in UTF-8 without issue

$ hg --encoding utf-8 log --encoding latin-1 -r tip
# prints out results in latin-1 without issue

Diff Detail

Repository
rARC Arcanist
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

cspeckmim retitled this revision from Force all mercurial commands to use `--encoding UTF-8` to Force all mercurial commands to use UTF-8 encoding.Jun 28 2021, 12:56 AM
cspeckmim edited the summary of this revision. (Show Details)

The results in T13649 -- where none of these results actually print a character that anyone would consider to be a "smart quote" (?) -- aren't exactly inspiring, but I think this is a step forward at least.

This revision is now accepted and ready to land.Jun 28 2021, 1:17 AM

I ran through another test just in case and things do seem to function fine. Another interesting point is when using nvim on Windows it will properly render the smart-quotes and em-dash, but mercruial's output uses the weird characters. Just in case I made sure to test having smart quotes and em-dash both in the Title and Summary fields as well as the content of the file change. Patching and landing the change from Windows, then pulling the commit onto the Mac and the end result on the Mac looks correct for all places.

As another test I tried creating a new commit on Windows and using smart-quotes and em-dash in the commit message and mercurial bombed with

$ hg ci # nvim opened an editor in UTF-8 encoding mode and let me write in a commit message: Test commit with “smart quotes” and "em—dash"
transaction abort!
rollback completed
note: commit message saved in .hg\last-message.txt
note: use 'hg commit --logfile .hg/last-message.txt --edit' to reuse it
abort: decoding near 't quotesΓÇ¥ and "emΓ': 'charmap' codec can't decode byte 0x9d in position 34: character maps to <undefined>!

Changing my commit invocation to

$ hg --encoding utf-8 ci --logfile .hg/last-message.txt

Resulted in successfully making the commit and when using hg log it prints out the exact same odd characters in place of real smart quotes and em-dash.