We sometimes parse output from svn/git/hg, and localization makes that parsing wrong.
Description
Revisions and Commits
rPHU libphutil | |||
D9938 Build $_ENV from CLI scripts | |||
D18990 | rPHUe144a9d7dd23 Add some sytem locale utilities to libphutil | ||
rARC Arcanist | |||
D9922 Fix `arc` issues with non-English systems | |||
D13993 | rARC4c3d75401f81 Use 'git blame --porcelain' for git blame info | ||
D13989 | rARC46009145f75d Minimize reliance on 'git branch' output format | ||
D13983 | rARC6ecb3fb87d2b Avoid parsing git "remote show" using "ls-remote" |
Related Objects
- Mentioned In
- 2018 Week 6 (Early February)
rPd5163d014345: Write patterns to "git grep" on stdin instead of passing them with "-e"
D18105: Write patterns to "git grep" on stdin instead of passing them with "-e"
T12807: Diffusion's "Grep File Content" doesn't work with UTF characters
D16596: Set default LANG env to english to avoid svn info parse error
T9956: Arcanist failed patch lint on windows (because "cp" does not exist)
T9455: Arc meaninglessly prompts users to amend changes in submodules into parent repositories
T8646: Provide more context for search results, particularly wiki documents
T9393: Importing Mercurial repo with blackbox extension enabled results in non-completing import and erroneneous revision
T9310: --no-verify option used for staging repositories not available in older gits
D13983: Avoid parsing git "remote show" using "ls-remote"
T7068: Diff Parse Exception: Expected a hunk header - Mentioned Here
- T13060: When LANG=POSIX, executing commands with UTF8 characters in the argument list fails
T9953: arc fails with `git:branch-unique(*)` base commit rule on CentOS 6
D14735: Don't use "-c" flag in "git:branch-unique()" revision range rule
D13998: Use 'remote.origin.url' fallback for git < 1.7.5
T5896: Mercurial fails to parse (A or B or ... or Z) revsets with more than about 300 items
D7614: Fix git "origin" remote in more circumstances
D12112: Fix Mercurial command injection vulnerability
T7339: Raise a setup warning when the "en_US.UTF-8" locale is unavailable
D9922: Fix `arc` issues with non-English systems
D9938: Build $_ENV from CLI scripts
Event Timeline
Does $ LANG=en_US.UTF-8 svn ... work on your system to force this?
If not, does $ LC_ALL=en_US.UTF-8 svn ... work?
I haven't been able to figure out how to install these binaries with support for non-English languages locally.
D9922 may fix this: it sets LC_ALL=en_US.UTF-8 in the environment for all VCS commands we execute.
Since I haven't had any luck getting non-English copies of this stuff installed locally, I could use some help verifying that this works. You can apply the patch with:
arcanist/ $ arc patch D9922
If it doesn't work, let me know what went wrong, and/or if you have a better approach.
- https://github.com/phacility/arcanist/pull/114 has original discussion of this issue.
- https://github.com/phacility/arcanist/issues/168 is a similar report.
- https://github.com/phacility/arcanist/pull/141 is a somewhat-similar report ("en_US.UTF-8" not available).
nope this one does not work:
[2014-07-13 22:32:46] EXCEPTION: (Exception) Unable to parse SVN info. at [/usr/local/lib/php/arcanist/src/repository/api/ArcanistSubversionAPI.php:359] #0 ArcanistSubversionAPI::getSVNInfo(usr.bin/Makefile) called at [/usr/local/lib/php/arcanist/src/parser/ArcanistDiffParser.php:93] #1 ArcanistDiffParser::parseSubversionDiff(Object ArcanistSubversionAPI, Array of size 5 starting with: { usr.bin/Makefile => 1 }) called at [/usr/local/lib/php/arcanist/src/workflow/ArcanistDiffWorkflow.php:962] #2 ArcanistDiffWorkflow::generateChanges() called at [/usr/local/lib/php/arcanist/src/workflow/ArcanistDiffWorkflow.php:526] #3 ArcanistDiffWorkflow::run() called at [/usr/local/lib/php/arcanist/scripts/arcanist.php:338]
Support Impact Significantly affects non-English users. Many workflows fail when we look for English text in command output.
High Support Impact We're likely to get more of them over time.
@bapt, @ite-klass - could someone please test D9922 + D9938? We have tons of trouble testing this stuff since we run all english otherwise.
tested with both applied and still fails with LC_ALL=fr_FR.UTF8
[2014-10-18 00:08:09] EXCEPTION: (Exception) Unable to parse SVN info. at [<arcanist>/src/repository/api/ArcanistSubversionAPI.php:362]
#0 ArcanistSubversionAPI::getSVNInfo(string) called at [<arcanist>/src/parser/ArcanistDiffParser.php:91] #1 ArcanistDiffParser::parseSubversionDiff(ArcanistSubversionAPI, array) called at [<arcanist>/src/workflow/ArcanistDiffWorkflow.php:911] #2 ArcanistDiffWorkflow::generateChanges() called at [<arcanist>/src/workflow/ArcanistDiffWorkflow.php:467] #3 ArcanistDiffWorkflow::run() called at [<arcanist>/scripts/arcanist.php:338]
Maybe the best for svn is to prefix all commands with --xml and parse xml (the xml is not localized)
I’m confused. Will what it parses depend on the languages installed/available and setup on the server?
I would expect the patch parser to recognize patch files of any language, without depending on runtime environment languages.
We can not parse every language. In the very narrow case of patch files, we probably can, but we depend on parsing the output from many other commands to detect specific errors in all of Git, Mercurial and Subversion. For example, sometimes a command exits with, say, exit code 1, but several different errors can actually cause that exit code, and we want to act differently depending on which error occurred. The only way we can tell which error occurred is to look at the program output. To parse error messages in every language we would need to hard-code every translation and keep that list up to date, which is unreasonable.
We need to find some set of environmental variables which forces SVN, Git and Mercurial to use only English. However, setting LC_ALL does not seem to work (per @bapt) and setting LANG does not seem to work (also per @bapt).
To move forward, we need to find some set of environmental variables we can set which forces English.
If you're experiencing this problem, you can help by finding a set of environmental variables which force English.
if I export myself LC_ALL=C it just works.
don't know why it was failing from the patches you have provided me
LANG=en_US.UTF-8 should do the trick, but it fails if the en_US.UTF-8 locale is not generated/available.
Another report of this over the weekend:
the test in https://secure.phabricator.com/diffusion/ARC/browse/master/src/repository/api/ArcanistGitAPI.php;0c9a03771961be79d09e566051a5a594866760d9$503-508
fails for non-English git, which makes repo detection fail (And shows up as "you can't close this revision" exception during arc land).
This one might be reasonably replaced with git config remote.origin.url?
I think the current Phabricator/Arcanist handling of locales, at least on Linux, is wrong. The LANG environment variable has lower priority than the family of LC_* variables, as per POSIX §8.2:
The values of locale categories shall be determined by a precedence order; the first condition met below determines the value:
- If the LC_ALL environment variable is defined and is not null, the value of LC_ALL shall be used.
- If the LC_* environment variable (LC_COLLATE, LC_CTYPE, LC_MESSAGES, LC_MONETARY, LC_NUMERIC, LC_TIME) is defined and is not null, the value of the environment variable shall be used to initialize the category that corresponds to the environment variable.
- If the LANG environment variable is defined and is not null, the value of the LANG environment variable shall be used.
- If the LANG environment variable is not set or is set to the empty string, the implementation-defined default locale shall be used.
Thus, any user-set LC_* variable could cause the output of external commands to differ from the LANG-set locale.
Furthermore, en_US.UTF-8 is not guaranteed to exist. The C or POSIX locales, though, should be always available and cause text output to use the more parseable program source default language (not necessarily English, but I would certainly expect it to be that for all VCS), as well as follow a fixed set of rules for time, collation, etc. (detailed in POSIX §7.3).
Sadly, those rules are ASCII-centric, so some localization shenanigans breaking parsing may remain when dealing with UTF-8 messages. Some systems have a C.UTF-8 locale that might solve these (not mine, though).
tl;dr what you most likely want is to force LC_ALL=C (or LC_ALL=C.UTF-8 where available) for any external binaries to ensure "Computer English", then dealing with any remaining localization issues on your own. Testing this with rARC on a Linux system with only the fr_FR.UTF-8 locale generated, and LANG=fr_FR.utf8 set in the user environment by systemd's locale configuration, I get:
beta|test:~/code/arcanist$ arc which | grep -A 1 'remote URI' Unable to determine the remote URI for this repository. beta|test:~/code/arcanist$ LC_ALL=C arc which | grep -A 1 'remote URI' The remote URI for this working copy is "https://github.com/phacility/arcanist.git".
With the underlying cause being:
beta|test:~/code/arcanist$ git remote show -n origin | head -n 3 * distante origin URL de rapatriement : https://github.com/phacility/arcanist.git URL push : https://github.com/phacility/arcanist.git beta|test:~/code/arcanist$ LC_ALL=C git remote show -n origin | head -n 3 * remote origin Fetch URL: https://github.com/phacility/arcanist.git Push URL: https://github.com/phacility/arcanist.git
I suppose this should work for OSX as well, although I can't test it now. As for Windows, I don't really know how locale stuff works over there.
See also T7339, which is somewhat related.
I think we have to use en_US.UTF-8 in the general case. There are two cases where we may need to read output:
- Error strings, as here: we need to distinguish between results which can only be differentiated by examining human-readable text messages the command emits when run. This is likely always ASCII so we don't really care if it's UTF8 or not, but it does need to be English.
- All other user data (for example, commit messages), which we need to be in UTF8. We don't care what language it's in but it needs to be UTF8.
In the general case, we may need both (1) and (2) from the same command, which forces us to use en_US.UTF-8. I'm not sure we have any cases of that today, but it seems plausible that we'll need to run something like git log --format=message eventually, and read the message (in UTF8) on success, and read the error (in English) on failure.
I'm not sure how onerous installing en_US.UTF-8 is for international users. If it's a huge problem, we could build an API like this:
$future = id(new ExecFuture('git ...')) ->setNeedUTF8Output(true) ->setWantStderrInEnglish(true); list($err) = $future->resolve(); if ($err) { try { $english_stderr = $future->getStderrInEnglish(); } catch (PhutilEnglishNotAvailableException $ex) { throw new PhutilProxyException( pht( 'Failed to do some git thing, then could not figure out '. 'what went wrong because this system does not have '. 'an English locale installed.'), $ex); }
The behavior of these methods would be:
- setNeedUTF8Output(): Throws unless some UTF8 locale is available on the system.
- setWantStderrInEnglish(): Attempts to run the command in an English locale.
- getStderrInEnglish(): If the command ran in an English locale, returns stderr. If it did not, throws an exception telling the user that we can't get English error output and they should install an English locale if they want more details.
I think this is the friendliest API we could build for users (i.e., it gives them the greatest chance of getting things done without installing any locales) but it's fairly complicated/hostile to use correctly.
At the other extreme, we could just require everyone to have en_US.UTF-8 to do anything. This is easier for us to get right but might be burdensome to users.
We examine English output in only a small number of cases, so it doesn't necessarily seem unreasonable to put a mess of an API on top of those. However, I think we need UTF8 output for a very large number of commands, so requiring callers to explicitly turn this on will likely result in a lot of API misuse errors.
I think it's probably too difficult for us to get UTF8 right if it's not default: users will be more burdened by us screwing stuff up than they would be by installing any UTF8 locale (the number of users with no UTF8 locale in 2015 is also presumably very small).
I'm less sure about English. That one feels more OK to make explicit to me, although it's possible we implicitly rely on English output more than I realize.
In this particular case we're also parsing English output on stdout, and should just be using svn info --xml anyway, which maybe moots most of this.
Also not mentioned here is cases where we phutil_passthru() some command like git push. These would be best run in the system default locale, since we echo the output to the user -- unless we also parse it. I think we never do this right now, but I'm not 100% sure.
system default locale
To be precise, I just mean whatever the user's default locale is. If they're using git in French and we want to show them the output from a git command, it should ideally also be in French.
(Sorry I hadn't seen T7339, certainly makes my last comment mostly redundant. Perhaps the tasks can be merged?)
I completely missed (2), indeed the POSIX locale is not sufficient. I believe the point about using LC_ALL instead of LANG still stands, though. That should weed out issues caused by user LC_* settings.
On to the harder problem:
It seems to me that preparing for this general case may be actually harmful, as you either invite fragility through reliance on English output (which is also more prone to breakage with upgrades), or push the burden of enabling en_US.UTF-8 to users (this seems less bad for Phabricator than Arcanist, though). Your worst-case seems rather unlikely for at least the VCS commands:
- you need to read UTF8 user data from a command (ruling out LC_ALL=C),
- which can fail in different ways without different exit codes,
- it is actually important to distinguish between these,
- en_US.UTF-8 is not present (but some other UTF8 locale is),
- there are no alternate locale-insensitive APIs, commands or strategies (such as observing some state after the failed command),
- the command has potential side effects, or the time of execution is important, and cannot be simply re-run under LC_ALL=C.
I doubt these will frequently happen all at once, but I may be missing something again here. If this happens often, I would agree that requiring en_US.UTF-8 and explicitly telling users is a much simpler solution.
The current reliance on English can still be decreased though, through Subversion's --xml, git plumbing commands, etc. I can try to send patches for some of these if you are interested.
I'm not sure how onerous installing en_US.UTF-8 is for international users. If it's a huge problem, we could build an API like this:
[...]
- setNeedUTF8Output(): Throws unless some UTF8 locale is available on the system.
- setWantStderrInEnglish(): Attempts to run the command in an English locale.
- getStderrInEnglish(): If the command ran in an English locale, returns stderr. If it did not, throws an exception telling the user that we can't get English error output and they should install an English locale if they want more details.
I agree that this API is easy to misuse. At the very least, some UTF8 locale should always be required.
As for Phabricator, I'd say @fkf's suggestion of a setup warning requiring en_US.UTF-8 is enough. I don't think it's that complex for platform admins to enable it as a once-off during installation if it's not there. Typically it's a matter of enabling it in some configuration (e.g. /etc/locale.gen) then regenerating the locale files (e.g. locale-gen), although it varies by distro.
It's a tougher call whether you need to build this or not for Arcanist, where requiring end-users perform locale surgery does add some friction, times number of users (it's also extra cumbersome to do it without root privileges).
I think it's probably too difficult for us to get UTF8 right if it's not default: users will be more burdened by us screwing stuff up than they would be by installing any UTF8 locale (the number of users with no UTF8 locale in 2015 is also presumably very small).
Sure, in the absence of any UTF8 locale, it seems appropriate to simply require one. The (admittedly philosophical) problem lies with users of other UTF8 locales.
It doesn't look like you can get to stderr through phutil_passthru(), no?
Anyway, in this case I imagine that the user getting to see the command output should mostly compensate for not having a precise error code.
the command has potential side effects, or the time of execution is important, and cannot be simply re-run under LC_ALL=C.
I think this tactic (re-run the command to figure out the error) is a really fragile one and would be very hesitant to pursue it over requiring users to install locales, but agree on this otherwise.
For context, I think there are approximately ~3 cases where we intentionally rely on English output and have thoughtfully tried to work around the issue and have likely exhausted reasonable alternatives, and I think ~2 (but possibly all) of these are in Phabricator, not arc. So this problem is vanishingly rare in practice, and it is conceivable that we'll never encounter the "english + utf8" case -- I just don't want to give it up casually and then run into it as soon as we've done all the legwork to implement something that doesn't consider it or makes it really hard.
Both of the specific examples in this report are almost certainly cases of reliance on English when we should be locale agnostic: we can use svn info --xml in the first case, and git config remote.origin.url seems promising in the second case. I'd guess there are perhaps ~5-10 more of these, but that all or almost all of them are casual and likely have language-agnostic resolutions if we dig a bit.
It doesn't look like you can get to stderr through phutil_passthru(), no?
You can't, and I'm nearly certain we never examine the output of these commands, but some code could conceivably invoke, say, phutil_passthru('cmd &2>1 | tee /tmp/something/errors') and then parse it. I'm quite confident we don't do this today and expect we'll never have cause to, it just doesn't seem wholly impossible for us to ever hit this case.
One silly thought is that we could force arc to use a non-English locale during development to make it easier to write correct code: if you tried to write something that relied on parsing localizable output, it would be more clear that your approach was wrong. But this would be confusing to new contributors and pretty cumbersome and probably not worthwhile, given the low incidence of issues in this vein and the high frequency of encountering VCS command output in some form.
Broadly, I lean toward this as the pathway forward:
- Correct language-dependency of all callsites identified here, and try to fix future problems by making them language agnostic. I think this is pretty clear-cut as the best approach whenever it's possible.
- On the client, require a UTF8 locale (but not necessarily English) to be available in arc before we'll run it. Specify this user-selected UTF8 locale for all commands using LC_ALL.
- On the server, raise a setup warning in Phabricator if en_US.UTF-8 is not available. Specify en_US.UTF-8 for all commands using LC_ALL.
- Figure out where we actually rely on parsing English text in arc before deciding on a solution for it. It's very possible that the number of callsites is 0 or can be made 0, and then we can just avoid solving the tricky part of this indefinitely. (Overall, I tend to favor reducing the specificity of the error message if the locale is not English if we do encounter these cases, though.)
Yes, it would be a desperate measure with a lot of problems of its own.
Broadly, I lean toward this as the pathway forward:
[...]
This all looks reasonable to me (unless you find the arrangement below preferable to completely giving up English output parsing).
I experimented for a bit with setting LC_* variables instead of LC_ALL, and it looks like there is some hope for getting parsable UTF8 output in Arcanist without actually requiring en_US.UTF-8. This should be achievable through an elaborate system of ropes and pulleys as follows:
- Ensure there is at least one available UTF8 locale.
- Ensure the user has LANG set in their environment to a preferred UTF8 locale (say, xx_XX.UTF-8).
- (This is not strictly needed and can be worked around, but otherwise many other things may break and life becomes generally harder for arc.)
- For non-userfacing commands where errors need to be parsed, enforce the following environment:
- LC_CTYPE deals with character sets, among other things. Set LC_CTYPE=xx_XX.UTF-8.
- LANG is the fallback value for all other locale categories. Set LANG=C to ensure parsable output (notably, the LC_MESSAGES category controls message text translations).
- Do not set LC_ALL, as it will completely override this lovingly-crafted environment.
- For userfacing commands
- where output does not need to be parsed: just keep the user's LANG and LC_* settings.
- where output does need to be parsed: weep inconsolably. This is the worst possible situation, but I can't come up with a plausible case.
Example (using rHGTEST):
beta:/tmp/hg-test$ locale -a C de_DE.utf8 fr_FR.utf8 POSIX tr_TR.utf8 beta:/tmp/hg-test$ export LANG=de_DE.UTF-8 beta:/tmp/hg-test$ locale LANG=de_DE.UTF-8 LC_CTYPE="de_DE.UTF-8" LC_NUMERIC="de_DE.UTF-8" LC_TIME="de_DE.UTF-8" LC_COLLATE="de_DE.UTF-8" LC_MONETARY="de_DE.UTF-8" LC_MESSAGES="de_DE.UTF-8" LC_PAPER="de_DE.UTF-8" LC_NAME="de_DE.UTF-8" LC_ADDRESS="de_DE.UTF-8" LC_TELEPHONE="de_DE.UTF-8" LC_MEASUREMENT="de_DE.UTF-8" LC_IDENTIFICATION="de_DE.UTF-8" LC_ALL=
beta:/tmp/hg-test$ hg log --rev 0e1f30ecdeca Änderung: 8:0e1f30ecdeca Lesezeichen: master Marke: tip Nutzer: [redacted] Datum: Mon May 04 16:36:57 2015 +0200 Zusammenfassung: Ümläüts äre nice
beta:/tmp/hg-test$ LANG=C hg log --rev 0e1f30ecdeca changeset: 8:0e1f30ecdeca bookmark: master tag: tip user: [redacted] date: Mon May 04 16:36:57 2015 +0200 summary: ?ml??ts ?re nice
beta:/tmp/hg-test$ LANG=C LC_CTYPE=de_DE.UTF-8 hg log --rev 0e1f30ecdeca changeset: 8:0e1f30ecdeca bookmark: master tag: tip user: [redacted] date: Mon May 04 16:36:57 2015 +0200 summary: Ümläüts äre nice
Tests with other platforms and binaries would still be needed, but this looks promising to me.
Those were all the cases of parsing git environment-dependent output I could find in arc (unless built-in formats like git log --format=medium are susceptible to git-config/localization/future change, which I doubt).
I'll pore over hg and svn when I can give this more time, if nobody gets to it before.
D13983 breaks on users running RHEL6. There git version is 1.7.1 and git 1.7.1 does not support --get-url option for ls-remote.
It looks like --get-url was added in early 2011:
https://github.com/git/git/commit/45781adb9a89c0c47f61ccf8062be26b86a38a54
I think it was an undocumented internal option for a while and doesn't seem to have ever made it to the release notes, but from digging around in tags/blame, I think 1.7.5-ish (circa May 2011) is the first version it appeared in.
Git 1.7.1 is very old, and appears to have been released around April, 2010.
It looks like the minimum Git version for Atlassian Stash is 1.7.6:
https://confluence.atlassian.com/display/STASHKB/Git+1.7.1+is+Not+Supported+by+Stash
We do not currently have a formal minimum version for Git (presumably one exists in practice, but no one has ever reported an issue with an old version which we could not trivially resolve).
In Phabricator, we're currently fairly liberal about running with old versions. We do workaround or fail in some specific circumstances (for example: git 1.7.1 has an unrelated problem with origin, see D7614, and we test for the hg clone command injection vulnerability fixed in Mercurial 3.2.4, see D12112). The minimum version for Mercurial is currently 1.9 (July 2011) and the minimum version for Subversion is 1.5 (June 2008).
These timelines largely reflect the rate of change of the underlying tools, though. Mercurial is totally different than it was in 2011, while Subversion has not changed much since 600 BC, when it was originally written.
Historically, we've tried to err on the side of accommodating old versions. For example, I wrote a patch to remove our reliance on the --debug flag to certain Mercurial commands that appears in Mercurial 2.4 (circa June 2012), but we do not yet require it and still use --debug. This parser goofiness is also fixed but I expect we'll be stuck with my elegant workaround (see T5896#108946) for many years to come.
That said, I think we could plausibly just require newer git here; 1.7.1 was released about a year before Phabricator was.
If this was on the server, I'd probably lean toward doing this, but since it's on the client I'm inclined to try to accommodate this.
@jbeta, how bad do you think falling back to git config remote.origin.url is likely to be in practice? The only problem with it was that it doesn't respect insteadOf?
It looks like insteadOf was introduced as early as 1.5.5, so we can't just fall back for free, but it seems reasonable for this to just not work (our semantics for getRemoteURI() aren't tightly defined in arc, and we don't clone from it or push to it; I believe use of insteadOf to be rare in the wild and ignoring it to be non-problematic in practice).
Alternatively, we could fall back to the English-language parsing of git remote show origin, although it's really nice not to have to do that anywhere.
no one has ever reported an issue with an old version which we could not trivially resolve
Well, maybe "which we could not reasonably resolve" is more accurate.
Thanks @epriestley. RHEL6 is still widely used, and even the latest update on that line carries only git 1.7.1. So if we need a new version of git, its installation becomes a messy business with either custom / rpmforge-extras repo or do compile/install. Would prefer if we can keep the 1.7.1 support around.
In those rare cases when I have to deal with "old version causes headaches, nev version is not avail everywhere", I put handlers for old and new in place, with handler of old way spilling out warnings that "old version causes headache and I don't care anymore about it, use new".
Whoops, sorry about that.
I don't think it's a bad fallback at all (certainly preferable than leaving RHEL6 users out at any rate), it's simply that when I found ls-remote --get-url to be the cleaner approach, I neglected to check how recent the option was. To make it still work out of the box with most potential users of insteadOf, we could use either command depending on git version if you think that's worth the hassle.
Alternatively, we could fall back to the English-language parsing of git remote show origin, although it's really nice not to have to do that anywhere.
gods help you @epriestley