Page MenuHomePhabricator

When LANG=POSIX, executing commands with UTF8 characters in the argument list fails
Closed, ResolvedPublic


(Original Title: Commits may partially fail to import when affecting files with UTF8 characters)

See PHI343. An install reports an issue importing a commit which affects a file containing an en-dash.

The import stalls in the GitCommitMessageParserWorker step.

Event Timeline

epriestley triaged this task as Normal priority.Feb 3 2018, 1:47 PM
epriestley created this task.

To try to reproduce this, I did this:

$ nano 'en-dash:–.txt'
$ git commit -am endash
[master c7ca66f] endash
 1 file changed, 1 insertion(+)
 create mode 100644 "en-dash:\342\200\223.txt"

Note that the filename contains an en-dash and the "create mode" output reports it with octal escapes.

This actually works better locally than it did in production. I get a full import, although the view of the commit in Diffusion is significantly broken:

Screen Shot 2018-02-03 at 5.51.44 AM.png (724×477 px, 26 KB)

  • The table of contents has parsed the name with quotes.
  • The diff fails to load.

The original problem involved an en-dash surrounded by spaces, so I'm going to see if that reproduces anything.

Same results for spaces.

The original problem involved a binary file, so I'm going to see if I can reproduce that.

Same result.

The original problem occurred inside DifferentialDiffExtractionEngine so maybe the conditions are:

  • En dash.
  • Binary file.
  • Attached to a revision.

Trying that.

No luck, I don't even see any bugs in Differential locally with that change (the en-dash decodes and renders properly). So this might be in the realm of T7339.

I now strongly believe this is some kind of terminal environment problem. This script has different behavior in production and on my laptop:

#!/usr/bin/env php

$root = dirname(dirname(__FILE__));
require_once $root.'/scripts/init/init-script.php';

$endash = "\xE2\x80\x93";
echo ' Input: <'.$endash.">\n";
list($stdout) = execx('echo %s', $endash);
echo 'Output: <'.rtrim($stdout).">\n";


 Input: <–>
Output: <>

(Zero bytes are echoed.)


 Input: <–>
Output: <–>

The input is echoed faithfully.

We should already be setting LANG=en_US.UTF-8 in the subprocess environment, and executing with LANG=en_US.UTF-8 has no effect, nor does adding putenv('LANG=en_US.UTF-8'); before the subprocess executes.

However, executing with LANG=en_US.UTF-8 or LC_ALL=en_US.UTF-8 outside the script (as in $ LANG=en_US.UTF-8 ./bin/test) does work.

I suspect this is a very, very deep hole. For now, I'm going to:

  • Manually change the /etc/environment on repo001 to include LANG=en_US.UTF-8.
  • See if that resolves PHI343.

I expect it will, and then we can figure out exactly how deep this goes at a later date.

Setting LANG in /etc/environment had no effect on anything.

The internet says I should edit /etc/default/locale but this is already set to LANG=en_US.UTF-8 but locale reports POSIX.

sudo update-locale LC_ALL=en_US.UTF-8 LANG=en_US.UTF-8 also does not work, although this is apparently just a Perl script that rewrites /etc/default/locale for you.

Thankfully, it seems that setlocale(LC_ALL, 'en_US.UTF-8'); from within PHP does work, so we don't need to figure out what PAM initialization cache file Ubuntu is actually reading LANG settings from.

epriestley renamed this task from Commits may partially fail to import when affecting files with UTF8 characters to When LANG=POSIX, executing commands with UTF8 characters in the argument list fails.Feb 3 2018, 4:07 PM
epriestley updated the task description. (Show Details)

setlocale(...) doesn't (?) appear to work from within Apache (?).

I think I didn't kick Apache hard enough since I eventually rammed this through with setlocale().