Page MenuHomePhabricator

Raise a setup warning when the "en_US.UTF-8" locale is unavailable
Open, NormalPublic

Description

I commit something to my local Mercurial repository and write the log message containing some umlauts using nano. I then push the changes to Phabricator. The resulting email contains ? in subject and body, even though it has following headers set:

MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain; charset="utf-8"

Within Phabricator, the commit message also shows ? instead of the umlauts. The generated site has <meta charset="UTF-8">.

Looking at the Mercurial repository on the server directly (using hg log), the umlauts are displayed properly.

I have set LANG=en_GB.UTF-8 on my local system. The server supports the same locale. PHP has default_charset = "UTF-8" set.

The issue appears with HHVM 3.5.0 as well as PHP-FPM 5.5.21. It also showed (on the web and in emails) with old commits that were created when we were still using mod_php. I am running Apache 2.4.10-9 and MariaDB 10.0.15-3 on Debian 8 Jessie. The Mercurial versions involved are 3.0.1 (client) and 3.1.2 (server).

Related Objects

Event Timeline

devurandom raised the priority of this task from to Needs Triage.
devurandom updated the task description. (Show Details)
devurandom added a subscriber: devurandom.

I don't think we support MariaDB. My guess is that datastore is truncating the characters somehow.

@btrahan Sorry for unsubscribing you. I edited the description while you replied – saving it must have reset the subscribers to what they were when I started editing.

MariaDB is just a fork of MySQL. It runs fine for me. I don't know that we support or test HHVM.

I can't reproduce this issue locally. Though that is just with a quick test in my normal environment. Phabricator supports utf8-mb4 by default, so mangled characters are almost always an environment issue outside of our control.

Can you reproduce the issue on this install?

In T7339#97735, @chad wrote:

I don't know that we support or test HHVM.

I also tried with PHP-FPM and before with mod_php we had the same issue.

Can you reproduce the issue on this install?

I do not think I have commit access to any Mercurial repository here.

This task is what we refer to as a "fishing expedition" which essentially means is a significant amount of time (asking questions back and forth since we can't reproduce the issue) trying to get to which component is causing the issue. We have a very wide character support range with utf8-mb4, so we're going to be naturally skeptical that it's a Phabricator issue if only one user is reporting difficulty. I would say most Phabricator users are not in an English character set at this point, given the support tickets here.

fda0b086 is an example commit in the utf8-mb4 range, which displays as expected.

There are maybe a few ideas I could toss out to try:

I assume you're not using Differential, you mention just committing directly. Can you try not using nano? Just a straight command line commit? What version is Mercurial? Does going through arcanist display the same behavior? Is there another computer with a different stack you can test locally?

We are going to set up test git, svn, and hg repositories on here for users to test diffs / commits against.

  • Not using nano results in the same problem.
$ touch test
$ hg add test
$ hg commit -m 'Ümläüt Test'
  • Mercurial version
    • locally:
$ hg --version
Mercurial Distributed SCM (version 3.0.1)
  • on the server:
# hg --version
Mercurial Distributed SCM (version 3.1.2)
  • I tried with arcanist:
$ hg rm test
$ arc diff
< enter title: Ümläüt Test 2, summary: Ümläüt testing try 2

The result shows up properly in Phabricator.

  • I did not yet try the same from a different computer.

So to summarize, hg and nano generated commits = bad. arc = good?

Almost. hg generates bad commits (using nano or not), arc generates good diffs (using nano).

I just used arc land to create an actual commit and arrive at the same problem.

  • In Diffusion the commit now looks like this:
?ml??t Test 2

Summary: ?ml??t testing try 2
  • While in Differential it looks like this (slightly reformatted for readability):
Ümläüt Test 2

Summary: Ümläüt testing try 2

The Land to Hosted Repository button in Phabricator bailed out with a permission problem on /var/repo/…, probably because that directory is setup to be managed by the phd user and not the webserver.

See T7342 for what I did to actually commit the revision and what @chad is referring to in T7339#97783.

arc land --help or https://secure.phabricator.com/book/phabricator/article/arcanist_diff/

(I've never used hg before, so I'm short on help for you there)

Yep, I think after my initial try with the commit right on the default (aka master) branch, which arc land apparently dispises, I think I mostly hit Mercurial specific issues (landing bookmarks and --merge not being supported). I created T7342 on the documentation issue to not derail this task here.

rHGTEST exists for you to test our server / repository for the issue.

In T7339#98164, @chad wrote:

rHGTEST exists for you to test our server / repository for the issue.

$ hg clone ssh://dweller@secure.phabricator.com/diffusion/HGTEST/hg-test/
remote: sudo: a password is required
abort: no suitable response from remote hg!
epriestley added a subscriber: epriestley.

Sorry, my fault -- I only gave the VCS user sudo access to the git binaries when I set things up. I'll update the config when I have a chance.

I can reproduce the issue on a Mercurial repository hosted on Bitbucket: commit imported by Phabricator shows correctly UTF-8 in the diff, but not in the commit message. UTF-8 characters are handled without any issue with Git repos.

I think the HGTEST repo works now.

Seems to work here: rHGTEST0e1f30ecdecaf66a3da45a62b7c6af3ad7c5515a

A solution/workaround for this problem appears to be to generate the en_US.UTF-8 locale on the Phabricator host. Restarting Phabricator does not appear to be necessary for the change to take effect, but already-imported repositories will have to be reparsed with bin/repository reparse --all $repository --message --owners.

The reason for this is that if the locale is not present, external commands (such as hg log) fall back to the C default locale, which is not a UTF-8 locale. Under these circumstances, the hg utility replaces non-ASCII7 characters with question marks.

That utilities are run with the en_US.UTF-8 locale appears to be desired; in the PhabricatorRepository class it says

$env = array(
  // NOTE: Force the language to "en_US.UTF-8", which overrides locale
  // settings. This makes stuff print in English instead of, e.g., French,
  // so we can parse the output of some commands, error messages, etc.
  'LANG' => 'en_US.UTF-8',

So this is arguably a documentation issue. I think it would be helpful to have an error message in the log file when the locale does not exist, though.

Alternative: Following patch (for the en_GB.UTF-8 locale).

commit 6d6534ebebe881a85382850c6b9817c9e34571fa
Author: Dennis Schridde <dennis.schridde@uni-heidelberg.de>
Date:   Thu Aug 6 15:00:38 2015 +0200

    Force LANG to en_GB.UTF-8 instead of en_US.UTF-8 when dealing with repositories

diff --git a/src/applications/repository/storage/PhabricatorRepository.php b/src/applications/repository/storage/PhabricatorRepository.php
index d5b1282..39f4618 100644
--- a/src/applications/repository/storage/PhabricatorRepository.php
+++ b/src/applications/repository/storage/PhabricatorRepository.php
@@ -395,10 +395,10 @@ final class PhabricatorRepository extends PhabricatorRepositoryDAO
 
   private function getCommonCommandEnvironment() {
     $env = array(
-      // NOTE: Force the language to "en_US.UTF-8", which overrides locale
+      // NOTE: Force the language to "en_GB.UTF-8", which overrides locale
       // settings. This makes stuff print in English instead of, e.g., French,
       // so we can parse the output of some commands, error messages, etc.
-      'LANG' => 'en_US.UTF-8',
+      'LANG' => 'en_GB.UTF-8',
 
       // Propagate PHABRICATOR_ENV explicitly. For discussion, see T4155.
       'PHABRICATOR_ENV' => PhabricatorEnv::getSelectedEnvironmentName(),

Reparsing now creates correct commit messages.

Can this be made configurable? If not, it is still a documentation issue as @fkf mentioned.

Configurability is probably not a good idea -- the comment in the code says that other code parses the output from external tools and expects them to give that output in a fixed manner (particularly in English). This looks like it is hard-coded for a reason.

It's probably fine to have it run with en_GB instead of en_US because very nearly nothing will make a difference between the two, but if it is made configurable people will expect it to support de_DE@euro.ISO8859-15 or tg_TJ.KOI8-T, and that is not reasonably implementable.

From a usability point of view, I think the best solution would be to have it pop up as "You have an unresolved setup issue...", because while the issue is difficult to track down, it is easy to fix once you know about it. Telling the user that the en_US.UTF-8 locale is missing from the system and possibly how to generate it on common systems should be sufficient.

How do you detect that a locale is missing? Parse locale -a?

How do you generate locales on common systems?

(Philosophically, I would naively expect commands to fail, rather than fall back to C, if an invalid locale was specified, but this is probably unreasonable of me.)

I think we could pick any available UTF8 locale here safely, as they should only differ in collation rules (I think?) and I don't think we ever rely on VCS commands to sort text. I'd also favor a setup warning if only a small number of user are affected and it's easy to install/generate locales, though.

  • locale -a seems fine. The binary belongs to glibc (libc-bin package on Ubuntu 14.04).
  • Editing /etc/locale.gen and then executing locale-gen is the usual way to generate locales. The binary belongs to the locales package on Ubuntu 14.04 and iirc to glibc on Gentoo/Linux.

A setup warning is fine with me.

Sure, parsing locale -a would be a possibility. It's also possible to stay in PHP, for example:

function locale_exists($locale_string) {
    // Remember the old locale, since the check below is mutating.
    $old_locale = setlocale(LC_ALL, '');
    $result = false;

    // returns false if the locale is unavailable, sets the locale otherwise.
    if(setlocale(LC_ALL, $locale_string)) {
        $result = true;
    }

    // restore the old locale so things don't change outside for no reason.
    setlocale(LC_ALL, $old_locale);

    return $result;
}

// ...

if(locale_exists('en_US.UTF-8')) {
  // ...
}

It's a bit of a toss-up which way one likes better; using external tools and messing with global state for no reason both have their drawbacks. I'm not sure if there's a non-mutating way to check if setlocale would succeed in pure PHP that doesn't involve forking (I don't usually write a lot of PHP, I'm afraid). Perhaps someone has a better idea.

Generating locales on common systems: localedef is the relevant POSIX command, e.g.

localedef -f UTF-8 -i /usr/share/i18n/locales/en_US

The path could vary, but I'm not sure that it does; it's the same for RHEL 6 and Debian 8, at least. The tricky bit is to do it so that locales are regenerated upon system updates. On Debian-based systems such as Ubuntu and Mint you'd use

dpkg-reconfigure locales

for this (it's just a pretty way to edit /etc/locale.gen and run locale-gen), and on RHEL all locales are generated by default, so there should be no problem on such systems. I suspect this extends to Fedora, CentOS and SuSE as well. (This is just as well because Red Hat-ish systems don't have locale-gen) That covers the most common Linux systems, and anyway a system administrator should be able to find out how to do this thing on the system he administrates, particularly if that system isn't mainstream.

I have no idea how it works on Mac OS, although the POSIX way should at least work there, as well. If Phabricator even supports Mac OS.

Well, if there is a PHP function setlocale() which reliably returns error codes for non-existing locales, then you could simply do something like this:

if (!setlocale(LC_ALL, "en_US.UTF-8") {
  raise_setup_warning(...);
}

In that case you don't have to pass a different env to the repository subprocesses, either...

If having the rest of the code run under en_US.UTF-8 is not a problem or even desirable, then that is quite true. It would have implications for eventual localization, though.

epriestley renamed this task from Umlauts in commit messages show up as ? in emails and on the web to Raise a setup warning when the "en_US.UTF-8" locale is unavailable.Feb 2 2017, 4:13 PM
epriestley triaged this task as Normal priority.
epriestley moved this task from Triage to v3 on the Diffusion board.
epriestley edited projects, added Diffusion (v3); removed Diffusion.

For hg, set HGPLAIN=1 will disable translations. See hg help scripting for details.

D18988 addresses the issue from D18105 (git grep 🐑 fails) more directly by calling @setlocale(LC_ALL, 'en_US.UTF-8') unconditionally during startup. The motivating case in T13060 is:

$ git cat-file blob 🐑.txt

...which we can't cheat our way around like we did in D18105 by passing the 🐑 over stdin instead of via the argument list.

This may fail if en_US.UTF-8 is not available, but many systems (e.g., all our stock Ubuntu production hosts) have en_US.UTF-8 available but have a POSIX default locale where UTF-8 is broken. When en_US.UTF-8 is unavailable my expectation is that this call will silently fail and the system will be no worse off than it was before.

This does not directly address the root issue here, and we should still interact with the user about this eventually. It does probably answer this question from a very practical viewpoint, though, as @devurandom and @fkf suggested above:

How do you detect that a locale is missing?

A: Test if setlocale(LC_ALL, ...) fails.

I think the ultimate version of this will probably look something like:

$has_en = @setlocale(LC_ALL, 'en_US.UTF-8');
if ($has_en) {
  return;
}

// If we reach this point, we set a flag saying "stop passing
// en_US.UTF-8 to subprocesses" and just use the system settings
// instead, since we know en_US.UTF-8 won't help -- and it may
// hurt if the system setting is some other UTF8 setting.

$actual = nonempty(getenv('LANG'), getenv('LC_ALL'), ..., 'POSIX'); // Sketchy?
$has_utf8 = preg_match('/utf-?8/i', $actual);

if ($has_utf8) {
  
  The 'en_US.UTF-8' locale is not available on this system. Phabricator
  works best when this locale is available (for example, it occasionally
  relies on parsing English-language command output to distinguish between
  similar error states).

  This system is using another UTF-8 locale ($actual) so this probably will
  not cause significant problems, but it is recommended that you install
  en_US.UTF-8 to make sure everything works properly.

  You do not need to make this locale the default for your system;
  Phabricator will select it automatically at runtime.

} else {

  The 'en_US.UTF-8' locale is not available on this system and the current
  locale setting ({$actual}) is not a UTF-8 locale. This will prevent
  Phabricator from passing UTF-8 characters to subprocesses in command
  line argument lists, sometimes cause UTF-8 output to be mangled, and may
  severely impact your ability to work with UTF-8 data.

  It is strongly recommended that you install en_US.UTF-8.

  You do not need to make this locale the default for your system;
  Phabricator will select it automatically at runtime.

}
epriestley mentioned this in Unknown Object (Phriction Wiki Document).Feb 10 2018, 1:02 AM

See https://discourse.phabricator-community.org/t/data-truncated-when-pushing-into-repository/3586/ for what is likely to be a related issue.

Under some locale settings, (string)1.23 emits the string "1,23". This is causing issues because inserting 1,23 into a MySQL "double" column doesn't work.

This comma-cast isn't reversible, so under certain locale settings the value of (double)(string)$x is nowhere close to the original value of $x. See some extremely productive conversations here:

In the first bug, someone who sounds like they know what they're talking about says:

Getting the locale is possible by passing null to setlocale.

This isn't what the documentation says, but maybe it's true most of the time:

The return value of setlocale() depends on the system that PHP is running. It returns exactly what the system setlocale function returns.
https://www.php.net/setlocale

A different "fix" for the "how do we figure out the locale?" issue might be to just perform a functional test. We need:

  • A (default?) locale which supports UTF8 so that echo 🐑 works.
  • A numeric locale so that 1.23 casts to "1.23".

We can possibly do something like:

setlocale(LC_ALL, 'en_US.UTF-8');
setlocale(LC_NUMERIC, 'C');

if ((string)1.23 !== "1.23") {
  raise a fatal error about numeric locales
}

if (`echo 🐑` !== 🐑) {
  raise a warning about UTF8 locales
}

In this particular case, we can also avoid the use of (string) inside qsprintf(), and special case phutil_string_cast() to always use "C" rules for casting doubles. There are other reasons to prefer phutil_string_cast() over (string), but this feels like a very subtle, very hard-to-root-out source of bugs for all time and refusing to run in an environment where (double)(string)$x is significantly destructive seems like a better fix.

(Note that we can already do proper locale-aware numeric formatting in pht() for actual user-facing display strings, so no changes to behavior here impact our ability to localize -- this is just about converting numeric values into strings for other programs to consume.)