Page MenuHomePhabricator

`utf8_decode` is provided by XML extension, which may not be installed
Closed, ResolvedPublic

Description

On an Ubuntu 16.04 machine with the php7.0 package installed, spelling correction errors out with:

arc dieff
[2016-09-29 18:59:12] EXCEPTION: (Error) Call to undefined function utf8_decode() at [<phutil>/src/utils/utf8.php:236]
arcanist(head=master, ref.master=89e8b4852384), dblib(head=bin, ref.master=a6f1cdafe4af, ref.bin=1537c1180274), phutil(head=master, ref.master=9c03af69571f)
  #0 phutil_utf8_strlen(string) called at [<phutil>/src/parser/argument/PhutilArgumentSpellingCorrector.php:120]
  #1 PhutilArgumentSpellingCorrector::correctSpelling(string, array) called at [<arcanist>/src/configuration/ArcanistConfiguration.php:148]
  #2 ArcanistConfiguration::selectWorkflow(string, array, ArcanistConfigurationManager, PhutilConsole) called at [<arcanist>/scripts/arcanist.php:193]

This is because the php7.0 package does not install the XML extension, which is what provides utf8_decode. Installing the php7.0-xml package provides the extension and fixes the error, but it would be good to either message this better, or have a workaround for such installs (which are likely commonplace).

Event Timeline

T9640 and T2383 are somewhat related.

We can count(phutil_utf8v_combined($str)) as a fallback, but that's substantially slow on large inputs (it also has slightly different -- but probably more-correct -- behavior).

We also don't have any arc 🐗 --☃ commands and are unlikely to ever implement any so strlen() is likely no worse in real situations, except that PhutilArgumentSpellingCorrector might see third-party off-label use to correct things other than CLI arguments.

In the short term, I'll accept a patch to do this if you want to test it?

if (function_exists('utf8_decode')) {
  return strlen(utf8_decode($str));
}
return count(phutil_utf8v($str));

phutil_utf8v_combined() is probably more correct, but phutil_utf8v() should have more similar behavior to utf8_decode().

In the longer term, I'd rather rename this to be more clear anyway (phutil_utf8_glyph_length(...) vs phutil_utf8_codepoint_length() or somesuch).

Looks like we use the function to mean "codepoints" in some places in phabricator/.

@HarryOtto, your message is off topic in this thread and is so tenuously related to anything here that it looks like it might be SEO linkbuilding spam, so I've removed it.

If you're having a problem that you believe is because of a bug in Phabricator, please follow the instructions in Contributing Bug Reports.

Otherwise, see Support Resources.

T11744 has a vaguely-related special case where "split + count" was unreasonably slow for very large inputs.

I think this probably motivates trying to bring split/count behaviors under a single umbrella like PhutilUTF8StringTruncator, which is better able to handle needs and apply optimizations than the lighter-weight APIs which came before it.

alexmv claimed this task.

Fixed by D17188.