Page MenuHomePhabricator

EXCEPTION: (AphrontCharacterSetQueryException) Attempting to construct a query using a non-utf8 string when utf8 is expected
Open, Needs TriagePublic

Assigned To
None
Authored By
TiTi
Feb 13 2015, 1:20 PM
Tags
None
Referenced Files
F690786: nonutf8-hg-repo.tar.gz
Aug 5 2015, 3:04 PM
F382231: failures.png
Apr 24 2015, 3:27 PM
Tokens
"Orange Medal" token, awarded by d.maznekov."Love" token, awarded by vostok4."Manufacturing Defect?" token, awarded by rlogiacco."The World Burns" token, awarded by PhoneixS.

Description

Hi,

I'm opening a new issue for what I've described in T6315, because this task is closed and I'm still having the issue after upgrading phabricator.

During mercurial repository importation, some commits weren't importing:

rd@ly-phabricator-test:~/phabricator$ ./bin/repository importing E
rEfa3b374825f1f99d53bad22a1d4c52834342db93 Change, Owners, Herald
rE85369cdf3eb47d88cb39d11079b9dbf8b925371d Change, Owners, Herald
rE1ffcd1eea76daf11ff5b25605f8bc630af646227 Change, Owners, Herald
rE65718876b5791a548e5a0c6f6c136d38616842c4 Change, Owners, Herald
rEb46a1d9527c67a7b64993775e46d92f1d0608f2c Change
rd@ly-phabricator-test:~/phabricator$

I had to mark the repository as imported to move on :

./bin/repository mark-imported <repository>

But I still have a huge number of failures in the daemons log :

PhabricatorRepositoryMercurialCommitChangeParserWorker - Expires 1m - Failures 20971

Phabricator is constently retrying to parse thoses commits...
And sometimes it seems to impact the polling: repository is not updated anymore, polling seems to have stop.
I had to installed a cron to restart daemons from times to times...


This is what I got when reparsing with the new syntax :

rd@ly-phabricator-test:~/phabricator$ ./bin/repository reparse rEfa3b374825f1f99d53bad22a1d4c52834342db93 --message --change --owners --herald
You are about to recreate the relationship entries between the commits and the packages they touch. This might delete some existing relationship entries for some old commits.


    Are you ready to continue? [y/N] y

[2015-02-13 12:54:52] EXCEPTION: (AphrontCharacterSetQueryException) Attempting to construct a query using a non-utf8 string when utf8 is expected. Use the `%B` conversion to escape binary strings data. at [<phutil>/src/aphront/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:331]
  #0 AphrontBaseMySQLDatabaseConnection::validateUTF8String(string) called at [<phutil>/src/aphront/storage/connection/mysql/AphrontMySQLiDatabaseConnection.php:10]
  #1 AphrontMySQLiDatabaseConnection::escapeUTF8String(string) called at [<phutil>/src/xsprintf/qsprintf.php:170]
  #2 xsprintf_query(AphrontMySQLiDatabaseConnection, string, integer, string, integer) called at [<phutil>/src/xsprintf/xsprintf.php:70]
  #3 xsprintf(string, AphrontMySQLiDatabaseConnection, array) called at [<phutil>/src/xsprintf/qsprintf.php:64]
  #4 qsprintf(AphrontMySQLiDatabaseConnection, string, string, string) called at [<phabricator>/src/applications/repository/worker/commitchangeparser/PhabricatorRepositoryCommitChangeParserWorker.php:58]
  #5 PhabricatorRepositoryCommitChangeParserWorker::lookupOrCreatePaths(array) called at [<phabricator>/src/applications/repository/worker/commitchangeparser/PhabricatorRepositoryMercurialCommitChangeParserWorker.php:255]
  #6 PhabricatorRepositoryMercurialCommitChangeParserWorker::parseCommitChanges(PhabricatorRepository, PhabricatorRepositoryCommit) called at [<phabricator>/src/applications/repository/worker/commitchangeparser/PhabricatorRepositoryCommitChangeParserWorker.php:30]
  #7 PhabricatorRepositoryCommitChangeParserWorker::parseCommit(PhabricatorRepository, PhabricatorRepositoryCommit) called at [<phabricator>/src/applications/repository/worker/PhabricatorRepositoryCommitParserWorker.php:44]
  #8 PhabricatorRepositoryCommitParserWorker::doWork() called at [<phabricator>/src/infrastructure/daemon/workers/PhabricatorWorker.php:91]
  #9 PhabricatorWorker::executeTask() called at [<phabricator>/src/applications/repository/management/PhabricatorRepositoryManagementReparseWorkflow.php:297]
  #10 PhabricatorRepositoryManagementReparseWorkflow::execute(PhutilArgumentParser) called at [<phutil>/src/parser/argument/PhutilArgumentParser.php:396]
  #11 PhutilArgumentParser::parseWorkflowsFull(array) called at [<phutil>/src/parser/argument/PhutilArgumentParser.php:292]
  #12 PhutilArgumentParser::parseWorkflows(array) called at [<phabricator>/scripts/repository/manage_repositories.php:22]
rd@ly-phabricator-test:~/phabricator$

In fact the issue comes from fa3b374825f1, following commits are merges including the same modification, so let's focus on that one.
This commits contains several added files, and one of them caught my attention:

diff -r cbb86f87066e -r fa3b374825f1 Delivery/Facture-Avoir N° 094169.PDF

I suspect the filename Facture-Avoir N° 094169.PDF to be the root cause.
Which is fine in windows explorer.

  • Can I do something to stop phabricator from parsing thoses commits ?
  • Can this exception be fixed ?

Thank you

Event Timeline

TiTi raised the priority of this task from to Needs Triage.
TiTi updated the task description. (Show Details)
TiTi added a subscriber: TiTi.

What version of mysql are you running? Is your instance at HEAD?

Hello,
Thanks for the quick answer.

MySQL version + distribution:

rd@ly-phabricator-test:~/phabricator$ mysql --version
mysql  Ver 14.14 Distrib 5.5.41, for debian-linux-gnu (x86_64) using readline 6.3
rd@ly-phabricator-test:~/phabricator$ cat /etc/issue
Ubuntu 14.04.1 LTS \n \l
rd@ly-phabricator-test:~/phabricator$

Yes, I've upgraded Phabricator as mentionned.
(Using http://www.phabricator.com/rsrc/install/update_phabricator.sh from the installation guide)
I've retried just now.
Same exception.

At which git-commit are Your phabricator instance is? You've upgraded libphutil also? what commits are Your versions of those.

As for filename in windows explorer - do not expect UTF in windows. maybe ucs-2 or if You're lucky enough utf-16... So most probably Your issue exposes some problem with filename handling :)

I was at the following version:

But I've just upgraded and I'm now at:

(which break the "Recent Tasks" panel by the way)

Still the same :-/

Indeed my problem is related with filename on windows. However, it is a a very old commit on my repository, and it cause a huge number of failure in the daemon Leased Tasks:

failures.png (244×1 px, 29 KB)

It also fill the queue task with useless work.

That wouldn't bother me much but it seems it break the PhabricatorRepositoryPullLocalDaemon: sometimes, daemons got stuck. I had to restart them every few hours or so with a crontab so that the new commits get pulled and parsed.
I'm not quite sure this is a related, but with the task adding up in the queue and the huge number of failures...

I'm wondering why phabricator keeps trying to parse thoses commits, even though I've ./bin/repository mark-imported <repository> it.

I believe I have the same problem. The cause appears to be a filename that is not fully encodeable in UTF-8 -- in my case, it contains an umlaut. I have uploaded a tarball of a Mercurial repository that can be used to reproduce the problem:

The file it contains is called foo.txt - Verknüpfung.lnk. In an old repository of mine, such a file appears in two commits and was apparently added by accident -- creating such a file on German-locale Windows is unfortunately easy to do by mistake ("Vernknüpfung" translates to "link," and you get it by clicking "Create Link" in the context menu). I mention this because it may have an impact on prioritisation; most coders will never put special characters into file names on purpose (except perhaps for test cases?), but this sort of thing makes it possible to one day find one in your repo anyway.

Anyway, when importing the Repo in that tarball, I get

[05-Aug-2015 16:12:37 Europe/Berlin] [2015-08-05 16:12:37] EXCEPTION: (PhutilProxyException) Error while executing Task ID 75618. {>} (AphrontCharacterSetQueryException) Attempting to construct a query using a non-utf8 string when utf8 is expected. Use the `%B` conversion to escape binary strings data. at [<phutil>/src/aphront/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:334]
[05-Aug-2015 16:12:37 Europe/Berlin] arcanist(head=master, ref.master=fb5d5b86fadf), phabricator(head=master, ref.master=135d0c9ee7a6), phutil(head=master, ref.master=b8420e193467)
[05-Aug-2015 16:12:37 Europe/Berlin]   #0 <#2> AphrontBaseMySQLDatabaseConnection::validateUTF8String(string) called at [<phutil>/src/aphront/storage/connection/mysql/AphrontMySQLiDatabaseConnection.php:10]
[05-Aug-2015 16:12:37 Europe/Berlin]   #1 <#2> AphrontMySQLiDatabaseConnection::escapeUTF8String(string) called at [<phutil>/src/xsprintf/qsprintf.php:178]
[05-Aug-2015 16:12:37 Europe/Berlin]   #2 <#2> xsprintf_query(AphrontMySQLiDatabaseConnection, string, integer, string, integer) called at [<phutil>/src/xsprintf/xsprintf.php:70]
[05-Aug-2015 16:12:37 Europe/Berlin]   #3 <#2> xsprintf(string, AphrontMySQLiDatabaseConnection, array) called at [<phutil>/src/xsprintf/qsprintf.php:64]
[05-Aug-2015 16:12:37 Europe/Berlin]   #4 <#2> qsprintf(AphrontMySQLiDatabaseConnection, string, string, string) called at [<phabricator>/src/applications/repository/worker/commitchangeparser/PhabricatorRepositoryCommitChangeParserWorker.php:58]
[05-Aug-2015 16:12:37 Europe/Berlin]   #5 <#2> PhabricatorRepositoryCommitChangeParserWorker::lookupOrCreatePaths(array) called at [<phabricator>/src/applications/repository/worker/commitchangeparser/PhabricatorRepositoryMercurialCommitChangeParserWorker.php:255]
[05-Aug-2015 16:12:37 Europe/Berlin]   #6 <#2> PhabricatorRepositoryMercurialCommitChangeParserWorker::parseCommitChanges(PhabricatorRepository, PhabricatorRepositoryCommit) called at [<phabricator>/src/applications/repository/worker/commitchangeparser/PhabricatorRepositoryCommitChangeParserWorker.php:30]
[05-Aug-2015 16:12:37 Europe/Berlin]   #7 <#2> PhabricatorRepositoryCommitChangeParserWorker::parseCommit(PhabricatorRepository, PhabricatorRepositoryCommit) called at [<phabricator>/src/applications/repository/worker/PhabricatorRepositoryCommitParserWorker.php:44]
[05-Aug-2015 16:12:37 Europe/Berlin]   #8 <#2> PhabricatorRepositoryCommitParserWorker::doWork() called at [<phabricator>/src/infrastructure/daemon/workers/PhabricatorWorker.php:91]
[05-Aug-2015 16:12:37 Europe/Berlin]   #9 <#2> PhabricatorWorker::executeTask() called at [<phabricator>/src/infrastructure/daemon/workers/storage/PhabricatorWorkerActiveTask.php:162]
[05-Aug-2015 16:12:37 Europe/Berlin]   #10 <#2> PhabricatorWorkerActiveTask::executeTask() called at [<phabricator>/src/infrastructure/daemon/workers/PhabricatorTaskmasterDaemon.php:22]
[05-Aug-2015 16:12:37 Europe/Berlin]   #11 PhabricatorTaskmasterDaemon::run() called at [<phutil>/src/daemon/PhutilDaemon.php:183]
[05-Aug-2015 16:12:37 Europe/Berlin]   #12 PhutilDaemon::execute() called at [<phutil>/scripts/daemon/exec/exec_daemon.php:125]

Steps to reproduce:

  1. Download tarball, unpack it, cd into the repository
  2. hg serve
  3. Import from http://localhost:8000 into Diffusion

I have fiddled a little with the code, with limited success. Replacing (%s, %s) with (%B, %s) in PhabricatorRepositoryCommitChangeParserWorker::lookupOrCreatePaths makes this error go away but another pop up:

[05-Aug-2015 16:49:49 Europe/Berlin] [2015-08-05 16:49:49] ERROR 8: Undefined index: /foo.txt - Verknüpfung.lnk at [/opt/phabricator/phabricator/src/applications/repository/worker/commitchangeparser/PhabricatorRepositoryMercurialCommitChangeParserWorker.php:258]

(note that in the logfile, the "ü" in the error message is iso8859-1-encoded). This refers to a place where it says

$changes[$key]['pathID'] = $path_map[$change['path']];

The reason for this is that the path field in the phabricator_repository.repository_path table is utf8mb4-encoded, so the %B-formatted string is cut off before the umlaut:

MariaDB [(none)]> SELECT * FROM phabricator_repository.repository_path WHERE path LIKE '%Verkn' ORDER BY path;
+------+----------------------------+----------------------------------+
| id   | path                       | pathHash                         |
+------+----------------------------+----------------------------------+
| 8775 | /foo.txt - Verkn           | e794ba50efd359713f2903e32f0fc5bf |
+------+----------------------------+----------------------------------+

It appears that "/foo.txt - Verkn" becomes the value of $change['path'], which is not in $path_map because that still contains the unmangled name.

This is as far as I got right now. I think handling this would either require the file names to be saved as binary strings or converted before storage. I am not sure which approach is more sensible. I am reminded of the Makefile problem.

I am using the following revisions:

arcanist: fb5d5b86fadf909355d26741a54337a1e1ccb8ae
libphutil: b8420e193467db46748540cb0209aea2cf7e1d3c
phabricator: 135d0c9ee7a6d0552ba2f16bd36aef9b574de16c

Those are the current master branch.

Yesterday I have updated to the last version in master and I have the same problem.

Still having the same issue although I updated to last version.

[2016-03-14 08:56:39] EXCEPTION: (PhutilProxyException) Error while executing Task ID 37770. {>} (AphrontCharacterSetQueryException) Attempting to construct a query using a non-utf8 string when utf8 is expected. Use the `%B` conversion to escape binary strings data. at [<phutil>/src/aphront/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:362] arcanist(head=stable, ref.master=57f6fb59d739, ref.stable=92a93ab8f475), phabricator(head=stable, ref.master=e4372e1276fd, ref.stable=71bda66870d8), phutil(head=stable, ref.master=f43291e99d36, ref.stable=c72172b9954a)
#0 <#2> AphrontBaseMySQLDatabaseConnection::validateUTF8String(string) called at [<phutil>/src/aphront/storage/connection/mysql/AphrontMySQLiDatabaseConnection.php:10]
#1 <#2> AphrontMySQLiDatabaseConnection::escapeUTF8String(string) called at [<phutil>/src/xsprintf/qsprintf.php:178]
#2 <#2> xsprintf_query(AphrontMySQLiDatabaseConnection, string, integer, string, integer) called at [<phutil>/src/xsprintf/xsprintf.php:70]
#3 <#2> xsprintf(string, AphrontMySQLiDatabaseConnection, array) called at [<phutil>/src/xsprintf/qsprintf.php:64]
#4 <#2> qsprintf(AphrontMySQLiDatabaseConnection, string, string, string) called at [<phabricator>/src/applications/repository/worker/commitchangeparser/PhabricatorRepositoryCommitChangeParserWorker.php:54]
#5 <#2> PhabricatorRepositoryCommitChangeParserWorker::lookupOrCreatePaths(array) called at [<phabricator>/src/applications/repository/worker/commitchangeparser/PhabricatorRepositoryMercurialCommitChangeParserWorker.php:255]
#6 <#2> PhabricatorRepositoryMercurialCommitChangeParserWorker::parseCommitChanges(PhabricatorRepository, PhabricatorRepositoryCommit) called at [<phabricator>/src/applications/repository/worker/commitchangeparser/PhabricatorRepositoryCommitChangeParserWorker.php:26]
#7 <#2> PhabricatorRepositoryCommitChangeParserWorker::parseCommit(PhabricatorRepository, PhabricatorRepositoryCommit) called at [<phabricator>/src/applications/repository/worker/PhabricatorRepositoryCommitParserWorker.php:40]
#8 <#2> PhabricatorRepositoryCommitParserWorker::doWork() called at [<phabricator>/src/infrastructure/daemon/workers/PhabricatorWorker.php:122]
#9 <#2> PhabricatorWorker::executeTask() called at [<phabricator>/src/infrastructure/daemon/workers/storage/PhabricatorWorkerActiveTask.php:171]
#10 <#2> PhabricatorWorkerActiveTask::executeTask() called at [<phabricator>/src/infrastructure/daemon/workers/PhabricatorTaskmasterDaemon.php:22]
#11 PhabricatorTaskmasterDaemon::run() called at [<phutil>/src/daemon/PhutilDaemon.php:183]
#12 PhutilDaemon::execute() called at [<phutil>/scripts/daemon/exec/exec_daemon.php:125]

Well I have searched and the problem is windows encode names in non utf-8 code (usually Windows-1252) and commit this files names in mercurial. Then the parser extract this file names supposing the are correctly encoded in utf8 and as they aren't other classes (like commit parsers) do wrong things (for example try to insert with %s when it isn't utf).

One workaround is to convert them when needed in the parser, as you can see with the next diff:

index 8317cc7..9047c69 100644
--- a/src/repository/parser/ArcanistMercurialParser.php
+++ b/src/repository/parser/ArcanistMercurialParser.php
@@ -41,6 +41,13 @@ final class ArcanistMercurialParser extends Phobject {
       }
       $code = $line[0];
       $path = substr($line, 2);
+
+      if (!preg_match('!!u', $path))
+      {
+          // this isn't utf8, convert it.
+          $path = utf8_encode($path);
+      }
+
       switch ($code) {
         case 'A':
           $flags |= ArcanistRepositoryAPI::FLAG_ADDED;

As we don't really know the encoding utf8_encode can fail but usually it'll convert it correctly. If we really know the encode used, we can use iconv to make the right conversion.

I hope this help to solve this problem.

@PhoneixS, thanks for the hint, that workaround worked for us for the time being.

It worked for me too, the trouble in my case was created by the character § used in a filename pushed into mercurial from a Windows client.

vostok4 added a subscriber: vostok4.

@PhoneixS thank you for the fix. It fixed a very annoying problem here with Windows-1252 encoding in a file name!

Is this workaround now the recommended way to deal with this issue / ever going to be merged into the code base?

I have not tried this out yet but mercurial claims to respect an HGENCODING environment variable (https://www.mercurial-scm.org/doc/hg.1.html#environment-variables). It's possible that setting this on the phabricator server might fix the issue by affecting any hg commands run by phabricator. Note that the environment variable would need to be set/exported for both the system user account which phabricator web server runs under (like httpd or nginx) and the system user account which the daemons run under, as they both run mercurial commands.

To clarify I think having the environment variable set as HGENCODING=utf-8 would be a possible solution/workaround.

I don't think HGENCODING applies to the content of changesets, since that would invalidate the changeset hashes and the whole revision graph with it.

WIth the small test repo I attached to an earlier comment in this issue, I get the following behavior:

$ HGENCODING=UTF-8 hg files
foo.txt
foo.txt - Verkn<FC>pfung.lnk

where <FC> is the Windows-1252 code for the character ü.

So that doesn't convert file names. I'm guessing it's even possible for a repo to contain two files that have the same file name in different encodings, so this behaviour doesn't seem surprising -- from Mercurial's point of view, it'd open a can of worms. I wonder if the same thing could happen in a git repo.

Hmm okay that might only take effect at the time of making commits, though I've run into other issues where phabricator/arcanist fails to parse the contents of a commit message and forcing it to interpret as UTF-8 via --encoding utf-8 in the mercurial command has worked.