Page MenuHomePhabricator

Commit Message Parsing Fails With Unicode
Closed, DuplicatePublic

Description

The PhabricatorRepositoryGitCommitMessageParserWorker task seems to fail if the commit message contains unicode.

Running ./bin/phd debug task produces the following output:

<VERB> PhabricatorTaskmasterDaemon Task 5298049 failed!
[2014-04-06 13:44:02] EXCEPTION: (PhutilProxyException) Error while executing task ID 5298049 from queue. {>} (AphrontQueryCharacterSetException) Attempting to construct a query containing characters outside of the Unicode Basic Multilingual Plane. MySQL will silently truncate this data if it is inserted into a `utf8` column. Use the `%B` conversion to escape binary strings data. at [/data/www/libphutil/src/aphront/storage/connection/mysql/AphrontMySQLDatabaseConnectionBase.php:334]
  #0 AphrontMySQLDatabaseConnectionBase::validateUTF8String(no longer including ???) called at [/data/www/libphutil/src/aphront/storage/connection/mysql/AphrontMySQLiDatabaseConnection.php:12]
  #1 AphrontMySQLiDatabaseConnection::escapeUTF8String(no longer including ???) called at [/data/www/libphutil/src/xsprintf/qsprintf.php:176]
  #2 xsprintf_query(Object AphrontMySQLiDatabaseConnection, %s = %s, 6, no longer including ???, 7) called at [/data/www/libphutil/src/xsprintf/xsprintf.php:63]
  #3 xsprintf(xsprintf_query, Object AphrontMySQLiDatabaseConnection, Array of size 3 starting with: { 0 => %C = %ns }) called at [/data/www/libphutil/src/xsprintf/qsprintf.php:66]
  #4 qsprintf(Object AphrontMySQLiDatabaseConnection, %C = %ns, summary, no longer including ???) called at [/data/www/phabricator/src/infrastructure/storage/lisk/LiskDAO.php:1163]
  #5 LiskDAO::update() called at [/data/www/phabricator/src/infrastructure/storage/lisk/LiskDAO.php:1100]
  #6 LiskDAO::save() called at [/data/www/phabricator/src/applications/repository/storage/PhabricatorRepositoryCommit.php:107]
  #7 PhabricatorRepositoryCommit::save() called at [/data/www/phabricator/src/applications/repository/worker/commitmessageparser/PhabricatorRepositoryCommitMessageParserWorker.php:65]
  #8 PhabricatorRepositoryCommitMessageParserWorker::updateCommitData(Object DiffusionCommitRef) called at [/data/www/phabricator/src/applications/repository/worker/commitmessageparser/PhabricatorRepositoryGitCommitMessageParserWorker.php:15]
  #9 PhabricatorRepositoryGitCommitMessageParserWorker::parseCommit(Object PhabricatorRepository, Object PhabricatorRepositoryCommit) called at [/data/www/phabricator/src/applications/repository/worker/PhabricatorRepositoryCommitParserWorker.php:43]
  #10 PhabricatorRepositoryCommitParserWorker::doWork() called at [/data/www/phabricator/src/infrastructure/daemon/workers/PhabricatorWorker.php:84]
  #11 PhabricatorWorker::executeTask() called at [/data/www/phabricator/src/infrastructure/daemon/workers/storage/PhabricatorWorkerActiveTask.php:122]
  #12 PhabricatorWorkerActiveTask::executeTask() called at [/data/www/phabricator/src/infrastructure/daemon/workers/PhabricatorTaskmasterDaemon.php:19]
  #13 PhabricatorTaskmasterDaemon::run() called at [/data/www/libphutil/src/daemon/PhutilDaemon.php:85]
  #14 PhutilDaemon::execute() called at [/data/www/libphutil/scripts/daemon/exec/exec_daemon.php:112]
>>> [16] <event> daemon.didLogMessage <listeners = 2>
<<< [16] <event> 3,568 us
[2014-04-06 13:44:02] EXCEPTION: (AphrontQueryCharacterSetException) Attempting to construct a query containing characters outside of the Unicode Basic Multilingual Plane. MySQL will silently truncate this data if it is inserted into a `utf8` column. Use the `%B` conversion to escape binary strings data. at [/data/www/libphutil/src/aphront/storage/connection/mysql/AphrontMySQLDatabaseConnectionBase.php:334]
  #0 AphrontMySQLDatabaseConnectionBase::validateUTF8String(>>> [14] <connect> phabricator_worker

Event Timeline

joshuaspence raised the priority of this task from to High.
joshuaspence updated the task description. (Show Details)
joshuaspence added a subscriber: joshuaspence.
joshuaspence updated the task description. (Show Details)
joshuaspence added projects: Daemons, Diffusion.

It seems that this issue also affections Maniphest, as I could not include unicode characters from the output of ./bin/phd debug task (instead, I replaced the unicode characters with ???).

This is expected if the text contains characters which require 4 or more bytes to represent (characters outside of the basic multilingual plane). Is that the case? Often, these are emoji characters. If so, there's discussion in T1191.

If they're not emoji, we can look at some other possible issues.

Unicode is expected to work properly for all BMP characters in UTF-8: โ˜ƒ

My knowledge of unicode isn't up-to-scratch, but the symbol in question is \u1F4E7, presumably this is outside of BMP.

Yeah, the UTF8 encoding of that character is \xF0\x9F\x93\xA7, which is four bytes long. I'm going to merge this into T1191, since it's the same root issue.

We've been seeing more of these issues recently so it seems inevitable that we'll have to support utf8mb4 before we can drop support for older MySQL. I'll make an effort to move T1191 (and the vaguely-related T4045) forward soon.

โœ˜ Merged into T1191.