Page MenuHomePhabricator

Rendering a diff on a giant line runs for minutes, publishing transaction fails
Closed, ResolvedPublic

Description

We have a lingering, repeatedly-failing task that when it runs burns a CPU for a long time before eventually failing. This is the same task that starred in T11744, and after that's fixed (with D16683) it gets farther and seems to reach a different bug.

Running the task now under bin/worker execute --trace, it runs for a little over 5 minutes, continuously spinning a CPU core; gets to the point of making a bunch of database INSERTs and UPDATEs; and after a bunch of those have apparently succeeded, there's this:

>>> [151] <query> INSERT INTO `metamta_mail` (`actorPHID`, `parameters`, `status`, `message`, `relatedPHID`, `phid`, `dateCreated`, `dateModified`) VALUES [redacted]
<<< [151] <query> 233,365 us
>>> [152] <query> UPDATE `worker_activetask` SET `failureTime` = '1475890723', `taskClass` = 'PhabricatorApplicationTransactionPublishWorker', `leaseOwner` = NULL, `leaseExpires` = '1475890693', `failureCount` = '1885', `dataID` = '29804072', `priority` = '1000', `objectPHID` = 'PHID-DREV-olkoii564vunfq2hdkmw', `id` = '29804024' WHERE `id` = '29804024'
<<< [152] <query> 5,285 us
[2016-10-08 01:38:43] EXCEPTION: (AphrontConnectionLostQueryException) #2006: MySQL server has gone away

This error may occur if your MySQL 'wait_timeout' or 'max_allowed_packet' configuration values are set too low. at [<phutil>/src/aphront/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:304]
arcanist(head=master, ref.master=c8f70fab07d7), phabricator(head=master, ref.master=5973df176809), phutil(head=master, ref.master=6c65998b0b1c), pytest-phabricator(head=1550854ff89dca474176d307f6a485cd5eff0452, ref.master=1939f1e4f656)
  #0 AphrontBaseMySQLDatabaseConnection::throwCommonException(integer, string) called at [<phutil>/src/aphront/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:348]
  #1 AphrontBaseMySQLDatabaseConnection::throwQueryCodeException(integer, string) called at [<phutil>/src/aphront/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:289]
  #2 AphrontBaseMySQLDatabaseConnection::throwQueryException(mysqli) called at [<phutil>/src/aphront/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:185]
  #3 AphrontBaseMySQLDatabaseConnection::executeRawQuery(string) called at [<phutil>/src/xsprintf/queryfx.php:8]
  #4 queryfx(AphrontMySQLiDatabaseConnection, string, string, string, array, string)
  #5 call_user_func_array(string, array) called at [<phutil>/src/aphront/storage/connection/AphrontDatabaseConnection.php:58]
  #6 AphrontDatabaseConnection::query(string, string, string, array, string) called at [<phabricator>/src/infrastructure/storage/lisk/LiskDAO.php:1261]
  #7 LiskDAO::insertRecordIntoDatabase(string) called at [<phabricator>/src/infrastructure/storage/lisk/LiskDAO.php:1106]
  #8 LiskDAO::insert() called at [<phabricator>/src/infrastructure/storage/lisk/LiskDAO.php:1075]
  #9 LiskDAO::save() called at [<phabricator>/src/applications/metamta/storage/PhabricatorMetaMTAMail.php:355]
  #10 PhabricatorMetaMTAMail::save() called at [<phabricator>/src/applications/transactions/editor/PhabricatorApplicationTransactionEditor.php:1178]
  #11 PhabricatorApplicationTransactionEditor::publishTransactions(DifferentialRevision, array) called at [<phabricator>/src/applications/transactions/worker/PhabricatorApplicationTransactionPublishWorker.php:21]
[...]

For at least a couple of minutes of that time, it's sitting with no trace output happening, occupying under 250MB of memory (a great improvement!), and a SIGHUP finds it here:

#0 /phab_path/libphutil/src/future/exec/PhutilSignalRouter.php(61): PhutilBacktraceSignalHandler->handleSignal(Object(PhutilSignalRouter), 1)
#1 /phab_path/arcanist/src/difference/ArcanistDiffUtils.php(137): PhutilSignalRouter->routeSignal(1)
#2 /phab_path/phabricator/src/applications/differential/parser/DifferentialChangesetParser.php(574): ArcanistDiffUtils::applyIntralineDiff(Object(PhutilSafeHTML), Array)
#3 /phab_path/phabricator/src/applications/differential/parser/DifferentialChangesetParser.php(812): DifferentialChangesetParser->applyIntraline(Array, Array, Array)
#4 /phab_path/phabricator/src/applications/differential/parser/DifferentialChangesetParser.php(645): DifferentialChangesetParser->process()
#5 /phab_path/phabricator/src/applications/differential/parser/DifferentialChangesetParser.php(859): DifferentialChangesetParser->tryCacheStuff()
[...]

That line 137 where it's interrupted appears to be
$buf .= $str[$i];
so it's building up a big string.

I'm not entirely sure why that MySQL query fails after all the others succeeded -- perhaps the message's guess is correct and the issue is that we took so long to get here? Or perhaps the metamta_mail INSERT, despite apparently succeeding, is so giant that it causes the connection to drop.

In any case, we don't really want to be spending several minutes of CPU rendering the transaction. Probably there's another spot where we ought to apply a limit and bail on fancy diffing, in addition to the one in D16683.

Event Timeline

Hrrm -- what is max_allowed_packet set to in MySQL?

mysql> select @@max_allowed_packet;
+----------------------+
| @@max_allowed_packet |
+----------------------+
|            134217728 |
+----------------------+
1 row in set (0.00 sec)

What is metamta.differential.inline-patches set to in your Phabricator config?

What is metamta.email-body-limit set to in your Phabricator config?

(Following up for Greg, as I'm now the oncall.)

mysql> select @@max_allowed_packet;
+----------------------+
| @@max_allowed_packet |
+----------------------+
|             67108864 |
+----------------------+
1 row in set (0.00 sec)

metamta.differential.inline-patches is 500.
metamta.email-body-limit is still the default of 524288.

I'm vaguely suspicious that the mail is just too big (1MB of text plus tons and tons of HTML for syntax highlighting it?), and that the error is really arising from max_allowed_packet being violated, particularly given that the INSERT took 233ms to execute.

That said, 67108864 is like 68MB, which seems kinda huge for us to be hitting even with a ton of HTML.

You could apply this change to libphutil/, then re-run the bin/worker execute --id <id> --trace test to get a size in the ouput:

diff --git a/src/serviceprofiler/PhutilServiceProfiler.php b/src/serviceprofiler/PhutilServiceProfiler.php
index 4b0ed94..4cdf7e9 100644
--- a/src/serviceprofiler/PhutilServiceProfiler.php
+++ b/src/serviceprofiler/PhutilServiceProfiler.php
@@ -88,7 +88,10 @@ final class PhutilServiceProfiler extends Phobject {
           $desc = $data['database'];
           break;
         case 'query':
+          $query_length = strlen($data['query']);
           $desc = substr($data['query'], 0, 512);
+          $desc .= ' ';
+          $desc .= pht('<bytes = %s>', new PhutilNumber($query_length));
           break;
         case 'multi-query':
           $desc = array();

If that's somewhere in the realm of 67108864, probably a max_allowed_packet issue. If we're way under that, we might be holding a transaction open improperly or something like that.

Assuming the mail is legitimately too gigantic, I'm inclined to imply a byte limit from metamta.differential.inline-patches of something like 128 * <whatever you set it to>, and then fold the similar Diffusion limits into a single similar option (both would then mean, e.g., "about the amount of text that 500 lines of code normally require", rather than being explicit line/byte limits).

I'll give that a try. I think the associated diffs have enormous lines, so it seems likely that's what's going on.

Yep! It looks like that's the issue:

INSERT INTO `metamta_mail`  [...] <bytes = 87,272,583>

That byte limit change would stop the task from failing and retrying, right?

Ah, thanks! Yeah, looks pretty conclusive.

The fix will just plop a "This diff is way too big to put in the email." message or whatever in place of the diff (I don't remember what the current behavior is, but something analogous) and everything should go smoothly after that (next retry will work, send the mail, and then move the task to the archive).

Should be straightforward to fix -- I probably won't get to it tonight, but should be able to get a fix out tomorrow.

Sorry, "tomorrow" definitely didn't happen. Looking at this now.

D16714 should fix this by adding an implicit byte limit.

T11767 discusses further work.

After D16714, the problem tasks should complete cleanly.

We deployed D16714 and reran one of the failing tasks -- it completed cleanly as expected. Thanks!

Awesome, thanks for following up. Let us know if you run into anything else!