Page MenuHomePhabricator

Large diffs can still repeatedly fail to insert
Closed, ResolvedPublic

Description

This may be related to T11748 / D16714. We have a task for a large revision ("This diff is very large and affects 4,800 files") which is still appearing to try to generate mail which is too large, and fails to insert into MySQL. Unlike T11748, it fails quickly, and doesn't consume much CPU.

[19-Dec-2016 22:39:11 UTC] [2016-12-19 22:39:11] EXCEPTION: (PhutilProxyException) Error while executing Task ID 35362582. {>} (AphrontConnectionLostQueryException) #2006: MySQL server has gone away

This error may occur if your MySQL 'wait_timeout' or 'max_allowed_packet' configuration values are set too low. at [<phutil>/src/aphront/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:304]
[19-Dec-2016 22:39:11 UTC] arcanist(head=master, ref.master=2484647734ee), phabricator(head=master, ref.master=b6d9c7deea60), phutil(head=master, ref.master=f1afeab618fa), pytest-phabricator(head=1550854ff89dca474176d307f6a485cd5eff0452, ref.master=1939f1e4f656)
[19-Dec-2016 22:39:11 UTC]   #0 <#2> AphrontBaseMySQLDatabaseConnection::throwCommonException(integer, string) called at [<phutil>/src/aphront/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:348]
[19-Dec-2016 22:39:11 UTC]   #1 <#2> AphrontBaseMySQLDatabaseConnection::throwQueryCodeException(integer, string) called at [<phutil>/src/aphront/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:289]
[19-Dec-2016 22:39:11 UTC]   #2 <#2> AphrontBaseMySQLDatabaseConnection::throwQueryException(mysqli) called at [<phutil>/src/aphront/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:185]
[19-Dec-2016 22:39:11 UTC]   #3 <#2> AphrontBaseMySQLDatabaseConnection::executeRawQuery(string) called at [<phutil>/src/xsprintf/queryfx.php:8]
[19-Dec-2016 22:39:11 UTC]   #4 <#2> queryfx(AphrontMySQLiDatabaseConnection, string, string, string, array, string)
[19-Dec-2016 22:39:11 UTC]   #5 <#2> call_user_func_array(string, array) called at [<phutil>/src/aphront/storage/connection/AphrontDatabaseConnection.php:58]
[19-Dec-2016 22:39:11 UTC]   #6 <#2> AphrontDatabaseConnection::query(string, string, string, array, string) called at [<phabricator>/src/infrastructure/storage/lisk/LiskDAO.php:1261]
[19-Dec-2016 22:39:11 UTC]   #7 <#2> LiskDAO::insertRecordIntoDatabase(string) called at [<phabricator>/src/infrastructure/storage/lisk/LiskDAO.php:1106]
[19-Dec-2016 22:39:11 UTC]   #8 <#2> LiskDAO::insert() called at [<phabricator>/src/infrastructure/storage/lisk/LiskDAO.php:1075]
[19-Dec-2016 22:39:11 UTC]   #9 <#2> LiskDAO::save() called at [<phabricator>/src/applications/metamta/storage/PhabricatorMetaMTAMail.php:355]
[19-Dec-2016 22:39:11 UTC]   #10 <#2> PhabricatorMetaMTAMail::save() called at [<phabricator>/src/applications/transactions/editor/PhabricatorApplicationTransactionEditor.php:1178]
[19-Dec-2016 22:39:11 UTC]   #11 <#2> PhabricatorApplicationTransactionEditor::publishTransactions(DifferentialRevision, array) called at [<phabricator>/src/applications/transactions/worker/PhabricatorApplicationTransactionPublishWorker.php:21]
[19-Dec-2016 22:39:11 UTC]   #12 <#2> PhabricatorApplicationTransactionPublishWorker::doWork() called at [<phabricator>/src/infrastructure/daemon/workers/PhabricatorWorker.php:123]
[19-Dec-2016 22:39:11 UTC]   #13 <#2> PhabricatorWorker::executeTask() called at [<phabricator>/src/infrastructure/daemon/workers/storage/PhabricatorWorkerActiveTask.php:171]
[19-Dec-2016 22:39:11 UTC]   #14 <#2> PhabricatorWorkerActiveTask::executeTask() called at [<phabricator>/src/infrastructure/daemon/workers/PhabricatorTaskmasterDaemon.php:22]
[19-Dec-2016 22:39:11 UTC]   #15 PhabricatorTaskmasterDaemon::run() called at [<phutil>/src/daemon/PhutilDaemon.php:181]
[19-Dec-2016 22:39:11 UTC]   #16 PhutilDaemon::execute() called at [<phutil>/scripts/daemon/exec/exec_daemon.php:127]

Task details:

+----------+------------------------------------------------+----------------------------+--------------+--------------+----------+-------------+----------+--------------------------------+
| id       | taskClass                                      | leaseOwner                 | leaseExpires | failureCount | dataID   | failureTime | priority | objectPHID                     |
+----------+------------------------------------------------+----------------------------+--------------+--------------+----------+-------------+----------+--------------------------------+
| 35362582 | PhabricatorApplicationTransactionPublishWorker | 12760:1482187718:tails:648 |   1482188018 |            3 | 35362630 |  1482187754 |     1000 | PHID-DREV-wf7fxuctedq3wqaown4u |
+----------+------------------------------------------------+----------------------------+--------------+--------------+----------+-------------+----------+--------------------------------+
mysql> select @@max_allowed_packet;
+----------------------+
| @@max_allowed_packet |
+----------------------+
|             67108864 |
+----------------------+
1 row in set (0.00 sec)

Using 2016 week 43 stable (rP is 5efbf4d, rPHU is 20841d2)

Event Timeline

Please update to HEAD before filing a bug report. We can't take reports against old versions of Phabricator. You can use Phacility if you'd like to test an up to date installation.

Hm; we weren't running HEAD when we reported T11748. We have an upgrade planned to latest-stable in a week or so -- will that be sufficient to re-test with? Or is there a repository that we can apply a 4,800-file diff to on Phacility?

We weren't running quite HEAD then, but we were running a version that was our patches on top of something about a week old -- see the end of the description of T11743, which was the first report in that chain. Our version today is a couple of months old, so I see why that'd feel different.

After our next upgrade to latest-stable, we'll be back to a similar level of HEAD-ness to what we were running then.

Note that in T11743 I ended up reproducing on Phacility anyway at Evan's request so that he could debug it live.

You can always create a test instance in Phacility, and upload the diff there - those are free and temporary - see Reproduction/Resources
There's also rGITTEST, although maybe it's not so good for a super huge diff...

It will likely be easiest to replicate again after upgrading to Week 52 stable this Friday, since I can't use precise same diff, and don't know the precise set of replication criteria yet.

epriestley triaged this task as Normal priority.

I strongly suspect this report has enough information to let us replicate the issue locally, so you don't need to take any further action yet. I'll follow up if I have difficulty replicating it and we can figure out what differs between my synthetic test case and your real environment.

(Because of churn in T11114 and other upstream changes I'm not exactly sure what's happening with Week 52 anyway.)

Alright, I'm actually having trouble reproducing this. Here's what I tried:

  • I reduced max_allowed_packet to 1MB.
  • I restarted mysqld.
  • I verified the change was in effect:
mysql> select @@max_allowed_packet;
+----------------------+
| @@max_allowed_packet |
+----------------------+
|              1048576 |
+----------------------+
1 row in set (0.00 sec)
  • I checked out a new branch:
$ git checkout -b pathtest
  • I created 10,000 moderately long files:
$ for i in `seq 1 10000`; do echo $i > moderately_long_directory_name/long_long_path_name_$i.txt; done;
  • I committed the change and created a revision from it:
$ git commit -a -m paths
$ arc diff HEAD^ --conduit-uri=http://local.phacility.com/
  • I received a prompt advising me not to do this, but entered "y" to continue anyway:
This diff has a very large number of changes (10,000). Differential works
best for changes which will receive detailed human review, and not as
well for large automated changes or bulk checkins. See
https://secure.phabricator.com/book/phabricator/article/differential_large_changes/
for information about reviewing big checkins. Continue anyway? [y/N] y
  • I ran bin/phd debug to generate the mail in the foreground.

Mail generation worked correctly. We generate these paths in the mail:

AFFECTED FILES
  moderately_long_directory_name/long_long_path_name_1.txt
  moderately_long_directory_name/long_long_path_name_10.txt
  moderately_long_directory_name/long_long_path_name_100.txt
  moderately_long_directory_name/long_long_path_name_1000.txt
<...snip...>
  moderately_long_directory_name/long_long_path_name_122.txt
  moderately_long_directory_name/long_long_path_name_1220.txt
  moderately_long_directory_name/long_long_path_name_1221.txt
  (9750 more files...)

Note that the "snip" is me snipping text, but the "9750 more files..." is Phabricator automatically snipping the file list and showing only 250 files.

I had inline diffs configured, but they were correctly dropped from this mail.

The mail message Phabricator generates as a result of this is only about 38KB, combining both text and HTML mail and fit easily in the 1MB packet limit, so I'd expect half as many files should fit handily into the ~64MB packet limit described, unless their paths are incredibly long -- on the order of 256KB each (which isn't actually possible until D17062, which is ahead of you).

Perhaps the most direct diagnostic is to add something like this near line 1178 of PhabricatorApplicationTransactionEditor.php:

var_dump($mail);
die("Aborting.");

Then run bin/worker execute --id 35362582.

That should dump >65MB of output to stdout, and the content of that output should tell us which field is growing unreasonably large. It is quite possible that there is some relevant configuration difference and I'm just not hitting it right, although I think all the relevant config is still set to values similar to the cases in T11748.

Alternatively, you can attempt to reproduce it on a Phacility instance and I'm happy to fish the root cause out of the environment there.

bin/worker execute --id 35362582 did not work because the task was archived. Instead I ran ./bin/phd stop; ./bin/worker retry --id 35362582; ./bin/phd debug PhabricatorTaskmasterDaemon. The key seems to be the production of the .patch file itself, not the number of files. Somehow these users managed to cram a >100M patchfile into Differential:

1<VERB> PhabricatorTaskmasterDaemon Working on task 35362582 (PhabricatorApplicationTransactionPublishWorker)...
2object(PhabricatorMetaMTAMail)#421 (14) {
3 ["actorPHID":protected"]=>
4 string(30) "..."
5 ["parameters":protected]=>
6 array(18) {
7 ["sensitive"]=>
8 bool(false)
9 ["subject"]=>
10 string(83) "..."
11 ["headers"]=>
12 array(6) {
13 [0]=>
14 array(2) {
15 [0]=>
16 string(12) "Thread-Topic"
17 [1]=>
18 string(83) "..."
19 }
20 [1]=>
21 array(2) {
22 [0]=>
23 string(14) "X-Herald-Rules"
24 [1]=>
25 string(26) "..."
26 }
27 [2]=>
28 array(2) {
29 [0]=>
30 string(16) "X-Phabricator-To"
31 [1]=>
32 string(32) "..."
33 }
34 [3]=>
35 array(2) {
36 [0]=>
37 string(16) "X-Phabricator-To"
38 [1]=>
39 string(32) "..."
40 }
41 [4]=>
42 array(2) {
43 [0]=>
44 string(16) "X-Phabricator-To"
45 [1]=>
46 string(32) "..."
47 }
48 [5]=>
49 array(2) {
50 [0]=>
51 string(16) "X-Phabricator-Cc"
52 [1]=>
53 string(32) "..."
54 }
55 }
56 ["from"]=>
57 string(30) "..."
58 ["subject-prefix"]=>
59 string(14) "[Differential]"
60 ["vary-subject-prefix"]=>
61 string(8) "[Closed]"
62 ["thread-id"]=>
63 string(51) "..."
64 ["is-first-message"]=>
65 bool(false)
66 ["exclude"]=>
67 array(0) {
68 }
69 ["herald-force-recipients"]=>
70 array(0) {
71 }
72 ["mailtags"]=>
73 array(2) {
74 [0]=>
75 string(20) "differential-updated"
76 [1]=>
77 string(22) "differential-committed"
78 }
79 ["is-bulk"]=>
80 bool(true)
81 ["body"]=>
82 string(16820) "..."
83 ["html-body"]=>
84 string(18920) "..."
85 ["attachments"]=>
86 array(1) {
87 [0]=>
88 array(3) {
89 ["filename"]=>
90 string(20) "D253041.837558.patch"
91 ["mimetype"]=>
92 string(27) "text/x-patch; charset=utf-8"
93 ["data"]=>
94 string(112531411) "
95Segmentation fault (core dumped)
96Freeing active task leases...

Such a large diff is clearly not a reasonable thing to begin with, but ideally one that wouldn't wedge the task. The most coherent thing might be to not attach patchfiles over a certain size?

Aha -- the diff as posted for review was mostly-reasonably sized. The diff as committed was the huge one:

mysql> select differential_diff.id as diff_id, creationMethod, SUM(addLines), SUM(delLines)
    ->   from differential_changeset
    ->   join differential_diff on differential_changeset.diffID = differential_diff.id
    ->  where differential_diff.revisionID = 253041
    ->  group by differential_diff.id;
+---------+----------------+---------------+---------------+
| diff_id | creationMethod | SUM(addLines) | SUM(delLines) |
+---------+----------------+---------------+---------------+
|  837437 | arc            |         11845 |          8567 |
|  837558 | commit         |       1752986 |        396108 |
+---------+----------------+---------------+---------------+

Thanks, I'll take a more in-depth look at this.

This might be sort of a generalization of T11767.

What is metamta.diffusion.byte-limit set to?

Oh, never mind, that's the Differential patch responding to the commit, so metamta.diffusion.byte-limit isn't relevant.

Also, is metamta.email-body-limit set to something crazy? I think that only means main-body (not body + attachments) right now, but it should probably mean body + attachments + headers.

metamta.email-body-limit is unset.

epriestley edited projects, added Diffusion (v3); removed Diffusion.

I'm going to presume this is resolved by D18598 -- I updated T11767 with more plans and context for smoothing over the behavior, but I think it will no longer be disastrously bad. I'll follow up in PHI36 PHI55.

Oh, yes, sorry, looked at the wrong tab.

This is resolved as far as I'm aware. See also T11767 (improvements to these limits), PHI125 (similar issue with non-UTF8 attachments failing to insert) and T13053 (planning).