Page MenuHomePhabricator

During MySQL import, server may "2006 MySQL server has gone away" when "max_allowed_packet" server setting is too small, despite client setting
Open, LowPublic

Description

Loading a large (~8GB compressed) MySQL dump is currently failing during the Phacility export process. The underlying command we run is:

gzip -cd <dump.sql> | mysql ...

This is failing after a few minutes with this error:

ERROR 2006 (HY000) at line 13935: MySQL server has gone away

I initially thought this might be a max-allowed-packet issue, but mysql ... --max-allowed-packet 1G also failed in the same way.

This construction also suffers from the "bash pipes are incompatible with error handling" issue where gzip -cd ... failing does not cause the pipe construction to fail, so it's plausible that something extremely silly like "I copied the file wrong" is actually the root cause and that's just being obscured by the shell/pipe.

I'm going to try these steps to reduce the problem:

  • Unzip the dump before running it.
  • Look at the unzipped dump and see if line 13935 is bad in some obvious way.
  • Run it with source ...; instead of by piping it in, with the hope that MySQL will emit a more helpful error message.

Details

Differential Revisions
D20677: Provide import/restore guidance for "max_allowed_packet" and "innodb_log_file_size"
Restricted Differential Revision
Commits
Restricted Diffusion Commit
Restricted Differential Revision / Restricted Diffusion Commit
Restricted Differential Revision / Restricted Diffusion Commit
Restricted Differential Revision / Restricted Diffusion Commit
Restricted Differential Revision / Restricted Diffusion Commit
Restricted Differential Revision / Restricted Diffusion Commit

Event Timeline

epriestley triaged this task as Low priority.Jul 22 2019, 3:18 PM
epriestley created this task.

(Internally, see also PHI1329.)

(Whatever the resolution is here might also motivate tailoring our restore/import instructions, since this error is pretty opaque and the next steps aren't obvious.)

Look at the unzipped dump and see if line 13935 is bad in some obvious way.

This line is a ~105MB INSERT INTO differential_hunk ... statement. So it's large, but it should fit into --max-allowed-packet 1G comfortably.

Unzip the dump before running it.

This worked fine, dump is ~50GB but unzipped without issues.

Run it with source ...;

This appears to be working, but in a spooky way. The mysql client is routinely encountering the same issue, but it appears to have automatic retries:

...

ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
Connection id:    120
Current database: <instance>_differential

ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
Connection id:    121
Current database: <instance>_differential

ERROR 2006 (HY000): MySQL server has gone away
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
Connection id:    122
Current database: <instance>_differential

...

...so we're making progress despite MySQL randomly having seizures. The error here isn't any more helpful than it was with gzip ... | mysql.

I'll also double check wait_timeout and interactive_timeout, but the client is running queries regularly so it's hard for me to imagine they're involved here.

Aha! The MySQL error log actually appears to have something useful:

190722 15:51:32  InnoDB: ERROR: the age of the last checkpoint is 9441198,
InnoDB: which exceeds the log group capacity 9433498.
InnoDB: If you are using big BLOB or TEXT rows, you must set the
InnoDB: combined size of log files at least 10 times bigger than the
InnoDB: largest such row.

I'll also double check wait_timeout and interactive_timeout...

At this point I'm fairly sure the InnoDB error above is the real root cause, but for completeness, these are both set to the default value (28800 seconds, i.e. "8 hours") and definitely shouldn't be interacting with the workflow (which is taking minutes, not hours):

mysql> select @@wait_timeout;
+----------------+
| @@wait_timeout |
+----------------+
|          28800 |
+----------------+
1 row in set (0.00 sec)

mysql> select @@interactive_timeout;
+-----------------------+
| @@interactive_timeout |
+-----------------------+
|                 28800 |
+-----------------------+
1 row in set (0.00 sec)

The "age of the last checkpoint" error appears to primarily implicate innodb_log_file_size, which is currently set to the default value (5MB):

mysql> select @@innodb_log_file_size;
+------------------------+
| @@innodb_log_file_size |
+------------------------+
|                5242880 |
+------------------------+
1 row in set (0.00 sec)

In MySQL 5.6.8, the default value was increased from 5MB to 48MB.

This 2016 post on the Percona blog (which I've generally found trustworthy) discusses a large innodb_log_file_size (10-15GB) for benchmarking:

https://www.percona.com/blog/2016/05/31/what-is-a-big-innodb_log_file_size/

...and identifies the primary tradeoff as slower crash recovery. That post and this one (although this one is quite old, now):

https://www.percona.com/blog/2008/11/21/how-to-calculate-a-good-innodb-log-file-size/

...and the updated default in recent MySQL also suggest that the 5MB default is probably low for production workloads.

Like other MySQL settings the "ideal" way to set this one seems to be to carefully measure your totally constant workload, then calculate a value based on that, e.g. this post from the author on the first link:

To understand if you need to increase innodb_log_file_size – you need to see if Checkpoint Age is close to Async point.
You can check it with mysql ... or a better way to look at graphs like in this post – as it gives your historic information.
If Checkpoint Age is getting close to Async point – I would consider to increase innodb_log_file_size, say by 20%.
Then repeat again, but make sure you are happy with your crash recovery time.

Nobody has time for that and all of our workloads are dominated by users doing nonsense, so for now I'm going to look at making these changes:

  • On aux, increase innodb_log_file_size to 8GB. The derivation of this is:
    • The maximum value of max_allowed_packet is 1GB, so the largest statement we should ever expect to be able to execute from a dump is 1GB.
    • The total combined log file size is (innodb_log_file_size * innodb_log_files_in_group). The latter defaults to 2 and I anticipate keeping it at 2.
    • The error message wants us to choose a combined log file size at least 10x larger than the largest row (although I suspect this might not really be a strict technical requirement).
    • So selecting 8GB means the combined log file size will be 16GB, which is 10x larger than the 1GB packet size, with some room for overhead (which the blog posts above suggest might be signficant).
    • aux is only used for data wrangling, not production workloads, so we can trade off things like crash recovery and disk size at zero cost.
  • On other tiers, increase innodb_log_file_size to 128MB. This is arbitrary, the three sources above all just suggest that 5MB may realistically be way smaller than ideal.
  • Note this as a possible source of error in the "backup" documentation.
epriestley renamed this task from During MySQL import, server may "go away" on large dumps? to During MySQL import, server may "2006 MySQL server has gone away" when row data size is large relative to "innodb_log_file_size".Jul 22 2019, 4:27 PM
epriestley added a revision: Restricted Differential Revision.Jul 22 2019, 4:38 PM
epriestley renamed this task from During MySQL import, server may "2006 MySQL server has gone away" when row data size is large relative to "innodb_log_file_size" to During MySQL import, server may "2006 MySQL server has gone away" when row data size is large relative to "innodb_log_file_size" (?).Jul 22 2019, 6:13 PM

I adjusted innodb_log_file_size to 1GB and attempted the import again, but ran into the same issue.

There's no corresponding error in the log so I may have been on the wrong track. (I'm not sure if the gzip -cd ... | mysql ... flow ever produced a log error, I wasn't watching the log initially.)

I'm going to try source again under the larger innodb_log_file_size and see what happens.

Ugh. With log_warnings = 2:

190722 18:55:55 [Warning] Aborted connection 6 to db: '<instance>_differential' user: 'root' host: 'localhost' (Got a packet bigger than 'max_allowed_packet' bytes)

I think the client and server have separate max_allowed_packet settings, and when we exceed the server limit we get an opaque error on the client which is not normally sent to the server error log.

Bumping max_allowed_packet to 1G in the server config resolved things. The export process then spent a long time doing a bin/files migration (which could use a progress bar, maybe) and is now doing a dump (which could too, although I'm less sure of how we'd build one).

The innodb_log_file_size thing looks like it's probably a red herring, and either not a real error and/or perhaps specific to using source ...; for imports.

epriestley renamed this task from During MySQL import, server may "2006 MySQL server has gone away" when row data size is large relative to "innodb_log_file_size" (?) to During MySQL import, server may "2006 MySQL server has gone away" when "max_allowed_packet" server setting is too small, despite client setting.Tue, Jul 23, 12:59 PM

So actual actionable stuff here is:

  • Increase max_allowed_packet to 1G on the aux tier. It might be nice to tweak how MySQL configuration works before doing this in order to avoid the trap in D20676 where files in mysql/conf.d/ do not have defined load order so they can't follow a general-template.cnf + specific-overrides.cnf pattern.
  • Update the restore documentation to underscore that max_allowed_packet has both client and server components. This is explicit in the MySQL documentation, I just: missed/misunderstood it; was misled by the InnoDB logfile error; and was misled by the good reporting of the client limit compared to the poor reporting of the server limit.

Both the client and the server have their own max_allowed_packet variable, so if you want to handle big packets, you must increase this variable both in the client and in the server.
https://dev.mysql.com/doc/refman/8.0/en/packet-too-large.html

  • The ideal size for innodb_log_file_size is probably more than 5MB, but I'm not planning to touch that for now given how difficult it is to change and how little evidence there is that we get anything out of changing it (see D20676).

We could also consider these things:

  • Write bin/storage restore ... which just does everything for you (and, e.g., checks the server limit). However, this is less flexible than just having a raw MySQL dump (e.g., you must have a copy of Phabricator to run it) so I'm not terribly excited about it.
  • Add a statement at the beginning of our output SQL files which increases the max_allowed_packet.

Adjusting log_warnings = 2 in production (to get connection aborts into the error log) is also possibly desirable, although the background level of connection abortions (general network flakiness, server restarts during deploy, wait_timeout on very long-running demons?) may be high enough that this is more noise than signal.

epriestley added a revision: Restricted Differential Revision.Tue, Jul 30, 6:18 PM
epriestley added a revision: Restricted Differential Revision.Tue, Jul 30, 6:20 PM
epriestley added a commit: Restricted Diffusion Commit.Tue, Jul 30, 6:21 PM
epriestley added a commit: Restricted Diffusion Commit.
epriestley added a commit: Restricted Diffusion Commit.
epriestley added a revision: Restricted Differential Revision.Sun, Aug 11, 4:10 PM
epriestley added a revision: Restricted Differential Revision.Sun, Aug 11, 4:19 PM
epriestley added a commit: Restricted Diffusion Commit.Sun, Aug 11, 4:22 PM
epriestley added a commit: Restricted Diffusion Commit.
epriestley added a commit: Restricted Diffusion Commit.Mon, Aug 12, 4:34 PM