Page MenuHomePhabricator

MySQL may take several seconds after restart to begin listening on domain socket
Closed, ResolvedPublic

Description

In remote deploy, we currently restart MySQL and then try to connect to it shortly afterward.

Sometimes, probably when MySQL has a large amount of data (in this instance, one affected host has 76GB of data), the socket may not be listening by the time the restart command exits, leading to this error:

[db010] [2017-10-14 12:19:13] EXCEPTION: (CommandException) Command failed with error #1!
[db010] COMMAND
[db010] echo 'DELETE FROM mysql.user WHERE User = "root" AND Host != "localhost"' | mysql -uroot
[db010] 
[db010] STDOUT
[db010] (empty)
[db010] 
[db010] STDERR
[db010] ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2)

We should probably make sure the socket is listening before continuing past the service mysqld restart.

In this case, the two affected hosts (db010 and db014) haven't been purged in a while and have some large test instances, so I expect I can just reduce the data size to something manageable with the current workflow fairly easily (bin/host destroy --instance-kinds test --instance-statuses suspended,disabled). I'm running the destruction workflows now.

Revisions and Commits

Restricted Differential Revision

Event Timeline

I expect I can just reduce the data size to something manageable with the current workflow fairly easily

This worked correctly.

epriestley added a revision: Restricted Differential Revision.Oct 23 2017, 6:03 PM

We hit one of more of these (db024) last week so D18725 should fix it.

epriestley added a commit: Restricted Diffusion Commit.Oct 23 2017, 8:09 PM
epriestley claimed this task.

I pushed secure004 with that patch and it worked fine, although mysql came back up fast enough that we didn't have to wait. Since I don't have a way to actually trigger this condition, I'm going to assume this is resolved until we have evidence otherwise.