See PHI1180. Currently we retry connections only on error #2003 ("Unable to Connect"). Additionally, retry on #2002 ("Connection Timeout").
Also, make the error message a little more clear and useful.
Differential D20350
Retry connections on timeouts, and raise more readable connection failure messages epriestley on Mar 29 2019, 11:23 PM. Authored by Tags None Referenced Files
Subscribers None
Details See PHI1180. Currently we retry connections only on error #2003 ("Unable to Connect"). Additionally, retry on #2002 ("Connection Timeout"). Also, make the error message a little more clear and useful. Rigged my mysql.host to point nowhere, then: $ ./bin/mail list-outbound [2019-03-29 16:22:22] PHLOG: 'Retrying (attempt 1) after connection failure ("AphrontConnectionQueryException", #2002): Attempt to connect to root@127.0.0.2 failed with error #2002: Operation timed out.' at [/Users/epriestley/dev/core/lib/libphutil/src/aphront/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:124] [2019-03-29 16:22:32] PHLOG: 'Retrying (attempt 2) after connection failure ("AphrontConnectionQueryException", #2002): Attempt to connect to root@127.0.0.2 failed with error #2002: Operation timed out.' at [/Users/epriestley/dev/core/lib/libphutil/src/aphront/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:124] [2019-03-29 16:22:42] EXCEPTION: (PhabricatorClusterStrandedException) Unable to establish a connection to any database host (while trying "local_config"). All masters and replicas are completely unreachable. AphrontConnectionQueryException: Attempt to connect to root@127.0.0.2 failed with error #2002: Operation timed out. at [<phabricator>/src/infrastructure/storage/lisk/PhabricatorLiskDAO.php:177] arcanist(head=experimental, ref.master=9830c9316d38, ref.experimental=8f261fe4750e), corgi(head=master, ref.master=6371578c9d32), instances(head=master, ref.master=5e60851d5b29), libcore(), phabricator(head=write1, ref.master=02f94cd7d288, ref.write1=02f94cd7d288, custom=9), phutil(head=retry1, ref.master=524fcf465108, ref.retry1=c943ae1e44f3), services(head=stable, ref.master=1157b5fde68a, ref.stable=84560b653ef4) #0 PhabricatorLiskDAO::raiseUnreachable(string, AphrontConnectionQueryException) called at [<phabricator>/src/infrastructure/storage/lisk/PhabricatorLiskDAO.php:134] #1 PhabricatorLiskDAO::newClusterConnection(string, string, string) called at [<phabricator>/src/infrastructure/storage/lisk/PhabricatorLiskDAO.php:70] #2 PhabricatorLiskDAO::establishLiveConnection(string) called at [<phabricator>/src/infrastructure/storage/lisk/LiskDAO.php:841] #3 LiskDAO::establishConnection(string) called at [<phabricator>/src/infrastructure/storage/lisk/LiskDAO.php:518] #4 LiskDAO::loadRawDataWhere(string, string) called at [<phabricator>/src/infrastructure/storage/lisk/LiskDAO.php:478] #5 LiskDAO::loadAllWhere(string, string) called at [<phabricator>/src/infrastructure/env/PhabricatorConfigDatabaseSource.php:18] #6 PhabricatorConfigDatabaseSource::loadConfig(string) called at [<phabricator>/src/infrastructure/env/PhabricatorConfigDatabaseSource.php:7] #7 PhabricatorConfigDatabaseSource::__construct(string) called at [<phabricator>/src/infrastructure/env/PhabricatorEnv.php:253] #8 PhabricatorEnv::buildConfigurationSourceStack(boolean) called at [<phabricator>/src/infrastructure/env/PhabricatorEnv.php:95] #9 PhabricatorEnv::initializeCommonEnvironment(boolean) called at [<phabricator>/src/infrastructure/env/PhabricatorEnv.php:75] #10 PhabricatorEnv::initializeScriptEnvironment(boolean) called at [<phabricator>/scripts/init/lib.php:22] #11 init_phabricator_script(array) called at [<phabricator>/scripts/init/init-script.php:9] #12 require_once(string) called at [<phabricator>/scripts/__init_script__.php:3] #13 require_once(string) called at [<phabricator>/scripts/mail/manage_mail.php:5]
Diff Detail
Event TimelineComment Actions The code looks pretty obviously correct to me, but maybe test the other case to make sure we aren't suddenly retrying for all error codes? Try, uh, error 2005, "Unknown MySQL server host"? Comment Actions I skimmed the other client-side error codes, and the only other condition that stood out as being another retry candidate was 2013, "lost connection to server". I'm happy to say "don't add it until someone asks for it", though. Comment Actions I think we can't get 2013 on connection. There's a separate piece of retry logic for failures on queries (vs connections), but it's a bit more complicated (it has to deal with transactions and stuff) and fails more directly when queries don't work (no failover to read-only if you got a connection to a master and then lost it later during a query, since we have no clue what kind of mixed-up state you're in if some master queries worked, and it's safer to just throw in the towel). I wasn't able to figure out how to get a 2005 (using a bogus host name is still a 2002), but I got a 1045 by providing a bad username and verified it doesn't retry. |