Page MenuHomePhabricator

Retry connections on timeouts, and raise more readable connection failure messages

Authored by epriestley on Mar 29 2019, 11:23 PM.



See PHI1180. Currently we retry connections only on error #2003 ("Unable to Connect"). Additionally, retry on #2002 ("Connection Timeout").

Also, make the error message a little more clear and useful.

Test Plan

Rigged my to point nowhere, then:

$ ./bin/mail list-outbound
[2019-03-29 16:22:22] PHLOG: 'Retrying (attempt 1) after connection failure ("AphrontConnectionQueryException", #2002): Attempt to connect to root@ failed with error #2002: Operation timed out.' at [/Users/epriestley/dev/core/lib/libphutil/src/aphront/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:124]
[2019-03-29 16:22:32] PHLOG: 'Retrying (attempt 2) after connection failure ("AphrontConnectionQueryException", #2002): Attempt to connect to root@ failed with error #2002: Operation timed out.' at [/Users/epriestley/dev/core/lib/libphutil/src/aphront/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:124]
[2019-03-29 16:22:42] EXCEPTION: (PhabricatorClusterStrandedException) Unable to establish a connection to any database host (while trying "local_config"). All masters and replicas are completely unreachable.

AphrontConnectionQueryException: Attempt to connect to root@ failed with error #2002: Operation timed out. at [<phabricator>/src/infrastructure/storage/lisk/PhabricatorLiskDAO.php:177]
arcanist(head=experimental, ref.master=9830c9316d38, ref.experimental=8f261fe4750e), corgi(head=master, ref.master=6371578c9d32), instances(head=master, ref.master=5e60851d5b29), libcore(), phabricator(head=write1, ref.master=02f94cd7d288, ref.write1=02f94cd7d288, custom=9), phutil(head=retry1, ref.master=524fcf465108, ref.retry1=c943ae1e44f3), services(head=stable, ref.master=1157b5fde68a, ref.stable=84560b653ef4)
  #0 PhabricatorLiskDAO::raiseUnreachable(string, AphrontConnectionQueryException) called at [<phabricator>/src/infrastructure/storage/lisk/PhabricatorLiskDAO.php:134]
  #1 PhabricatorLiskDAO::newClusterConnection(string, string, string) called at [<phabricator>/src/infrastructure/storage/lisk/PhabricatorLiskDAO.php:70]
  #2 PhabricatorLiskDAO::establishLiveConnection(string) called at [<phabricator>/src/infrastructure/storage/lisk/LiskDAO.php:841]
  #3 LiskDAO::establishConnection(string) called at [<phabricator>/src/infrastructure/storage/lisk/LiskDAO.php:518]
  #4 LiskDAO::loadRawDataWhere(string, string) called at [<phabricator>/src/infrastructure/storage/lisk/LiskDAO.php:478]
  #5 LiskDAO::loadAllWhere(string, string) called at [<phabricator>/src/infrastructure/env/PhabricatorConfigDatabaseSource.php:18]
  #6 PhabricatorConfigDatabaseSource::loadConfig(string) called at [<phabricator>/src/infrastructure/env/PhabricatorConfigDatabaseSource.php:7]
  #7 PhabricatorConfigDatabaseSource::__construct(string) called at [<phabricator>/src/infrastructure/env/PhabricatorEnv.php:253]
  #8 PhabricatorEnv::buildConfigurationSourceStack(boolean) called at [<phabricator>/src/infrastructure/env/PhabricatorEnv.php:95]
  #9 PhabricatorEnv::initializeCommonEnvironment(boolean) called at [<phabricator>/src/infrastructure/env/PhabricatorEnv.php:75]
  #10 PhabricatorEnv::initializeScriptEnvironment(boolean) called at [<phabricator>/scripts/init/lib.php:22]
  #11 init_phabricator_script(array) called at [<phabricator>/scripts/init/init-script.php:9]
  #12 require_once(string) called at [<phabricator>/scripts/__init_script__.php:3]
  #13 require_once(string) called at [<phabricator>/scripts/mail/manage_mail.php:5]

Diff Detail

rPHU libphutil
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

epriestley created this revision.Mar 29 2019, 11:23 PM
epriestley requested review of this revision.Mar 29 2019, 11:24 PM
  • Be slightly more precise with error language.
amckinley accepted this revision.Apr 1 2019, 6:01 PM

The code looks pretty obviously correct to me, but maybe test the other case to make sure we aren't suddenly retrying for all error codes? Try, uh, error 2005, "Unknown MySQL server host"?

This revision is now accepted and ready to land.Apr 1 2019, 6:01 PM

I skimmed the other client-side error codes, and the only other condition that stood out as being another retry candidate was 2013, "lost connection to server". I'm happy to say "don't add it until someone asks for it", though.

I think we can't get 2013 on connection. There's a separate piece of retry logic for failures on queries (vs connections), but it's a bit more complicated (it has to deal with transactions and stuff) and fails more directly when queries don't work (no failover to read-only if you got a connection to a master and then lost it later during a query, since we have no clue what kind of mixed-up state you're in if some master queries worked, and it's safer to just throw in the towel).

I wasn't able to figure out how to get a 2005 (using a bogus host name is still a 2002), but I got a 1045 by providing a bad username and verified it doesn't retry.

This revision was automatically updated to reflect the committed changes.