Page MenuHomePhabricator

Retry connections on timeouts, and raise more readable connection failure messages
ClosedPublic

Authored by epriestley on Mar 29 2019, 11:23 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Apr 18, 3:14 AM
Unknown Object (File)
Wed, Apr 17, 3:07 PM
Unknown Object (File)
Thu, Apr 11, 7:16 AM
Unknown Object (File)
Fri, Apr 5, 10:07 PM
Unknown Object (File)
Sun, Mar 31, 12:55 AM
Unknown Object (File)
Sat, Mar 30, 1:30 PM
Unknown Object (File)
Thu, Mar 28, 9:25 PM
Unknown Object (File)
Thu, Mar 28, 4:55 AM
Subscribers
None

Details

Summary

See PHI1180. Currently we retry connections only on error #2003 ("Unable to Connect"). Additionally, retry on #2002 ("Connection Timeout").

Also, make the error message a little more clear and useful.

Test Plan

Rigged my mysql.host to point nowhere, then:

$ ./bin/mail list-outbound
[2019-03-29 16:22:22] PHLOG: 'Retrying (attempt 1) after connection failure ("AphrontConnectionQueryException", #2002): Attempt to connect to root@127.0.0.2 failed with error #2002: Operation timed out.' at [/Users/epriestley/dev/core/lib/libphutil/src/aphront/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:124]
[2019-03-29 16:22:32] PHLOG: 'Retrying (attempt 2) after connection failure ("AphrontConnectionQueryException", #2002): Attempt to connect to root@127.0.0.2 failed with error #2002: Operation timed out.' at [/Users/epriestley/dev/core/lib/libphutil/src/aphront/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:124]
[2019-03-29 16:22:42] EXCEPTION: (PhabricatorClusterStrandedException) Unable to establish a connection to any database host (while trying "local_config"). All masters and replicas are completely unreachable.

AphrontConnectionQueryException: Attempt to connect to root@127.0.0.2 failed with error #2002: Operation timed out. at [<phabricator>/src/infrastructure/storage/lisk/PhabricatorLiskDAO.php:177]
arcanist(head=experimental, ref.master=9830c9316d38, ref.experimental=8f261fe4750e), corgi(head=master, ref.master=6371578c9d32), instances(head=master, ref.master=5e60851d5b29), libcore(), phabricator(head=write1, ref.master=02f94cd7d288, ref.write1=02f94cd7d288, custom=9), phutil(head=retry1, ref.master=524fcf465108, ref.retry1=c943ae1e44f3), services(head=stable, ref.master=1157b5fde68a, ref.stable=84560b653ef4)
  #0 PhabricatorLiskDAO::raiseUnreachable(string, AphrontConnectionQueryException) called at [<phabricator>/src/infrastructure/storage/lisk/PhabricatorLiskDAO.php:134]
  #1 PhabricatorLiskDAO::newClusterConnection(string, string, string) called at [<phabricator>/src/infrastructure/storage/lisk/PhabricatorLiskDAO.php:70]
  #2 PhabricatorLiskDAO::establishLiveConnection(string) called at [<phabricator>/src/infrastructure/storage/lisk/LiskDAO.php:841]
  #3 LiskDAO::establishConnection(string) called at [<phabricator>/src/infrastructure/storage/lisk/LiskDAO.php:518]
  #4 LiskDAO::loadRawDataWhere(string, string) called at [<phabricator>/src/infrastructure/storage/lisk/LiskDAO.php:478]
  #5 LiskDAO::loadAllWhere(string, string) called at [<phabricator>/src/infrastructure/env/PhabricatorConfigDatabaseSource.php:18]
  #6 PhabricatorConfigDatabaseSource::loadConfig(string) called at [<phabricator>/src/infrastructure/env/PhabricatorConfigDatabaseSource.php:7]
  #7 PhabricatorConfigDatabaseSource::__construct(string) called at [<phabricator>/src/infrastructure/env/PhabricatorEnv.php:253]
  #8 PhabricatorEnv::buildConfigurationSourceStack(boolean) called at [<phabricator>/src/infrastructure/env/PhabricatorEnv.php:95]
  #9 PhabricatorEnv::initializeCommonEnvironment(boolean) called at [<phabricator>/src/infrastructure/env/PhabricatorEnv.php:75]
  #10 PhabricatorEnv::initializeScriptEnvironment(boolean) called at [<phabricator>/scripts/init/lib.php:22]
  #11 init_phabricator_script(array) called at [<phabricator>/scripts/init/init-script.php:9]
  #12 require_once(string) called at [<phabricator>/scripts/__init_script__.php:3]
  #13 require_once(string) called at [<phabricator>/scripts/mail/manage_mail.php:5]

Diff Detail

Repository
rPHU libphutil
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

  • Be slightly more precise with error language.

The code looks pretty obviously correct to me, but maybe test the other case to make sure we aren't suddenly retrying for all error codes? Try, uh, error 2005, "Unknown MySQL server host"?

This revision is now accepted and ready to land.Apr 1 2019, 6:01 PM

I skimmed the other client-side error codes, and the only other condition that stood out as being another retry candidate was 2013, "lost connection to server". I'm happy to say "don't add it until someone asks for it", though.

I think we can't get 2013 on connection. There's a separate piece of retry logic for failures on queries (vs connections), but it's a bit more complicated (it has to deal with transactions and stuff) and fails more directly when queries don't work (no failover to read-only if you got a connection to a master and then lost it later during a query, since we have no clue what kind of mixed-up state you're in if some master queries worked, and it's safer to just throw in the towel).

I wasn't able to figure out how to get a 2005 (using a bogus host name is still a 2002), but I got a 1045 by providing a bad username and verified it doesn't retry.

This revision was automatically updated to reflect the committed changes.