Page MenuHomePhabricator

Daemons are having a rough time starting on `admin`
Closed, ResolvedPublic

Description

During this week's deployment, I'm observing some issues with starting daemons on admin001 after deployment.

  • When started with remote deploy, they appear to be exiting immediately and uncleanly.
  • When started with phd restart --gently --autoscale-reserve 0.20 (the command which remote deploy executes), they work great. Of course.

I'm going to try to clear this up before continuing the deployment.

Event Timeline

It looks like this might be an issue with MySQL not being available yet when the daemons start. The deployment script restarts MySQL, then immediately restarts the daemons, and a bunch of this stuff is ending up in the log in the 1-2 seconds after the restart:

[05-Mar-2017 11:29:14 UTC] [2017-03-05 11:29:14] EXCEPTION: (AphrontConnectionQueryException) Attempt to connect to cli-d033e22ae348@localhost failed with error #2002: Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2). at [<phutil>/src/aphront/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:343]
[05-Mar-2017 11:29:14 UTC] arcanist(head=stable, ref.master=7d15b85a1bc0, ref.stable=822bc53ca306), corgi(head=master, ref.master=cc9175a945da), instances(head=stable, ref.master=db56d5d6ad91, ref.stable=a3ae6a3f55dc), libcore(), phabricator(head=stable, ref.stable=9124bb416241), phutil(head=stable, ref.master=92882eb9404d, ref.stable=796cb1c2ee27), services(head=stable, ref.master=5b51b63027b7, ref.stable=198eb67bd822)
[05-Mar-2017 11:29:14 UTC]   #0 <#2> AphrontBaseMySQLDatabaseConnection::throwConnectionException(integer, string, string, string) called at [<phutil>/src/aphront/storage/connection/mysql/AphrontMySQLiDatabaseConnection.php:76]
[05-Mar-2017 11:29:14 UTC]   #1 <#2> AphrontMySQLiDatabaseConnection::connect() called at [<phutil>/src/aphront/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:101]
[05-Mar-2017 11:29:14 UTC]   #2 <#2> AphrontBaseMySQLDatabaseConnection::establishConnection() called at [<phutil>/src/aphront/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:124]
[05-Mar-2017 11:29:14 UTC]   #3 <#2> AphrontBaseMySQLDatabaseConnection::requireConnection() called at [<phutil>/src/aphront/storage/connection/mysql/AphrontMySQLiDatabaseConnection.php:15]
[05-Mar-2017 11:29:14 UTC]   #4 <#2> AphrontMySQLiDatabaseConnection::escapeBinaryString(string) called at [<phutil>/src/aphront/storage/connection/mysql/AphrontMySQLiDatabaseConnection.php:11]
[05-Mar-2017 11:29:14 UTC]   #5 <#2> AphrontMySQLiDatabaseConnection::escapeUTF8String(string) called at [<phutil>/src/xsprintf/qsprintf.php:178]
[05-Mar-2017 11:29:14 UTC]   #6 <#2> xsprintf_query(AphrontMySQLiDatabaseConnection, string, integer, integer, integer) called at [<phutil>/src/xsprintf/xsprintf.php:70]
[05-Mar-2017 11:29:14 UTC]   #7 <#2> xsprintf(string, AphrontMySQLiDatabaseConnection, array) called at [<phutil>/src/xsprintf/qsprintf.php:64]
[05-Mar-2017 11:29:14 UTC]   #8 <#2> qsprintf(AphrontMySQLiDatabaseConnection, string, integer) called at [<phabricator>/src/infrastructure/storage/lisk/LiskDAO.php:1243]
[05-Mar-2017 11:29:14 UTC]   #9 <#2> LiskDAO::insertRecordIntoDatabase(string) called at [<phabricator>/src/infrastructure/storage/lisk/LiskDAO.php:1106]
[05-Mar-2017 11:29:14 UTC]   #10 <#2> LiskDAO::insert() called at [<phabricator>/src/infrastructure/storage/lisk/LiskDAO.php:1075]
[05-Mar-2017 11:29:14 UTC]   #11 <#2> LiskDAO::save() called at [<phabricator>/src/applications/daemon/event/PhabricatorDaemonEventListener.php:80]
[05-Mar-2017 11:29:14 UTC]   #12 <#2> PhabricatorDaemonEventListener::handleLogEvent(PhutilEvent) called at [<phabricator>/src/applications/daemon/event/PhabricatorDaemonEventListener.php:24]
[05-Mar-2017 11:29:14 UTC]   #13 <#2> PhabricatorDaemonEventListener::handleEvent(PhutilEvent) called at [<phutil>/src/events/PhutilEventEngine.php:65]
[05-Mar-2017 11:29:14 UTC]   #14 <#2> PhutilEventEngine::dispatchEvent(PhutilEvent) called at [<phutil>/src/daemon/PhutilDaemonHandle.php:297]
[05-Mar-2017 11:29:14 UTC]   #15 phlog(AphrontConnectionQueryException) called at [<phutil>/src/daemon/PhutilDaemonHandle.php:299]
[05-Mar-2017 11:29:14 UTC]   #16 PhutilDaemonHandle::dispatchEvent(string, array) called at [<phutil>/src/daemon/PhutilDaemonHandle.php:466]
[05-Mar-2017 11:29:14 UTC]   #17 PhutilDaemonHandle::logMessage(string, string) called at [<phutil>/src/daemon/PhutilDaemonHandle.php:206]
[05-Mar-2017 11:29:14 UTC]   #18 PhutilDaemonHandle::scheduleRestart() called at [<phutil>/src/daemon/PhutilDaemonHandle.php:165]
[05-Mar-2017 11:29:14 UTC]   #19 PhutilDaemonHandle::update() called at [<phutil>/src/daemon/PhutilDaemonPool.php:178]
[05-Mar-2017 11:29:14 UTC]   #20 PhutilDaemonPool::updatePool() called at [<phutil>/src/daemon/PhutilDaemonOverseer.php:177]
[05-Mar-2017 11:29:14 UTC]   #21 PhutilDaemonOverseer::run() called at [<phabricator>/scripts/daemon/launch_daemon.php:24]

This could possibly be a result of recent changes affecting how exceptions get handled, or making us go down this codepath to do logging earlier than we used to.

I'm going to investigate that behavior in more detail locally and see if I can reproduce it.

An issue complicating this is that instance synchronization is currently failing because the schemata differ:

Daemon 533819 STDE [Sun, 05 Mar 2017 12:26:32 +0000] [2017-03-05 12:26:32] EXCEPTION: (AphrontSchemaQueryException) #1054: Unknown column 'defaultProfileImagePHID' in 'field list' at [<phutil>/src/aphront/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:325]

Since I expect this issue to either not affect the repo tier or be resolvable through manual fiddling I'm going to continue the rest of the deploy normally (to resolve the schemata difference) and come back to this once it completes.

Specifically, I suspect it won't affect repo because db restarts separately, so the database won't be unavailable when daemons on repo restart.

In terms of application/code stability, it would be good to handle this properly.

In terms of your deployment (given you're not using an exotic/non-standard service manager), you might want to look into socket activation so you don't have to care about service ordering anymore and let MySQL simply start when needed while the request waits for the startup to complete. This works for both, local AF_UNIX and networked AF_INET sockets just fine and also allows you to restart services without loosing requests, as the socket handler will buffer incoming requests during service restarts.

I haven't been able to reproduce this locally by artificially making services unavailable or connections fail. Debugging this in production will require momentary service interruptions on admin until I can narrow things down so I'm going to wait until off-peak to poke at it.

The MySQL stuff seems to be in the old daemons (them trying to do logging while MySQL restarts), not the new daemons, and not the cause of the issue.

A bad restart looks like this:

[18-Mar-2017 11:52:09 UTC] 2017-03-18 11:52:09 AM [OVER] Started new daemon overseer (with PID "7913").
[18-Mar-2017 11:52:09 UTC] 2017-03-18 11:52:09 AM [SGNL] Overseer ("7913") received signal 1 ("SIGHUP").
[18-Mar-2017 11:52:09 UTC] 2017-03-18 11:52:09 AM [SGNL] Overseer ("7913") received signal 2 ("SIGINT").
[18-Mar-2017 11:52:09 UTC] 2017-03-18 11:52:09 AM [SGNL] Overseer ("7913") received signal 2 ("SIGINT").
[18-Mar-2017 11:52:09 UTC] 2017-03-18 11:52:09 AM [SGNL] Overseer ("7913") received signal 15 ("SIGTERM")

A good restart doesn't see any of those signals.

None of those four signals are expected on their own.

We interpret SIGINT + SIGINT to mean SIGTERM, so the last signal is likely a synthetic one and expected after receiving two SIGINT, leaving us with three unexplained signals.

I suspect the SIGHUP might be a genuine SIGHUP because ssh is hanging up, and the overseer is still in the same process group because overseers don't setsid. If so, the fix is the fix suggested by @eadler in D15812.

I don't currently understand where the first or second SIGINT are coming from. They're also the important signals.

I'm going to apply a patch similar to D15812 and see where that leaves us.

The setsid() appears to have resolved the issue, in the sense that daemons are now starting cleanly every time without receiving any inexplicable signals.

I remain unsure about where the stray SIGINT signals were coming from but since they no longer reproduce I'm not sure how to narrow them down. It's possible that they were also originating from unwinding the SSH + shell somehow, in which case this is the correct fix, but I'm not confident that I have a full understanding of the powers involved here.

But I'm satisfied that this appears to be working properly again.