Page MenuHomePhabricator

Make daemons less aggressive about cycling connections
Closed, ResolvedPublic

Description

Currently, daemons potentially cycle connections very aggressively while under load. After a connection is closed, the outbound port usually can't be reused for about 60 seconds, so sufficiently aggressive cycling can exhaust outbound ports.

A simple partial mitigation is to skip cycling while loaded:

diff --git a/src/infrastructure/daemon/PhabricatorDaemon.php b/src/infrastructure/daemon/PhabricatorDaemon.php
index 349e0f0..aadef5a 100644
--- a/src/infrastructure/daemon/PhabricatorDaemon.php
+++ b/src/infrastructure/daemon/PhabricatorDaemon.php
@@ -11,7 +11,9 @@ abstract class PhabricatorDaemon extends PhutilDaemon {
   }
 
   protected function willSleep($duration) {
-    LiskDAO::closeAllConnections();
+    if ($duration) {
+      LiskDAO::closeAllConnections();
+    }
     return;
   }

There should be a number of other similar adjustments to make behavior here a bit more reasonable. We'll still do a connection per application (at least until after T11044) but should be keeping things under ~100 connections per daemon until then.

Event Timeline

I deployed this stuff, and I'm actually seeing more used ports now on secure001 than I did in D16389. However, I think most of these are coming from web requests and the larger count is because we have more traffic during US daytime than we did earlier in the morning. It didn't occur to me earlier that I'd be accidentally counting connections from both sources.

I'll be able to measure repo001 in about 48 hours (which should give us a more definitive count). I could probably figure out some clever ways to distinguish between web-originated and daemon-originated connections prior to that, but don't have a number to compare to.

I still suspect this is in much better shape now, but I'll wait until I have some kind of plausible measurement to that effect to call it resolved.

After merging with upstream, we were able to increase our taskmasters up to 32, and still not come anywhere close to the number of connections that were previously used. So far, so good on our end.

Great! It looks like we're down from about 18K to 3K on repo001 too, so it appears that this pretty much works as expected.

There's significant room to refine this further, but since we aren't running out of headroom anywhere now we can look at this again after T11044.