Just noting this for myself; I'll deal with this in the morning.
Description
Revisions and Commits
Restricted Diffusion Commit | |
Restricted Diffusion Commit |
Event Timeline
Here's the smoking gun from /var/log/syslog:
Mar 30 06:39:31 secure001 kernel: [23966991.746872] Out of memory: Kill process 13793 (mysqld) score 402 or sacrifice child
Before this, the log dumps a summarized process table and shows 75 apache2 subprocesses active:
... Mar 30 06:39:31 secure001 kernel: [23966991.746870] [ 9306] 33 9306 245441 1937 122 0 0 apache2 ...
The size of an apache2 process varies, but the ones running something seem to take up about 0.5-1% of host memory, from ps auxwww. The host has 7.5GB of RAM, so this corresponds to about 50-75MB, which seems approximately reasonable and is about what I'd expect.
This host has more memory constraints than usual because it runs daemons, MySQL, and apache on the same box and does not have VM configured. The apache2 worker limit is the default from /etc/apache2/mods-enabled/mpm_prefork.load, which is set to 150.
If we hit that limit, we'd conservatively expect 7.5GB-11.25GB of memory to be dedicated only to apache, when the machine has at most ~5GB of available RAM to give it after accounting for MySQL, etc. So it isn't surprising that this can run into an OOM condition.
The rest of my investigation is murker:
The requests which forced us OOM might be these: note that response codes jump from 200 to 500 midway through, and the timestamps line up. I think I misattributed this to macro thumbnailing earlier, it doesn't seem to be (there was just some of that producing 500's earlier in the log):
secure.phabricator.com:80 172.30.1.175 - - [30/Mar/2016:06:39:24 +0000] "GET /file/data/@secure/xx/PHID-FILE-i2kdhqa462ykf2wfyj63/profile HTTP/1.1" 200 5157 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36" secure.phabricator.com:80 172.30.1.175 - - [30/Mar/2016:06:39:24 +0000] "GET /file/data/@secure/xx/PHID-FILE-7hf6op6zvzwworrbelps/profile HTTP/1.1" 200 76219 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36" secure.phabricator.com:80 172.30.1.175 - - [30/Mar/2016:06:39:24 +0000] "GET /file/data/@secure/xx/PHID-FILE-axz2ybuu5rld4f2eoazo/profile HTTP/1.1" 200 8498 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36" secure.phabricator.com:80 172.30.1.175 - - [30/Mar/2016:06:39:24 +0000] "GET /file/data/@secure/xx/PHID-FILE-6an4zv6y6u4fhyjblaex/profile HTTP/1.1" 500 1700 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36" secure.phabricator.com:80 172.30.1.175 - - [30/Mar/2016:06:39:24 +0000] "GET /file/data/@secure/xx/PHID-FILE-nwmwqbwqwmsq52ifwkc3/profile HTTP/1.1" 500 1649 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36"
I'm not sure about this being directly related. A single client did appear to send us send us 86 profile image requests at approximately the same time, which could plausibly have broken the camel's back in an existing loaded environment since the theory of the failure is just Apache launching too many subprocesses.
A possible contributing factor is that node is currently huge (26% of host memory):
ubuntu 6984 1.1 26.5 3018372 2033840 ? Sl Mar28 29:39 nodejs /core/lib/phabricator/support/aphlict/server/aphlict_server.js --client-port=22280 --admin-port=22281 --admin-host=localhost --ssl-key=/core/conf/ssl/secure.phabricator.com.key --ssl-cert=/core/conf/ssl/secure.phabricator.com.crt --log=/var/log/aphlict.log
It appears that it was also large (although not as large) at the time of the OOM condition, so there's probably an issue there and that was likely a contributing factor.
To prevent this, I plan to:
- reduce MaxRequestWorkers from 150 to 24 on secure nodes (i.e., target approximately a 3GB ceiling) and to 48 on web nodes (i.e., target approximately at 6GB ceiling);
- follow up on the node memory use issue.
I've deployed a MaxRequestWorkers 24 change to this server and wasn't able to get it to burst over 24 workers with ab, so that seems to be working. I'll keep an eye on it for the rest of the week, and the 48 limit will go out to the cluster later this week if no problems turn up.
MaxRequestWorkers 24 appears to be creating some issues. This host only serves about 5 requests/second, but I can hit delays with concurrency of 10 (e.g., ab -n100 -c10):
Percentage of the requests served within a certain time (ms) 50% 120 66% 125 75% 134 80% 146 90% 228 95% 4445 98% 5351 99% 5382 100% 5382 (longest request)
I'm not immediately sure why 24 works can't handle ~14 requests/second without blocking. One theory is that some slots may be consumed by, e.g., long git clone requests operating against repositories, but I think we don't have a great way to find these right now.
We're seeing a lot of connections stick in "R" ("Reading Request") for a long time, but the client is the load balancer and the request hasn't been read yet so it's not clear exactly what's going on with these. I think they may be crawlers, since pretty much every search engine loves indexing everything on this install hundreds of times an hour.
I'm just going to push the client limit back up a bit for now, since the continuous performance degradation is worse than the OOM killer striking down MySQL once every five years, assuming this isn't the start of something bigger.
I'll also enable VM on this host.
We appear to be stable here now:
- We have a lower process limit for apache.
- We have a second node in the cluster.
- aphlict now uses less memory (256M, down from 1.5GB).
There's a remaining aphlict issue in T10696 related to memory usage in a disk full condition.
These numbers may need some more tweaking but we haven't seen more issues in about a month.
Similar numbers also went into production in the Phacility cluster, which also now has an additional web node.