Page MenuHomePhabricator

OOM killer nuked `mysql` on `secure.phabricator.com` after apache2 spawned 75+ workers
Closed, ResolvedPublic

Description

Just noting this for myself; I'll deal with this in the morning.

Revisions and Commits

Event Timeline

Here's the smoking gun from /var/log/syslog:

Mar 30 06:39:31 secure001 kernel: [23966991.746872] Out of memory: Kill process 13793 (mysqld) score 402 or sacrifice child

Before this, the log dumps a summarized process table and shows 75 apache2 subprocesses active:

...
Mar 30 06:39:31 secure001 kernel: [23966991.746870] [ 9306]    33  9306   245441     1937     122        0             0 apache2
...

The size of an apache2 process varies, but the ones running something seem to take up about 0.5-1% of host memory, from ps auxwww. The host has 7.5GB of RAM, so this corresponds to about 50-75MB, which seems approximately reasonable and is about what I'd expect.

This host has more memory constraints than usual because it runs daemons, MySQL, and apache on the same box and does not have VM configured. The apache2 worker limit is the default from /etc/apache2/mods-enabled/mpm_prefork.load, which is set to 150.

If we hit that limit, we'd conservatively expect 7.5GB-11.25GB of memory to be dedicated only to apache, when the machine has at most ~5GB of available RAM to give it after accounting for MySQL, etc. So it isn't surprising that this can run into an OOM condition.


The rest of my investigation is murker:

The requests which forced us OOM might be these: note that response codes jump from 200 to 500 midway through, and the timestamps line up. I think I misattributed this to macro thumbnailing earlier, it doesn't seem to be (there was just some of that producing 500's earlier in the log):

secure.phabricator.com:80 172.30.1.175 - - [30/Mar/2016:06:39:24 +0000] "GET /file/data/@secure/xx/PHID-FILE-i2kdhqa462ykf2wfyj63/profile HTTP/1.1" 200 5157 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36"
secure.phabricator.com:80 172.30.1.175 - - [30/Mar/2016:06:39:24 +0000] "GET /file/data/@secure/xx/PHID-FILE-7hf6op6zvzwworrbelps/profile HTTP/1.1" 200 76219 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36"
secure.phabricator.com:80 172.30.1.175 - - [30/Mar/2016:06:39:24 +0000] "GET /file/data/@secure/xx/PHID-FILE-axz2ybuu5rld4f2eoazo/profile HTTP/1.1" 200 8498 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36"
secure.phabricator.com:80 172.30.1.175 - - [30/Mar/2016:06:39:24 +0000] "GET /file/data/@secure/xx/PHID-FILE-6an4zv6y6u4fhyjblaex/profile HTTP/1.1" 500 1700 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36"
secure.phabricator.com:80 172.30.1.175 - - [30/Mar/2016:06:39:24 +0000] "GET /file/data/@secure/xx/PHID-FILE-nwmwqbwqwmsq52ifwkc3/profile HTTP/1.1" 500 1649 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36"

I'm not sure about this being directly related. A single client did appear to send us send us 86 profile image requests at approximately the same time, which could plausibly have broken the camel's back in an existing loaded environment since the theory of the failure is just Apache launching too many subprocesses.

A possible contributing factor is that node is currently huge (26% of host memory):

ubuntu    6984  1.1 26.5 3018372 2033840 ?     Sl   Mar28  29:39 nodejs /core/lib/phabricator/support/aphlict/server/aphlict_server.js --client-port=22280 --admin-port=22281 --admin-host=localhost --ssl-key=/core/conf/ssl/secure.phabricator.com.key --ssl-cert=/core/conf/ssl/secure.phabricator.com.crt --log=/var/log/aphlict.log

It appears that it was also large (although not as large) at the time of the OOM condition, so there's probably an issue there and that was likely a contributing factor.

To prevent this, I plan to:

  • reduce MaxRequestWorkers from 150 to 24 on secure nodes (i.e., target approximately a 3GB ceiling) and to 48 on web nodes (i.e., target approximately at 6GB ceiling);
  • follow up on the node memory use issue.
epriestley renamed this task from OOM killer nuked `mysql` on `secure.phabricator.com` after numerous simultaneous macro thumbnailing requests (?) forced into a low-memory condition to OOM killer nuked `mysql` on `secure.phabricator.com` after apache2 spawned 75+ workers.Mar 30 2016, 12:37 PM
epriestley added a commit: Restricted Diffusion Commit.Mar 30 2016, 12:48 PM

I've deployed a MaxRequestWorkers 24 change to this server and wasn't able to get it to burst over 24 workers with ab, so that seems to be working. I'll keep an eye on it for the rest of the week, and the 48 limit will go out to the cluster later this week if no problems turn up.

MaxRequestWorkers 24 appears to be creating some issues. This host only serves about 5 requests/second, but I can hit delays with concurrency of 10 (e.g., ab -n100 -c10):

Percentage of the requests served within a certain time (ms)
  50%    120
  66%    125
  75%    134
  80%    146
  90%    228
  95%   4445
  98%   5351
  99%   5382
 100%   5382 (longest request)

I'm not immediately sure why 24 works can't handle ~14 requests/second without blocking. One theory is that some slots may be consumed by, e.g., long git clone requests operating against repositories, but I think we don't have a great way to find these right now.

I'm fiddling with LB health checks, so it's my fault if the server vanishes.

We're seeing a lot of connections stick in "R" ("Reading Request") for a long time, but the client is the load balancer and the request hasn't been read yet so it's not clear exactly what's going on with these. I think they may be crawlers, since pretty much every search engine loves indexing everything on this install hundreds of times an hour.

I'm just going to push the client limit back up a bit for now, since the continuous performance degradation is worse than the OOM killer striking down MySQL once every five years, assuming this isn't the start of something bigger.

I'll also enable VM on this host.

epriestley added a commit: Restricted Diffusion Commit.Mar 30 2016, 6:38 PM

We appear to be stable here now:

  • We have a lower process limit for apache.
  • We have a second node in the cluster.
  • aphlict now uses less memory (256M, down from 1.5GB).

There's a remaining aphlict issue in T10696 related to memory usage in a disk full condition.

These numbers may need some more tweaking but we haven't seen more issues in about a month.

Similar numbers also went into production in the Phacility cluster, which also now has an additional web node.

epriestley mentioned this in Unknown Object (Maniphest Task).Apr 20 2017, 7:05 PM