Page MenuHomePhabricator

Phacility Cluster: Apache exited on repo001.phacility.net
Closed, ResolvedPublic

Description

The httpd process vanished on repo001.phacility.net. I restarted it, but we should figure out what happened and stop it from happening again.

Revisions and Commits

rP Phabricator
D12395

Event Timeline

epriestley raised the priority of this task from to Normal.
epriestley updated the task description. (Show Details)
epriestley added a project: Phacility.
epriestley moved this task to Do After Launch on the Phacility board.
epriestley added a subscriber: epriestley.

It looks like logrotate restarted Apache (by default, it does this weekly to rotate logs) and it failed to come back up (which is catastrophically bad).

I think the chain of events was:

error.log.1
[Sun Apr 12 06:40:14.115398 2015] [mpm_prefork:notice] [pid 12709] AH00171: Graceful restart requested, doing restart

logrotate used apache2 reload to perform a graceful restart after rotating logfiles.

error.log
PHP Fatal error: PHP Startup: apc_mmap: mmap failed: in Unknown on line 0
[Sun Apr 12 06:40:14.299720 2015] [core:notice] [pid 12709] AH00060: seg fault or similar nasty error detected in the parent process

apc failed to mmap on startup, causing apache to fatal.


Root Causes

Logrotate restarts apache to tell it to reload its logfiles. If logrotate did not do this, apache would continue writing to the old logfile, since it doesn't know it needs to reopen the file.

  • It does this because it's explicitly configured to, by default.
  • On most systems, this is a sensible default.

APC (technically APCu) probably failed to MMAP because the machine didn't have enough free memory, although I can't find any evidence to indisputably prove that (our configuration doesn't currently collect RAM metrics). There were two related changes in the last week:

  • More instances.
  • Increase in APC shm_size (rCOREb7c8080) in connection with T5501.

Specifically, each instance has 4 daemon processes (overseer, pull, trigger, taskmaster) which consume about 25MB of RAM each. repo class hosts are m3.large instances which have 7.5GB of RAM. This works out to something in the realm of 50 instances per repo box, and repo001 had about 70 at the time this occurred. bin/host status currently reports memory as 93% full.


Next Steps

Stop logrotate From Restarting: I'm going to stop logrotate from restarting apache. We restart everything weekly at a minimum, and it's fine if log rotation is irregular. That is, it's not important that the x.log.1 / x.log.2 files line up exactly with the same window of time each week, since we don't rely on (or particularly expect) this to be true. On the other hand, automatic processes making host state changes can cause a wide variety of issues. Because we use APC, restarting also has other side effects (cache implications, picks up new code on disk). Some day we'll race with logrotate and it will issue a restart at the same time we perform deployment and apache will pick up half-updated directories full of code. No good can ever come of this.

If this had happened during a deployment instead, it would have been caught and fixed immediately.

Prune Test Instances: This is silly in the long run, but since we recently launched the proportion of obviously-unused instances (e.g., named "xdemo" or "xtest" with one user or which I otherwise know to be test instances) is large compared to the total number of instances. This memory pressure is mostly created by test instances: an instance with hundreds of users has a similar footprint to the footprint of an unused test instance.


Other Approaches

Add More Hosts: We can directly resolve this by adding more hosts to the repository host pool, but since we're only running low on RAM and it's mostly consumed by test instances, I don't plan to pursue this for now. It will probably happen in the next 1-2 months, depending on growth. This scales the cluster out without reducing cost-per-instance.

Activate Swap: The hosts don't have any swap, buy do have SSD instance storage which is ready to be allocated as swap. It's not clear how much swap can help. Most of these processes are doing (trivial) things frequently (not just sitting idle) so not much may be able to be paged out. Still, this is worth at least looking into, and it's possible that this could have a substantial effect and reduce cost-per-instance.

Use Hosts with More Memory: Moving from m3.large (7.5GB) hosts to r3.large (15GB) hosts increases the cost of each host by roughly 25% but gives us 100% more memory, so this may be worth considering. However, since we're basically just wasting this memory, I'd like to pursue other less-wasteful approaches first. For example, if swap works reasonably well, that has a much bigger impact on cost-per-instance.

Consolidate Daemons: Ideally, we would refactor the daemons to run multiple instances per process, and then lump all the single-user instances into one process and accept slightly reduced performance for them. This is complicated, although many of the required steps are things we're probably interested in pursuing anyway. I don't plan to do this anytime too soon, but do imagine possibly doing it eventually. This would have the greatest impact on cost-per-instance.

I also issued a 24 hour service credit to all instances, as this was a significant unplanned service disruption.

  • I manually removed the restart from the logrotate script. I'll make that part of deployment when I make other deployment changes.
  • I pruned a handful of obviously-not-in-service test/demo instances and restarted the daemons to terminate their associated daemon processes, which gave us some additional headroom.
  • I manually configured 16GB of swap on the ephemeral drive. Depending on how that goes, I'll make that part of deployment. I'm not yet sure if we're better off or worse off with swap, but it's conceivable that swap may be a good fit for this workload.

If you're running a system with systemd (but I don't think the version of Ubuntu you are using does?), you could configure the Apache service with:

Restart=always
RestartSec=0

(http://www.freedesktop.org/software/systemd/man/systemd.service.html#Restart=)

which would ensure the service always remains running until an explicit systemctl stop apache2 is executed (i.e. even if the exact restart logrotate issues fails, systemd will continuously attempt to start the service until it comes back online, making it resilient against temporary start failures).

Of course, if you're using systemd, you also don't need logrotate, since the journal is responsible for logging (and handles it in a sensible way so you don't need to rotate logs). You can also set the system-wide limit for the journal size so it won't result in out-of-disk-space issues: http://www.freedesktop.org/software/systemd/man/journald.conf.html.

In addition, you can also constrain services to have hard or soft memory limits. This ensures you could allocate the apache2 service to have enough of a memory slice on the machine, while constraining the daemons so they don't consume too much RAM. Further documentation is at: http://www.freedesktop.org/software/systemd/man/systemd.resource-control.html

My sysop philosophy values having the smallest possible number of moving parts, so I'm very hesitant to add systemd to the mix.

Generally, I don't want anything restarting things automatically. I am much more worried about cascading failures caused by unanticipated interactions between several systems than I am about services randomly failing in isolation. My ops experience isn't super extensive, but almost all badness I've seen has interactions between systems as a root cause, while almost no badness I've seen has services which are flaky in isolation as a root cause (for badness attributable to one of those two root causes, at least). If a service is flaky in isolation, it probably shouldn't be in production anyway.

(MetaMTA is an auto-retry solution, but I didn't have the power to fix or remove the flaky service.)

If we'd reserved memory for apache, some daemons would have failed to start. Apache is probably the best service to kill on this box if something has to die, since the failure is super obvious and easy to deal with.

More broadly:

  • No more issues this week.
  • Adding swap seems like a good fix: the box overflowed physical memory gracefully without any apparent problems (notably, performance degradation), although we don't have sufficient monitoring for me to confidently claim that there was definitely no effect. Before I restarted services we were using about 1.5GB of swap.

I'll plan to make the logrotate adjustment and swap configuration part of deployment next time I touch deployment, or before we launch another repo host.

epriestley renamed this task from Apache exited on repo001.phacility.net to Phacility Cluster: Apache exited on repo001.phacility.net.May 17 2015, 12:22 AM
epriestley added a commit: Restricted Diffusion Commit.May 17 2015, 12:32 AM
  • I added logrotate configuration as part of the deploy process for hosts running apache.
  • I added swap configuration as part of the deploy process and enabled 16GB of swap for repo-tier hosts.
  • Things have been consistently stable since making the original changes.
  • I also recently added more billing-status review tools and pruned a bunch more demo/test instances, which gave us more headroom here (host load is currently hovering comfortably around 0.60, with physical memory only about 90% consumed).