30GB Phacility instance caused a series of cascading failures which left web services unreachable
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	epriestley
	Apr 21 2016, 1:05 PM

Description

The Phacility cluster experienced a major service disruption which affected some instances for about 12 hours between 8:30 PM PDT April 20, 2016 and 8:30 AM PDT April 21, 2016. This task details our response to this issue.

Incident Summary
See below for a more detailed timeline.

Time	Since Previous	Event
April 20, 8:30 PM PDT	-	`db003` begins to falter. 20% of instances likely began to see disruptions shortly after this.
April 20, 11:50 PM PDT	3 hours, 20 minutes	`web001` hangs and is removed from the loadbalancer. All instances now experience a total web service disruption.
April 21, 5:55 AM PDT	6 hours, 5 minutes	Initial operational response.
April 21, 6:02 AM PDT	7 minutes	80% of instances restored to service.
April 21, 7:05 AM PDT	1 hour, 3 minutes	100% of instances restored to service.

Instance Credit

We've issued all instances a week of service credit for this disruption. This will be reflected on your next invoice, or over the course of the next two invoices if you receive an invoice in the next week.

Related Objects

Mentioned In: T10849: Support query timeouts via `max_statement_time` or MYSLQI_ASYNC
Mentioned Here: T7338: Build more status tools for monitoring Phacility cluster health
T9823: Expand Phacility cluster repo and db tiers (add nodes: 005)
T10655: Phacility cluster mail deliverability issue
T10849: Support query timeouts via `max_statement_time` or MYSLQI_ASYNC
D15775: Don't cache resources we can't generate properly

Event Timeline

epriestley created this task.Apr 21 2016, 1:05 PM

Herald added subscribers: chad, eadler. · View Herald TranscriptApr 21 2016, 1:05 PM

epriestley merged a task: T10846: Problem with phacility?.Apr 21 2016, 1:05 PM

epriestley added a subscriber: • wuup.

Thanks man, appreciate it.

allixsenos added a subscriber: allixsenos.Apr 21 2016, 1:18 PM

Events so far:

At about 11:50 PM PST April 20th, apache stopped responding to requests on web001 and the load balancer (lb001) removed it from service after it failed health checks.
Around 5:50 AM PST April 21st, I logged in to a number of notifications from users about the issue.
I connected to the server around 5:55 AM PST and rebooted it after inspecting things, service was restored around 6:02 AM PST.

I don't have any details about the technical cause yet, since it wasn't initially obvious and I resolved the problem by turning it off and on again. I'll look for more information and attempt to pin the cause down in more detail.

This failure is primarily an operational/administrative issue (single-node failure caused service failure, 6 hours until operational response) rather than a technical one (restoration was straightforward service was restored in ~7 minutes once response occurred). I'll discuss these issues after seeing what information I can find about the technical cause.

I connected to the server around 5:55 AM PST and rebooted it after inspecting things, service was restored around 6:02 AM PST.

https://plousio.phacility.com/ - is this working for you?

Looking into that. There are some additional issues with queued daemon tasks on admin as well that I'm looking into, unsure if those are symptoms of the web issue or not yet.

our stuff is still down

Looking up service metadata for some subset of instances is failing, I'm not sure why yet or how large that subset is.

Pretty sure I've figured this out but resolving it will take some finesse.

eadler removed a subscriber: eadler.Apr 21 2016, 2:02 PM

Pretty sure I've figured this out but resolving it will take some finesse.

No problem, take your time. Realize its pretty early there.

I think the issue is initially resolved now, is anyone with an affected instance still seeing an issue?

• wuup awarded a token.Apr 21 2016, 2:07 PM

As a very mild silver lining, I'm pretty sure this was the actual root cause of the webserver hang, and it's relatively tractable. Here's what appears to have happened:

One instance on db003 did something very creative and their database grew to approximately 30-40GB (?) very quickly. I'll look into this in more detail.
This largely exhausted disk space and caused MySQL to start hanging on some operations.
In turn, this hung web connections for instances on this host.

Restarting the webserver freed apache workers, but presumably would have hung again if the issue wasn't resolved. I forcefully created space on the affected volume and restarted MySQL. I'll assess what happened here in more detail and follow up once I have additional information.

confirm noomhq is operational again

epriestley renamed this task from Phacility web services were unreachable to 40GB Phacility instance caused a series of cascading failures which left web services unreachable.Apr 21 2016, 2:26 PM

epriestley renamed this task from 40GB Phacility instance caused a series of cascading failures which left web services unreachable to 30GB Phacility instance caused a series of cascading failures which left web services unreachable.Apr 21 2016, 2:38 PM

From here on, this analysis discusses some private instance data. I'm generally very hesitant to disclose information like this, but the data is non-identifying and the nature of the data is important to understanding the nature of the failure, so I think disclosure is reasonable in this case. I've limited the disclosure to only include information important in understanding the technical aspects of the issue.

I think a more complete timeline now looks like this:

Recently, an instance on the db003 shard enabled metamta.diffusion.inline-patches and imported several large repositories. In vague terms to limit disclosure, this was roughly four copies of the Linux kernel. (I don't have any evidence to suggest that there was malicious or abusive intent behind this.)
This caused the database size to quickly grow to about 30GB, about 20GB of which is queued patch email.
We currently allocate 64GB primary ddata data volumes on db hosts and err heavily on the side of caution when destroying instance data, so instances hold a long backlog of test data. Thus, this additional surge in data size was sufficient to substantially exhaust disk space on the volume.
Because MySQL couldn't write, connections started hanging. This backed up daemons on admin001 and webserver processes on web001. I don't have an exact timeline on this, but it looks like issues started around 8:30 PM PDT April 20th (a little more than 3 hours before web001 went down). Other instances on db003 may have been affected that early, or may have experienced service disruptions starting between that time and when web001 failed. Instances on the other shards (db001 through db005) were likely not affected until 11:50 PM PDT.
Once web001 backed up enough, it wan't able to respond to /status/ checks from the loadbalancer and was taken out of service.
At approximately 6:02 AM PDT I restarted web001, which cleared the hung connections. This restored service for about 80% of instances, but instances on db003 (approximately 20% of instances) were still unavailable. This was not immediately obvious -- I tested several instances and they all worked, but they happened to be on different database shards. The issue would also have reoccurred eventually, although this likely bought us 3-4 hours of uptime.
Around 6:30 AM PDT I restarted daemons on admin001 and it began flushing the task backlog (this had little impact on users; these tasks are mostly administrative/billing).
At approximately 7:05 AM PDT I cleared the issue on db003. This restored service to the remaining 20% of instances.

I've stopped the bleeding on db003 by destroying out-of-service test instances and we now have about 20GB of headroom on the volume, but the problematic instance is still active and continuing to grow -- it's about 3x larger than all other instances on the shard combined. I'd like to avoid disabling it since we don't actually say "don't import 4 copies of the linux kernel" anywhere and the action doesn't appear to be abusive in nature, although that may be the only reasonable course of action available.

My next steps are:

Figure out how to stop the problem instance before it kills again. I can suspend it, but I want to stop it non-violently if possible. I'll do this now.
Walk through each step of this cascade in more detail and outline why the issue occurred and what we'll do to prevent it in the future. I expect to follow up with this later today.

epriestley updated the task description. (Show Details)Apr 21 2016, 3:13 PM

epriestley updated the task description. (Show Details)Apr 21 2016, 3:22 PM

epriestley mentioned this in T10849: Support query timeouts via `max_statement_time` or MYSLQI_ASYNC.Apr 21 2016, 6:21 PM

Some additional response steps:

I couldn't come up with a non-violent approach so I suspended the instance in question and notified the administrator.
I've verified that no other shards are currently at risk of similar issues. They're generally <40% utilized, although db001 was a little above that. I pruned some instances there just to be on the safe side.

Broadly, here are some of the other issues that led to this failure and what we plan to do to resolve or prevent them.

Briefly:

MySQL client behavior in this situation was very bad, and made this far worse than it should have been (a partial issue affecting a subset of instances became a total issue affecting all instances). We'll pursue technical solutions to fix client behavior (T10849).
Most of the delay in resolving this was because there's no way to page us. We'll provide a way to page staff in an emergency.
The web service was not redundant, and now is. This probably would have delayed rather than prevented the problem, but we should be strictly more resilient now.

In detail:

Quotas: Instances don't currently have hard storage quotas, so nothing stops an instance from using an arbitrarily large amount of data. I have some ideas for improving usage monitoring and possibly implementing hard quotas, but need to prototype them before making a decision about moving forward.

I'm hesitant about hard quotas because the quota still needs to be relatively generous to avoid interfering with legitimate instances, and per-instance quotas won't stop several instances from exhausting resources by working together. Large instances with necessarily large quotas could also burst through available capacity quickly. So hard quotas on their own wouldn't really stop this problem, and would create other problems which could interfere with legitimate users.

I suspect that implementing better monitoring is worthwhile, but that hard quotas might create more problems than they solve in the context of other issues and fixes.

Storage Provisioning: The current volume size of ddata volumes (64GB) was a rough guess when we first launched. Although it isn't wildly unreasonable, it doesn't have as much headroom as I'd like and I think it should probably be made larger: we're doing a better job of fitting more instances per hardware node now than we were at launch, retain a lot of data for a long period if time, and it's common for test instances to perform an import of some relatively large repository.

I plan to increase these volumes to 256GB moving forward, and increase dbak volumes to 128GB (this data compresses well). I can't upgrade existing volumes without some downtime so I probably won't make this increase for existing nodes until I can schedule it concurrently with other required downtime (maybe when we start dumping binlogs for replication). When we configure binlogs, that will also tend to place somewhat more storage pressure on ddata volumes, further motivating this change.

The required downtime would not be particularly long (maybe 15 minutes per shard?), so there may be an opportunity to do this in a way that is minimally disruptive sooner.

I've updated Adding Hardware to reflect the new volume sizes.

Query Hangs: I think the biggest technical issue here was the poor behavior of MySQL clients on the web and admin tiers when MySQL stopped processing some queries. For some of these clients, like the backup processes, we must accept that queries may take a relatively long time to run. However, for web processes we could and should timeout queries after a short delay. This would have lead to dramatically better behavior and left only hosts on db003 materially affected, with a far more obvious indicator of the root cause.

However, it doesn't seem like there's a great way to do this until MySQL 5.7.4, which introduced max_statement_time (and apparently it was renamed to max_execution_time in 5.7.8). This is relatively recent as MySQL goes (about two years old, I think). We currently run MYSQL 5.5.47-0ubuntu0.14.04.1, which doesn't yet have this feature. If this was enabled, I believe it would have dramatically reduced the impact of this issue and made an appropriate response much more direct, and is the most effective technical mitigation step we can perform.

I plan to pursue support for this option and evaluate upgrading MySQL to take advantage of it.

It's also possible that we can implement this on the client side in some MySQL drivers using features like MYSQLI_ASYNC, but this also requires relatively recent versions of client software. Still, this is worth evaluating and may be easier to set up.

I've filed this as T10849.

Monitoring: We're behind the curve on cluster monitoring and have been for a while; T7338 discusses some particulars. We have a mixture of monitoring tools but they tend to provide ways to actively query cluster health rather than passively observe it. This is difficult to prioritize because unplanned production downtime is rare, and the amount of monitoring we'd need to build or configure to predict or quickly respond to all potential incidents is theoretically large.

This is something we plan to improve, but it's generally hard to evaluate the cost/benefit of building out monitoring infrastructure. In hindsight we can often pinpoint the thing that should have been better monitored, but I'm not sure how often we'd get it right in foresight, and incidents have generally had dissimilar causes. We likely could have resolved this particular issue more quickly with reasonable monitoring, but the last incident related to outbound mail (T10655) would have been difficult to build monitoring for, even in hindsight.

I think the inability to page us (see below) is a bigger issue than monitoring on its own, since providing that ability would trigger operational response regardless of the presence of appropriate monitors. I generally plan to pursue that first, then fill in monitoring to supplement it in the longer term as we have more confidence in our ability to foresee problems.

Web Node Failure: We suffered a total loss of web services after the failure of a single node, web001. We've had a redundant node in production for months (web002), I just never put it into the loadbalancer pool. I brought it up to have some capacity on hand after T9823 but didn't want to swap it in at the time since it was the middle of the day and I didn't want to disrupt anything if there was an issue. It then sat unused for about 5 months, for no particular reason.

I put this node into production about three minutes after restoring web001, without issue.

It likely wouldn't have actually saved us here since all of the processes on that node would have hung too, but it might have delayed the impact to other instances by another 3-4 hours. And there could have been an issue with putting it into production, or could still be in the future (issues like D15775 create some amount of peril, although I believe we are insulated from that here).

Paging and Staff Availability: The biggest issue here was one of response time: we have two full-time staff with similar waking hours and effective coverage from about 5AM PDT to 10PM PDT. This happened at the worst possible time -- shortly after we both went to sleep -- and there's no way to page us.

We've planned to move support into Nuance for a long time to improve this, and Nuance is next on our "natural" roadmap, although some prioritized work is in the pipeline before that. Once support moves to Nuance we'll provide a [X] Page someone, this is an emergency. button for situations like this.

In the long run we should be getting paged for monitoring alarms too, as discussed above, but having a red button that users could click is a more general solution to all problems in this class and would have reduced primary downtime from ~6 hours to ~15 minutes without relying on us to predict the correct things to monitor and alert on.

Not sure if its related but I get a message regarding "unresolved setup issue" on our instance[1] now:

Phabricator is configured to use a notification server, but is unable to connect to it. You should resolve this issue or disable the notification server. It may be helpful to double check your configuration or restart the server using the command below.
HTTPFutureCURLResponseStatus
[cURL/7] (http://notify001.phacility.net:22281/status/?instance=mindoktor) <CURLE_COULDNT_CONNECT> The cURL library raised an error while making a request. You may be able to find more information about this error (error code: 7) on the cURL site: http://curl.haxx.se/libcurl/c/libcurl-errors.html#CURLECOULDNTCONNECT

[1] mindoktor.phacility.com

Thanks, I think that should be fixed now. Not clear if it was related to everything else and I can't immediately come up with a way that it would have been -- I'll keep an eye on it. We've made some changes to the Aphlict server recently and it might have some issues that still need to be worked out, so it could be purely coincidental.

Though this downtime came at an unfortunate time for us (we just migrated from bitbucket and in the middle of the workday here in Sweden) I must say that you have dealt with it in a phrofessional way. Thanks.

Sorry to hear that @otto.bretz, but glad you're still happy with the service overall. I think this is the longest unplanned downtime we've had since launch and it's definitely not acceptable to us.

cspeckmim awarded a token.Apr 22 2016, 2:02 AM

cspeckmim added a subscriber: cspeckmim.Apr 22 2016, 2:06 AM

There's nothing particularly useful or actionable here now, so closing it out. (I believe this was the most severe incident Phacility ever experienced while actively maintained.)

30GB Phacility instance caused a series of cascading failures which left web services unreachableClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

30GB Phacility instance caused a series of cascading failures which left web services unreachable
Closed, ResolvedPublic
Actions