Page MenuHomePhabricator

Reduce impact of backups on running instances
Closed, ResolvedPublic

Description

An instance reported degraded performance, which looks like it was just related to backups for another instance running. The backup in question took 29 minutes to run today, instead of the usual 3 minutes.

Revisions and Commits

Restricted Differential Revision
Restricted Differential Revision

Event Timeline

epriestley raised the priority of this task from to Normal.
epriestley updated the task description. (Show Details)
epriestley added projects: Ops, Phacility.
epriestley added a subscriber: epriestley.
epriestley added a commit: Restricted Diffusion Commit.Jul 6 2015, 3:37 PM

Some immediate observations:

  • The instance which was doing backups is a suspended test instance. This specific issue could be resolved immediately by adding the ability to destroy instances. The two largest instances on this hardware are both long-disabled, destroyable test instances.
  • innodb_buffer_pool_size got bumped down informally and then back up again formally, but may be too high. I think I'm going to drop it to ~50%-66% of the host's RAM (it's about 85% right now).
  • The root cause of the 3 -> 29 minute jump isn't clear to me. Obviously, 3 minutes is a lot better than 29 minutes. top / ps / free / CloudWatch didn't report the host running out of anything, and CPU utilization was low. The EBS volumes also didn't report an unusual amount of utilization. There was less memory used during the backup which makes me suspect this is some kind of complex buffer pool size interaction.
  • In the long term, we should use replicas and dump backups off them so main database performance is unaffected during a backup, but this requires significant work on T4209.
  • We could nice the backups, but they weren't actually limited by CPU so I suspect this wouldn't help.

So my plans are:

  • Adjust the buffer pool size back down to something in the 50%-66% range.
  • Add the ability to destroy instances.
  • Destroy old instances.

I think the actual extent of "destroy" we need is just "stop backups" -- keeping the data itself around costs us roughly nothing -- so I'll probably just tailor that.

Possibly I can just stop backups on suspended/disabled instances if they've been offline for ~48 hours -- we won't destroy the old backups just because we stop the new ones. So that's probably simpler, and just requires a new...

While I was writing this up, the host ran out of RAM and stopped working, so this is probably a memory management issue.

I'm dropping the pool size and giving these hosts swap now.

Ubuntu also ships with some debian-sys-maint junk which runs huge, slow queries to check for problematic table data on every restart, which we should likely disable.

I've restarted the tier with a ~50% buffer pool configuration (4GB of 7.5GB) and 8GB swap (rCORE471d3a7acee1) and things seem to be holding stable. I'll pursue the other changes now.

epriestley added a commit: Restricted Diffusion Commit.Jul 6 2015, 4:08 PM

I no-op'd the debian-start script and deployed db002 with the new changes (rCORE51d7570f5ddf) to verify it. I'm not going to restart db001 for the moment, since bringing it down is more disruptive and I only need confidence that the change is correct.

epriestley added a revision: Restricted Differential Revision.Jul 6 2015, 4:31 PM
epriestley added a revision: Restricted Differential Revision.Jul 6 2015, 4:52 PM
epriestley added a commit: Restricted Diffusion Commit.Jul 6 2015, 4:59 PM

I stopped storage upgrade from running on out-of-service instances (rCOREe9f998b109c1) and upgraded db002 and repo002 to verify this change. We don't need to keep out-of-service schemata up to date, and this reduces the ongoing cost of out-of-service instances.

epriestley added a commit: Restricted Diffusion Commit.Jul 6 2015, 6:21 PM
epriestley added a commit: Restricted Diffusion Commit.
epriestley claimed this task.

I pushed a change to record instance out-of-service dates (D13561) and stop running backups for out-of-service instances after about 48 hours (D13563). I promoted these changes to rSAAS stable and deployed admin. I manually updated all "Suspended" and "Disabled" instances to have an out-of-service date of July 1, 2015.

I manually triggered a backup of the earlier problem instance, and it completed in slightly less than 2 minutes (this backup was forced because it was a named backup from the web UI).

So I think this is resolved, despite not finding a clear mechanism for the memory/load interaction. My thinking is basically:

  • 95% chance this was some flavor of memory issue causing gzip or mysql to mostly-silently degrade when unable to allocate memory.
  • 5% chance of spooky AWS ghosts?

The new settings should give us better behavior and we should no longer pay any ongoing resource costs for out-of-service instances: particularly, their backup cycles will no longer affect in-service instances.

I've issued a 24-hour credit for all instances because there was a material disruption to service here.