Page MenuHomePhabricator

Replace cluster magnetic volumes with SSD volumes
Closed, ResolvedPublic

Description

Many of the db and repo volumes are magnetic ("standard"), not SSD ("gp2").

There is no reason for these volumes to be magnetic, and magnetic volumes are undesirable. Their pricing and I/O characteristics aren't directly comparable to SSD, but generally they have less I/O, less flexible I/O, and aren't meaningfully cheaper.

We should swap these to become SSD volumes. (Compacting the tier may moot some of this by reducing the number of devices; we can migrate away from magnetic devices.)

Event Timeline

I'm going to go through the volumes type-by-type instead of host-by-host, starting with the backup volumes (because those should be fine to detach as long as the backup isn't running). It looks like backups run daily at 2300, so that should give me plenty of time.

Test run (using AWS console)

  1. Unmount rbak009.phacility.com (oldest backup volume)
  2. Ensure volume is detached
  3. Create EBS snapshot of rbak009
  4. Create new SSD volume from snapshot of the same size, named rbak009-ssd.phacility.com
  5. Attach new volume
  6. Remount the volume, ensure mount point is correct, check filesystem
  7. Delete original volume
  8. Delete snapshot
  9. Rename rbak009-ssd.phacility.com to rbak009.phacility.com

Everything run (using API)

  1. One volume at a time, repeat the above steps using the AWS API, pausing after step 5 for hand-checking that the new volume came up correctly

Backups should run continuously (starting 12 hours after the instance launches, then every 24 hours after that):

https://secure.phabricator.com/diffusion/SAAS/browse/master/src/applications/instances/editor/InstancesInstanceEditor.php$471-481

However, they'll retry if they fail, so this should be safe.

Since I imagine we'll just rebuild everything "fairly soon" to rebalance/compact/NAT and you can't use this strategy for non-bak volumes I'm not sure it's worth doing by hand, though.

Oh, I'm happy to kick this down the road until we go through the Big Compaction. I just saw this task as a dependency for T13076, which is on this week's planning board and figured I'd tackle it. Is there a different task you think I should work on instead to move T13076 forward?

I think all the T13076 stuff is blocked on me for now unless you're feeling especially ambitious about trying to get bin/host download working on 2GB+ files (T12907 / D19011). That might be a bit of a mess of a task though since I think I have a lot of secret cURL knowledge from over the years that is only somewhat-documented in HTTPSFuture.