Page MenuHomePhabricator

Enable `innodb_file_per_table` and migrate existing databases in the cluster
Closed, ResolvedPublic

Description

I thought I did this a year ago, but apparently not. Without innodb_file_per_table, InnoDB stores data in a big lump. innodb_file_per_table stores a file per table instead.

Migrating this is a huge pain, though -- especially for the shards that are already >50% utilized.

Revisions and Commits

Event Timeline

I'm going to do this tonight/tomorrow, but this is going to be fairly involved. Current utilization is:

Shardddata Volume Utilization
db00169%
db00235%
db00335%
db00463%
db00577%

The primary sources of storage pressure to date have been:

  • Storing file data in the database instead of S3 (resolved by T7163).
  • A bug which caused one instance to generate 4GB of diff data (resolved by T10210).
  • Test instances not having an automated destruction workflow (mostly resolved by T10166, which makes them very easy to clean up manually).

For db002 and db003, I can just enable innodb_file_per_table, optimize all the tables, and we're good to go.

For the other shards, there is probably more data in the databases than will fit in the free space on disk. I'm going to attempt to verify this. If, for example, db005 only really has <22% of a disk of "actual" data, then we could safely optimize in-place. This is more likely possible for db004 and db001 than db005, though.

For db005, I think I need to attach a larger swap volume, copy the data over, cycle the volumes, then do the optimization.

This may actually be not-too-terrible. I think there's a good chance that db005 really does have less than 22% of a volume of "actual" data, since the compressed size of backups for all active instances on the shard is only about 300MB. We had a bunch of heavy test instance churn on that shard for whatever reason.

db004 has heftier backups but the two largest instances there are "test" instances.

db003 and db002 are less than 50% utilized and certainly in the clear.

db001 has two large test instances that I can destroy and a relatively modest backup data size.

So I'm probably going to try to do everything in-place. Rough plan is:

  • Destroy all out-of-service test instances.
  • Destroy any huge out-of-service "test" instances.
  • Restart to pick up innodb_file_per_table.
  • Optimize all tables (mysqlcheck -o --all-databases).
  • Nuke ibdata1.

I'll start with db002 and db003, which don't run any risks of running out of disk space by surprise. If backup sizes aren't good predictors of final data size on the simpler cases I can do the more-full volumes with swap volumes; otherwise I can proceed in-place.

I am beginning this on admin001 now.

epriestley added a commit: Restricted Diffusion Commit.Jan 30 2016, 9:20 AM

Prior to optimization, admin001 was 46% utilized (27GB of database data) with a 434M total backup size.

Optimization took 1m57s.

Utilization increased to 49% after optimization (~2GB of data).

Removing ibdata1 after optimizing all the tables completely broke MySQL. Putting it back fixed things. It looks like I need to dump + restore instead.

I disabled network connections to MySQL and restarted it.

I dumped everything with:

mysqldump --hex-blob --single-transaction --default-character-set=utf8 --all-databases | gzip > everything.sql.gz

This took about 1m9s and produced a 435MB archive.

I stopped MySQL, removed the ib* files and restarted MySQL. It recreated them and came up cleanly.

I imported the full dump again:

zcat everything.sql.gz | mysql -u root

This took 2m46s.

I restored network connectivity and retarted MySQL. The server came up clean.

This left the volume 4% utilized (1.9GB).

I'm beginning this on secure001 now (this host). Prior to maintenance, the volume is 52% utilized (16GB) with a 306M compressed backup. Service on this host will be briefly interrupted.

Er, I was reading the wrong backup filesize -- the actual size is 1.5GB. A big chunk of this is in parse caches, so I ran bin/cache purge --purge-all. I'm now doing the actual maintenance.

This host took 4m4s to dump and 9m36s to restore. Utilization is now 21% (6.1G).

I'm beginning this on db001 now. The high volume utilization shouldn't be an issue since the ultimate process doesn't involve doing anything in place.

On db001:

  • I destroyed all out-of-service test instances on the host.
  • I destroyed one unusually large out-of-service "test" instance.
  • This left us 67% utilized (40GB) with a 1.3GB compressed backup.
  • I disabled network traffic.
  • I dumped the databases, which took 35m44s (???).
  • I wiped the data and enabled per-file tables.
  • I'm currently reloading the databases, but I expect that to take a while.

I've started preparing db002 in the meantime:

  • The huge increase in dump/restore time for db001 vs secure001 seems to be primarily related to the number of databases/tables/rows, not the total data size. I'll attempt to mitigate this for the remaining shards.
  • On db002, I was significantly more aggressive in destroying suspended and disabled instances (primarily, I attempted to destroy "test" instances).
  • For "up" instances, I purged all caches.
  • db001 compressed much better than admin001 or secure001 and overshot volume free space on restore. I've cleared space and restarted the restore.
  • db002 is starting at 31% utilized (19GB) with a 3GB compressed backup (a significant portion of this is unmigrated file data, though, which I expect to dump and restore relatively quickly).
  • I disabled network on db002 and am dumping the data.
  • The db002 dump completed in 7m31s, which is much more promising.
  • I wiped the data and enabled per-file tables.
  • I'm currently reloading the databases.
  • The db002 load completed in a relatively reasonable 17m50s.
  • Utilization is now 18% (11G). A sizable chunk of this is removable at some future date.
  • I'm in the process of preparing db003.
  • The db001 load is ongoing.
  • db003 has 31% utilization (19G) with a 250MB total compressed backup size.
  • Reassuringly, dump time was 1m49s.
  • Load time was 6m32s.
  • db003 is now 8% utilized (4.3G).
  • I'm prepping db004.
  • db001 is still loading.
  • I destroyed out-of-service instances on db004.
  • I disabled and destroyed two very large "test" instances which were still technically in service but had one user, no activity, and were long overdue on billing (just not >120 days overdue yet, so they hadn't triggered automatic suspension). One of them had some extra information which strongly suggested it was a "test" instance. If this was in error, they can be restored from backups later.
  • This leaves db004 59% utilized (35G) with a 1.1GB compressed backup size.
  • db004 dumped in 3m32s.
  • db004 is now loading.
  • I'm preparing db005.
  • db001 is still loading.
  • db004 loaded in 14m18s and is now 11% utilized (6.2G).
  • db005 is 77% utilized (46G) with a 250MB compressed backup.
  • db005 dumped in about 1m30s.
  • db005 loaded in 8m22s.
  • db005 is now 8% utilized (4.7GB).
  • db001 is still loading, and I think I pretty much just have to wait it out. Major issue is one enormous, disabled "test" instance but there's no easy way to skip it.

db001 is still loading. It's through the particularly slow instance and I expect it to finish in the relatively near future.

Service on db001 has been restored, although there's a backlog of backups + daemons that are slamming it so it won't be very responsive until those drain.

I purged dead instances on db001 more aggressively, leaving us with this result:

HostBeforeAfterReclaimedDowntime
Support
admin00146%4%91%15m
secure00152%21%60%15m
Shards
db00167%19%72% 9 HOURS
db00231%18%42%30m
db00331%8%74%15m
db00459%11%82%15m
db00577%8%90%15m

So everything basically went fine, except db001 was a horrible disaster.

The good news is that this is now fixed forever and we never have to deal with it again.

I've issued all instances a 72-hour credit for this disruption.