Friday, September 29
An instance reported slow performance around noon (see PHI96).
I connected to the shard (db009) and identified multiple simultaneous backup processes running in ps auxwww. I killed them and started a new clean one, under the theory that this might be mostly a load issue, perhaps related to the Ferret engine indexing backlog, and that reducing the load might restore things to normal operation. The new backup dumped quickly at first but slowed down dramatically as time went on, and ultimately did not complete in the next 16 hours.
I also observed that the instance had 70GB of MySQL data: 50GB in repository_commit_fngrams, 10GB in repository_commit_ffield, and 10GB of actual data in all other tables. The instance also had something in the realm of 600,000 SearchWorker tasks queued. The instance has approximately 800,000 indexable commits.
Saturday, September 30
I observed that the backup was still ongoing after approximately 16 hours and killed it. db009 had a magnetic volume at this point (T12999) and I suspected this might be a storage I/O issue. I snapshotted the magnetic volume (this took maybe two hours), then I allocated a new SSD volume and performed a volume swap on the host with bin/swap (this also took maybe two hours).
I also suspected (or, perhaps, hoped) this might be some flavor of table fragmentation issue, since the 50GB fngrams table seemed obviously wrong given that it appeared that fewer than 200,000 commits had indexed. However, the fdocument table actually has all ~800,000 entries, so those jobs were more likely reindexes. The exact origin remains unclear, but I think this table size is actually predicted for that number of documents.
Locally, we have 13,000 tasks and a ~1GB fngrams table. This predicts an (800,000 / 13,000) * 1GB = 61GB fngrams table for 800,000 objects. So this is at least in the ballpark.
After swapping to an SSD, I began running OPTIMIZE TABLE under the "some sort of table fragmentation" theory. This moved forward, albeit very slowly (approximately 30x slower than optimizing similar tables on secure).
Somewhere around this time (mid-morning), @amckinley noticed that the SSD burst I/O budget had been completely consumed for the volume. I swapped the volume to io1 with a ~10x larger allocation of iops, which appeared to have some sort of effect, but not much of one.
We also observed (via iotop) a sustained 80MB/sec I/O write rate coming out of MySQL. This write rate was sustained even when the daemons were stopped (that is, no new indexing tasks were running), but very little of it appeared to be going to the OPTIMIZE TABLE (which was growing the tempfile on disk at less than 4MB/sec).
I'm not familiar with tools to inspect MySQL I/O activity, the actual load on the server didn't make sense in terms of generating I/O, and Googling didn't turn up much of use. I suspected some kind of memory/swap/paging issue with the size of the fngrams table, although I'm not sure this makes much sense (why would this cause writes?). Regardless, the fngrams table still seemed like the unusual element -- and the most likely culprit.
I truncated the fngrams table, the ffield table, and the worker_activetask table, then optimized them, removing the task queue and reducing the data size to approximately 10GB. I deployed db025 and migrated the instance via bin/move. This ran into several issues:
- The large size of the dump hit HTTP errors downloading from admin.phacility.com. This is probably a config issue, but see also T12907.
- The dump was more than 2GB, so bin/file download would have failed in HTTPSFuture anyway, since it could not store the file data in a PHP string.
- We've previously only used bin/move to move repositories (T12817). When used on databases, it fails to create credentials on the target host before connecting to the database. This is probably an easy fix.
I worked around these issues with scp, bin/host upgrade, and a little tweaking of the code. When the instance came up on db025 I ran a new backup, which completed promptly. Load and I/O had vanished from both db009 and db025 and the system appeared stable.
Sunday, October 1
Backups ran normally without intervention.
Monday, October 2
I reverted the io1 volumes to gp2 and wrote this up.