The database backups on admin run around 6:30 PM PST and currently cause sustained load for what looks like ~10-ish minutes. During this time, instance performance can be negatively impacted.
Reduce Backup Size
The most direct lever we can pull is to reduce the size of this backup so it runs for a shorter period of time. From bin/storage probe, the largest tables are:
... file_storageblob 477.5 MB 3.3% admin_conduit 1,050.5 MB 7.2% conduit_certificatetoken 0.0 MB 0.0% conduit_token 0.3 MB 0.0% conduit_methodcalllog 1,050.2 MB 7.2% admin_leylines 12,355.2 MB 84.6% leylines_checkpoint 0.0 MB 0.0% leylines_checkpointtransaction 0.1 MB 0.0% leylines_dimension 26.0 MB 0.2% leylines_event 12,329.0 MB 84.4% TOTAL 14,601.7 MB 100.0%
- More than 90% of this data is basically junk we don't need to be backing up.
- The Leylines table is just user tracking for ad clicks.
- Short term: archive the data and truncate the table.
- Mid term: I think we can filter and throw away most of this data (a lot of it is, e.g., crawlers hitting one page and not setting cookies).
- Long term: Probably move "data warehousing" to separate hardware with more appropriate backup behavior.
- The conduit_methodcalllog table uses a 7-day window but instances make many calls to pull service metadata.
- Short term: This table has little value. We could reduce this window to 24 hours without really losing anything.
- The file blob table is also kind of big. It looks like S3 is configured, but there may be some files we can get rid of or migrate (e.g., old data from before S3 got set up) and/or the MySQL maximum storage size might be set a little high.
Reduce Backup Impact
Backups are written to a separate bak volume so the writes shouldn't cause I/O contention, in theory. The read from the database will cause some general load but I don't think there's a ton we can do about that.
I'm guessing the gzip might actually be responsible for a big chunk of the load. We could try to verify that, then try:
- Using nice to reduce the CPU priority of gzip.
- Using faster but less-thorough compression (gzip --fast).
Reduce Load Impact
We could reduce the impact of load on instances, so high admin load has less effect on other services.
- The almanac.service.search method is currently much slower than it needs to be. T12297 has started with fixing it. I believe there are substantial improvements to be made here.
- We could put fancier things in place here (timeouts, a separate cache layer) but I think these increase complexity and are overreactions to the immediate issue. When we eventually make this component more sophisticated I think we'll have a different set of problems we want to use to pick a solution -- mostly, the ability to push configuration changes quickly so we can, e.g., blacklist a client address with a single configuration change or quickly push instances into and out of read-only/maintenance mode during deployments.
Align Risk and Response
We can move this process from 6:30 PM PST (lower staff availability) to 11:00 AM PST (better staff availability) to improve our ability to respond to any future issues.
Long Term: Dump Snapshots from Replicas
The long-term fix here is to dump snapshots from replicas, not masters. There's currently no admin replica, but we could move toward a multi-node admin tier. This is not completely trivial since db + repo + web are all overlayed, but likely not very difficult given that we have run a similar setup on secure for quite a while without any real issues.
These things would also need to become replica-aware:
- The backup process needs to know that it should be acting on the replica, not the master.
- The backup process should probably sanity check the replica (replicating + from the right master + not too far behind master) before running so we don't get a green light from backups and go find we actually made stale backups of the wrong thing later.
- The provisioning process needs to know that bak volumes only mount on replicas when a master + replica is configured.