Recent changes to Drydock cause each lease to keep a list of resources it has attempted to allocate. If the Drydock state never allows resources to come up in a good state, the lease may continue trying to allocate resources forever. Each allocation attempts adds an item to the list, and executes SQL like this:
mysql> UPDATE ... SET attributes = '{"allocatedResourcePHIDs:[A, B, C, D, E, ...]"}' WHERE ...;
This list grows progressively longer and these queries eventually become very large.
Phacility instances run with the default MySQL binlog retention policy (30 days). In PHI2199 and PHI2200, an instance filled up ~192GB of a ~256GB volume with binlogs of these queries by failing to build a resource 1.6M+ times.
Likely remedies:
- Phacility instances should not retain 30 days of binlogs. In most cases binlogs could probably be disabled entirely, but retaining 24h is probably reasonable.
- The binlog format should probably be MIXED. This is what services actually performing replication use, and is generally more efficient (although not necessarily for this particular query).
- Drydock should probably store this information in a separate table, not as object properties.
- Drydock should have some failsafes against infinite failure loops (e.g., fail a lease permanently if it encounters three failures from the same blueprint or something).