Drydock resource accounting may put significant stress on the MySQL binlog if a lease is unsatisfiable
Open, NormalPublic
Actions

Assigned To

Authored By

	epriestley
	Jun 7 2022, 3:04 AM

Description

See PHI2199. See PHI2200.

Recent changes to Drydock cause each lease to keep a list of resources it has attempted to allocate. If the Drydock state never allows resources to come up in a good state, the lease may continue trying to allocate resources forever. Each allocation attempts adds an item to the list, and executes SQL like this:

mysql> UPDATE ... SET attributes = '{"allocatedResourcePHIDs:[A, B, C, D, E, ...]"}' WHERE ...;

This list grows progressively longer and these queries eventually become very large.

Phacility instances run with the default MySQL binlog retention policy (30 days). In PHI2199 and PHI2200, an instance filled up ~192GB of a ~256GB volume with binlogs of these queries by failing to build a resource 1.6M+ times.

Likely remedies:

Phacility instances should not retain 30 days of binlogs. In most cases binlogs could probably be disabled entirely, but retaining 24h is probably reasonable.
The binlog format should probably be MIXED. This is what services actually performing replication use, and is generally more efficient (although not necessarily for this particular query).
Drydock should probably store this information in a separate table, not as object properties.
Drydock should have some failsafes against infinite failure loops (e.g., fail a lease permanently if it encounters three failures from the same blueprint or something).

Revisions and Commits

Restricted Diffusion Commit

Event Timeline

epriestley triaged this task as Normal priority.Jun 7 2022, 3:04 AM

epriestley created this task.

epriestley added a commit: Restricted Diffusion Commit.Jun 13 2022, 12:56 PM

The drydock_resource table could use a (status, ...) key to satisfy common/default queries.

An earlier patch here (rCORE6d6170f76463) swapped binlogs to MIXED and set a 24-hour retention policy. This issue has not reoccurred in the cluster since that patch went out, but the root causes remain unresolved.

Drydock resource accounting may put significant stress on the MySQL binlog if a lease is unsatisfiableOpen, NormalPublicActions

Description

Revisions and Commits

Event Timeline

Drydock resource accounting may put significant stress on the MySQL binlog if a lease is unsatisfiable
Open, NormalPublic
Actions