A couple of instances inadvertently conspired to outpace log compaction on repo001. More as it develops...
Description
Revisions and Commits
Restricted Diffusion Commit | |||
rP Phabricator | |||
D13939 | rP3ef270b2926e Allow transaction publishers to pass binary data to workers |
Related Objects
- Mentioned In
- D13927: Handle application transactions with null data.
- Mentioned Here
- T9283: Expand Phacility cluster repo and db tiers (add nodes: 003)
D13939: Allow transaction publishers to pass binary data to workers
T7338: Build more status tools for monitoring Phacility cluster health
T8672: PhabricatorTaskmasterDaemon fatals in PublishWorker
Event Timeline
I cycled the host to trigger compaction and bring everything back up, which seems to have resolved the initial issue.
Some contributing issues:
- We don't reserve a dedicated log volume, so logs end up on the root volume.
- This volume is relatively small.
- When this volume fills up, it can cause other sorts of errors (for example, bash completion wasn't working before I triggered compaction).
- About 1% of the instances on this host are using >50% of the log storage.
- The heaviest log is full of errors from T8672.
- The second and third heaviest logs seem to be mostly related to bad repository configuration (one or more repositories retrying pulls over and over again).
- Our proactive monitoring of these kinds of issues is relatively weak (see T7338). They're easy to recover from, but we learn about them from users more often than we should.
I'm planning to take these corrective actions:
- Move logs to bak storage volumes. These are ubiquitous, suitable, and amply sized, and won't interact with normal system operation.
- Fix T8672. This is on my relatively-short-list anyway.
- Look at increasing backoff after pull failures. This should be a small adjustment that can prevent the major routine source of useless logs.
- Look at increasing normal worker backoff. We could back off more quickly and further in repeated failure cases like T8672 without any real costs.
- Real categorical fix is better proactive monitoring (per T7338) but that's involved and some ways away. Particularly, both "free disk" and "queue backlog" alarms could have detected this.
I think current compaction rules are generally fine as-is. We only missed routine compaction by about 18 hours, and T8672 is exceptionally noisy. As currently allocated, the bak volume on this host has ~300x more free space than the root volume, so we could have run for ~6 years without filling it up. So I think this is mostly an issue of increasing headroom in the short term, and improving proactive monitoring in the longer term.
D13939 will fix T8672, the most immediate cause of high log volumes, but I'm going to hold it until after we cut the release since it also raises the urgency of potential existing errors.
Log volumes are currently all under 50% full so I don't expect I'll need to perform any manual intervention this week. Ideally I'll get logs moved to bak (or maybe new dedicated log) volumes in the next couple days, things look stable so that may slip depending on what else crops up.
I've added support for dedicated log volumes (see commit above). Only repo class hosts currently have much use for them, although I'd expect to deploy them more widely eventually.
I'm attaching and swapping logs over to these volumes now.
These things have been fixed:
- All repo hosts got rlog volumes. Haven't seen any issues here, and tier expansion in T9283 went cleanly.
- D13939 is now deployed, and log volume is back down to a trickle.
These haven't been fixed yet:
- Tweaking pull/worker backoff. This would be nice to have but the volume isn't high enough to cause concern.
- In the long term, more proactive monitoring of cluster health (T7338).
We haven't experienced further issues here after the initial compaction, and haven't needed manual attention since attaching rlog volumes.