Phacility Cluster: Daemons outpaced log compaction
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	epriestley
	Aug 14 2015, 6:57 PM

Description

A couple of instances inadvertently conspired to outpace log compaction on repo001. More as it develops...

Revisions and Commits

		Restricted Diffusion Commit
rP Phabricator
	D13939	rP3ef270b2926e Allow transaction publishers to pass binary data to workers

Related Objects

Mentioned In: D13927: Handle application transactions with null data.
Mentioned Here: T9283: Expand Phacility cluster repo and db tiers (add nodes: 003)
D13939: Allow transaction publishers to pass binary data to workers
T7338: Build more status tools for monitoring Phacility cluster health
T8672: PhabricatorTaskmasterDaemon fatals in PublishWorker

Event Timeline

epriestley created this task.Aug 14 2015, 6:57 PM

epriestley raised the priority of this task from to Normal.

epriestley updated the task description. (Show Details)

epriestley added projects: Phacility, Ops.

epriestley added a subscriber: epriestley.

Herald added a subscriber: chad. · View Herald TranscriptAug 14 2015, 6:57 PM

I cycled the host to trigger compaction and bring everything back up, which seems to have resolved the initial issue.

Some contributing issues:

We don't reserve a dedicated log volume, so logs end up on the root volume.
- This volume is relatively small.
- When this volume fills up, it can cause other sorts of errors (for example, bash completion wasn't working before I triggered compaction).
About 1% of the instances on this host are using >50% of the log storage.
The heaviest log is full of errors from T8672.
The second and third heaviest logs seem to be mostly related to bad repository configuration (one or more repositories retrying pulls over and over again).
Our proactive monitoring of these kinds of issues is relatively weak (see T7338). They're easy to recover from, but we learn about them from users more often than we should.

I'm planning to take these corrective actions:

Move logs to bak storage volumes. These are ubiquitous, suitable, and amply sized, and won't interact with normal system operation.
Fix T8672. This is on my relatively-short-list anyway.
Look at increasing backoff after pull failures. This should be a small adjustment that can prevent the major routine source of useless logs.
Look at increasing normal worker backoff. We could back off more quickly and further in repeated failure cases like T8672 without any real costs.
Real categorical fix is better proactive monitoring (per T7338) but that's involved and some ways away. Particularly, both "free disk" and "queue backlog" alarms could have detected this.

I think current compaction rules are generally fine as-is. We only missed routine compaction by about 18 hours, and T8672 is exceptionally noisy. As currently allocated, the bak volume on this host has ~300x more free space than the root volume, so we could have run for ~6 years without filling it up. So I think this is mostly an issue of increasing headroom in the short term, and improving proactive monitoring in the longer term.

chad awarded a token.Aug 14 2015, 7:31 PM

epriestley mentioned this in D13927: Handle application transactions with null data..Aug 18 2015, 7:23 PM

epriestley added a revision: D13939: Allow transaction publishers to pass binary data to workers.Aug 19 2015, 1:10 PM

epriestley moved this task from Backlog to Do After Launch on the Phacility board.Aug 19 2015, 1:40 PM

D13939 will fix T8672, the most immediate cause of high log volumes, but I'm going to hold it until after we cut the release since it also raises the urgency of potential existing errors.

Log volumes are currently all under 50% full so I don't expect I'll need to perform any manual intervention this week. Ideally I'll get logs moved to bak (or maybe new dedicated log) volumes in the next couple days, things look stable so that may slip depending on what else crops up.

epriestley added a commit: Restricted Diffusion Commit.Aug 20 2015, 2:47 PM

I've added support for dedicated log volumes (see commit above). Only repo class hosts currently have much use for them, although I'd expect to deploy them more widely eventually.

I'm attaching and swapping logs over to these volumes now.

This seems to have completed without incident.

joshuaspence added a subscriber: joshuaspence.Aug 22 2015, 8:23 AM

epriestley added a commit: rP3ef270b2926e: Allow transaction publishers to pass binary data to workers.Aug 22 2015, 10:14 PM

These things have been fixed:

All repo hosts got rlog volumes. Haven't seen any issues here, and tier expansion in T9283 went cleanly.
D13939 is now deployed, and log volume is back down to a trickle.

These haven't been fixed yet:

Tweaking pull/worker backoff. This would be nice to have but the volume isn't high enough to cause concern.
In the long term, more proactive monitoring of cluster health (T7338).

We haven't experienced further issues here after the initial compaction, and haven't needed manual attention since attaching rlog volumes.

Phacility Cluster: Daemons outpaced log compactionClosed, ResolvedPublicActions

Description

Revisions and Commits

Related Objects

Event Timeline

Phacility Cluster: Daemons outpaced log compaction
Closed, ResolvedPublic
Actions