Page MenuHomePhabricator

Phacility Cluster: Daemons outpaced log compaction
Closed, ResolvedPublic

Description

A couple of instances inadvertently conspired to outpace log compaction on repo001. More as it develops...

Revisions and Commits

Event Timeline

epriestley raised the priority of this task from to Normal.
epriestley updated the task description. (Show Details)
epriestley added projects: Phacility, Ops.
epriestley added a subscriber: epriestley.

I cycled the host to trigger compaction and bring everything back up, which seems to have resolved the initial issue.

Some contributing issues:

  • We don't reserve a dedicated log volume, so logs end up on the root volume.
    • This volume is relatively small.
    • When this volume fills up, it can cause other sorts of errors (for example, bash completion wasn't working before I triggered compaction).
  • About 1% of the instances on this host are using >50% of the log storage.
  • The heaviest log is full of errors from T8672.
  • The second and third heaviest logs seem to be mostly related to bad repository configuration (one or more repositories retrying pulls over and over again).
  • Our proactive monitoring of these kinds of issues is relatively weak (see T7338). They're easy to recover from, but we learn about them from users more often than we should.

I'm planning to take these corrective actions:

  • Move logs to bak storage volumes. These are ubiquitous, suitable, and amply sized, and won't interact with normal system operation.
  • Fix T8672. This is on my relatively-short-list anyway.
  • Look at increasing backoff after pull failures. This should be a small adjustment that can prevent the major routine source of useless logs.
  • Look at increasing normal worker backoff. We could back off more quickly and further in repeated failure cases like T8672 without any real costs.
  • Real categorical fix is better proactive monitoring (per T7338) but that's involved and some ways away. Particularly, both "free disk" and "queue backlog" alarms could have detected this.

I think current compaction rules are generally fine as-is. We only missed routine compaction by about 18 hours, and T8672 is exceptionally noisy. As currently allocated, the bak volume on this host has ~300x more free space than the root volume, so we could have run for ~6 years without filling it up. So I think this is mostly an issue of increasing headroom in the short term, and improving proactive monitoring in the longer term.

D13939 will fix T8672, the most immediate cause of high log volumes, but I'm going to hold it until after we cut the release since it also raises the urgency of potential existing errors.

Log volumes are currently all under 50% full so I don't expect I'll need to perform any manual intervention this week. Ideally I'll get logs moved to bak (or maybe new dedicated log) volumes in the next couple days, things look stable so that may slip depending on what else crops up.

epriestley added a commit: Restricted Diffusion Commit.Aug 20 2015, 2:47 PM

I've added support for dedicated log volumes (see commit above). Only repo class hosts currently have much use for them, although I'd expect to deploy them more widely eventually.

I'm attaching and swapping logs over to these volumes now.

This seems to have completed without incident.

epriestley claimed this task.

These things have been fixed:

  • All repo hosts got rlog volumes. Haven't seen any issues here, and tier expansion in T9283 went cleanly.
  • D13939 is now deployed, and log volume is back down to a trickle.

These haven't been fixed yet:

  • Tweaking pull/worker backoff. This would be nice to have but the volume isn't high enough to cause concern.
  • In the long term, more proactive monitoring of cluster health (T7338).

We haven't experienced further issues here after the initial compaction, and haven't needed manual attention since attaching rlog volumes.