Build more status tools for monitoring Phacility cluster health
Open, NormalPublic
Actions

Assigned To

Authored By

	epriestley
	Feb 20 2015, 3:55 PM

Description

We have limited insight into cluster health right now. We will probably want to be able to do these things soon, at a minimum:

Monitor free disk space; I've allocated this fairly conservatively and may be way off about what we actually need. Running out of this also degrades performance instantly instead of gradually.
(T8781) Monitor daemon queue length.
(T11559) Monitor outstanding repository errors.
Get paged/alerted when stuff is down.

The full-power version of this involves expanding Almanac, building Facts, and adding SMS support, but we probably want to build some lead-up solutions on the way.

Revisions and Commits

Restricted Diffusion Commit

Related Objects
Search...

Status	Assigned	Task
Open	None	T7346 Anticipate scaling challenges in the Phacility cluster
Open	epriestley	T7338 Build more status tools for monitoring Phacility cluster health
Resolved	epriestley	T920 Provide SMS Support

Event Timeline

epriestley created this task.Feb 20 2015, 3:55 PM

epriestley raised the priority of this task from to Normal.

epriestley updated the task description. (Show Details)

epriestley added projects: Phacility, Almanac.

epriestley moved this task to Do After Launch on the Phacility board.

epriestley added a subtask: T920: Provide SMS Support.

epriestley added a subscriber: epriestley.

• chasemp added a subscriber: • chasemp.Feb 20 2015, 4:31 PM

Mnkras added a subscriber: Mnkras.Feb 21 2015, 4:58 AM

epriestley mentioned this in T7346: Anticipate scaling challenges in the Phacility cluster.Feb 21 2015, 1:04 PM

epriestley added a commit: Restricted Diffusion Commit.Feb 21 2015, 10:40 PM

epriestley added a parent task: T7346: Anticipate scaling challenges in the Phacility cluster.Feb 26 2015, 2:56 PM

andrewkumar added a mock: Restricted Pholio Mock.Mar 20 2015, 3:22 AM

chad removed a mock: Restricted Pholio Mock.Mar 20 2015, 3:23 AM

• bmosinski renamed this task from Build more status tools for monitoring Phacility cluster health to BES EOL determine strategy to sunset in place.Apr 23 2015, 8:33 PM

• bmosinski claimed this task.

• bmosinski updated the task description. (Show Details)

• bmosinski changed the visibility from "Public (No Login Required)" to "bmosinski (Bob Mosinski)".

• bmosinski changed the edit policy from "All Users" to "bmosinski (Bob Mosinski)".

• bmosinski removed subscribers: Mnkras, • chasemp, epriestley.

epriestley renamed this task from BES EOL determine strategy to sunset in place to Build more status tools for monitoring Phacility cluster health.May 15 2015, 1:18 PM

epriestley claimed this task.

epriestley updated the task description. (Show Details)

epriestley added subscribers: Mnkras, • chasemp, epriestley.

epriestley mentioned this in T8210: Phacility Cluster: Bastion host stopped responding.May 15 2015, 2:51 PM

epriestley changed the visibility from "All Users" to "Public (No Login Required)".Jul 7 2015, 6:32 PM

epriestley mentioned this in T8781: Not receiving emails.Jul 7 2015, 6:47 PM

From T8781, monitoring instance daemon queue lengths would be particularly helpful in identifying at least 1-2 issues we've hit.

epriestley mentioned this in T9187: Phacility Cluster: Daemons outpaced log compaction.Aug 14 2015, 7:12 PM

In T7338#125216, @epriestley wrote:

From T8781, monitoring instance daemon queue lengths would be particularly helpful in identifying at least 1-2 issues we've hit.

Queue length isn't currently anywhere for $YOUR_FAVORITE_ALERTING_TOOL to consume short of scraping the html on /daemon, right? (You need to expose it and set up monitoring on it for the Phacillity Cluster.)

Yeah, there's no way to connect a third-party tool to application-level metrics like queue length right now.

My tentative plan is to sink a day or two into building a first-party tool connected to Almanac (since it's already a reliable, authoritative list of all services and devices we want to monitor) and see how promising that is. If it looks like a reasonable path forward I'd probably put some conduit.monitor endpoint in the upstream and let that emit application-level metrics.

You could expose this stuff today by dumping your own method into src/extensions/, but knowing how to properly pull things like queue length is probably the hard part. I'm not sure how many other application-level metrics we really have. We could pull some very vague request rate and response time metrics out of Multimeter but I'm not sure they'd be particularly useful (in particular, I'm not sure they'd be particularly well correlated with actual errors). But maybe once we had the skeleton in place some other stuff would be a good fit (e.g., logging and reporting certain kinds of service failures).

• ricky.liddard added a subscriber: • ricky.liddard.Oct 17 2015, 6:36 AM

This comment was removed by epriestley.

• jshirley added a subscriber: • jshirley.Nov 17 2015, 6:16 PM

epriestley mentioned this in T8668: The web interface should alert (show a config issue?) when there are fatal errors or exceptions in the daemon log.Dec 26 2015, 11:04 PM

epriestley mentioned this in T10246: Deploy Drydock in the Phacility cluster.Feb 26 2016, 2:56 PM

joshuaspence added a subscriber: joshuaspence.Feb 29 2016, 8:22 AM

• psycrym added a subscriber: • psycrym.Mar 15 2016, 8:28 AM

epriestley mentioned this in T10847: 30GB Phacility instance caused a series of cascading failures which left web services unreachable.Apr 21 2016, 6:53 PM

epriestley mentioned this in T11559: Add a list of individual repositories with update errors to "Repository Servers" page.Aug 30 2016, 2:24 PM

epriestley updated the task description. (Show Details)Aug 30 2016, 4:25 PM

Herald added a subscriber: eadler. · View Herald TranscriptAug 30 2016, 4:25 PM

epriestley mentioned this in T11665: repo002.phacility.net is heavily loaded.Sep 19 2016, 10:29 PM