Page MenuHomePhabricator

Build more status tools for monitoring Phacility cluster health
Open, NormalPublic

Description

We have limited insight into cluster health right now. We will probably want to be able to do these things soon, at a minimum:

  • Monitor free disk space; I've allocated this fairly conservatively and may be way off about what we actually need. Running out of this also degrades performance instantly instead of gradually.
  • (T8781) Monitor daemon queue length.
  • (T11559) Monitor outstanding repository errors.
  • Get paged/alerted when stuff is down.


The full-power version of this involves expanding Almanac, building Facts, and adding SMS support, but we probably want to build some lead-up solutions on the way.

Revisions and Commits

Event Timeline

epriestley raised the priority of this task from to Normal.
epriestley updated the task description. (Show Details)
epriestley added projects: Phacility, Almanac.
epriestley moved this task to Do After Launch on the Phacility board.
epriestley added a subscriber: epriestley.
epriestley added a commit: Restricted Diffusion Commit.Feb 21 2015, 10:40 PM
chad removed a mock: Restricted Pholio Mock.Mar 20 2015, 3:23 AM
bmosinski renamed this task from Build more status tools for monitoring Phacility cluster health to BES EOL determine strategy to sunset in place.Apr 23 2015, 8:33 PM
bmosinski claimed this task.
bmosinski updated the task description. (Show Details)
bmosinski changed the visibility from "Public (No Login Required)" to "bmosinski (Bob Mosinski)".
bmosinski changed the edit policy from "All Users" to "bmosinski (Bob Mosinski)".
bmosinski removed subscribers: Mnkras, chasemp, epriestley.
epriestley renamed this task from BES EOL determine strategy to sunset in place to Build more status tools for monitoring Phacility cluster health.May 15 2015, 1:18 PM
epriestley claimed this task.
epriestley updated the task description. (Show Details)
epriestley added subscribers: Mnkras, chasemp, epriestley.
epriestley changed the visibility from "All Users" to "Public (No Login Required)".Jul 7 2015, 6:32 PM

From T8781, monitoring instance daemon queue lengths would be particularly helpful in identifying at least 1-2 issues we've hit.

From T8781, monitoring instance daemon queue lengths would be particularly helpful in identifying at least 1-2 issues we've hit.

Queue length isn't currently anywhere for $YOUR_FAVORITE_ALERTING_TOOL to consume short of scraping the html on /daemon, right? (You need to expose it and set up monitoring on it for the Phacillity Cluster.)

Yeah, there's no way to connect a third-party tool to application-level metrics like queue length right now.

My tentative plan is to sink a day or two into building a first-party tool connected to Almanac (since it's already a reliable, authoritative list of all services and devices we want to monitor) and see how promising that is. If it looks like a reasonable path forward I'd probably put some conduit.monitor endpoint in the upstream and let that emit application-level metrics.

You could expose this stuff today by dumping your own method into src/extensions/, but knowing how to properly pull things like queue length is probably the hard part. I'm not sure how many other application-level metrics we really have. We could pull some very vague request rate and response time metrics out of Multimeter but I'm not sure they'd be particularly useful (in particular, I'm not sure they'd be particularly well correlated with actual errors). But maybe once we had the skeleton in place some other stuff would be a good fit (e.g., logging and reporting certain kinds of service failures).