We have limited insight into cluster health right now. We will probably want to be able to do these things soon, at a minimum:
- Monitor free disk space; I've allocated this fairly conservatively and may be way off about what we actually need. Running out of this also degrades performance instantly instead of gradually.
- (T8781) Monitor daemon queue length.
- (T11559) Monitor outstanding repository errors.
- Get paged/alerted when stuff is down.
The full-power version of this involves expanding Almanac, building Facts, and adding SMS support, but we probably want to build some lead-up solutions on the way.