Change Details

We have limited insight into cluster health right now. We will probably want to be able to do these things soon, at a minimum: - Monitor free disk space; I've allocated this fairly conservatively and may be way off about what we actually need. Running out of this also degrades performance instantly instead of gradually. - (T8781) Monitor daemon queue length. - (T11559) Monitor outstanding repository errors. - Get paged/alerted when stuff is down. The full-power version of this involves expanding Almanac, building Facts, and adding SMS support, but we probably want to build some lead-up solutions on the way.