Page MenuHomePhabricator

Write about "add more logging / monitoring / tests"
Open, WishlistPublic

Description

This is just an idea I've been kicking around a little bit, and kind of a variation of every other Good Idea in Software Development which basically all boil down to "understand why things happen before you create a plan to respond to them", but I haven't specifically seen it developed too much elsewhere.

With some frequency, I'll see suggestions to "add more logging", "add more monitoring", "add more tests", etc., or related questions ("how can I find the logs"). I think these questions are often the wrong questions to ask, because they're jumping over an "understand the problem" step and assuming a solution (the least-surgical, most broad-spectrum one-size-fits-all solution) -- but logging/monitoring/tests are very poor solutions to some problems in these domains.

Narrow FocusBroader Focus
Logging as an active diagnostic tool.Observability of the system.
Where are the logs?How can I observe/diagnose the behavior?
Should we log this [to help future operator-at-keyboard active diagnostics]?How can we make this system more observable?
Monitoring as a reliability layer.Reliability of the system.
Is the system monitored?Is the system reliable?
Should we add monitoring?How can we make the system more reliable?
Unit tests as regression protection.Robustness of the system [to change].
Do we have test coverage?Is the system robust to change?
Should we add tests?How can we make the system more robust to change?

This is really just a set of special cases of "describe the problem, not the solution", but they're kind of a weird flavor of that?

Event Timeline

epriestley triaged this task as Wishlist priority.Aug 15 2019, 4:42 PM
epriestley created this task.

Another variation of this is "add more documentation", although I think the pattern around this one is more rarely a sort of "problem domain / solution domain mismatch" sort of issue and more often a "human communication" issue, usually with one of these two templates:

User A: When I click button X, the UI explodes, literally killing me. Maybe this could be documented?

(The recent thread https://discourse.phabricator-community.org/t/how-to-remove-users-from-projects/3042/ is an example of this.)

..or:

User B: When I repeatedly click button "Delete All Data Forever, Permanently" and all the confirmation screens behind it, then enter my MFA credentials, then type my middle name backwards, my data is deleted! I just wanted to hide it from the UI! The documentation should be more clear about this!

(I have a recent example of this in my inbox, although the consequences were fortunately less severe.)

Neither user actually read the documentation before clicking the button so these templates are both based on an imagined scenario where other users do. They don't.

In both cases, the UI is almost certainly the actual problem, although User B is also a problem. We should correct the UI for User A, and consider ways we can save User B from themselves.

I think this whole topic is usually more of a "how to support users better" issue, not a "how to develop software better" issue. There are some cases where it becomes more of an operational/system issue ("documentation as a service recovery tool"), but the other questions above are about systems 90% of the time and this one is about user support 90% of the time.