Although we have pretty good tools for understanding and resolving a performance problem once we can reproduce it (like the Service profiler, XHProf, and debug.time-limit), we have few tools for searching for performance problems or for attributing performance problems to a root cause when users complain that an entire install is slow but don't have a specific repro. Even for first party installs like this one, it's easy for something to consume substantial resources and escape notice -- for example, Slack continuously makes 4 API calls per second and I only caught that by seeing it in the access log.
Recording performance data would allow us to answer questions like:
- What unreported performance problems are users experiencing?
- How fast is Phabricator?
- Is Phabricator getting faster or slower?
- Which service calls have high variance and/or very poor worst case performance?
- What are the highest-impact places to focus performance work?
- Where are resources being spent?
- Are there workload causes for poor performance (e.g., a bot making a trillion calls)?
- In the Phacility use case, which instances are consuming disproportionate levels of resources?
- In consulting use cases, where should we focus our efforts?
I think none of these are particularly pressing questions today, but I imagine that at least a first cut of this tool will be worth building sometime in 2015.