Page MenuHomePhabricator

D14173.diff
No OneTemporary

D14173.diff

diff --git a/src/docs/user/field/exit_codes.diviner b/src/docs/user/field/exit_codes.diviner
new file mode 100644
--- /dev/null
+++ b/src/docs/user/field/exit_codes.diviner
@@ -0,0 +1,243 @@
+@title Command Line Exit Codes
+@group fieldmanual
+
+Explains the use of exit codes in Phabricator command line scripts.
+
+Overview
+========
+
+When you run a command from the command line, it exits with an //exit code//.
+This code is normally not shown on the CLI, but you can examine the exit code
+of the last command you ran by looking at `$?` in your shell:
+
+ $ ls
+ ...
+ $ echo $?
+ 0
+
+Programs which run commands can operate on exit codes, and shell constructs
+like `cmdx && cmdy` operate on exit codes.
+
+The code `0` means success. Other codes signal some sort of error or status
+condition, depending on the system and command.
+
+With rare exception, Phabricator uses //all other codes// to signal
+**catastrophic failure**.
+
+This is an explicit architectural decision and one we are unlikely to deviate
+from: generally, we will not accept patches which give a command a nonzero exit
+code to indicate an expected state, an application status, or a minor abnormal
+condition.
+
+Generally, this decision reflects a philosophical belief that attaching
+application semantics to exit codes is a relic of a simpler time, and that
+they are not appropriate for communicating application state in a modern
+operational environment. This document explains the reasoning behind our use of
+exit codes in more detail.
+
+In particular, this approach is informed by a focus on operating Phabricator
+clusters at scale. This is not a common deployment scenario, but we consider it
+the most important one. Our use of exit codes makes it easier to deploy and
+operate a Phabricator cluster at larger scales. It makes it slightly harder to
+deploy and operate a small cluster or single host by gluing together `bash`
+scripts. We are willingly trading the small scale away for advantages at larger
+scales.
+
+
+Problems With Exit Codes
+========================
+
+We do not use exit codes to communicate application state because doing so
+makes it harder to write correct scripts, and the primary benefit is that it
+makes it easier to write incorrect ones.
+
+This is somewhat at odds with the philosophy of "worse is better", but a modern
+operations environment faces different forces than the interactive shell did
+in the 1970s, particularly at scale.
+
+We consider correctness to be very important to modern operations environments.
+In particular, we manage a Phabricator cluster (Phacility) and believe that
+having reliable, repeatable processes for provisioning, configuration and
+deployment is critical to maintaining and scaling our operations. Our use of
+exit codes makes it easier to implement processes that are correct and reliable
+on top of Phabricator management scripts.
+
+Exit codes as signals for application state are problematic because they are
+ambiguous: you can't use them to distinguish between dissimilar failure states
+which should prompt very different operational responses.
+
+Exit codes primarily make writing things like `bash` scripts easier, but we
+think you shouldn't be writing `bash` scripts in a modern operational
+environment if you care very much about your software working.
+
+Software environments which are powerful enough to handle errors properly are
+also powerful enough to parse command output to unambiguously read and react to
+complex state. Communicating application state through exit codes almost
+exclusively makes it easier to handle errors in a haphazard way which is often
+incorrect.
+
+
+Exit Codes are Ambiguous
+========================
+
+In many cases, exit codes carry very little information and many different
+conditions can produce the same exit code, including conditions which should
+prompt very different responses.
+
+The command line tool `grep` searches for text. For example, you might run
+a command like this:
+
+ $ grep zebra corpus.txt
+
+This searches for the text `zebra` in the file `corpus.txt`. If the text is
+not found, `grep` exits with a nonzero exit code (specifically, `1`).
+
+Suppose you run `grep zebra corpus.txt` and observe a nonzero exit code. What
+does that mean? These are //some// of the possible conditions which are
+consistent with your observation:
+
+ - The text `zebra` was not found in `corpus.txt`.
+ - `corpus.txt` does not exist.
+ - You do not have permission to read `corpus.txt`.
+ - `grep` is not installed.
+ - You do not have permission to run `grep`.
+ - There is a bug in `grep`.
+ - Your `grep` binary is corrupt.
+ - `grep` was killed by a signal.
+
+If you're running this command interactively on a single machine, it's probably
+OK for all of these conditions to be conflated. You aren't going to examine the
+exit code anyway (it isn't even visible to you by default), and `grep` likely
+printed useful information to `stderr` if you hit one of the less common issues.
+
+If you're running this command from operational software (like deployment,
+configuration or monitoring scripts) and you care about the correctness and
+repeatability of your process, we believe conflating these conditions is not
+OK. The operational response to text not being present in a file should almost
+always differ substantially from the response to the file not being present or
+`grep` being broken.
+
+In a particularly bad case, a broken `grep` might cause a careless deployment
+script to continue down an inappropriate path and cascade into a more serious
+failure.
+
+Even in a less severe case, unexpected conditions should be detected and raised
+to operations staff. `grep` being broken or a file that is expected to exist
+not existing are both detectable, unexpected, and likely severe conditions, but
+they can not be differentiated and handled by examining the exit code of
+`grep`. It is much better to detect and raise these problems immediately than
+discover them after a lengthy root cause analysis.
+
+Some of these conditions can be differentiated by examining the specific exit
+code of the command instead of acting on all nonzero exit codes. However, many
+failure conditions produce the same exit codes (particularly code `1`) and
+there is no way to guarantee that a particular code signals a particular
+condition, especially across systems.
+
+Realistically, it is also relatively rare for scripts to even make an effort to
+distinguish between exit codes, and all nonzero exit codes are often treated
+the same way.
+
+
+Bash Scripts are not Robust
+============================
+
+Exit codes that indicate application status make writing `bash` scripts (or
+scripts in other tools which provide a thin layer on top of what is essentially
+`bash`) a lot easier and more convenient.
+
+For example, it is pretty tricky to parse JSON in `bash` or with standard
+command-line tools, and much easier to react to exit codes. This is sometimes
+used as an argument for communicating application status in exit codes.
+
+We reject this because we don't think you should be writing `bash` scripts if
+you're doing real operations. Funadmentally, `bash` shell scripts are not a
+robust building block for creating correct, reliable operational processes.
+
+Here is one problem with using `bash` scripts to perform operational tasks.
+Consider this command:
+
+ $ mysqldump | gzip > backup.sql.gz
+
+Now, consider this command:
+
+ $ mysqldermp | gzip > backup.sql.gz
+
+These commands represent a fairly standard way to accomplish a task (dumping
+a compressed database backup to disk) in a `bash` script.
+
+Note that the second command contains a typo (`dermp` instead of `dump`) which
+will cause the command to exit abruptly with a nonzero exit code.
+
+However, both these statements run successfully and exit with exit code `0`
+(indicating success). Both will create a `backup.sql.gz` file. One backs up
+your data; the other never backs up your data. This second command will never
+work and never do what the author intended, but will appear successful under
+casual inspection.
+
+These behaviors are the same under `set -e`.
+
+This fragile attitude toward error handling is endemic to `bash` scripts. The
+default behavior is to continue on errors, and it isn't easy to change this
+default. Options like `set -e` are unreliable and it is difficult to detect and
+react to errors in fundamental constructs like pipes. The tools that `bash`
+scripts employ (like `grep`) emit ambiguous error codes. Scripts can not help
+but propagate this ambiguity no matter how careful they are with error handling.
+
+It is likely //possible// to implement these things safely and correctly in
+`bash`, but it is not easy or straightforward. More importantly, it is not the
+default: the default behavior of `bash` is to ignore errors and continue.
+
+Gluing commands together in `bash` or something that sits on top of `bash`
+makes it easy and convenient to get a process that works fairly well most of
+the time at small scales, but we are not satisfied that it represents a robust
+foundation for operations at larger scales.
+
+
+Reacting to State
+=================
+
+Instead of communicating application state through exit codes, we generally
+communicate application state through machine-parseable output with a success
+(`0`) exit code. All nonzero exit codes indicate catastrophic failure which
+requires operational intervention.
+
+Callers are expected to request machine-parseable output if necessary (for
+example, by passing a `--json` flag or other similar flags), verify the command
+exits with a `0` exit code, parse the output, then react to the state it
+communicates as appropriate.
+
+In a sufficiently powerful scripting environment (e.g., one with data
+structures and a JSON parser), this is straightforward and makes it easy to
+react precisely and correctly. It also allows scripts to communicate
+arbitrarily complex state. Provided your environment gives you an appropriate
+toolset, it is much more powerful and not significantly more complex than using
+error codes.
+
+Most importantly, it allows the calling environment to treat nonzero exit
+statuses as catastrophic failure by default.
+
+
+Moving Forward
+==============
+
+Given these concerns, we are generally unwilling to bring changes which use
+exit codes to communicate application state (other than catastrophic failure)
+into the upstream. There are some exceptions, but these are rare. In
+particular, ease of use in a `bash` environment is not a compelling motivation.
+
+We are broadly willing to make output machine parseable or provide an explicit
+machine output mode (often a `--json` flag) if there is a reasonable use case
+for it. However, we operate a large production cluster of Phabricator instances
+with the tools available in the upstream, so the lack of machine parseable
+output is not sufficient to motivate adding such output on its own: we also
+need to understand the problem you're facing, and why it isn't a problem we
+face. A simpler or cleaner approach to the problem may already exist.
+
+If you just want to write `bash` scripts on top of Phabricator scripts and you
+are unswayed by these concerns, you can often just build a composite command to
+get roughly the same effect that you'd get out of an exit code.
+
+For example, you can pipe things to `grep` to convert output into exit codes.
+This should generally have failure rates that are comparable to the background
+failure level of relying on `bash` as a scripting environment.

File Metadata

Mime Type
text/plain
Expires
Sun, Nov 10, 9:28 PM (1 w, 1 d ago)
Storage Engine
blob
Storage Format
Encrypted (AES-256-CBC)
Storage Handle
6761350
Default Alt Text
D14173.diff (11 KB)

Event Timeline