Page MenuHomePhabricator

Cut the middle, not bottom, of a big traceback

Authored by brhodes on Feb 13 2018, 6:45 PM.
Referenced Files
Unknown Object (File)
Tue, Jun 28, 9:45 PM
Unknown Object (File)
Fri, Jun 24, 1:03 PM
Unknown Object (File)
Mon, Jun 20, 9:51 AM
Unknown Object (File)
Mon, Jun 13, 3:02 PM
Unknown Object (File)
Thu, Jun 9, 7:25 AM
Unknown Object (File)
May 23 2022, 7:55 AM


Group Reviewers
Blessed Reviewers
Required Signatures
L28 Phacility Individual Contributor License Agreement

Users have been unable to see several linter failure messages because
the error is so often located at the bottom of the large traceback, not
at the top. So let's excise the middle, rather than the end.

Test Plan

Tested by hand.

Diff Detail

rPHU libphutil
cut-middle (branched from master)
Lint Passed
Tests Passed
Build Status
Buildable 19526
Build 26426: Run Core Tests
Build 26425: arc lint + arc unit

Event Timeline

I'm generally not excited about this. See also T12742. I think any behavior we select here is ultimately fairly arbitrary.

In what context are commands throwing CommandExceptions such that you can't control their behavior or arc's handling to produce a more concise message for users?

(This was affected by a mail stamp bug which I fixed in D19082, which is why mail didn't go out for it initially and I didn't see it for a bit -- sorry about that!)

I'm not completely opposed to changing the behavior here, but I'd like the motivation to be like "of the last 25 actual random command failures I hit where various things broke for unexpected reasons, the current rule did pretty well on 21 of them but the proposed rule does pretty well on 24 of them, so it's probably a slightly better rule overall". At least anecdotally, I've rarely encountered a command which the current rule doesn't handle well. The one exception I can think of is that when database fixture construction fails during unit test execution, the relevant piece is at the bottom of the output. The proposed rule would do better on that.

But here, and in T12742, it sounds like the problem is more along the lines of a specific, predictable command which you have complete control over not being handled well by this rule. In these cases, my first instinct is that the command or the handling of the command should change, not the default catch-all error handling behavior.

My first response should be to step back and ask for a bit of context. Why is any truncation occurring here? My real preference would be to display the entire error output of the commands run, and if users dislike the verbosity, then they can dial back how much information is returned on an error or in the case of a crash. If a tool does decide that it needs, say, 4k of text to explain to the user what happened, then my preference would be that arc doesn't try to second-guess the tool, and simply prints all 4k of text.

But if arc is going to make the executive decision to remove output that a tool wanted to print for the user, then I can state that in 100% of cases where a Python-written linter hits an unexpected exception and has to exit, the current code chops off the actual exception, leaving the user with no idea what has actually gone wrong. Here is one (somewhat redacted) example from earlier this week:

Traceback (most recent call last):
  File "/home/romanov/.cache/bazel/.../", line 28, in <module>
    runpy.run_path(os.path.join(script_dir, ""), run_name='__main__')
  File "", line 252, in run_path
  File "", line 82, in _run_module_code
  File "", line 72, in _run_code
  File "/home/romanov/.cache/bazel/.../", line 6, in <module>
    from dropbox.i18n.archive.archive_id_generator import ArchiveIDGenerator
  File "/home/romanov/.cache/bazel/.../translation_linter.runfile... (1,852 more bytes) ... at [<arcanist>/src/lint/engine/ArcanistLintEngine.php:275]

Because the Python runtime always puts the exception name and exception text at the bottom (quite logically, since that's the text that winds up sitting right above the prompt to which it expects the user to be returned), the current truncation logic hides it in, from what I've observed, 100% of cases.

Truncation is occurring because some commands output very large volumes of text (many megabytes or gigabytes). CommandException is a low-level mechanism, and the message from CommandException is often sent to a logfile or a database log.

From blame, this was originally added in D4280 as a response to T2193. I believe the specific issue there is that svn ls --xml ... was emitting several megabytes of text and then failing, and the message was being sent to the daemon logs, which eventually exhausted storage space on disk. (In some cases at Facebook, this command produced gigabytes of output.)

In a separate case (similar to T11402, although I think the particulars were via private communication), an install hit an issue where a build got stuck in a loop producing gigabytes of output on multiple builds until it was killed.

In the general case, it's unsafe to trust that subprocesses know what they're doing and will emit only a reasonable amount of output, and trusting subprocesses to produce reasonable amounts of output can have far-reaching implications because we often store, process, or display that output.

CommandException is not normally intended to be shown to users directly. For example, the first thing it prints is the command: this almost always useless if you're a user (you very rarely care about what command was run or which flags were passed), but often the most useful part of the message if you're a Phabricator developer trying to troubleshoot or implement something, since it lets you repeat the command and figure out what's wrong.

Upstream workflows rarely show users CommandExceptions directly. If you want to show a user the entire stderr that a command emits, you can do something like this:

list($err, $stdout, $stderr) = exec_manual(...);
if ($err) {
  echo $stderr;
} else {

It's not clear to me why you are showing users CommandExceptions in any context, but particularly unclear in the context of linters. I believe upstream linters do not normally use CommandException to interact with users.

To step back further, the onus is on you to provide context when submitting upstream changes. By accepting them, I'm agreeing to maintain them indefinitely. If I don't feel like I fully understand the problem a change solves, I'll usually ask for more context until I do understand the problem, and agree that the change is the best available solution (or, at least, a reasonable solution) to the problem.

This is why Contributing Code suggests you file a task first -- before submitting code -- and Contributing Feature Requests spends so many words harping on "please describe problems, don't request features".

I don't yet understand what problem this is actually solving, but I suspect I can suggest a better solution once I do. Is the scenario something like this?

  • You have some kind of linter written in Python (is this something well-known, or something internal?).
  • You have Arcanist bindings for it (using ArcanistExternalLinter? Using fully custom code? Using some other bindings?).
  • The external linter encounters runtime fatal errors frequently, or at least frequently enough that users are affected by it. (How often? Is this just one user hitting this who got kind of annoyed and filed a task about it, or does this linter fatal like 10 times a day for everyone? If the latter, why does it fatal that often? Do you anticipate it continuing to fatal that often for the foreseeable future?)
  • When the linter fatals, it dumps a default Python stack trace with an exception message, which is >1000 bytes long with the most interesting part at the bottom. (Is the interesting part of a Python stack trace always fewer than 333 characters long?)
  • Users currently get a generic CommandException when this happens, which truncates the interesting part.
  • How do you expect users to react to this? Report the error to some team that maintains the linter? Fix the error themselves?

many megabytes or gigabytes … In some cases at Facebook, this command produced gigabytes of output.


That's so epic!

And, yes, makes sense of the truncation.

It's not clear to me why you are showing users CommandExceptions in any context…

From the interest of a co-worker in seeing this upstreamed, I'd assumed that we were using a CommandException roughly within bounds of its design by showing it to the user in this case. I'm now abandoning that assumption and will step back and re-trace the steps by which this particular exception is both being generated and then being put in front of our users.

Thanks for the very detailed response! I'll gather more context in the future before assuming that a diff that got our own users moving again is applicable to other arc users' use cases.

(We had an SVN monorepo at Facebook and someone deleted a root-level path -- I think /tfb/ -- which contained all branches of the website code in late 2010/early 2011. It was easy to revert in Subversion, but Phabricator had to parse a change affecting 97,000,000 paths or something like that.)

// This method is quite horrible. The underlying challenge is that some
// commits in the Facebook repository are enormous, taking multiple hours
// to 'ls -R' out of the repository and producing XML files >1GB in size.

// If we try to SimpleXML them, the object exhausts available memory on a
// 64G machine. Instead, cache the XML output and then parse it line by line
// to limit space requirements.

$cache_loc = sys_get_temp_dir().'/diffusion.'.$hashkey.'.svnls';
if (!Filesystem::pathExists($cache_loc)) {
  $tmp = new TempFile();
    '--xml ls -R %s > %s',
    'mv %s %s',

$map = $this->parseRecursiveListFileData($cache_loc);

And, of course, parseRecursiveListFileData() uses regular expressions to parse XML.