Page MenuHomePhabricator

Restore background linting and unit tests for performance
Closed, ResolvedPublic

Description

One of our engineers just tried to submit a diff with lint errors, but it just hung. When arc diff runs arc --recon diff --no-diff --ansi (ArcanistDiffWorkflow.php:443-ish), it just hangs rather than returning. If I add --no-lint it goes through fine.

I've isolated it to $lint_workflow->run(); on ~ArcanistDiffWorkflow.php:1240 - it never makes it to the switch on the next line. As best I can tell, it has started hanging if there's a prompt for an autofix patch. I tried throwing a $argv[] = '--never-apply-patches into ArcanistDiffWorkflow's runLint() function and it appears to fix the issue, but that seems like the wrong way to fix the problem. I also didn't see a recent change that looks like it would have introduced the error, but it's admittedly been a while since I did arc upgrade.

Event Timeline

Firehed raised the priority of this task from to Needs Triage.
Firehed updated the task description. (Show Details)
Firehed added a project: Arcanist.
Firehed added a subscriber: Firehed.

If it's easy to check, does --background 0 also fix it?

(There's a good chance I can hunt this down regardless, so if you have to bother three people to repro it don't worry about it, but if it's easy having that couldn't hurt.)

It does not (I assume you mean on the arc diff command, as arc lint --background 0 just errors out with an unrecognized option)

Hmm, peculiar. Thanks, I'll dig into this a bit.

Hmm, I can't immediately reproduce this. I adjusted the severity of an existing warning to "Autofix" but arc diff ran fine:

$ arc diff --conduit-uri=http://local.aphront.com:8080/
Linting...
>>> Lint for scripts/arcanist.php:


   Auto-Fix  (XHP34) Comment Spaces
    Put space after comment start.

               3 
               4 sanity_check_environment();
               5 
    >>> -      6 //asdf
        +        // asdf
               7 
               8 require_once dirname(__FILE__).'/__init_script__.php';
               9 
--- /INSECURE/devtools/arcanist/scripts/arcanist.php	2014-01-02 15:19:01.000000000 -0800
+++ /private/var/folders/8k/c3vkmjy5335gcxdzxkhwq82w0000gn/T/r35rt3awidc0sgkk/2359-Zuh2y3	2014-01-02 15:24:28.000000000 -0800
@@ -3,7 +3,7 @@
 
 sanity_check_environment();
 
-//asdf
+// asdf
 
 require_once dirname(__FILE__).'/__init_script__.php';
 


    Apply this patch to scripts/arcanist.php? [y/N]

A slim possibility is that this is related to T2463 -- does the machine this is occurring on have XDebug listeners enabled?

I never explicitly set it up on my machine which exhibits the problem, but I thought I saw something for eclipse running on the machine we first spotted the problem from.

This is mine:

eric@eric-dev ~/wepay: php -i | grep -i xdebug
/etc/php.d/xdebug.ini,
    with Xdebug v2.2.3, Copyright (c) 2002-2013, by Derick Rethans
xdebug
xdebug support => enabled
xdebug.auto_trace => Off => Off
xdebug.cli_color => 0 => 0
xdebug.collect_assignments => Off => Off
xdebug.collect_includes => On => On
xdebug.collect_params => 0 => 0
xdebug.collect_return => Off => Off
xdebug.collect_vars => Off => Off
xdebug.coverage_enable => On => On
xdebug.default_enable => On => On
xdebug.dump.COOKIE => no value => no value
xdebug.dump.ENV => no value => no value
xdebug.dump.FILES => no value => no value
xdebug.dump.GET => no value => no value
xdebug.dump.POST => no value => no value
xdebug.dump.REQUEST => no value => no value
xdebug.dump.SERVER => no value => no value
xdebug.dump.SESSION => no value => no value
xdebug.dump_globals => On => On
xdebug.dump_once => On => On
xdebug.dump_undefined => Off => Off
xdebug.extended_info => On => On
xdebug.file_link_format => no value => no value
xdebug.idekey => no value => no value
xdebug.max_nesting_level => 100 => 100
xdebug.overload_var_dump => On => On
xdebug.profiler_aggregate => Off => Off
xdebug.profiler_append => Off => Off
xdebug.profiler_enable => Off => Off
xdebug.profiler_enable_trigger => Off => Off
xdebug.profiler_output_dir => /tmp => /tmp
xdebug.profiler_output_name => cachegrind.out.%p => cachegrind.out.%p
xdebug.remote_autostart => Off => Off
xdebug.remote_connect_back => Off => Off
xdebug.remote_cookie_expire_time => 3600 => 3600
xdebug.remote_enable => Off => Off
xdebug.remote_handler => dbgp => dbgp
xdebug.remote_host => localhost => localhost
xdebug.remote_log => no value => no value
xdebug.remote_mode => req => req
xdebug.remote_port => 9000 => 9000
xdebug.scream => Off => Off
xdebug.show_exception_trace => Off => Off
xdebug.show_local_vars => Off => Off
xdebug.show_mem_delta => Off => Off
xdebug.trace_enable_trigger => Off => Off
xdebug.trace_format => 0 => 0
xdebug.trace_options => 0 => 0
xdebug.trace_output_dir => /tmp => /tmp
xdebug.trace_output_name => trace.%c => trace.%c
xdebug.var_display_max_children => 128 => 128
xdebug.var_display_max_data => 512 => 512
xdebug.var_display_max_depth => 3 => 3

A side channel approach would be to add:

declare(ticks = 1);

...near the top of scripts/arcanist.php, then kill -HUP <pid> the process once it hangs. That should make it write a stack trace to sys_get_temp_dir() , called phabricator_backtrace_<pid>. It may or may not be in a state where it can receive the signal, but if it is that could give us a clue about what's happening.

I think xdebug.remote_autostart is the relevant bit, but I never actually looked into T2463 since we've only seen one report of it to date.

eric@eric-dev ~/wepay: cat /tmp/phabricator_backtrace_511
#0 /home/sites/dev/libphutil/src/error/PhutilErrorHandler.php(188): __phutil_signal_handler__(1)
#1 [internal function]: PhutilErrorHandler::handleError(2, 'stream_select()...', '/home/sites/dev...', 196, Array)
#2 /home/sites/dev/libphutil/src/channel/PhutilChannel.php(196): stream_select(Array, Array, Array, 1, 0)
#3 /home/sites/dev/libphutil/src/channel/PhutilChannel.php(99): PhutilChannel::waitForActivity(Array, Array, Array)
#4 /home/sites/dev/libphutil/src/channel/PhutilProtocolChannel.php(135): PhutilChannel::waitForAny(Array)
#5 /home/sites/dev/libphutil/src/console/PhutilConsole.php(215): PhutilProtocolChannel->waitForMessage()
#6 /home/sites/dev/libphutil/src/console/PhutilConsole.php(117): PhutilConsole->waitForMessage()
#7 /home/sites/dev/arcanist/src/workflow/ArcanistLintWorkflow.php(530): PhutilConsole->confirm('Apply this patc...', false)
#8 /home/sites/dev/arcanist/src/workflow/ArcanistDiffWorkflow.php(1236): ArcanistLintWorkflow->run()
#9 /home/sites/dev/arcanist/src/workflow/ArcanistDiffWorkflow.php(1197): ArcanistDiffWorkflow->runLint()
#10 /home/sites/dev/arcanist/src/workflow/ArcanistDiffWorkflow.php(418): ArcanistDiffWorkflow->runLintUnit()
#11 /home/sites/dev/arcanist/scripts/arcanist.php(322): ArcanistDiffWorkflow->run()
#12 {main}

Oh, do that for the child process? That looks like the parent. It should have a --recon child.

(It's expected that the parent will be sitting in stream_select() most of the time.)

Super weird. On a new branch to reproduce the issue it submitted the diff just fine (although still hangs if I arc diff --recon --no-diff manually)

However that trace was on the --recon process I was running manually.

Oh, sorry -- it's expected that arc diff --recon --no-diff will hang indefinitely when run on its own -- it's waiting for input. Specifically, it should output something like this:

O:20:"PhutilConsoleMessage":2:{s:26:"PhutilConsoleMessagetype";s:14:"phutil:confirm";s:26:"PhutilConsoleMessagedata";a:2:{s:6:"prompt";s:48:"Apply this patch to scripts/arcanist.php?";s:7:"default";b:0;}}

That's a message from the child (--recon) process to the parent process saying "show the user a prompt, then send back what they type in". The child will wait for a response at this point, which looks something vaguely like:

O:nn:some serializd junk:{"blah blah", "the user typed Y"}

You can't reasonably fake the protocol manually unless you're good a typing binary data.

Running arc diff until the whole thing hangs and then HUP'ing the child (and possibly the parent) may be informative, but that looks like expected behavior (waiting for a response once it gets to a confirm prompt).

Ah, gotcha. I just confirmed that the developer that reported the issue to me doesn't even have xdebug installed, so it's definitely not the listener. However I told her to grab me the next time it happens (it *was* originally being run completely normally), so I'll upload a new trace with the SIGHUP trick when I have one.

I think it was actually after the lint and unit tests had run on her system, so the autofix thing is probably total red herring. Sorry about that :/

Anything else that might be useful?

I can't think of any other ways to attack this, but I'll keep an ear open for similar reports.

Ok, --background 0 DID allow it to go through. I don't have the stack from SIGHUP as apparently she doesn't have the pcntl extension installed either (I can get it installed if necessary, although given that your first instinct was right I'm guessing it probably isn't...)

Installing pcntl and digging the traces out would be helpful, I don't have a specific theory on what's breaking. The backgrounding is just complex and somewhat hang-prone (both processes spend some time waiting on messages from the other) and interacts with other stuff in weird ways (like the xdebug issue from earlier in this thread).

epriestley triaged this task as Wishlist priority.Jan 15 2014, 8:06 PM

(Just triaging until we get some more information.)

Sounds good. We're trying to push everyone through the arc workflows for everything so if it's still a legitimate issue I should have an update pretty soon.

A couple of coworkers have been complaining about arc diff hanging at 'Linting ...', I have a feeling it might be the same bug as this one since --background 0 allows the diffs to go through.
I will try to get a helpful stack trace, but meanwhile I just wanted to ping this thread as a heads up.

(Sadly I can't seem to repro on my machine, so I have to wait until someone else has a repro before I can get a trace).

Note: This reproed even with no lint errors.

@btrahan and @lpriestley have been hitting this recently too. Unfortunately, I'm not hitting it myself. However, I'm going to steal @lpriestley's laptop in a little bit and see if I can repro it and figure things out.

I sort of hunted this down a bit, but it ends up in the crawling horror of D7748.

The root issue is that PHP doesn't let you distinguish between EAGAIN or EPIPE on some pipes, so you can't tell if you need to wait to write or throw the pipe away.

D7748 appeared to provide a method to detect this. However, it looks like this method is not working correctly in some cases: it is incorrectly detecting the pipe as EPIPE'd, based on a 0-length write followed by a "writes will not block" select, followed by another 0-length write. This result is completely absurd and as far as I can tell PHP is in the wrong here, but that doesn't help much.

I think this has become more of an issue recently because arc lint started writing a lot more information to the channel (it now prints more diagnostic information about lint analysis). This is sort of bad anyway because it's bad to fill up the channel buffer, but resolving that won't resolve the underlying issue.

I also can't reproduce this on my machine at all.

Some possible ways forward:

  • Nail this down to specific versions of PHP, and force them into --background 0.
  • Switch to unix domain sockets for --recon.
  • Figure out what's going on inside PHP and find some alternate way to distinguish between EAGAIN and EPIPE.
  • Disable --background 0 completely.
  • We don't react appropriately to closed write sockets in flush(). Fixing that would reduce this from a hang to an error, at least.

This is also hitting me on latest arcanist.

A colleague is hitting this issue as well:

after upgrading to the latest version of MacOS (yeah I know I should use Linux, I'll switch soon) my arc diff command is hanging on "Linting...". It used to work fine before.

Arc lint works fine, and if I run 'arc diff --trace' it doesn't print out any useful info (see below):

17:52 16-1-168-192-dyn:~/redacted (andrea-unit-tests)$ arc diff --trace
libphutil loaded from '/Users/freelancer2/phabricator/libphutil/src'.
arcanist loaded from '/Users/freelancer2/phabricator/arcanist/src'.
Config: Reading user configuration file "/Users/freelancer2/.arcrc"...
Config: Did not find system configuration at "/etc/arcconfig".
Working Copy: Reading .arcconfig from "/Users/freelancer2/redacted/.arcconfig".
Working Copy: Path "/Users/freelancer2/redacted" is part of `git` working copy "/Users/freelancer2/redacted".
Working Copy: Project root is at "/Users/freelancer2/redacted".
Config: Did not find local configuration at "/Users/freelancer2/redacted/.git/arc/config".
>>> [0] <conduit> conduit.connect() <bytes = 486>
>>> [1] <http> https://phabricator.freelancer.com/api/conduit.connect
<<< [1] <http> 1,413,690 us
<<< [0] <conduit> 1,413,913 us
>>> [2] <exec> $ git diff --no-ext-diff --no-textconv --raw 'HEAD' --
>>> [3] <exec> $ git ls-files --others --exclude-standard
>>> [4] <exec> $ git diff-files --name-only
<<< [3] <exec> 7,639 us
<<< [4] <exec> 7,328 us
<<< [2] <exec> 10,672 us
>>> [5] <event> diff.didCollectChanges <listeners = 0>
<<< [5] <event> 60 us
>>> [6] <exec> $ git rev-parse --verify HEAD^
<<< [6] <exec> 5,897 us
>>> [7] <exec> $ git merge-base 'origin/master' HEAD
<<< [7] <exec> 10,201 us
>>> [8] <exec> $ php '/Users/freelancer2/phabricator/arcanist/src/../scripts/arcanist.php' --recon diff --no-diff '--ansi'
>>> [9] <exec> $ git log --first-parent --format=medium '041f13e53dfdfafeb3a126b5b6e27dd6df0e8ba1'..HEAD
<<< [9] <exec> 14,371 us
>>> [10] <conduit> differential.query() <bytes = 280>
>>> [11] <http> https://phabricator.freelancer.com/api/differential.query
<<< [11] <http> 409,318 us
<<< [10] <conduit> 409,486 us
>>> [12] <conduit> differential.query() <bytes = 190>
>>> [13] <http> https://phabricator.freelancer.com/api/differential.query
<<< [13] <http> 375,102 us
<<< [12] <conduit> 375,271 us
>>> [14] <conduit> differential.getcommitmessage() <bytes = 213>
>>> [15] <http> https://phabricator.freelancer.com/api/differential.getcommitmessage
<<< [15] <http> 396,851 us
<<< [14] <conduit> 397,019 us
>>> [16] <conduit> differential.parsecommitmessage() <bytes = 427>
>>> [17] <http> https://phabricator.freelancer.com/api/differential.parsecommitmessage
<<< [17] <http> 352,822 us
<<< [16] <conduit> 352,984 us
>>> [18] <event> diff.didBuildMessage <listeners = 0>
<<< [18] <event> 76 us
>>> [19] <exec> $ git rev-parse --git-dir
<<< [19] <exec> 6,531 us
>>> [20] <exec> $ which 'editor'
<<< [20] <exec> 3,458 us
>>> [21] <exec> $ which 'nano'
<<< [21] <exec> 3,277 us
>>> [22] <exec> $ nano  '/var/folders/dp/f2dprjbn2tx6tbvmb0bdjp3h0000gp/T/edit.cqxoys6d66g4o0kw/differential-update-comments'
<<< [22] <exec> 1,627,431 us
Linting...

My plan here is:

  1. Get the "throw instead of hang" patch written and get someone who is experiencing this to confirm that it works for them.
  2. Switch --recon to use a unix domain socket instead of normal pipes, probably.
  3. Maybe try to upstream a fix for this to the PHP mainline -- this is the only real fundamental PHP language issue we've hit to date that I can recall -- but this is hard since it doesn't reproduce for me.

I ran into this issue out of the blue today, so interesting notes if helpful.

  • One day, all "arc diff" calls worked, next day maybe 10% of calls worked.
  • --background 0 fixes the issue for me.
  • My general method for "managing" my macbook pro is to always sleep it, and I would say I used "arc diff" for weeks without doing a full restart of my laptop. Following a restart, "arc diff" calls are back to 100% success rate.

Without knowing much about arcanist's inner working (or it's use of php), perhaps something ends up scarce, that then leads to this condition? Recycling of some dirty resources?

This started happening to me again today on D10113.

  • --background 0 fixes it
  • rebooting does not
  • D10113 (my dev environment in general) is having a lint issue around ZipArchive class not existing incorrectly.

Happens to me and many other colleagues @ Flatiron on OSX with:
arcanist 4b903357c232dd2e413a871ef4d50681b0297ea1 (6 Aug 2014)
libphutil 4b903357c232dd2e413a871ef4d50681b0297ea1 (6 Aug 2014)
and:
PHP 5.4.24 (cli) (built: Jan 19 2014 21:32:15 (OSX 10.9.4)

--background 0 fixes it

If you can reproduce this, can you try applying this patch to libphutil/?

https://secure.phabricator.com/differential/diff/24536/

You should be able to apply it with:

libphutil/ $ arc patch --diff 24536

...or, here's a text version:

diff --git a/src/channel/PhutilChannel.php b/src/channel/PhutilChannel.php
index 5770548..8a125ac 100644
--- a/src/channel/PhutilChannel.php
+++ b/src/channel/PhutilChannel.php
@@ -411,12 +411,21 @@ abstract class PhutilChannel {
    * @task impl
    */
   public function flush() {
+
     while (!$this->isWriteBufferEmpty()) {
+      if (!$this->isOpenForWriting()) {
+        throw new Exception(
+          pht('Trying to flush a channel which is not writable!'));
+      }
+
       self::waitForAny(array($this));
+
       if (!$this->update()) {
-        throw new Exception('Channel closed while flushing output!');
+        throw new Exception(
+          pht('Channel closed while flushing output!'));
       }
     }
+
     return $this;
   }

The expected behavior is that the hang is replaced with an explicit exception. If that's what actually happens, I can move forward on making --recon use unix domain sockets or something like that.

@epriestley we've gotten the expected result:

(env)~/code/flatiron[arcpatch-D1813_1]$ arc diff
[2014-08-08 18:59:42] EXCEPTION: (Exception) Trying to flush a channel which is not writable! at [<phutil>/src/channel/PhutilChannel.php:416]
  #0 PhutilChannel::flush() called at [<phutil>/src/channel/PhutilChannelChannel.php:88]
  #1 PhutilChannelChannel::flush() called at [<phutil>/src/console/PhutilConsole.php:202]
  #2 PhutilConsole::writeMessage(PhutilConsoleMessage) called at [<phutil>/src/console/PhutilConsole.php:187]
  #3 PhutilConsole::writeTextMessage(string, array) called at [<phutil>/src/console/PhutilConsole.php:143]
  #4 PhutilConsole::writeOut(string, string) called at [<phutil>/src/console/PhutilConsole.php:175]
  #5 PhutilConsole::redirectOutCallback(string, integer) called at [<arcanist>/scripts/arcanist.php:368]
Linting...
Exception
Unable to read results from background linting and unit testing. You can try running arc diff again with --background 0
(Run with --trace for a full exception trace.)

Awesome, thanks! Let me look into developing a domain sockets patch to fix the underlying issue.

State of the world:

  • D10198 should fix the major issue (hangs) at the cost of losing the performance gains created by the --background feature.
  • I'm going to rewrite those gains in a simpler way, but since this normally saves something on the order of a couple of seconds, this might not happen for a bit.
btrahan added a subscriber: Unknown Object (MLST).Aug 8 2014, 11:17 PM
epriestley renamed this task from `arc diff` stalls when autofix lint errors found to Restore background linting and unit tests for performance.Aug 13 2014, 5:27 PM

Not sure if this is the right place to comment, but I have this issue on Windows (git-bash shell) still (or again). This is the last couple of lines of arc diff --trace:

>>> [34] <lint> ArcanistCppcheckLinter <paths = 1>
>>> [35] <exec> $ where "D:/Program Files (x86)/Cppcheck/cppcheck.exe"
<<< [35] <exec> 31,002 us
>>> [36] <exec> $ "D:/Program Files (x86)/Cppcheck/cppcheck.exe" "--quiet" "--inline-suppr" "--xml" "--xml-version=2" "-j2" "--enable=performance,style,portability,information" "C:\Code\Honey\cvietest2\src\cvietestapp.cpp"
<<< [34] <lint> 40,003 us
>>> [37] <lint> ArcanistCppcheckLinter

And then, when I manually kill the cppcheck process (it finishes in around 1.5 seconds when the above command is run manually):

<<< [36] <exec> 20,749,074 us
<<< [37] <lint> 20,743,073 us

[2015-02-17 07:00:07] EXCEPTION: (PhutilAggregateException) Some linters failed:
    - ArcanistCppcheckLinter: CommandException: Command failed with error #1!
      COMMAND
      "D:/Program Files (x86)/Cppcheck/cppcheck.exe" "--quiet" "--inline-suppr" "--xml" "--xml-version=2" "-j2" "--enable=performance,style,portability,information" "C:\Code\Honey\cvietest2\src\cvietestapp.cpp"

      STDOUT
      (empty)

      STDERR
      <?xml version="1.0" encoding="UTF-8"?>
      <results version="2">
          <cppcheck version="1.68"/>
          <errors>
              <error id="nullPointer" severity="warning" msg="Possible null pointer dereference: args - otherwise it is redundant to check it against null." verbose="Possible null pointer dereference: args - otherwise it is redundant to check it against null.">
                  <location file="C:\Code\Honey\cvietest2\src\optionparser.h" line="1504"/>
                  <location file="C:\Code\Honey\cvietest2\src\optionparser.h" line="1668"/>
              </error>
              <error id="variableScope" severity="style" msg="The scope of the variable &apos;fullLoop&apos; can be reduced." verbose="The scope of the variable &apos;fullLoop&apos; can be reduced. Warning: Be careful when fixing this message, especially when there are inner loops. Here is an example where cppcheck will write that the scope for &apos;i&apos; can be reduced:\012void f(int x)\012{\012    int i = 0;\012    if (x) {\012        // i... (3,047 more bytes) ... at [<arcanist>\src\lint\engine\ArcanistLintEngine.php:374]
  #0 ArcanistLintEngine::run() called at [<arcanist>\src\workflow\ArcanistLintWorkflow.php:320]
  #1 ArcanistLintWorkflow::run() called at [<arcanist>\src\workflow\ArcanistDiffWorkflow.php:1229]
  #2 ArcanistDiffWorkflow::runLint() called at [<arcanist>\src\workflow\ArcanistDiffWorkflow.php:1189]
  #3 ArcanistDiffWorkflow::runLintUnit() called at [<arcanist>\src\workflow\ArcanistDiffWorkflow.php:447]
  #4 ArcanistDiffWorkflow::run() called at [<arcanist>\scripts\arcanist.php:378]

This is the size of the output, for reference:

$ "D:/Program Files (x86)/Cppcheck/cppcheck.exe" "--quiet" "--inline-suppr" "--xml" "--xml-version=2" "-j2" "--enable=performance,style,portability,information" "C:\Code\Honey\cvietest2\src\cvietestapp.cpp" 2>&1 |wc
     38     586    5850

ETA on restoring support for background unit tests?

I currently expect to never pursue this.

This saved a very small amount of time (roughly, a few seconds) and we haven't received any related requests (e.g., to improve performance here) in the last year. The complexity of saving this time is high and the benefit is very low, so it is hard to imagine we will ever reach a state where we don't have dozens of more important/valuable things to work on.

epriestley claimed this task.

See T12996 for spiritual followup.