Page MenuHomePhabricator

"ssh-exec" sometimes hangs when git cloning larger repos
Closed, ResolvedPublic

Description

Problem

Recently our company migrated to phabricator and all went good, BUT for some reason a command to git submodule update --init infrequently gets stuck on our build slaves.

Our environment

  • Latest phabricator master running on a four core Ubuntu 14.04 server (nginx, php5).
  • Some big Git repos with submodules hosted in diffusion.
  • Jenkins build server with four build slaves.

How to reproduce

I've written a minimal C# command line app which concurrently execute the following two commands locally:

  • git clone <big_repo
  • git submodule update --init
Client side

After the second iteration "ssh.exe" gets stuck with the following command line args (from ProcessHacker):
ssh git@<our_server> "git-upload-pack 'repo.git'"

Server side

The php script "ssh-exec" hangs forever:

I'm no php guru so I hope this strace can help you a bit:

Process 28854 attached
11:21:11.326317 select(6, [5], [], [5], {0, 258527}) = 0 (Timeout)
11:21:11.586045 read(5, 0x1ecc9e0, 8192) = -1 EAGAIN (Resource temporarily unavailable)
11:21:11.586648 select(6, [5], [], [5], {1, 0}) = 0 (Timeout)
11:21:12.588346 read(5, 0x1ecc9e0, 8192) = -1 EAGAIN (Resource temporarily unavailable)
11:21:12.588962 select(6, [5], [], [5], {1, 0}) = 0 (Timeout)
11:21:13.590614 read(5, 0x1ecc9e0, 8192) = -1 EAGAIN (Resource temporarily unavailable)
11:21:13.591170 select(6, [5], [], [5], {1, 0}) = 0 (Timeout)
11:21:14.592899 read(5, 0x1ecc9e0, 8192) = -1 EAGAIN (Resource temporarily unavailable)
11:21:14.593405 select(6, [5], [], [5], {1, 0}) = 0 (Timeout)
11:21:15.594946 read(5, 0x1ecc9e0, 8192) = -1 EAGAIN (Resource temporarily unavailable)
11:21:15.595386 select(6, [5], [], [5], {1, 0}) = 0 (Timeout)
11:21:16.597088 read(5, 0x1ecc9e0, 8192) = -1 EAGAIN (Resource temporarily unavailable)
11:21:16.597549 select(6, [5], [], [5], {1, 0}) = 0 (Timeout)
11:21:17.599138 read(5, 0x1ecc9e0, 8192) = -1 EAGAIN (Resource temporarily unavailable)
11:21:17.599624 select(6, [5], [], [5], {1, 0}) = 0 (Timeout)
11:21:18.601216 read(5, 0x1ecc9e0, 8192) = -1 EAGAIN (Resource temporarily unavailable)
11:21:18.601686 select(6, [5], [], [5], {1, 0}^C

If I simply kill this php process on the server, then I get the following message locally:

Terminated
fatal: early EOF
fatal: The remote end hung up unexpectedly
fatal: index-pack failed

I've also checked that my sshd deamon is working correctly by simply cloning over port 222 (normal sshd port) and everything is working fine.

Any ideas?
Thank you very much for your effort in advance!

Event Timeline

VitaminCpp updated the task description. (Show Details)
VitaminCpp added projects: Git, libphutil.
VitaminCpp added a subscriber: VitaminCpp.
avivey added a subscriber: avivey.Nov 7 2015, 7:14 PM

ssh git@<our_server> "git-upload-pack 'repo.git'"

  • git-upload-pack is the server-side part of the git fetch/git pull transfer protocol.
  • 'repo.git' should look like diffusion/CALLSIGN/name.git; Is that right?

You can get better information by running the sshd in debug mode - see https://secure.phabricator.com/book/phabricator/article/diffusion_hosting/#troubleshooting-ssh.

Other than that, there's not enough information here to help us help you; See https://secure.phabricator.com/book/phabcontrib/article/bug_reports/ for what kind of information we need.

Yes, the path to 'repo.git' looks like this: plink.exe git@<our_server> "git-upload-pack '/diffusion/TPTB/3RDPARTY/tbb.git'".
I've tried running sshd in debug mode, but the sshd is terminating on every new connection attempt, because it isn't running as a daemon.
The only logging output I got when my clone attempt gets stuck is the following on the server side:

Nov 16 11:03:02 ei-srv-l-141 sshd[64152]: Set /proc/self/oom_score_adj to 0
Nov 16 11:03:02 ei-srv-l-141 sshd[64152]: Connection from 192.168.192.179 port 28269 on 192.168.192.141 port 22
Nov 16 11:03:03 ei-srv-l-141 sshd[64152]: Postponed publickey for git from 192.168.192.179 port 28269 ssh2 [preauth]
Nov 16 11:03:04 ei-srv-l-141 sshd[64152]: Accepted publickey for git from 192.168.192.179 port 28269 ssh2: RSA c1:1d:81:e5:81:e7:0e:b3:cd:92:3b:fa:b9:22:73:ae
Nov 16 11:03:04 ei-srv-l-141 sshd[64152]: User child is on pid 64160
Nov 16 11:03:04 ei-srv-l-141 sshd[64160]: Starting session: forced-command (key-option) ''/home/build/phabricator/phabricator/bin/ssh-exec' '--phabricator-ssh-user' 'm.herzog' '--phabricator-ssh-key' '32'' for git from 192.168.192.179 port 28269
Nov 16 11:03:04 ei-srv-l-141 sudo:      git : TTY=unknown ; PWD=/home/git ; USER=build ; COMMAND=/usr/bin/git-upload-pack -- /var/repo/TPL/
Nov 16 11:03:04 ei-srv-l-141 sudo: pam_unix(sudo:session): session opened for user build by (uid=0)
Nov 16 11:03:07 ei-srv-l-141 sudo: pam_unix(sudo:session): session closed for user build

Any ideas? Thanks.

After playing around a while I found out that it seems the hang occurs only if another user concurrently does a git clone:

Maybe a git submodule update --init --recursive doesn't play well with the phabricator SSHD redirection?!

Are the submodules all hosted in the same phabricator instance?

cole added a subscriber: cole.Nov 16 2015, 7:12 PM

We have been experiencing the same issue, though we can reliably reproduce on larger repositories with no submodules.

During the clone, the process tree looks like the following, with git-upload-pack running:

sshd
  `-sh -c '/opt/phabricator/phabricator/bin/ssh-exec' '--phabricator-ssh-user' 'jenkins' '--phabricator-ssh-key' '31'
      `-php /opt/phabricator/phabricator/bin/ssh-exec --phabricator-ssh-user jenkins --phabricator-ssh-key 31
          `-sh -c '/usr/bin/sudo' -E -n -u 'www-data' -- git-upload-pack -- '/var/lib/phabricator/repos/tempest'
              `-sudo -E -n -u www-data -- git-upload-pack -- /var/lib/phabricator/repos/tempest
                  `-git-upload-pack -- /var/lib/phabricator/repos/tempest
                      `-git pack-objects --revs --all --stdout --progress --delta-base-offset

When a clone hangs, it appears that on the server side the git-upload-pack completes:

sshd
  `-sh -c '/opt/phabricator/phabricator/bin/ssh-exec' '--phabricator-ssh-user' 'jenkins' '--phabricator-ssh-key' '31'
      `-php /opt/phabricator/phabricator/bin/ssh-exec --phabricator-ssh-user jenkins --phabricator-ssh-key 31

The git client preforming the clone remains hung, with the ssh connection open, waiting for the remaining data.

When this occurs, the process is in in infinite loop spinning in: https://github.com/phacility/phabricator/blob/master/src/applications/diffusion/ssh/DiffusionGitSSHWorkflow.php#L16 until the client is killed or the process is killed on the server.

As @VitaminCpp reported, an strace of the the hung ssh-exec process shows that it is trying to read from a pipe, but failing due to the resource being unavailable. Presumably this was the pipe from the output of git-upload-pack, which is no longer available because the process has terminated:

$ sudo strace -p 16791
Process 16791 attached - interrupt to quit
select(7, [6], [], [6], {0, 813239})    = 0 (Timeout)
read(6, 0x332b9f0, 8192)                = -1 EAGAIN (Resource temporarily unavailable)
select(7, [6], [], [6], {1, 0})         = 0 (Timeout)
read(6, 0x332b9f0, 8192)                = -1 EAGAIN (Resource temporarily unavailable)

We can fairly reliably reproduce the issue, and I am more than happy to gather further information. Are there any other details that will be useful in narrowing down this issue?

From both of your installs, can we have the versions of:

  • OS
  • php
  • git
  • openssh
  • phabricator

?

cole added a comment.Nov 16 2015, 7:34 PM
$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 12.04.4 LTS
Release:	12.04
Codename:	precise
$ php --version
PHP 5.3.10-1ubuntu3.9 with Suhosin-Patch (cli) (built: Dec 12 2013 04:27:25)
Copyright (c) 1997-2012 The PHP Group
Zend Engine v2.3.0, Copyright (c) 1998-2012 Zend Technologies
$ git --version
git version 1.7.9.5
$ dpkg -s openssh-server | grep Version
Version: 1:6.2p2-6ubuntu0.1

Phabricator: 1a84a2fe4bc485fb7aa6e03241684bc506ccb191 (updated this morning to test if the issue had possibly been fixed in master)

Not that it might help, but:

In T9724#144776, @cole wrote:
$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 12.04.4 LTS
Release:	12.04
Codename:	precise

That's ancient 14.04 is also LTS...

In T9724#144776, @cole wrote:
$ git --version
git version 1.7.9.5

This might be the reason... what's the harm in doing:

sudo apt-get remove git
sudo add-apt-repository ppa:git-core/ppa
sudo apt-get update
sudo apt-get install git
git --version
cole added a comment.Nov 16 2015, 7:56 PM

Not that it might help, but:

In T9724#144776, @cole wrote:
$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 12.04.4 LTS
Release:	12.04
Codename:	precise

That's ancient 14.04 is also LTS...

Indeed. It is certainly something we have planned, but other infrastructure projects have taken precedence of late.

In T9724#144776, @cole wrote:
$ git --version
git version 1.7.9.5

This might be the reason... what's the harm in doing:

sudo apt-get remove git
sudo add-apt-repository ppa:git-core/ppa
sudo apt-get update
sudo apt-get install git
git --version

I will give that a shot. FWIW, I can not reproduce this issue outside of Phabricator. I can't reproduce failures of any kind when cloning directly over the non-phabricator run sshd instance. I will upgrade git now and follow up with my results.

cole added a comment.Nov 16 2015, 8:38 PM

I have upgraded git to the latest version.

$ git --version
git version 2.6.3

Unfortunately we are still seeing random hangs when cloning.

Thanks for checking, @cole.

tail added a subscriber: tail.Nov 16 2015, 9:31 PM

Same as @cole... I've reported this issue with an already up to date OS and GIT:

build@ei-srv-l-141:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 14.04.3 LTS
Release:        14.04
Codename:       trusty
build@ei-srv-l-141:~$ php --version
PHP 5.5.9-1ubuntu4.14 (cli) (built: Oct 28 2015 01:34:46)
Copyright (c) 1997-2014 The PHP Group
Zend Engine v2.5.0, Copyright (c) 1998-2014 Zend Technologies
    with Zend OPcache v7.0.3, Copyright (c) 1999-2014, by Zend Technologies
build@ei-srv-l-141:~$ git --version
git version 2.6.3
build@ei-srv-l-141:~$ dpkg -s openssh-server | grep Version
Version: 1:6.6p1-2ubuntu2.3
BYK added a subscriber: BYK.Nov 17 2015, 8:16 PM

Any news on this?

I can't reproduce this. Here's what I did:

  • I opened three terminal windows.
  • I ran git clone ssh://dweller@secure.phabricator.com/diffusion/P/phabricator.git phabricator-X in each at the same time, where X is one of 1, 2 or 3.

All clones completed successfully, and I got three clones in the working directory after they finished.

The software versions on this server are very similar to yours:

ubuntu@secure001:/core$ php --version
PHP 5.5.9-1ubuntu4.14 (cli) (built: Oct 28 2015 01:34:46) 
Copyright (c) 1997-2014 The PHP Group
Zend Engine v2.5.0, Copyright (c) 1998-2014 Zend Technologies
    with Zend OPcache v7.0.3, Copyright (c) 1999-2014, by Zend Technologies
ubuntu@secure001:/core$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 14.04.3 LTS
Release:	14.04
Codename:	trusty

Our git is a little older:

ubuntu@secure001:/core$ git --version
git version 1.9.1

Our sshd is slightly different because we have a patch against it:

$ ./lib/sshd/bin/sshd-phabricator --invalid-flag-to-show-version
unknown option -- -
OpenSSH_6.7p1, OpenSSL 1.0.1f 6 Jan 2014
usage: sshd [-46DdeiqTt] [-b bits] [-C connection_spec] [-c host_cert_file]
            [-E log_file] [-f config_file] [-g login_grace_time]
            [-h host_key_file] [-k key_gen_time] [-o option] [-p port]
            [-u len]

..but it sounds like older git was also problematic for you.

Can you describe your reproduction process in more detail? From this:

...hang occurs only if another user concurrently does a git clone.

...it sounds like concurrency of 2 is sufficient to trigger the issue in your environment, so we would expect my test case (with concurrency of 3) to also trigger the issue? Or am I misunderstanding?

Does this issue reproduce in your environments under a simple approach like "run git clone in several windows", or only in a more complex case (like "run git submodule update --init in one window, and git clone in a separate window, but only some of the time, also depends on the phase of the moon")?

It would be most helpful if you can find a sequence of concurrent commands which are sufficient to reproduce the issue against this server.

cole added a comment.Dec 1 2015, 3:42 PM

I can't reproduce this. Here's what I did:

  • I opened three terminal windows.
  • I ran git clone ssh://dweller@secure.phabricator.com/diffusion/P/phabricator.git phabricator-X in each at the same time, where X is one of 1, 2 or 3.

All clones completed successfully, and I got three clones in the working directory after they finished.

In our case, I haven't been able to verify that it is an issue with concurrency. We experience this issue without concurrent clones of the repository. However, the issue does not manifest on every clone, but I am able to reliably reproduce within a few minutes of repeatedly cloning the repository.

The simple test case I have been using is:

$ while true; do git clone <repo>; rm -rf <repo>; done
[...]
Cloning into 'tempest-ad-server'...
remote: Counting objects: 24923, done.
remote: Compressing objects: 100% (14838/14838), done.
Receiving objects:  99% (24674/24923), 23.66 MiB | 4.54 MiB/s
**HUNG!**

Usually, within about 5 minutes of this running I can trigger the case. Our repository has no submodules, so in our case, I don't think any submodule actions play into the issue. It always hangs at 99%, but it does not always hang on the same object.

We have quite a few repositories hosted on Phabricator, but can only reproduce with a few of our largest repositories (100M+, 35k commits, and a medium sized repository ~40M, 3,500 commits).

In an attempt to narrow this down further and verify whether it was an issue with repository size, or number of objects, I ran the same while loop doing a shallow clone of the same repository, which ran for over 12 hours without triggering a hang.

$ while true; do git clone --depth=1 <repo>; rm -rf <repo>; done

Not sure if this will help reproducing on your end, or narrowing down the cause, but I am more than happy to gather more data if necessary.

The same here...

I've tried the same test case as @cole by executing $ while true; do git clone <repo>; rm -rf <repo>; done and after the 6th try its getting stuck at 36%:


So no submodules were involved.

For our company this is currently a real show stopper, because every day our nightly builds are broken. :(

IMHO it must have smth to do with the custom sshd config, because a git clone triggered over the standard ssh daemon is working fine!

This is our sshd_config.phabricator:

# NOTE: You must have OpenSSHD 6.2 or newer; support for AuthorizedKeysCommand
# was added in this version.

# NOTE: Edit these to the correct values for your setup.

AuthorizedKeysCommand /usr/libexec/phabricator-ssh-hook.sh
AuthorizedKeysCommandUser git
AllowUsers git

# You may need to tweak these options, but mostly they just turn off everything
# dangerous.

Port 22
Protocol 2
PermitRootLogin no
AllowAgentForwarding no
AllowTcpForwarding no
PrintMotd no
PrintLastLog no
PasswordAuthentication no
AuthorizedKeysFile none

# Logging
SyslogFacility AUTH
LogLevel INFO

TCPKeepAlive yes

ClientAliveInterval 60

PidFile /var/run/sshd-phabricator.pid

Does this make sense to someone?

VitaminCpp added a comment.EditedDec 4 2015, 1:02 PM

Again I attached strace to narrow down the problem further.

  • No submodules where involved.
  • Concurrently lunched 4 clone tasks on my workstation to a single repository.

After the 9th concurrent clone attempt it was stuck again.
This time I also attached an "lsof" output to get more info about the FDs involved in the "select()-read()" loop descriped in my initial posting:

root@server:~# strace -fp 53778
Process 53778 attached
select(6, [5], [], [5], {0, 749545})    = 0 (Timeout)
read(5, 0x3c8ea00, 8192)                = -1 EAGAIN (Resource temporarily unavailable)
select(6, [5], [], [5], {1, 0})         = 0 (Timeout)
read(5, 0x3c8ea00, 8192)                = -1 EAGAIN (Resource temporarily unavailable)
select(6, [5], [], [5], {1, 0})         = 0 (Timeout)
read(5, 0x3c8ea00, 8192)                = -1 EAGAIN (Resource temporarily unavailable)
select(6, [5], [], [5], {1, 0})         = 0 (Timeout)
read(5, 0x3c8ea00, 8192)                = -1 EAGAIN (Resource temporarily unavailable)
select(6, [5], [], [5], {1, 0})         = 0 (Timeout)
read(5, 0x3c8ea00, 8192)                = -1 EAGAIN (Resource temporarily unavailable)
select(6, [5], [], [5], {1, 0})         = 0 (Timeout)
read(5, 0x3c8ea00, 8192)                = -1 EAGAIN (Resource temporarily unavailable)
select(6, [5], [], [5], {1, 0}^CProcess 53778 detached
 <detached ...>

root@server:~# lsof -n -p 53778 | egrep -v "(DIR|REG)"
COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF   NODE NAME
php     53778  git    0r  FIFO    0,8      0t0 176194 pipe
php     53778  git    1w  FIFO    0,8      0t0 176195 pipe
php     53778  git    2w  FIFO    0,8      0t0 176196 pipe
php     53778  git    3u  IPv4 173661      0t0    TCP 192.168.192.141:31356->192.168.192.145:mysql (ESTABLISHED)
php     53778  git    4u  IPv4 173665      0t0    TCP 192.168.192.141:31358->192.168.192.145:mysql (ESTABLISHED)
php     53778  git    5r  FIFO    0,8      0t0 176194 pipe <= STDIN pipe?
php     53778  git    6w  FIFO    0,8      0t0 176195 pipe
php     53778  git    7w  FIFO    0,8      0t0 176196 pipe
php     53778  git    8u  IPv4 173667      0t0    TCP 192.168.192.141:31359->192.168.192.145:mysql (ESTABLISHED)
php     53778  git    9u  IPv4 173669      0t0    TCP 192.168.192.141:31360->192.168.192.145:mysql (ESTABLISHED)
php     53778  git   10u  IPv4 173672      0t0    TCP 192.168.192.141:31361->192.168.192.145:mysql (ESTABLISHED)
php     53778  git   11u  IPv4 174572      0t0    TCP 192.168.192.141:31362->192.168.192.145:mysql (ESTABLISHED)

I also turned on ssh logging by enabling ssh logging by setting the log.ssh.path and this is the tail output:

[Fri, 04 Dec 2015 13:44:59 +0100]       44192   our_server  192.168.192.179 git     build   m.herzog        git-upload-pack git-upload-pack /diffusion/AS/example.git 0       3381024 904     9238810
[Fri, 04 Dec 2015 13:44:59 +0100]       44202   our_server  192.168.192.179 git     build   m.herzog        git-upload-pack git-upload-pack /diffusion/AS/example.git 0       3858381 904     9238805
[Fri, 04 Dec 2015 13:44:59 +0100]       44212   our_server  192.168.192.179 git     build   m.herzog        git-upload-pack git-upload-pack /diffusion/AS/example.git 0       4011832 904     9238810
[Fri, 04 Dec 2015 13:45:05 +0100]       44299   our_server  192.168.192.179 git     build   m.herzog        git-upload-pack git-upload-pack /diffusion/AS/example.git 0       2497779 904     9238805
[Fri, 04 Dec 2015 13:45:05 +0100]       44308   our_server  192.168.192.179 git     build   m.herzog        git-upload-pack git-upload-pack /diffusion/AS/example.git 0       3676087 904     9238805
[Fri, 04 Dec 2015 13:45:05 +0100]       44330   our_server  192.168.192.179 git     build   m.herzog        git-upload-pack git-upload-pack /diffusion/AS/example.git 0       4202265 904     9238805
[Fri, 04 Dec 2015 13:45:05 +0100]       44312   our_server  192.168.192.179 git     build   m.herzog        git-upload-pack git-upload-pack /diffusion/AS/example.git 0       4288784 904     9238805
[Fri, 04 Dec 2015 13:45:10 +0100]       44460   our_server  192.168.192.179 git     build   m.herzog        git-upload-pack git-upload-pack /diffusion/AS/example.git 0       2879070 904     9238810
[Fri, 04 Dec 2015 13:45:11 +0100]       44472   our_server  192.168.192.179 git     build   m.herzog        git-upload-pack git-upload-pack /diffusion/AS/example.git 0       3998569 904     9238810
[Fri, 04 Dec 2015 13:45:11 +0100]       44482   our_server  192.168.192.179 git     build   m.herzog        git-upload-pack git-upload-pack /diffusion/AS/example.git 0       4238504 904     9238810
[Fri, 04 Dec 2015 13:45:11 +0100]       44466   our_server  192.168.192.179 git     build   m.herzog        git-upload-pack git-upload-pack /diffusion/AS/example.git 0       1687314301      904     9222954

IMHO not very informative in this case... :)

As you can see it`s definitively no networking related problem.
I think the script is waiting forever on some input in ssh-exec.php:269 $metrics_channel->flush();.

This issue is really getting wired to me...
Any ideas?

I can reproduce this on my dev install (Ubuntu 14.04.3 LTS, git 2.6, OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3, OpenSSL 1.0.1f 6 Jan 2014, PHP 5.5.9-1ubuntu4.14):

$ git clone --depth 5  ssh://git@localhost:7422/diffusion/MNG
Cloning into 'MNG'...
remote: Counting objects: 3560, done.
remote: Compressing objects: 100% (2960/2960), done.
Receiving objects:  99% (3525/3560), 153.83 MiB | 102.55 MiB/s
<hang>

As a temporary workaround our company currently uses direct checkout through the standard ssh server until we have a stable phabricator-ssh daemon available.
I hope this will be fixed soon...
Thanks anyway!

epriestley added a comment.EditedDec 16 2015, 9:10 PM

There is a small possibility that D14801 fixed this. That's highly speculative, but it is the kind of low-level bug which could theoretically cause intermittent, difficult-to-reproduce hangs of this nature.

avivey renamed this task from "ssh-exec" hangs when git cloning larger repos with submodules to "ssh-exec" sometimes hangs when git cloning larger repos .Dec 18 2015, 9:49 PM

Updating to HEAD and running the clone in a loop, it still hangs eventually :(

Same here... Sadly D14801 doesn't fix this issue.

Today I`ve had some more time and investigated this issue a bit further.
I tried to debug this issue by using gdb and figured out that after a while the method DiffusionGitSSHWorkflow->waitForGitClient() in DiffusionGitUploadPackSSHWorkflow.php:35 loops forever:

[0x7f595b9d4208] stream_select(array(1)[0x3efe808], array(0)[0x3efe718], array(1)[0x40fb4c8], 1, 0) /home/build/phabricator/libphutil/src/channel/PhutilChannel.php:197
[0x7f595b9d3848] PhutilChannel::waitForActivity(array(1)[0x40fb588], array(1)[0x40fb588], array(0)[0x40fb5e8]) /home/build/phabricator/libphutil/src/channel/PhutilChannel.php:100
[0x7f595b9d3748] PhutilChannel::waitForAny(array(1)[0x40fb588]) /home/build/phabricator/phabricator/src/applications/diffusion/ssh/DiffusionGitSSHWorkflow.php:25
[0x7f595b9d34a0] DiffusionGitSSHWorkflow->waitForGitClient() /home/build/phabricator/phabricator/src/applications/diffusion/ssh/DiffusionGitUploadPackSSHWorkflow.php:35
[0x7f595b9d2ec8] DiffusionGitUploadPackSSHWorkflow->executeRepositoryOperations() /home/build/phabricator/phabricator/src/applications/diffusion/ssh/DiffusionSSHWorkflow.php:142
[0x7f595b9d28a0] DiffusionSSHWorkflow->execute(object[0x373c5c8]) /home/build/phabricator/phabricator/scripts/ssh/ssh-exec.php:267

By narrowing down this further it seems that PhutilSocketChannel->readBytes() doesn't detect an EOF of the "git-upload-pack" STDOUT pipe correctly.
I`m no PHP expert, but maybe it has something to do with the issue pointed out here?

I also tried commenting out his code in DiffusionGitUploadPackSSHWorkflow.php:34:

if (!$err) {
  $this->waitForGitClient();
}

This improved the situation a bit and delayed the freeze for a while, but then again it hangs on another location:

[0x7fe13b16c878] stream_select(array(1)[0x35547d0], array(0)[0x31e08e0], array(1)[0x31eb638], 1, 0) /home/build/phabricator/libphutil/src/channel/PhutilChannel.php:197
[0x7fe13b16beb8] PhutilChannel::waitForActivity(array(1)[0x3593138], array(1)[0x3593138], array(0)[0x31cc040]) /home/build/phabricator/libphutil/src/channel/PhutilChannel.php:100
[0x7fe13b16bd98] PhutilChannel::waitForAny(array(1)[0x3593138]) /home/build/phabricator/libphutil/src/channel/PhutilChannel.php:418
[0x7fe13b16bb00] PhutilChannel->flush() /home/build/phabricator/libphutil/src/channel/PhutilChannelChannel.php:88
[0x7fe13b16b8a0] PhutilChannelChannel->flush() /home/build/phabricator/phabricator/scripts/ssh/ssh-exec.php:269

This time it loops forever in PhutilChannel->flush()...

So IMHO these polling loops don`t detect EOF situations on STDOUT pipes well?

I'm fairly confident that D14920 will fix this. I was able to reproduce it with the "run in a loop" method, and narrow it down from there.

When you have a chance, can anyone else who was hitting this update libphutil/ to rPHU5afd76 and see if it still reproduces?

I think the fread() thing from PHP.net is just the user being clueless. He's opening a connection to an HTTP server, writing this "HTTP request":

Data sent by socket

...with no newline, and then expecting a response and an EOF. Here's a simplified script which hangs:

<?php

$fp = fsockopen("example.com", 80);

fwrite($fp, "Data sent by socket");
while (!feof($fp)) {
  echo fread($fp, 1024);
}

fclose($fp);

Here's a "fix" which does not hang:

 $fp = fsockopen("example.com", 80);
 
-fwrite($fp, "Data sent by socket");
+fwrite($fp, "Data sent by socket\n");
 while (!feof($fp)) {
   echo fread($fp, 1024);
 }

Here's the output:

$ php -f example2.php 
HTTP/1.0 400 Bad Request
Content-Type: text/html
Content-Length: 349
Connection: close
Date: Thu, 31 Dec 2015 22:16:58 GMT
Server: ECSF (rhv/8198)

<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
         "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
	<head>
		<title>400 - Bad Request</title>
	</head>
	<body>
		<h1>400 - Bad Request</h1>
	</body>
</html>

Here's another "fix" which is probably what he wanted:

 $fp = fsockopen("example.com", 80);
 
-fwrite($fp, "Data sent by socket");
+fwrite($fp, "GET / HTTP/1.0\nHost: example.com\n\n");
 while (!feof($fp)) {
   echo fread($fp, 1024);
 }

Which does this:

epriestley@orbital ~ $ php -f example2.php 
HTTP/1.0 200 OK
Accept-Ranges: bytes
Cache-Control: max-age=604800
Content-Type: text/html
Date: Thu, 31 Dec 2015 22:18:34 GMT
Etag: "359670651"
Expires: Thu, 07 Jan 2016 22:18:34 GMT
Last-Modified: Fri, 09 Aug 2013 23:54:35 GMT
Server: ECS (rhv/818F)
Vary: Accept-Encoding
X-Cache: HIT
x-ec-custom-error: 1
Content-Length: 1270
Connection: close

<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 50px;
        background-color: #fff;
        border-radius: 1em;
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        body {
            background-color: #fff;
        }
        div {
            width: auto;
            margin: 0 auto;
            border-radius: 0;
            padding: 1em;
        }
    }
    </style>    
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is established to be used for illustrative examples in documents. You may use this
    domain in examples without prior coordination or asking for permission.</p>
    <p><a href="http://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>

I'm not aware of any issues with PHP and EOF on reads. I am aware of one issue with EOF on writes (see D7748), but I think this one is just "every comment on php.net is really dumb".

BYK added a comment.EditedDec 31 2015, 10:46 PM

This is another reason for me to dislike PHP but I really appreciate the time you've taken to explain the issue in addition to a promising fix for this task. Couldn't have asked for a better new year's gift :)

@epriestley Thank you very much for this fix and the explanation!
It seems that this issue is now finally solved and can be closed.
Great start of a new year! ;)

epriestley closed this task as Resolved.Jan 4 2016, 1:18 PM
epriestley claimed this task.

Thanks for the report! Let us know if you run into anything else.