Page MenuHomePhabricator

Writing to streams can run into issues with EINTR
Closed, ResolvedPublic

Description

See https://discourse.phabricator-community.org/t/sporadic-git-cloning-hang-over-ssh/2233/. This is probably a real issue but the reproduction case is "clone 9000 times in a loop under a full moon" and the proposed patch involves "preg_match()" against English language error messages. Hopefully we can do better on both dimensions.

See also https://bugs.php.net/bug.php?id=39598 in the PHP upstream.

Event Timeline

epriestley triaged this task as Low priority.

Here's a script which can hit these behaviors without cloning 9000 working copies in a loop:

#!/usr/bin/env php
<?php

pcntl_async_signals(true);
error_reporting(E_ALL | E_STRICT);
set_error_handler('onerror');
pcntl_signal(SIGHUP, 'onsignal');

$stdout = fopen('php://stdout', 'w');
if (!$stdout) {
  throw new Exception(pht('Unable to open stdout.'));
}

stream_set_blocking($stdout, 0);
$one_megabyte = str_repeat('.', 1024 * 1024);

xlog('This PID', getmypid());

while (true) {
  xlog('Trying a write', '...');

  $result = fwrite($stdout, $one_megabyte);
  xlog('fwrite()', $result);

  $read = array();
  $write = array($stdout);
  $except = array();

  xlog('Starting a', 'stream_select()');
  $result = stream_select($read, $write, $except, 5);
  xlog('stream_select()', $result);
  xlog('error_get_last()', error_get_last());
  xlog('posix_get_last_error()', posix_get_last_error());

  sleep(1);
}


function xlog($label, $value) {
  error_log($label.': '.json_encode($value));
}

function onerror($errno, $errstr) {
  xlog($errno, $errstr);
}

function onsignal($signo) {
  xlog('Got signal', $signo);
}

The general idea here is that you run it like:

$ ./select_errors.php | sleep 15

...and then maybe send it some signals with:

$ kill -HUP <pid>

The general behavior here is:

  • The first fwrite() will write some bytes. These bytes will fill up the buffer on our stdout / sleep's stdin.
  • Since sleep does not read stdin, the buffer will remain full. Subsequent fwrite() calls will write 0 bytes. This corresponds to EAGAIN.
  • After 5 seconds (whatever you pass to sleep X), the sleep process will exit. This will break the pipe, and put us in an EPIPE condition.
  • If you HUP the process in another terminal at the right time, you can interrupt the stream_select() call. This is EINTR.

My observations are:

  • In EAGAIN, the write writes 0 bytes and the stream_select() says we're not writable. This is reasonable.
  • In EPIPE, the write writes 0 bytes but the stream_select() says we're writable. This is super not great and the original problem we're working around.
  • In EINTR, the write writes 0 bytes and the stream_select() returns an error. This is the new condition we aren't handling properly.

That is:

fwrite(): errorfwrite(): 0fwrite(): 1+
stream_select(): errorTrust fwrite() error.NEW! Assume EINTR.Trust fwrite() success.
stream_select(): 0Trust fwrite() error.Legitimate EAGAIN.Trust fwrite() success.
stream_select(): 1Trust fwrite() error.Old. Heuristically guess whether this is a real EAGAIN that we raced against or an EPIPE.Trust fwrite() success.

That table isn't very helpful, I guess.

The patch is straightforward, at least, although we have to assume that the only condition which causes fwrite() to return 0 and then stream_select() to return an error is EINTR. That seems reasonable, but this is a huge mess so who knows.

None of error_get_last() or posix_get_last_error() do anything. I don't have socket_last_error() and we don't depend on socket support, and the Discourse thread suggests this doesn't matter anyway.

jbeck added a subscriber: jbeck.Feb 3 2019, 10:50 PM

I wouldn't exactly say it's a huge mess. It's just an interrupted system call, which is fairly well documented. See for example http://poincare.matf.bg.ac.rs/~ivana/courses/ps/sistemi_knjige/pomocno/apue/APUE/0201433079/ch10lev1sec5.html. It doesn't exactly help that PHP gives you a broken abstraction over the underlying C implementation.

The "mess" part is the broken PHP. In C, where you can tell what happened, the behavior is all reasonable.

jbeck added a comment.Feb 4 2019, 1:04 AM

Ok gotcha.

That upstream bug report is really something else.

It's true that it has been 12 years, but I'm sure fwrite() will start returning an error code when it encounters a permanent, fatal EPIPE error condition soon.