Page MenuHomePhabricator

Recover from rare EINTR signal interruptions during "stream_select(..., 0)"
ClosedPublic

Authored by epriestley on Feb 1 2019, 6:36 PM.

Details

Summary

Fixes T13243. This is generally a mess, but this patch seems like it's a step forward and refines our handling of possible conditions here.

I think this can only hurt us if there's some error condition where all of these are true:

  • fwrite() returns 0.
  • stream_select() errors.
  • The stream is legitimately in a dead/error/failures state and should not be retried.

Ideally, this shouldn't exist. But it's hard to be terribly confident that this is actually true.

(I'll hold this until after the release cut.)

Test Plan

In T13243, blocked a pipe to sleep 15, then interrupted a stream_select() by sending SIGHUP to the writing process. Saw 0 from fwrite() plus false from stream_select().

Diff Detail

Repository
rPHU libphutil
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

epriestley created this revision.Feb 1 2019, 6:36 PM
epriestley requested review of this revision.Feb 1 2019, 6:37 PM
epriestley retitled this revision from Recover from rare EINTR signal interruptioons during "stream_select(..., 0)" to Recover from rare EINTR signal interruptions during "stream_select(..., 0)".Feb 1 2019, 6:37 PM

The Discourse thread has a possible way to distinguish between failures because of EINTR and other failures, but it involves mangling the error handler and using preg_match() against English language error strings. Until we know of a condition where this simpler behavior is wrong, I'm hesitant to dip into that mess.

epriestley updated this revision to Diff 47952.Feb 1 2019, 6:42 PM
  • For consistency/clarity, call the result $result rather than $error.
amckinley requested changes to this revision.Feb 5 2019, 7:22 PM
  • fwrite() returns 0.
  • stream_select() errors.
  • The stream is legitimately in a dead/error/failures state and should not be retried.

We could race ourselves into this state by having a stream that changes state between the fwrite and stream_select calls, right? Like going from buffer full to EPIPE? In that case, the caller will just retry the write and get an error the next time, which doesn't seem like that bad of an outcome.

Otherwise looks good; holding for what I think are important comment corrections.

src/utils/utils.php
1009

"between EAGAIN, EPIPE, and EINTR."

1020–1032

Maybe it's worthwhile to setup an error handler and stop using the error suppression operator so we can log the actual messages? That at least would make it easier to create a repro the next time some combination of moon phase/ambient humidity bites us again.

1029

"can cause 'fwrite()' to return '0'"

This revision now requires changes to proceed.Feb 5 2019, 7:22 PM
epriestley added a comment.EditedFeb 5 2019, 7:41 PM

We could race ourselves into this state by having a stream that changes state between the fwrite and stream_select calls, right? Like going from buffer full to EPIPE? In that case, the caller will just retry the write and get an error the next time, which doesn't seem like that bad of an outcome.

Yes, but I think in all cases where the state changes during this sequence of operations we either: do the right thing anyway; or retry and get it right next time.

The assumption this hinges upon is that once we go into EPIPE we never go back (pipes never magically fix themselves) so once we positively detect that we went to EPIPE somehow that detection remains valid.

src/utils/utils.php
1009

Conceivably, we could get interrupted during either stream_select() or fwrite(). We're only dealing with EINTR when it interrupts stream_select() right now. Let me see if I can get fwrite() to EINTR.

1020–1032

We now expect this callsite to "regularly" emit EINTR errors as part of normal operation -- what should we do about that? Generate a log full of known-meaningless errors? Add a preg_match() filter to exclude these specific English-language error strings from this specific callsite?

amckinley added inline comments.Feb 5 2019, 7:50 PM
src/utils/utils.php
1020–1032

Good point. I'm just thinking it's pretty lucky that the Discourse reporter had the time, inclination, and ability to track this issue down to the extent that they did, and I don't know what we'd do the next time we got a "run this script in a loop for 8 hours to repro" issue.

I can't get signals during fwrite() to do anything. I did something like this:

<?php

pcntl_async_signals(true);
error_reporting(E_ALL | E_STRICT);
set_error_handler('onerror');
pcntl_signal(SIGHUP, 'onsignal');

xlog('This process PID: %s.', getmypid());

$stdout = fopen('php://stdout', 'w');
if (!$stdout) {
  throw new Exception(pht('Unable to open stdout.'));
}

stream_set_blocking($stdout, 0);

$start = time();
$total = 0;
while (time() < $start + 10) {
  $buf = str_repeat('a', 1024 * 1024);
  $result = fwrite($stdout, $buf);
  if ($result !== strlen($buf)) {
    xlog('! %s', json_encode($result));
  } else {
    $total += $result;
  }
}

xlog('Wrote %d bytes.', $total);

function onerror($errno, $errstr) {
  xlog('%d: %s', $errno, $errstr);
}

function onsignal($signo) {
  xlog('Got signal %d', $signo);
}

function xlog() {
  $args = func_get_args();
  fwrite(STDERR, call_user_func_array('sprintf', $args)."\n");
}

Then ran that as:

$ php -f test.php > /dev/null

...and HUP'd it a bunch of times in another window. But fwrite() never faltered:

$ php -f test.php > /dev/null
This process PID: 69251.
Got signal 1
Got signal 1
Got signal 1
Got signal 1
Got signal 1
Got signal 1
Got signal 1
Got signal 1
Got signal 1
Got signal 1
Got signal 1
Got signal 1
Got signal 1
Got signal 1
Got signal 1
Got signal 1
Got signal 1
Got signal 1
Got signal 1
Got signal 1
Wrote 77456211968 bytes.
epriestley updated this revision to Diff 47991.Feb 5 2019, 8:02 PM
  • Fix "fread()" typo.
  • Explicitly note that signals during fwrite() don't have any apparent effect.
amckinley accepted this revision.Feb 5 2019, 8:07 PM

I can't get signals during fwrite() to do anything.

I think doing a shell redirect to /dev/null will result in fwrites that return instantly (or even if they don't, maybe 1MB is too tiny of a window to interrupt. If you really wanted to run this to ground, you could probably put pv in the shell pipeline with a rate limit, but I'm satisfied with our current level of understanding of this issue.

This revision is now accepted and ready to land.Feb 5 2019, 8:07 PM

Ah, good call. php -f test.php | tee /dev/null > /dev/null is good enough to block the pipe as long as the buffer is at least ~512 bytes on my machine, and larger buffer sizes hit a lot of 0-length EAGAIN writes and not-the-full-size-of-the-buffer writes.

But, I still can't get it to do anything interesting when I SIGHUP it.

But, I still can't get it to do anything interesting when I SIGHUP it.

Looking at the docs for pcntl_signal, the last argument is restart_syscalls, which is implicitly true if not specified. Possibly the culprit?

I also traced fwrite() down through the PHP source to possibly end up here (at least, when we're writing to stdout):

static size_t php_output_stdout(const char *str, size_t str_len)
{
	fwrite(str, 1, str_len, stdout);
	return str_len;
}

...wherein the return value of fwrite() (in C) is ignored and str_len is returned unconditionally. This is a very PHP function. ¯\_(ツ)_/¯

(I think that's possibly an "emergency" implementation used during startup/shutdown before everything else is actually active, I'm not terribly familiar with the PHP source and there are several layers of indirection and function pointers around stream handling.)

...wherein the return value of fwrite() (in C) is ignored and str_len is returned unconditionally.

ohwellok

I think that's not really where we end up, more likely php_stdiop_write() instead? That checks the return value and replaces it with 0 if it looks like it might prevent us from continuing. Very helpful!

		int bytes_written = write(data->fd, buf, count);
		if (bytes_written < 0) return 0;
This revision was automatically updated to reflect the committed changes.