Page MenuHomePhabricator

Compress Conduit client requests
Closed, ResolvedPublic

Description

See PHI1679. Some Conduit requests are relatively large (at least several megabytes, especially uploading diffs until T13338), and essentially all Conduit requests should benefit from compression (bodies are always JSON and no Conduit requests contain binary data, see T13337).

Currently, ConduitClient does not compress requests. Phabricator has supported decoding compressed requests forever, so this should be a free performance improvement.

Event Timeline

epriestley triaged this task as Normal priority.Apr 8 2020, 3:58 PM
epriestley created this task.

PHP has a setting called post_max_size. If a POST request is larger than this size, PHP does not populate $_POST or $_FILES.

So if we receive a request which has an empty $_POST value, it may be because:

  • the user sent no data; or
  • the user sent more than post_max_size bytes of data.

Today, we try to detect this in PhabricatorStartup and issue a warning that data may have been silently discarded because it exceeds post_max_size. The rough shape of this test is that if Content-Length is positive but $_POST is empty, we assume $_POST is missing because PHP dropped the data.

However, there is really a third condition which also causes an empty $_POST value:

  • the request has "Content-Encoding", and some unknown set of version or content conditions are also satisfied.

The "unknown conditions" are because byte-for-byte identical requests result in populating $_POST with array("" => "") (which is not empty) and array() (which is empty) on two different test hosts.

Currently, this check sometimes misfires and detects "body was dropped by post_max_size" when the real condition is "body is compressed".

This check is quite old, and our processing of request bodies has changed significantly during work on T4369. In general, we now always rebuild $_POST from the raw body, and the raw body appears to not be affected by post_max_size. So, on the normal pathway, the value of post_max_size does not matter.

It can still matter in one (?) other case :

  • If you use a bare <input type="file"> to upload a large file (so the request is multipart/form-data), and enable_post_data_reading is enabled (this is common), $_FILES won't be populated and things will fail.

However, this is at least somewhat-clearly communicated and this workflow is very rare.

I think this check can simply be removed.

A grand aventure in PHI1679 led to identifying the Apache "SetInputFilter DEFLATE" directive as a problem with implementing compression.

The behavior of this directive is that it decompresses "Content-Encoding: gzip" requests inline, without changing the "Content-Encoding" or "Content-Length" headers. For example, if the client sends a 1000-byte compressed request body, we normally expect the server to receive a 1000-byte compressed request:

HTTP ...
Content-Length: 1000
Content-Encoding: gzip
<1,000 bytes of gzip-encoded compressed data...>

What we'll actually receive is something like this, with "Content-Encoding" and "Content-Length" headers that do not correspond to the request body:

HTTP ...
Content-Length: 1000
Content-Encoding: gzip
<3,281 bytes of uncompressed data>

This is pretty ridiculous, and Phabricator can't easily detect that it's happening. PHP also trusts the Content-Length header, so if we read file_get_contents('php://input'), we'll get the first 1,000 uncompressed bytes and then EOF. The other 2,281 bytes are lost and unrecoverable.

But wait, there's more!

<?php

require_once 'scripts/init/init-script.php';

$stream = fopen('LICENSE', 'rb');

if (!$stream) {
  throw new Exception('fopen() failed!');
}

$zlib_window = 15 + 32;

$ok = stream_filter_append(
  $stream,
  'zlib.inflate',
  STREAM_FILTER_READ,
  array(
    'window' => $zlib_window,
  ));

if (!$ok) {
  throw new Exception('stream_filter_append() failed!');
}

var_dump(
  array(
    'result' => stream_get_contents($stream),
  ));

LICENSE contains plain text (which is definitely not a valid gzip-compressed bytestream), so when we try to read it through a "zlib.inflate" stream filter we'll get an error, right? Like, if I cp LICENSE LICENSE.gz and gunzip LICENSE.gz, I get a sensible error:

$ gunzip LICENSE.gz 
gunzip: LICENSE.gz: not in gzip format

...so we might guess that PHP has the same behavior, right? Ah, no, of course not. You get a successful, zero-length read:

$ php -f test.php 
array(1) {
  ["result"]=>
  string(0) ""
}

In any case, D21116 adds a self-test for a server configuration which mangles gzip bodies.

In fairness, "SetInputFilter DEFLATE" is documented as uniquely opinionated, if you catch this box and read between the lines a little bit:

Screen Shot 2020-04-14 at 2.41.45 PM.png (98×763 px, 23 KB)

PHP's error handling behavior is also documented if you know what to look for:

... [A] stream may fail ... [a] stream ... may also ... not ... list ... most ... error[s].
https://www.php.net/manual/en/stream.errors.php

There's also one minor related concern in PHI1638 that I want to take care of here, since it's adjacent.

One possible improvement remains here: in streaming mode, HTTPSFuture will not "Accept-Encoding: gzip" and can not inflate responses. Today, this only applies to arc download.

I think the PHP APIs available for doing this are stream_filter_append() and inflate_init(). The error handling behavior of stream filters isn't great (per above) and inflate_init() is PHP7+ only and seems possibly-immature. For example, inflate_get_status(), which is important in a correct implementation, is documented as:

Usually returns either ZLIB_OK or ZLIB_STREAM_END.
https://www.php.net/manual/en/function.inflate-get-status.php

I suspect this is possible to get right, but it looks pretty fiddly. Since there isn't a ton of motivation to get this working right now, I'm going to leave it for whenever an "I regularly download enormous, uncompressed XML files with arc download" use case arises.