Page MenuHomePhabricator

Resolve structural problems with Conduit API methods related to large input/output sizes and binary data
Open, LowPublic

Description

Previously, see T5955.

Some API methods need to read or write data which is not a good match for JSON. In particular:

  • JSON is not naturally stream-oriented, and isn't an ideal format for transmitting very large blocks of data (for example, a 1GB git diff output).
  • JSON can not naturally encode binary data, and isn't an ideal format for transmitting binary data (for example, repositories may include path names which can not be represented naturally in JSON).

Earlier, I imagined navigating the binary issue by having the client say things like Content-Type: application/bson and Accept: application/json, application/bson and having the server fall back to BSON/protobuf/messagepack/whatever. However, I'm now generally less excited about this approach (see T5955#247571 for more details). It also doesn't help with the "large data size" issue at all.

Instead, I'm inclined to pursue these approaches:

  • Uploading large blocks of data: the client uploads the data as a file, then submits the file PHID. "Large data" generally means the File chunking block size (4MB).
  • Downloading large blocks of data: the server stores a temporary file and gives the client a file PHID / URI.
  • Uploading binary data: case-by-case? I'd ideally like the answer to be "not allowed" but this will mean that the API does not support certain operations like using hg|git grep to search for binary sequences in files. Good riddance?
  • Downloading binary data: we provide a "readable" encoding and a base64 raw encoding in some kind of standard type-format.

This isn't entirely exhaustive. Some open questions:

  • What do we do about arbitrarily long data in unusual places? An example is 2GB path names (see T10832). There are likely many weird/abusive variants of this where things like author names, branch names, tag names, etc., might be specifiable or corruptible to be arbitrarily long. (Offhand, hg bookmark happily accepts bookmark names up to the point where xargs fails.) Ideally we just reject these use cases, but inevitably someone wants arbitrarily long unit test names or whatever.
  • What do we do about framing calls over SSH?
  • In cases where a call may return an arbitrary amount of data, we'd generally like to provide time and/or byte limits and communicate these to callers. How do we do this? The existing tooHuge / tooSlow support feels a little silly as a general pattern.