Page MenuHomePhabricator

Implement Phragment
Closed, ResolvedPublic

Description

Currently there is no real way to version files. Because of this, even though you can upload a file as an artifact from a build, there's no way to specify that this file is a particular version of a particular package (e.g. the result of the build is "MyApp 1.0").

This also means there's no way to relate files with regard to making queries like "Retrieve the latest version of MyApp for Windows" or "Give me specific changes between my version and the latest version".

I'd like to explore the idea of a "Bundle" application (the alternative name was Phormation, but I don't particular like this). This intended to be a lightweight package and distribution system, although there are no initial intentions to represent dependencies between packages (because the packages in this case are being directly delivered to users as installable or extractable packages).

The following concepts exist in Bundle:

  • BundlePackage: A package such as "MyApp". This encompasses different versions of a package.
  • BundleVersion: A version of a package. This has an associated filePHID which is intended to be a ZIP file. I'm a bit torn between whether it should be a 1:1 mapping, with the target file a ZIP package; or a 1:* mapping with each individual file mapped. The latter has the distinct disadvantage that it makes it painful to upload results from a build (since it would need to transfer individual files).

There's the intention of supporting a delta-based update mechanism; where one can make a request with "I currently have version 15 of MyApp, what are the delta updates to the files between 15 and the latest version?" and the server will respond with the appropriate delta updates. This is similar to the way https://github.com/hach-que/Pivot.Update provides incremental updates for software, although having this service available with Phabricator's build system makes it much more powerful.

Related Objects

Event Timeline

hach-que claimed this task.
hach-que raised the priority of this task from to Needs Triage.
hach-que updated the task description. (Show Details)
hach-que added a project: Phabricator.
hach-que added subscribers: hach-que, epriestley.

+@btrahan
+@staticshock

There's some possible discussion in T4056 (some similar features, for other types of binary assets) and maybe T3742 (some similar features, but more user-focused). Another tool sort of in this space, Artifactory, came up recently as well.

In T4056, it sounded like the use case for versioned asset management (e.g., the ability to build an "Asset Library" in Phabricator) might be less strong than I imagined, but I still think this is at least worth thinking about. I believe it's very similar to real build artifacts, the curation would just be a lot more manual/human.

For T3742, my big issue was that it doesn't seem to enable us to do much once we build it, but Asset Libraries and Bundles provide some plausible motivators for at least some of the featureset. (Overlaying these on Files seems like the cleanest fit, though, rather than trying to push all of Files into a tree structure.)

A possible theme in these tasks is to make the the unique key a "BundlePackage" has be a virtual path, like "MyApp/Releases/MyApp.exe", and have the browse view be directory-like instead of list-like. This might make a lot of sense for some of these non-release use cases, like caching random .jar files which represent intermediate build stages, as building a typical Java project generates over 100 trillion intermediate .jar files. This could also resolve the 1:1 / 1:* issue by making each "file" in the "filesystem" a 1:1, and then have the API let you download either MyApp/x/y/* (to get a bunch of stuff) or MyApp/x/y/z.exe (to get a specific file).

One drawback to this is that we're basically writing Git at that point, sort of? Maybe we should actually consider Git as the storage engine, but instead of writing binary data into it, write metadata? Pretty much filePHIDs. So the repository would be a directory structure with a bunch of files that have one filePHID in them (and maybe a little bit of other data), and the actual data would live in Files, and that would let us avoid issues with Git not handling binaries especially well and let us clean up old binaries easily by deleting them. On the other hand, this is a massive, massive amount of effort, and we should never need to fork a file and could easily do updates transactionally, which I think kills about 95% of the value out of the gate. So this is probably a terrible idea.

If we pursue the paths thing, one thing I think we should consider is making the first element of every path a first-class "Namespace" sort of object which has all of the policy rules for the namespace. I'm worry that we made a mistake in Phriction by not doing this, and that the way forward there is to let you create multiple "wikis" but give each "wiki" the same policies for all of its pages. I think dealing with policies which can be set at any node in a directory structure may be really complex to implement and difficult for users to understand. On the other hand, the filesystem works like this and no one seems too freaked out about it, so maybe my concerns about complexity aren't reasonable ones. We could try doing a pathwise implementation in one system and see how bad it is.

The major technical challenge I see on the horizon here is dealing with large files and high I/O rates. In particular:

  • Uploads over HTTP currently read the entire file to disk, then read the entire file into RAM. It would be much better to be able to stream the file directly to permanent storage (like S3) in the case that we have a non-disk storage engine. However, we can not do this with PHP over normal HTTP: we don't get to run any code until PHP has already done everything. In general, there is no way for us to run code that executes as a request arrives in PHP. We can do streaming uploads over SSH easily, or we could send them through the Node server. But both of these are less-than-ideal for some use cases, though, and have relatively high complexity.
  • Large upload limits are a giant pain to configure in PHP, too.
  • Downloads over HTTP currently read the entire file into memory, then write the entire file to the network. I believe we can stream them instead in PHP, but we may encounter some difficulty dealing with webserver output buffering. I'm 95% confident this is tractable, but may take some massaging (for example, we may need to disable proxy_buffering in nginx, which might disable compression for other types of responses by default).
  • The cost we pay in non-network resources (RAM, CPU) to hold an IO-bound PHP process open is very high compared to alternatives. This probably is not a major issue: each php-fpm takes about 1% of the available RAM on this box, and hopefully we'll max out the pipes way before we get to 50-100 simultaneous downloads per box. But if you have, say, 32 we workers, and 32 people try to download artifacts over dialup at the same time, we'll DOS the box for other traffic, basically having the application "slow loris" itself (and we can't just have 10,000 workers, since they won't fit in RAM). I'm not especially concerned about this, and I think we can mitigate it when we get there, but it's something to watch out for. This is another possible argument for bringing Node into the pipeline.
  • For normal Files use cases, CDN integration is desirable at some point (see T2382). This probably doesn't impact things much, but is tangentially related.
  • Conduit needs some protocol work to support binary transmission efficiently. We currently put file content in JSON, which means we have to base64 encode it. So the wire size of files is larger than they are on disk. This was very easy to implement but is goofy. It shouldn't be hard to fix, though.

I think we can navigate all of this as long as we're careful about looking ahead when designing how the protocols work. Of these, uploading is the murkiest part by far to me. My initial thought is that at the high end we miight end up with an io.phabricator.* domain which you configure to go to Node, and then it handles streaming, events, and holding connections open cheaply, and talks to Phabricator over HTTP for all the metadata stuff. But this may be very high-end -- it looks like we can't even stream requests through nginx. (We could also examine replacing Node with Tornado or something like that if we get this far, there's no technical reason either the upload or notification stuff need to be in Node and the notification server is basically a toy at this point.) On the low-end, you'd just use the same domain for both www and I/O and that would work until you hit file size/process limits or want better performance.

We can probably build incremental updates into Files today -- basically, the client would send the sha1 of the file it has and the sha1 of the file it wants, and Phabricator can compute the diff and send it if it knows both sha1s. This would let us gently bump into some of the issues above a bit, although it probably doesn't have many practical applications until we have more structure.

hach-que renamed this task from Implement Bundle to Implement Phragment.Dec 5 2013, 9:29 PM