Page MenuHomePhabricator

Can Arcanist plausibly be compiled into a binary?
Open, NormalPublic

Assigned To
Authored By
Apr 21 2022, 8:24 PM
Referenced Files
F10635414: IMG_1119.jpeg
Apr 23 2022, 10:08 PM
F10635365: 67243942457__DA576CC8-D560-4CA7-AFC1-DB6452349F04 2.jpeg
Apr 23 2022, 10:08 PM
F10635302: 66226204266__9237AE74-4AEB-45CF-BC35-AD5EBA259AF3.jpeg
Apr 23 2022, 10:08 PM
F10635423: IMG_1041.jpeg
Apr 23 2022, 10:08 PM
F10635290: vanity.jpg
Apr 23 2022, 10:08 PM
"Love" token, awarded by mormegil."Y So Serious" token, awarded by cburroughs."Manufacturing Defect?" token, awarded by cspeckmim.


Although PHP does a lot of good for Phabricator, it also creates some tough problems. These are documented elsewhere, but for completeness:

  • It's hard for users to install PHP, and they don't want to install PHP -- and macOS will stop shipping with PHP soon, and Windows doesn't ship with PHP (and never has).
  • Phabricator would benefit from having access certain services (full-text search, full-codebase search, repository graph storage) that very likely aren't practical to write in PHP because they are too sensitive to performance, control over data in memory, or both.
  • Likewise, Phabricator would probably benefit from having a native webserver (no Apache dependency) and notification server (no Node dependency).
  • Various extensions and runtime changes (see T2312) could benefit Phabricator.

Hypothetically, Arcanist can be compiled into a native binary which has a statically linked PHP runtime and is hard-coded to run Arcanist.

A simple version of this is to replace bin/arc with a copy of php which just hard-codes the runtime arguments -f path/to/arcanist.php -- $@. This is obviously kind of goofy, but then we get this pathway forward:

  • build the compiler toolchain required to produce a static bin/arc, which is just "PHP in an Arcanist costume" and make it work on macOS and Windows;
  • decorate the PHP/C FFI stuff into easy C extension support;
  • precompile all the native code into a single binary to sweep PHP under the rug.

A personal motivation here is that I want to make a robot that has a blinking light, and that might be simpler if I could just build an MQTT server on top of Phabricator. But I want my robot and Phabricator to call some of the same MQTT code, and PHP is bad for robots.

A general challenge is that I have no idea how building things works. Here are things I generally believe to be true or true-ish:

  • A ".c" file can be compiled into a "static library" (sometimes .so?) or a "dynamic library" (sometimes .o?), maybe? What's the difference? How does this work on Windows (.dll ~= .o)? What is .dylib? What are the differences between Linux and macOS?
    • Which symbols in a ".c" file are present in the library? How can you control which symbols are emitted?
    • Can you enumerate symbols in an object file? How? Can you easily do this at runtime?
    • How much information about symbol names is preserved? Can you meaningfully enumerate types, e.g., subclasses of X, at runtime?
    • Can a binary enumerate its own symbols?
    • Why does the linker (or compiler?) need ".h" files? What happens if the definition in the ".h" file isn't the same as the definition in the object file?
  • Binaries generally load symbols automatically at startup time by loading dynamic libraries, I think?
    • The arguments for dynamic libraries over static libraries are mostly: security and memory usage? Do these really matter in 2022, at least in desktop environments? Doesn't a single Electron app take 85GB of RAM? Why isn't more stuff compiled statically?
    • Can binaries load objects at runtime? Is this rare? Why?
    • How can you tell what symbols a binary depends on? How can you tell what libraries it will try to load at startup?
    • What happens if a binary depends on f(int x) and loads f(float x)? Or, what prevents this?
    • What happens if a binary loads x.o and y.o and they each define a symbol with the same name?
  • Can we build a single binary with a bunch of data in it (e.g., a picture of a cat) without breaking anything?
    • Does the system always load an entire binary into memory at startup, motivating separation of large chunks of data?
  • If I compile a binary (or a .o, or a .so) on one system, how can I tell which systems it will work correctly on?
    • What happens if I try to use it on the "wrong" system?
  • What are the practical limits of multi-system or multi-architecture binaries?
    • Can a binary built on Ubuntu14 run on Ubuntu20 on the same hardware? Can it run on Debian? In what cases will it be unable to run?
  • How can PHP be built statically? How hard is this?
  • Why does ./configure spend 15 minutes compiling 800 programs to figure out if my system supports integers in 2022?

Event Timeline

epriestley triaged this task as Normal priority.Apr 21 2022, 8:24 PM
epriestley created this task.


$ cat random.c
int get_random_number() {
  return 5;
$ gcc -c -o random.o random.c
$ gcc -shared -o random.o

Can you enumerate symbols in an object file?

Yes, with nm:

$ nm
0000000000003fb0 T _get_random_number
                 U dyld_stub_binder

Which symbols in a ".c" file are present in the library? How can you control which symbols are emitted?

I think "almost all of them", at least by default. There's a complicated dance you can do with gcc -gc-sections ... to strip dead code, but the compiler is willing to build a library out of a bare .c file so it has no idea if any global symbol is going to be called by something that loads it or not.

For binaries with a single entry point GCC can do some analysis, but it seems like this process is fundamentally complicated because you only need a .c and you can get a symbol.

The strip binary can strip symbol names but this seems (?) pretty coarse and I think it is just throwing away symbols, not eliminating unreachable code.

GCC will drop static methods with no callers. However, static methods that are called by other functions are emitted as symbols, although those symbol names can be stripped with strip -x ..., probably nondestructively?

There's a bunch of __attribute__((visibility("..."))) stuff too, and "configure these 7 settings in XCode exactly like this, only works in old versions of XCode", but the consensus from users on StackOverflow wanting to hide all their secret proprietary function names in their static library files is that there's no simple way to do this and you're basically getting everything unless you put a lot of time and effort into it or write a script that parses the .o format and mangles the symbols.

How much information about symbol names is preserved? Can you meaningfully enumerate types, e.g., subclasses of X, at runtime?

This is kind of a loaded question, since classes aren't symbols at this level.

Full function names appear to be preserved. C++ class functions get mangled? C++ namespaces also get mangled. This can be somewhat undone with nm --demangle .... It looks like there's a whole complicated set of sometimes-per-compiler mangling rules.

extern "C" and various _cdecl directives can control name mangling.

With C++ RTTI you can get a bunch of class/type information embedded into class vtables (I know some of these words) but this really only powers RTTI and doesn't support enumeration.

But, you can likely roll your own with a bit of code generation and then dlopen()?

Can a binary enumerate its own symbols?

There doesn't seem to be any standard support for symbol enumeration. It's obviously possible since nm does it, but there's no dllist(...) or similar.

At least on Linux, a binary can inspect libraries it has loaded with dl_iterate_phdr(), but this seems a few steps removed from symbol introspection.

Roll Your Own Introspection

A library can seemingly provide a symbol listing like this:

  • Have the library define a "libname_get_symbols()" function.
  • Have the application dlopen(...) the library, then dlsym(...) the function, then call it.

You need the symbol to be namespaced with libname so it doesn't collide with other libraries, and the application needs to know both "libname" and the path to the library. I also think a binary can't dlopen() itself?

An issue is that once you know the name of the class, you can't new it. So the library also needs to export a new_classname(...) symbol that just does something like:

extern "C" Animal *new_Zebra(void) {
  return new Zebra();

There's also a pattern called "Self-Registering Classes" where libraries just run code when loaded by referencing global symbols, but I think this is quite distasteful. This is somewhat less bad than it otherwise might be because of RTLD_LAZY, though.

This is also the API you get for error checking of dl*() functions:

char *dlerror(void);

It may be possible to automate generating the symbol listing in the general case with gcc -fdump-lang-class or similar.

How can PHP be built statically? How hard is this?

At least on macOS, this is trivial:

./configure --enable-static --enable-cli --disable-all

PHP even has an "embed" SAPI which just emits a library with no main():

./configure --enable-static --enable-embed=static --disable-all

It's then trivial to provide a main() and link it statically. This produces a ~5MB binary with only required dynamic dependencies:

$ otool -L embed.a
	/usr/lib/libresolv.9.dylib (compatibility version 1.0.0, current version 1.0.0)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1311.0.0)

This binary doesn't have any extensions -- particularly, cURL -- but it's trivial to make it invoke arcanist.php from source.

I find this very intriguing.

Phabricator would benefit from having access certain services (full-text search, full-codebase search, repository graph storage) that very likely aren't practical to write in PHP

Is the purpose of implementing introspection of libraries to allow for building native components that would get picked up and auto-loaded by arcanist?

Is the purpose of implementing introspection of libraries to allow for building native components that would get picked up and auto-loaded by arcanist?

It might enable this, but it's hard for me to think of an Arcanist component that would really benefit from being native (outside of a handful of very narrow blocks of code in T2312).

The specific root problem I'm solving ("robot that has blinking lights") is that I want a coffee maker that automatically refills the water reservoir near the room I use as an office. (I have a few other similar problems I'd like to solve, but they're all approximately in this vein.)

I started solving this problem by building a 21' x 8' deck in my garage. The floor falls about ~11" over 21 feet, which made it impossible and/or unsafe to use as a workspace, since things would roll/slide away (or one end needed to be lifted way up on blocks). The front of the deck frame is ~11" high and the back is flush with the floor:

66226204266__9237AE74-4AEB-45CF-BC35-AD5EBA259AF3.jpeg (1×768 px, 119 KB)

Technically, my garage already had a deck in it: some previous owner had built a deck to (according to the realtor) allow an antique car with very low ground clearance to open its doors over a sort of curb on one side of the garage (not visible in the photo). This already-existing deck mostly made the levelness problem worse, since it started with a relatively steeper ramp than the underlying floor slope (and no part of it was level). So I removed this deck, then replaced it with a substantially similar deck with a slightly different shape.

The resulting surface isn't perfectly level, but it's about as good as any other floor or wall in the house and things don't roll or slide off it.

The frame is redwood, in the hopes of improving moisture resistance. Redwood is much more expensive than pressure-treated wood, but it was impossible to legally dispose of pressure-treated lumber in California for approximately 9 months during 2021 and redwood smells nice.

I put underlayment, OSB subfloor, and vinyl tile on top of the frame and then built carts for my table saw, drill press, and miter saw. I was able to squeeze in a jointer and planer (I'd been milling lumber on the table saw before). I ran a bunch of 2.5" ductwork on the ceiling for dust collection and ended up with a minimal-but-adequate setup where I can safely process 8' boards with infeed or outfeed support. I have to move things around a little bit between operations, but the whole thing isn't excessively annoying or dangerous or unbearably difficult to clean up:

67243942457__DA576CC8-D560-4CA7-AFC1-DB6452349F04 2.jpeg (800×600 px, 123 KB)

Now that I could run all the machines, I built a simple vanity to create some counter space (and storage space) in the bathroom. The old vanity was an all-porcelain pedestal-style vanity with an overhang that I ripped out after our 2-year-old, who I guess was maybe 1.5 at the time, stood up into another one and hurt himself pretty convincingly:

IMG_1041.jpeg (768×1 px, 143 KB)

vanity.jpg (768×1 px, 122 KB)

I also made the little tray on the right to hold the coffee.

I have a float sensor in the reservoir and a 1/4in filtered water line running into it right now, with an RO filter under the sink:

IMG_1119.jpeg (1×768 px, 100 KB)

So I've made some progress, but I still have to manually open that blue valve and then manually close it a few moments later to fill the reservoir. Instead, I want a microcontroller to read the float sensor and open a solenoid-actuated valve when the water level drops to automatically refill it.

An absolute minimum requirement for this microcontroller is that it support over-the-air updates: if I have to run a USB cable into my bathroom to fix my coffee maker, what was the point of any of this?

OTA updates need some kind of server component that can send the new builds to the microcontroller. If I wrote this component using Phabricator as a platform, most of my work would already be done for me (by past me).

But if I make this interaction speak (say) MQTT as a protocol, I need one MQTT parser in C (on the microcontroller) and one in PHP (in the Phabricator application). I don't want to write two copies of the same code in different languages: that would be a silly waste of effort.

However, if I simply replace "PHP" in the Phabricator environment with "C that includes the PHP runtime", both components can run the same code.

This is approximately the same as the simpler approach of "write PHP bindings for an MQTT extension written in C", but I think it's more powerful to think of it as "a native runtime that can call PHP code" because the startup behavior does not need to be "invoke a PHP script". For example, the microcontroller could talk to an MQTT sever written in C that could mostly do a bunch of good wholesome select() and memcpy() type stuff on wholesomely unsafe memory, but call into PHP to save and load data through the Lisk storage layer, do auth, etc.

I just want to do the introspection part (or at least, to know if I can do the introspection part) because I think it has worked really well as a pattern for building modular software in Phabricator.

Can we build a single binary with a bunch of data in it (e.g., a picture of a cat) without breaking anything?

Obviously we can, like this:

const char *cat[] = "...";

So this question is more like: can we do better than this?

It looks like the answer is possibly "no", or "not significantly". We can use the objcopy binary to generate a .o file with the cat picture in a .data or .rodata section, but the .o file we get out seems like it's pretty much the same as what we get out of compiling picture_of_a_cat.c, and this approach is probably less portable than gcc picture_of_a_cat.c -o cat.o (e.g., macOS doesn't have objcopy, although there's likely some equivalent).

Since the binary data we'd want to include in the executable is probably more like const simple_hash_table *file_data_table[] = ..., just code-generating .c and then compiling that with gcc is probably reasonable.

Does the system always load an entire binary into memory at startup, motivating separation of large chunks of data?

As far as a I can tell, the answer is "generally, no": modern operating systems map the binary into virtual address space but don't copy it into memory until pages are actually accessed. There are some special cases and exceptions but there's no obvious strong argument against putting pictures of cats into your binary that I can find in 20 seconds of Googling.

All of the PHP code in Arcanist is ~3.5MB (uncompressed, with comments, etc) and the PHP runtime binary is ~4MB so this isn't really a significant concern anyway.

Can binaries load objects at runtime?

Yes, with dlopen(), etc.

Is this rare?

Yes? Maybe?


See below?

What happens if a binary loads x.o and y.o and they each define a symbol with the same name?

Seems like: the first one loaded usually wins, possibly with an error depending on exactly where this is occurring in the linking or execution steps. In cases where you're intentionally replacing a symbol, the symbol in y.o may be accessible from x.o with dlsym(RTLD_NEXT, ...) without knowing the identity of y.o (they're probably .so by now).

Why isn't more stuff compiled statically? Why [isn't dlopen() used more often]?
Why does ./configure spend 15 minutes compiling 800 programs to figure out if my system supports integers in 2022?

Theory: no one else knows how computers work either, and when people encounter a problem with computers they almost always build an abstraction layer on top of the problem instead of fixing the root problem?


  1. PHP doesn't link cURL statically and doesn't appear (?) to provide any easy way to link it statically.
    • Possible solution: build static cURL.
    • Possible solution: link cURL dynamically.
    • Possible solution: replace cURL with mbedTLS + a first-party HTTP client, since we don't care about 99% of what cURL does. Or implement first-party TLS hahaha except I'm half-serious? If HTTP/HTTPS is going to happen from an embedded context on my coffee maker I need a healthy embeddable TLS + HTTP stack anyway.
  2. The STDERR and STDOUT constants are defined by the CLI SAPI, and not present in the embed SAPI.
    • Possible solution: define them in the C wrapper.
    • Possible solution: polyfill them at startup in PHP (I'm not entirely sure this is possible).
    • Possible solution: abstract around them and use php://stderr and php://stdout instead. See also T13556. This is likely desirable anyway.
  3. When arc tries to load PHP code, it needs to read data out of the executable binary in some set of conditions (e.g., "if we miss on disk").
    • Possible solution: hook zend_compile_file(), which seems to be the expected way to approach this. The flow in phar_compile_file() seems similar.
  4. To run unit tests, arc depends on the presence of a php binary on the system.
    • Possible solution: accept that you must also have PHP to run arc's tests (and that system PHP may differ from arc php).
    • Possible solution: also include the PHP CLI wrapper and invoke it when arc is executed as php, providing what is essentially a fallback toolset.
    • Possible solution: provide a "unit test helper" fallback toolset.
    • Possible solution: rewrite the 5 trivial cases where we need this (support/unit/*) in shell script (but: the reason to do these in PHP in the first place was so that they're portable to Windows).

I generated D21794 with a native binary that has no dependency on system PHP (but does depend on system cURL).

It still depends on all the ".php" files on disk, but maybe this is basically fine? The idea of distributing a single binary is philosophically appealing, but these components seem sort of silly to build into a binary:

  • the "arc-hg" extension for Mercurial;
  • the shell completion support files;
  • config examples;
  • lint metadata (PHP symbol information, English spelling data);
  • SSL certificate data;
  • XHPAST; and
  • the arc anoid game script.

All this stuff can be squished into a single binary, but I'm not ultimately sure how much value there is in doing it. Making it appear that there's no PHP happening anywhere in arcanist/ has some arguable value for convincing people with very strong language preferences that the tool won't bite them, but I don't really have a horse in that race anymore.

One alternative is to implement parts of T2312 (like making "3" == 3 a runtime error), call the resulting not-quite-compatible-with-PHP language "PHP for You", and use the ".py" extension to identify files written in "PHP for You".

What are the practical limits of multi-system or multi-architecture binaries?

There seem to be three general dimensions here that matter here:

  1. What format is the executable in? Linux executes files in ELF format ("Executable and Linkable Format"), macOS executes Mach-O, and Windows runs PE ("Portable Executable").
  2. What instruction set is the actual code in the binary using? For our purposes, "arm64" (M1 Macs) and "x86_64" (pretty much everything else) are most relevant, since (AFAIK) very few humans develop computer software on systems with other architectures today (even if they are building software which will run on other architectures).
  3. What libraries and symbols does the executable link against and call?

Executable Format: For (1), it's possible, although not necessarily advisable, to distribute a single executable binary that runs almost everywhere:

This is extremely clever, but probably not worth the complexity. To make it work on Windows, PHP would also have to be able to link against Cosmopolitan libc.

Instruction Set: For (2), I believe APE doesn't build binaries that can execute on multiple architectures (although perhaps the approach could). In practice, building x86_64 is probably good enough because macOS can emulate it on M1 chips.

This is primarily relevant for macOS, where you can build a "Universal Binary" by gluing together an x86_64 binary (for older Intel Macs) and an arm64 binary (for newer M1 macs):

System Libraries: See below.

Can a binary built on Ubuntu14 run on Ubuntu20 on the same hardware? Can it run on Debian? In what cases will it be unable to run?

The internet claims the answer is roughly "it works in theory, but almost never in practice". The issue is that the library versions and symbols on these systems differ slightly, so the stuff you link against on Ubuntu20 may not exist or may not work the same way on Ubuntu14 or Debian. But most of it will be the same, so programs might work fine if they minimize dynamic links and get lucky with whatever they do link against.

My naive guess is that maybe this does actually kind of work a lot of the time if you're just linking against libc, since a lot of effort is put into making libc binary compatible and pretty much everything uses a mutually-binary-compatible libc implementation? And Ubuntu and Debian both switched from glibc to eglibc -- and then switched back -- so it can't be that disruptive?

Except that it looks like libc symbols all get versioned, and when you build an executable it links against all the newest versions? See also this (which seems clever but terrible?):

None of this seems more compelling than the simple approach of:

  • build a macOS Universal binary;
  • build a Windows 64-bit binary;
  • maybe build a couple of popular Linux binaries but I'm probably not really going to delve into this myself;
  • other systems can run php -f arcanist.php (or build from source themselves if arc ever becomes incompatible with upstream php).

FWIW I've found by far the easiest way to work with microcontrollers is using micropython / circuitpython on any of these chips: ESP32, ATSAMD21/ATSAMD51 and RP2040. The esp32 is in many ways the easiest and most practical because it's extremely cheap and includes a wifi radio.

I wonder if PHP could be made to work on the ESP32? I guess if it can run python anything's possible.

Just for my own notes:

I don't know if this is any good, but this project appears to be a generic server written in C that can call into PHP to run application code, somewhat similar to what I imagine writing above:

I've never heard of it before and it has a mere ~450 GitHub stars despite appearing to be technically impressive, although maybe it's rougher than it looks at first glance (e.g., open issue list seems to include it having no ability to stream large requests).

These are also projects in the same general vein:

The MQTT protocol appears to be extremely simple and generally sensible. I think Aphlict could reasonably speak standards-compliant MQTT (Aphlict is pretty much a subset of MQTT semantically, albeit wrapped in JSON). It's not clear there's much reason to do this, but if Phabricator eventually has an MQTT component maybe that would lead to less code overall.

(One reason not to put Aphlict over MQTT is that dealing with binary in Javascript is a bit messy, and MQTT is a binary protocol.)

MQTT is not entirely ideal for delivering large (relatively speaking) binaries for OTA updates, since it doesn't have a first-class concept of anything like file downloads. Lots of people are doing it, but seemingly in a bit of a hacky way where you (for example) create a one-off unique topic and then push all the firmware chunks in sequence. Since the client and server both have to know what they're doing, it seems cleaner to smuggle this into a SUBSCRIBE with a magic user property or something?

On the ESP32, doing an OTA via the builtin easy-mode wrapper esp_https_ota(...) once the WiFi radio is connected is sort of unreasonably simple. The full process I went through was:

  • Use ardunio-cli compile --export binaries ... to get a build/xyz/xyz.ino.bin file. You get other files too, but this one appears sufficient. There are some esptool elf2image and esptool make_image workflows, but those appear unnecessary.
  • Throw that up on an HTTPS server somewhere.
  • Drag and drop the root certificate for that server's cert out of Chrome, then convert it to CRT like this:
$ openssl x509 -inform der -in in.cer -out out.crt
  • Turn that into a C string and embed it in the program.
  • Include <esp_https_ota.h> and call esp_https_ota(...) more or less according to the documentation, after connecting the radio.

After that, the board just OTA updated flawlessly without further tweaking.

I'm still cheating a lot here, in the sense of "gluing together demo code in a fragile way":

ESP netif: I'm using ESP netif easy-mode esp_netif_create_default_wifi_sta(), to get DHCP and get an IP address assigned (and maybe it's doing a bit more than this for me). Since netif only wraps lwIP (?) this seems like an unnecessary layer with no likely upsides except that it's easier to get working initially?

No Network Interface Selection: I'm using socket() (from lwIP?) directly without passing context from the WiFi stack to identify which network interface should be used. This seems wrong (i.e., the code should not be guessing that it should use the WiFi radio via magic)? But this also seems very standard, and it's not clear the APIs even let you choose which network device you want to work with?

Some answer on StackOverflow ( says this is just what lwIP does and you can inject some hooks if you want multiple network interfaces.

I'd prefer APIs that don't infer important information from secret globals, but I think that ship sailed ~50 years ago in this case. I don't think I am quite dedicated enough to yak shaving to write a TCP stack. Or am I? haha no haha... but what if?

ESP Easy Mode OTA: esp_https_ota() is an unnecessarily heavy abstraction over a lot of stuff, like an entire HTTPS stack, possibly including a TLS stack that isn't mbed? Looks like it's a fork of mbed:

ESP-IDF uses a fork of Mbed TLS which includes a few patches (related to hardware routines of certain modules like bignum (MPI) and ECC) over vanilla Mbed TLS.

FreeRTOS/Events: I have a lot of non-evented code and a murky model of how I should be using tasks and events on FreeRTOS.

Arduino CLI/IDE: The entire Ardunio IDE/environment seems like an unnecessary heavy abstraction over esptool, and I should probably be using the ESP IDF? The ESP IDF also seems to be a lot of Python code wrapping lower-level things, so maybe that layer can also be at least partially removed? I'm mainly hoping to get faster builds here.

Faster Builds

I'd like to reduce the duration of the build-deploy iteration loop. It's not bad right now (and OTA seems faster than USB?), but also not great, especially compared to PHP. One part of this happens on the development machine, and improving this is probably mostly about unwrapping layers of Python over layers of autoconf or whatever. The other part happens on the controller itself. I think this part is mostly limited by bandwidth and flash write speed, although it actually seems pretty fast for OTA? (How fast is it actually?)

Can performance be improved by tracking which version of the firmware is available locally and just downloading a delta? That is:

  • When booting for the first time, copy firmware to the second firmware zone. Both now have version X.
  • When doing an OTA update to version Y, delta X vs Y and just patch X in zone 2, then reboot with zone 2 active.
  • On the next OTA update, delta X vs Z and patch X in zone 1, then reboot with zone 1 active.

This seems "free", i.e. we get to keep a rollback version of the firmware and just download and write less data. Of course, this assumes that incremental updates produce a relatively similar binary (or can be made to), and/or operations like firmware-to-firmware copies are much cheaper than network-to-firmware copies (so insert/delete can be meaningful operations). It's possible that producing firmware versions X and Y that have any kind of binary similarity is extremely difficult and/or that flash write rate is the limiting factor so "insert" isn't a realistic operation.

Here's someone doing exactly this ( although I don't see any real answers about the stuff I care about, e.g.:

  • How fast are: reading from flash, writing to flash, and reading from the network? (What are the per-operation overheads and the throughput rate?)

Obviously, a delta update will be smaller, but it's not obvious that it's actually faster.


  • How do modern servers written in C/C++ handle parallelizing requests? What model is the best fit if we assume requests may invoke PHP code?

How do modern servers written in C/C++ handle parallelizing requests?

These might all be completely wrong, but:

PHP-FPM: fork() children, each child handles one request at a time.

I think the children lock the listening fd, accept(), then process the request.

H20: I've never heard of this, but it came up while searching. Uses threads with libuv. I think multiple requests per thread?

Node: Threads with libuv, not sure about requests-per-thread.

Deno: Threads (?) with tokio because Rust, I don't expect to touch Rust.

Apache: Configurable, but generally modern Apache forks children, then each child runs threads? It's not obvious to me why this is desirable.

In practice, the default is prefork and everyone seems to run prefork, which means no threads. See also below.

One request per thread.

mod_php: Theoretically works with threaded apache but a lot of people with blogs seem to think this is, charitably, optimistic, and that you can't actually run PHP with threads. Unless you jump through hoops, this is prefork + single-threaded.

  • Is ZTS fine?
  • Is ZTS an unsalvageable broken mess?
  • Is ZTS mostly okay, but has some rough spots that could be fixed?

Nginx: Threaded, one thread per core, multiple requests per thread.

Varnish: Threaded, separate acceptor and worker threads. One worker thread per session.

Swoole: I, uh, can't immediately figure out what this is doing. I think "coroutines" are application-level and swap only (?) when you wait for I/O. Phabricator can already do this (less generally, but probably more usably) with Futures in application space. Swoole has separate "Process" stuff, but it's just a wrapper around fork(). It has "Table", but that's APC. The actual server is probably (?) process-per-request.

What model is the best fit if we assume requests may invoke PHP code?

For workloads like Aphlict/MQTT (many long-lived connections with low activity per connection) we'd probably prefer an nginx-style model with ~1 thread per core.

For PHP, unless ZTS actually works -- and probably even then -- the "prefork" model seems like the only real option. Since prefork (or even fork-on-demand) is much simpler, I'll likely start there.

Here's how I'm thinking about overengineering this:

I believe some server responsibilities are bottlenecked by process overhead if they are one-process-per-request. This is stuff like servicing persistent connections, actual network I/O, etc. Most of the time, these connections are sitting idle. I believe per-process overhead is very low so these bottlenecks are not very constraining (e.g., they only occur at a very large request/connection rate), but the preferable design is multiple-request-per-process where possible, likely with threading.

In contrast, some server responsibilities are best in their own process. This is mostly "PHP", but other application behavior might not be thread safe or might not be worthwhile to make thread safe, or might be thread safe within a process but benefit from process separation (e.g., a service holding an in-memory repository full-text index).

It's desirable for a server to be able to handle both types of responsibility, e.g. both Aphlict connections and HTTP + PHP connections, which makes a hybrid model where some components are threaded and some components are forked into separate processes seem reasonable-ish.

In this hybrid model, a thread would accept() and a separate process would (sometimes) actually handle the request. This isn't how, e.g., php-fpm appears to work (I think it calls accept() in the child process).

  • What options exist for inter-process communication?
  • Can we pass an accepted fd to a subprocess?
  • Why does dup() exist? Why does dup2() exist?

A desirable capability is graceful restarting, i.e. seamlessly upgrading from one software version to another. Normally, this drains old children and sends new connections to new children.

  • Can we hold a connection open across a subprocess upgrade, so that even persistent connections can be seamlessly upgraded (assuming there are no protocol differences, of course)?
  • What does execve() do? Can we pass things across execve() through some sidechannel? Is this a terrible idea?
  • Generally, what are the limits of passing accepted connections across processes?

Assuming enough of the answers to these questions are "yeah, computers can do that", a message bus with a bunch of source/sink components seems like a reasonable architecture? Sending messages across a (possibly inter-process) bus is probably much slower than, say, zero-copy receive wizardry, but given that I'm planning for one part of this thing to run PHP, it seems unlikely that the rest of it is ever going to get to a place where the biggest performance improvement is zero-copy receives.

Another general question is:

  • Why isn't there a (set of?) modular "everything server" projects with pluggable application/protocol logic already? Are there, I just don't know about them? Surely anyone excited about microservices (is anyone still excited about microservices?) isn't writing a unique accept() loop per microservice? Are they just all HTTP?

One possible complexity is that downstream components might need to backpressure upstream components? But this should be possible as long as the message queues are designed for it.

What options exist for inter-process communication?

Aside: it has become impossible to search for anything technical on the internet because the top 50 results are 45 pages of blogspam and 5 stack overflow questions with no useful information.

Signals: Processes can send signals to other processes. These are somewhat easy to send from the CLI but terrible in all other respects.

Sharing Files on Disk: You could technically do this.

Shared Memory: Multiple processes can access the same memory region (but need to synchronize this access in some way)

POSIX Semaphores: Semaphores in <semaphore.h>.

System V Semaphores: More semaphores in <sys/sem.h>.

POSIX Memory Queues: There's a set of mq_open() APIs in <mqueue.h> that I've never seen anything use.

System V Memory Queues: There's a set of msgget() APIs in <sys/msg.h> that I've never seen anything use.

Unix Sockets: You can use an AF_UNIX local sockets. This is widely used: MySQL uses these by default, and ZeroMQ seems to use them internally to do IPC. These can be stream oriented or datagram oriented. I'm not immediately sure exactly what ZeroMQ is doing with these because SOCK_STREAM seems incompatible with having multiple queue consumers? But since ZeroMQ is mostly networked maybe it just accepts that the subprocess needs to send a "pull" message across the socket first? This is probably a reasonable sort of thing to do, I suppose, and makes separating the queue across hosts easy if that's ever desirable.

Mach Stuff: macOS implements a subset of the above, plus mach_* stuff.

I am suspicious that D21794 may have broken something subtle with unix magic, since I'm seeing some hangs out of deployment scripts wrapping daemon management scripts. I think the issue is probably one of:

  • We may fopen() an additional stdout and/or stderr handle, but do not fclose() it?
  • Passing fopen('php://stdout', ...) to a passthru command may have different behavior from passing STDOUT?

We may fopen() an additional stdout and/or stderr handle, but do not fclose() it?

Yes, this appears to be the problem. If we fopen("php://stderr", ...) and do not fclose() the resource that is returned, processes that fork and daemonize will not exit if stderr satisfies some set of conditions like "is not a tty", although the real condition is probably more complicated. Given this script:

$ cat bin/fork 
#!/usr/bin/env php

$pid = pcntl_fork();

if (!$pid) {
  $stdout = fopen('php://stdout', $ignored_mode = '');




...we get this behavior:

$ ./bin/fork # Exits immediately.
$ ./bin/fork | cat # Waits 5 seconds before exiting.

I think the approach in D21794 can't actually work when daemonizing, because I think we must fclose() the magic STDOUT and STDERR file descriptors. It's not good enough to fopen() them and then fclose() the new fds we've opened, since we're just closing another file descriptor pointing at the same underlying resource, and the hang must be in the vein of "does any open fd reference this resource".

To close the original STDOUT and STDERR, the runtime must have access to them, so the SAPI needs to provide them.

I'm going to keep about half of D21794 and unwind the other half.