Restructure base85 unit tests to support inlining and multiple encoding pathways
ClosedPublic
Actions

Authored by epriestley on Apr 25 2018, 2:33 PM.

Details

Reviewers

amckinley

Maniphest Tasks

T13130: Plans: 2018 Week 17/18 Bonus Content

Commits

rARCbcab677a7a8c: Restructure base85 unit tests to support inlining and multiple encoding pathways

Summary

Ref T13130. I want to take a crack at improving performance here, but two possible approaches (inlining the actual encoding; using integers if they're big enough) aren't easy to test right now.

Restructure the tests so they can support these kinds of refactoring.

The "32bit" and "64bit" modes currently do the same thing, but I expect to introduce introduce separate encoding pathways in a future change, if the profiler says it actually helps.

(I'll hold this and everything that comes after it until I make meaningful performance improvements.)

Test Plan

Ran arc unit, got passes on tests.

Diff Detail

Repository

rARC Arcanist

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

epriestley created this revision.Apr 25 2018, 2:33 PM

Harbormaster completed remote builds in B20229: Diff 46420.Apr 25 2018, 2:33 PM

epriestley requested review of this revision.Apr 25 2018, 2:33 PM

epriestley mentioned this in D19408: Slightly improve base85 performance for 64-bit systems.Apr 25 2018, 4:30 PM

epriestley mentioned this in T2312: Modifying PHP.Apr 25 2018, 4:31 PM

Why do all the lines in the expect files start with z now?

src/parser/__tests__/ArcanistBundleTestCase.php
957	My kingdom for an #IFDEF! I googled for a second, and I'm shocked to learn that this is apparently the standard way to detect a 64-bit OS ¯\_(ツ)_/¯

This revision is now accepted and ready to land.Apr 25 2018, 4:57 PM

The "z" just means "this line has 52 more bytes encoded on it". If you look carefully, the last lines in the file actually have a different marker, meaning "this line has (some specific number of bytes smaller than 52) on it" (e.g., "y" = 51, "x" = 50, something like that).

The reason it needs this is that it always produces output in 5-byte chunks, so the output for these inputs is exactly the same:

X<eof>
X\0<eof>
X\0\0<eof>
X\0\0\0<eof>

Say that output is "ABCDE". The encoder will produce something like this:

X<eof>       -> aABCDE
X\0<eof>     -> bABCDE
X\0\0<eof>   -> cABCDE
X\0\0\0<eof> -> dABCDE

The "number of bytes on this line" marker lets the decoder later figure out how many trailing NULL bytes it needs to stick on the end. This is a bit simplified, but I think it's roughly what's going on. The "z" thing is also maybe kind of a checksum to help tell if the file got damaged in transit?

Before, we were just testing base85($data), but now we're testing kind-of-format-as-a-git-patch(base85($data)) so that I can inline base85() later.

The behavior of kind-of-format-as-a-git-patch(...) is to break the data into 52-character lines, encode them, then stick "z" in front of each of them (if the line has 52 bytes worth of data) or some other character (if it has fewer), so that's why the new stuff is line-wrapped and has a bunch of "z". The actual text should otherwise be the same.

src/parser/__tests__/ArcanistBundleTestCase.php
957	We could do something like `define('IS_64_BIT', PHP_INT_SIZE >= 8);` but I'm hesitant to start using `define()` for anything. Or we could do something like `phutil_is_64bit()`, but I think we only do this in ~4 places (two here, two around newsfeed?) and it seems unlikely to break.