Page MenuHomePhabricator

Restructure base85 unit tests to support inlining and multiple encoding pathways
ClosedPublic

Authored by epriestley on Apr 25 2018, 2:33 PM.
Tags
None
Referenced Files
F13061615: D19407.diff
Fri, Apr 19, 7:43 PM
F13059146: D19407.id46429.diff
Fri, Apr 19, 3:45 PM
Unknown Object (File)
Tue, Apr 9, 11:56 AM
Unknown Object (File)
Tue, Apr 2, 10:24 PM
Unknown Object (File)
Tue, Apr 2, 4:50 PM
Unknown Object (File)
Wed, Mar 27, 7:45 AM
Unknown Object (File)
Wed, Mar 27, 7:45 AM
Unknown Object (File)
Wed, Mar 27, 7:44 AM
Subscribers
None

Details

Summary

Ref T13130. I want to take a crack at improving performance here, but two possible approaches (inlining the actual encoding; using integers if they're big enough) aren't easy to test right now.

Restructure the tests so they can support these kinds of refactoring.

The "32bit" and "64bit" modes currently do the same thing, but I expect to introduce introduce separate encoding pathways in a future change, if the profiler says it actually helps.

(I'll hold this and everything that comes after it until I make meaningful performance improvements.)

Test Plan

Ran arc unit, got passes on tests.

Diff Detail

Repository
rARC Arcanist
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

Why do all the lines in the expect files start with z now?

src/parser/__tests__/ArcanistBundleTestCase.php
957

My kingdom for an #IFDEF! I googled for a second, and I'm shocked to learn that this is apparently the standard way to detect a 64-bit OS ¯\_(ツ)_/¯

This revision is now accepted and ready to land.Apr 25 2018, 4:57 PM

The "z" just means "this line has 52 more bytes encoded on it". If you look carefully, the last lines in the file actually have a different marker, meaning "this line has (some specific number of bytes smaller than 52) on it" (e.g., "y" = 51, "x" = 50, something like that).

The reason it needs this is that it always produces output in 5-byte chunks, so the output for these inputs is exactly the same:

X<eof>
X\0<eof>
X\0\0<eof>
X\0\0\0<eof>

Say that output is "ABCDE". The encoder will produce something like this:

X<eof>       -> aABCDE
X\0<eof>     -> bABCDE
X\0\0<eof>   -> cABCDE
X\0\0\0<eof> -> dABCDE

The "number of bytes on this line" marker lets the decoder later figure out how many trailing NULL bytes it needs to stick on the end. This is a bit simplified, but I think it's roughly what's going on. The "z" thing is also maybe kind of a checksum to help tell if the file got damaged in transit?

Before, we were just testing base85($data), but now we're testing kind-of-format-as-a-git-patch(base85($data)) so that I can inline base85() later.

The behavior of kind-of-format-as-a-git-patch(...) is to break the data into 52-character lines, encode them, then stick "z" in front of each of them (if the line has 52 bytes worth of data) or some other character (if it has fewer), so that's why the new stuff is line-wrapped and has a bunch of "z". The actual text should otherwise be the same.

src/parser/__tests__/ArcanistBundleTestCase.php
957

We could do something like define('IS_64_BIT', PHP_INT_SIZE >= 8); but I'm hesitant to start using define() for anything.

Or we could do something like phutil_is_64bit(), but I think we only do this in ~4 places (two here, two around newsfeed?) and it seems unlikely to break.