Page MenuHomePhabricator

Try to route cluster writes to nodes which won't need to synchronize first
ClosedPublic

Authored by epriestley on Oct 5 2018, 8:35 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Dec 12, 10:43 PM
Unknown Object (File)
Wed, Dec 11, 1:41 PM
Unknown Object (File)
Tue, Dec 10, 6:54 PM
Unknown Object (File)
Mon, Dec 9, 4:07 PM
Unknown Object (File)
Wed, Nov 27, 3:52 AM
Unknown Object (File)
Sat, Nov 23, 11:35 AM
Unknown Object (File)
Nov 23 2024, 5:14 AM
Unknown Object (File)
Nov 22 2024, 8:48 PM
Tokens
"Doubloon" token, awarded by joshuaspence.

Details

Summary

Ref T13109. Ref T13202. See PHI905. See PHI889. When we receive a write to a repository cluster, we currently send it to a random writable node.

Instead, we can prefer:

  • the node currently holding the write lock; or
  • any node which is already up to date.

These should simply be better nodes to take writes in all cases. The write lock is global for the repository, so there's no scaling benefit to spreading writes across different nodes, and these particular nodes will be able to accept the write more quickly.

Test Plan
  • This is observable by using fprintf(STDERR, "%s\n", ...) in the logic, then running git push. I'd like to pull this routing logic out of PhabricatorRepository at some point, probably into a dedicated ClusterURIQuery sort of class, but that is a larger change.
  • Added some fprintf(...) stuff around which nodes were being selected.
  • Added a sleep(10) after grabbing the write lock.
  • In one window, pushed. Then pushed in a second window.
    • Saw the second window select the lock holder as the write target based on it currently holding the lock.
    • Without a concurrent push, saw pushes select up-to-date nodes based on their up-to-date-ness.

Diff Detail

Repository
rP Phabricator
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

amckinley added inline comments.
src/applications/repository/storage/PhabricatorRepository.php
2030

Is there any way we could make this shuffle() more reproducible? It would be nice if we could replay requests and get identical routing behavior for tracing down weirdness.

2044

Add a break here?

2074

Won't this result in duplicates in $results?

This revision is now accepted and ready to land.Oct 15 2018, 9:37 PM

We can make shuffle() predictable, but should probably implement our own state-maintaining PRNG to do so. If we use srand() / mt_srand() without maintaining our own state:

  • seeding affects later calls to rand() / mt_rand() (until the next srand() / mt_srand()), not just the call we're hoping to target, so we may cause collateral damage;
  • if our code looks like mt_srand(123); $engine->route();, we can never introduce a call to shuffle(), rand()/mt_rand(), etc., between the seed call and the "intended" shuffle() call, or that new call will eat the seed and break test;
  • seeding is not stable across versions: shuffle() output for a given seed changed in PHP 7.1 (when shuffle() switched to use the MT core) and changed again in PHP 7.2 (when an MT modulus bias bug was fixed).

This would let us make requests reproducible in the "unit test" sense, but I'm not sure this will ever make requests practically replayable in a "reproduce a bug" sense. Even though we can make shuffle() predictable and could find some way to pass a seed in, you'd have to somehow freeze the entire cluster state immediately when you observe an anomaly, since the list coming in to shuffle() may differ if cluster state differs. We can freeze the state for unit tests, but probably not in the real world. I'm not sure this tool would be very useful in the real world, and I'm not sure a hypothetical REQUEST_READ_ROUTING_SEED=123 git fetch ... would be more useful than a hypothetical FORCE_READ_TO_HOST=repo123.phacility.net git fetch ....

If we did something like this, we could seed an entire request once and then generate different independent PRNGs from that seed by specifying different PRNG roles while keeping existing PRNGs stable across the introduction of new PRNGs. This solves the problem where you'd like to provide REQUEST_SEED=123, not SEED_A=123 SEED_B=234 SEED_C=345 ..., but would also like REQUEST_SEED=123 to produce the same results regardless of how many values the PRNG generated before you reached the desired shuffle(). I think this isn't really a complete fix, but might stabilize things enough to make behavior reasonably consistent-ish:

<?php

class PRNG {
  
  private $masterSeed;
  private $seeds = array();
  
  public function generate($role, $min = null, $max = null) {
    if (!isset($this->seeds[$role])) {
      $this->seeds[$role] = hash_hmac($this->masterSeed, $role);
    }
    
    $seed = $this->seeds[$role];
    $result = $this->mt($seed);
    $this->seeds[$role] = $result;
    
    return $this->clamp($result, $min, $max);
  }
  
  private function mt($seed) {
    // Do Mersenne Twister Here.
    return ...;
  }
  
  private function clamp($result, $min, $max) {
    // This isn't quite right and doesn't produce a uniform distribution.
    return ($result % ($max - $min)) + $min;
  }
}

That seems reasonable for unit tests, but it doesn't help us with the "freeze the entire world" problem for reproducibility. But, for unit tests specifically, I'm not sure this (seedable PRNG) is the right primitive or the best way to think about the problem. We could write a test like this:

$prng = new PRNG(123);
$engine->setPRNG($prng);
$result = $engine->route();
assertEqual($result, 'repo234.phacility.net');

...and it looks like we're testing things properly. However, this test would still pass if we forgot the shuffle() call entirely. So we add a second copy with new PRNG(124) and get a different node. But if there's a bug which prevents the last element in the list from ever being returned (say, a careless off-by-one error in a loop) we might never catch it. And this strategy would definitely never catch an error where one entry is doubled in the list and selected twice as often. I believe that it's very hard to write a test which runs in finite time and makes exhaustive assertions about the behavior of a random algorithm.

We don't really care about asserting that a specific PRNG seed produces a specific routing result. The assertion we really want to make is that all routing results are included and are equally likely. We could do this in a deterministic way by returning probabilistic results and resolving them at the last second. Imagine:

// $result is a PhutilWeightedList.
$result = $engine->route();

assertEqual(4, count($result));
assertEqual(array('repo1', 'repo2', 'repo3', 'repo4'), $result->getSortedKeys());
assertEqual(0.25, $result->getKeyWeight('repo1'));
assertEqual(0.25, $result->getKeyWeight('repo2'));
assertEqual(0.25, $result->getKeyWeight('repo3'));
assertEqual(0.25, $result->getKeyWeight('repo4'));

This describes all the behavior we actually care about from the routing engine. Then we only need to trust that the $result->getOneWeightedResultAtRandom() function actually works (e.g., shuffles the list properly), which is simpler to convince ourselves of and can be tested in isolation, and we can test all similar routing algorithms by describing the entire behavior we expect instead of just asserting that particular seeds produce particular known results.

I think this is possibly desirable down the road, but probably a lot of work, and that the time is likely better spent on algorithm improvements (T13211, T10884) today, although it might be worthwhile to restructure the API around a WeightedList sort of return type today even if we don't actually test all the routing yet.


src/applications/repository/storage/PhabricatorRepository.php
2044

It's possible that multiple results may have the same device PHID. Today, you can configure one device to have several SSH bindings (say, on ports 123 and 234) and it will get two separate URI entries and be selectable twice (it will also receive twice as much traffic as devices with only one binding).

It's possible that this shouldn't be allowed, but if there are two bindings we don't otherwise have a way to figure out which one is the "right" binding, and this use case doesn't seem totally useless (e.g., if you're testing a new SSHD configuration, maybe the easiest way to do that really is just to start it on a different port and take a bit of traffic to see if anything explodes). Under this configuration, I think that sending traffic to both bindings seems like the best interpretation of likely user intent.

(It's also possible that this should be allowed but that we should normalize traffic per-device instead of per-binding, so that even if a device has 100 bindings, it receives "one device" worth of traffic instead of 100x "one binding" worth of traffic, but this seems very hard without some kind of probabilistic list primitives.)

I could also vaguely imagine cases where repolb1:123 and repolb1:234 are really a LB-like node which routes traffic to VMs on different hardware or something that share a NAS disk? This seems bad and silly but I think there's enough room to imagine useful things here that a configuration with multiple bindings to the same device isn't obviously wrong or clearly invalid.

2074

No -- for arrays, A + B means "take A, then add all the keys in B which do not yet exist". The + operator never changes the key for any value, even if it has to discard a value (because the key is already in use).

This is different from array_merge(A, B), which always retains all the values, even if it has to change keys.

In the construction R = array_select_keys(X, Y) + X, the arrays R and X always have exactly the same keys and values.

A way to think about + is $result = $values + $defaults;, e.g.:

$values = array(
  'name' => 'Mona Lisa',
);

$defaults = array(
  'name' => 'Untitled',
  'medium' => 'Oil on Canvas',
  'dimensions' => '77cm x 53cm',
);

var_dump($values + $defaults);

...produces:

array(3) {
  ["name"]=>
  string(9) "Mona Lisa"
  ["medium"]=>
  string(13) "Oil on Canvas"
  ["dimensions"]=>
  string(11) "77cm x 53cm"
}

I believe + is equivalent to this function in all cases:

function array_plus($u, $v) {
  $r = $u;
  foreach ($v as $key => $value) {
    if (array_key_exists($key, $r)) {
      continue;
    }
    $r[$key] = $value;
  }
  return $r;
}

I do think it would possibly be reasonable to write a phutil_reorder_array_bringing_keys_to_front($array, $keys) or similar since the behavior of this construction isn't obvious even though it's correct, although that name isn't very good.

To "sort of" test this in a real environment, I'm going to do this:

  1. Push a file with 128MB of garbage to rGITTEST.
  2. Immediately, time how long it takes to push a small followup change to a different branch.

I'll repeat these steps until writes (1) and (2) go to different nodes to get some rough sense of the synchronize cost when a followup write hits an unsynchronized node.

Then I'll deploy this change and repeat the test. With this change, I expect the second push to be unable to reach an unsynchronized node.

Big push is:

git-test-secure/ $ head -c 134217728 /dev/urandom > junk.128MB.random && git commit -am 'junk' && git push

Little push is:

git-test-secure-2/ $ echo stuff >> stuff.txt && git commit -am 'More Stuff' && time git push

I'm running these commands serially so that the second command does not need to wait for the first command to actually finish writing and the timing difference in the two cases is emphasized, although I expect that running them in parallel will produce similar results, as long as the small push starts after the large push acquires the repository write lock.

I got lucky (hit the wrong node and got a slow write) on the first push. Here's the big push:

epriestley@orbital ~/scratch/git-test-secure $ head -c 134217728 /dev/urandom > junk.128MB.random && git commit -am 'junk' && git push
[write-chunks c16fb0b] junk
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 junk.128MB.random
# Push received by "secure003.phacility.net", forwarding to cluster host.
# Acquiring write lock for repository "rGITTEST"...
# Acquired write lock immediately.
# Acquiring read lock for repository "rGITTEST" on device "secure001.phacility.net"...
# Acquired read lock immediately.
# Device "secure001.phacility.net" is already a cluster leader and does not need to be synchronized.
# Ready to receive on cluster host "secure001.phacility.net".
Counting objects: 3, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 128.04 MiB | 11.67 MiB/s, done.
Total 3 (delta 1), reused 0 (delta 0)
# Released cluster write lock.
To ssh://secure.phabricator.com/diffusion/GITTEST/git-test.git
   75cff16..c16fb0b  write-chunks -> write-chunks

Note that it went to secure001:

# Device "secure001.phacility.net" is already a cluster leader and does not need to be synchronized.

Here's the little push:

$ echo stuff >> stuff.txt && git commit -am 'More Stuff' && time git push
[small-changes b7dd276] More Stuff
 1 file changed, 1 insertion(+)
# Push received by "secure001.phacility.net", forwarding to cluster host.
# Acquiring write lock for repository "rGITTEST"...
# Acquired write lock immediately.
# Acquiring read lock for repository "rGITTEST" on device "secure002.phacility.net"...
# Acquired read lock after 9 second(s).
# Device "secure002.phacility.net" is already a cluster leader and does not need to be synchronized.
# Ready to receive on cluster host "secure002.phacility.net".
Counting objects: 3, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 265 bytes | 265.00 KiB/s, done.
Total 3 (delta 1), reused 0 (delta 0)
# Released cluster write lock.
To ssh://secure.phabricator.com/diffusion/GITTEST/git-test.git
   97c8829..b7dd276  small-changes -> small-changes

real	0m17.111s
user	0m0.021s
sys	0m0.018s

Note that it went to secure002 (the unsynchronized node) and spent 9 seconds waiting for a read lock (although the synchronize had already started by the time I pushed -- the push did not cause the synchronize):

# Acquired read lock after 9 second(s).
# Device "secure002.phacility.net" is already a cluster leader and does not need to be synchronized.

Compare to a push when the node is already synchronized, which takes 4s instead of 17s:

$ echo stuff >> stuff.txt && git commit -am 'More Stuff' && time git push
[small-changes 95c84be] More Stuff
 1 file changed, 1 insertion(+)
# Push received by "secure002.phacility.net", forwarding to cluster host.
# Acquiring write lock for repository "rGITTEST"...
# Acquired write lock immediately.
# Acquiring read lock for repository "rGITTEST" on device "secure001.phacility.net"...
# Acquired read lock immediately.
# Device "secure001.phacility.net" is already a cluster leader and does not need to be synchronized.
# Ready to receive on cluster host "secure001.phacility.net".
Counting objects: 3, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 267 bytes | 267.00 KiB/s, done.
Total 3 (delta 1), reused 0 (delta 0)
# Released cluster write lock.
To ssh://secure.phabricator.com/diffusion/GITTEST/git-test.git
   b7dd276..95c84be  small-changes -> small-changes

real	0m3.913s
user	0m0.024s
sys	0m0.018s

Now I'll deploy this and repeat the test. My expectation is that I'll no longer be able to hit the unsynchronized case (writes will never be sent to a node which needs to synchronize) and all writes will be closer to 4s than 17s.

This revision was automatically updated to reflect the committed changes.

I made another large push, then a series of small pushes. The big push hit secure001. The small pushes then hit:

secure001 8s
secure001 6s
secure001 6s
secure001 4s
secure001 5s
secure002 5s

This is consistent with expectation, i.e. that I won't be able to get a write to go to secure002 until it synchronizes and won't be able to get a 17-ish second push anymore. Another round of the same also send the big push to secure001 and then produced these small push timings:

secure001 5s
secure001 5s
secure001 7s
secure001 7s
secure002 5s

Finally, I pushed small changes without pushing a large change first. Expectation is that these will go to the two nodes randomly as long as I wait a bit between pushes to let the cluster sync:

secure002 5s
secure001 5s
secure002 4s
secure002 5s
secure002 4s
secure001 5s
secure002 5s

So that also appears to align with expectations. We can't really make any general statements based on this data, but the routing changes appear to have the expected/intended behavior and improve at least one real-world use case.