Page MenuHomePhabricator

Improve cluster write feedback and write routing
Open, NormalPublic

Description

See PHI480. An install with a master/master repository cluster has seen feedback from users about this message on the SSH wire flow:

# Waiting up to 120 second(s) for a cluster write lock...

Specifically:

  • This message is missing an English translation.
  • This message isn't as clear as it could be that the lock will be acquired instantaneously when it's available (i.e., we aren't wasting time in sleep(...)).
  • This message isn't as clear as it could be about the scope of the lock (global per repository).

It would be more helpful if waiting for a write lock gave you more information about who has the lock, so it's clear that the server is doing something and that you aren't just being made to wait for no reason or because of a bug:

Acquiring the repository write lock on rXYZ...
Waiting for epriestley to finish pushing abcdef1234 to master...
...

Beyond this, we could improve the write routing algorithm. The write process looks like this:

  • Choose a random node.
  • Acquire the repository write lock.
  • Acquire the host read lock.
  • Synchronize.
  • Release the host read lock.
  • Accept the write.
  • Release the repository write lock.

This synchronize can be skipped if we get lucky, and randomly choose the same node that the writer ahead of us chose, since that node will already have the write we just waited for.

We can make clients exceptionally lucky by routing like this:

  • If possible, choose the node with a writer holding the lock.
  • If possible, choose a leader node.
  • Choose a random node.

Since writes are always one-at-a-time, there's effectively no downside to this: we can't hit a "thundering herd" problem with write traffic.

See also T10884 for read routing, which is a little trickier, since a bad read routing algorithm can kill a cluster by focusing too much traffic on a small set of nodes.


I plan to make these changes:

  • Modify the PushEvent table to contain more information about locks and waits (e.g., active queue status; how long the push waited for the write lock; how long the push spent synchronizing).
  • Modify the "Waiting..." prompt to show more information by querying the PushEvent table.

Based on this data, we can make a decision about whether smarter write routing is worthwhile by seeing how much time pushes are spending synchronizing after acquiring the write lock.

(I suspect it will be, and the bar to implement this isn't very high, but it would still be nice to have some data suggesting that it will have a real effect: although I think the smarter routing is strictly better, it is more complex than routing with shuffle().)

Event Timeline

epriestley triaged this task as Normal priority.Mar 19 2018, 2:07 PM
epriestley created this task.

Modify the PushEvent table to contain more information about locks and waits (e.g., active queue status; how long the push waited for the write lock; how long the push spent synchronizing).

This is tricky because the PushEvent is actually written by DiffusionCommitHookEngine. This runs in a subprocess two levels down:

- sshd
  - ssh-exec             <--- Where we can measure locks.
    - git receive-pack
      - commit-hook      <--- Where we write PushEvent.

So there are a couple issues:

  • we don't have a great way to pass timing information down to the commit hook;
  • the timing isn't actually complete when PushEvent finishes writing.

I can come up with two approaches to deal with this:

  1. We can write to the SSH log instead. This isn't as nice since it disconnects timing information from push information and puts application-specific information into the SSH log. The advantage is that ssh-exec fully manages the SSH log.
  2. We can try to tie the SSH request to the PushEvent by passing some request identifier down or some log ID up.

I tentatively favor approach (2), and passing something down seems better than passing something up, so I'm going to see if I can get traction on that.

@epriestley : our current phabricator version is c46be2a70b4db72d61e76690ef095384de2c3f91 that was landed on 04/13, will there be any issues if we directly cherry pick the recent two versions without pulling any other changes?
rP51073b972ece: Try to route cluster writes to nodes which won't need to synchronize first.Wed, Oct 17, 3:08 PM
rPbc6c8c0e93a7: Explicitly shuffle nodes before selecting one for cluster sync.

Also, are there any dependent changes on arc, lib and schema if we cherry pick the two versions?

I think you can cherry-pick those changes safely. I suspect further changes connected to T13211 and/or T10884 in the near future are likely to be difficult to cherry-pick, though, so ideally you should try to upgrade.

(It's possible you'll also need D19701 and D19702, and those need D19720 (in libphutil/), so even just picking these changes may be a bit of a mess.)