Make cluster repositories more resistant to freezing
ClosedPublic
Actions

Authored by epriestley on Apr 24 2016, 5:59 PM.

Details

Reviewers

chad

Maniphest Tasks

T10860: After an inconsistent cluster repository write, consider just ignoring the lock

Commits

rP892a9a1f07d9: Make cluster repositories more resistant to freezing

Summary

Ref T10860. This allows us to recover if the connection to the database is lost during a push.

If we lose the connection to the master database during a push, we would previously freeze the repository. This is very safe, but not very operator-friendly since you have to go manually unfreeze it.

We don't need to be quite this aggressive about freezing things. The repository state is still consistent after we've "upgraded" the lock by setting isWriting = 1, so we're actually fine even if we lost the global lock.

Instead of just freezing the repository immediately, sit there in a loop waiting for the master to come back up for a few minutes. If it recovers, we can release the lock and everything will be OK again.

Basically, the changes are:

If we can't release the lock at first, sit in a loop trying really hard to release it for a while.
Add a unique lock identifier so we can be certain we're only releasing our lock no matter what else is going on.
Do the version reads on the same connection holding the lock, so we can be sure we haven't lost the lock before we do that read.

Test Plan

Added a sleep(10) after accepting the write but before releasing the lock so I could run mysqld stop and force this issue to occur.
Pushed like this:

$ echo D >> record && git commit -am D && git push
[master 707ecc3] D
 1 file changed, 1 insertion(+)
# Push received by "local001.phacility.net", forwarding to cluster host.
# Waiting up to 120 second(s) for a cluster write lock...
# Acquired write lock immediately.
# Waiting up to 120 second(s) for a cluster read lock on "local001.phacility.net"...
# Acquired read lock immediately.
# Device "local001.phacility.net" is already a cluster leader and does not need to be synchronized.
# Ready to receive on cluster host "local001.phacility.net".
Counting objects: 3, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 254 bytes | 0 bytes/s, done.
Total 3 (delta 1), reused 0 (delta 0)
BEGIN SLEEP

Here, I stopped mysqld from the CLI in another terminal window.

END SLEEP
# CRITICAL. Failed to release cluster write lock!
# The connection to the master database was lost while receiving the write.
# This process will spend 300 more second(s) attempting to recover, then give up.

Here, I started mysqld again.

# RECOVERED. Link to master database was restored.
# Released cluster write lock.
To ssh://local@localvault.phacility.com/diffusion/26/locktopia.git
   2cbf87c..707ecc3  master -> master