Page MenuHomePhabricator

Make cluster repositories more resistant to freezing
ClosedPublic

Authored by epriestley on Apr 24 2016, 5:59 PM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, Mar 27, 6:44 AM
Unknown Object (File)
Sun, Mar 10, 3:34 PM
Unknown Object (File)
Tue, Mar 5, 1:06 AM
Unknown Object (File)
Feb 24 2024, 7:49 PM
Unknown Object (File)
Feb 14 2024, 5:21 AM
Unknown Object (File)
Jan 24 2024, 2:30 AM
Unknown Object (File)
Jan 17 2024, 12:40 AM
Unknown Object (File)
Jan 16 2024, 10:22 PM
Subscribers
None

Details

Summary

Ref T10860. This allows us to recover if the connection to the database is lost during a push.

If we lose the connection to the master database during a push, we would previously freeze the repository. This is very safe, but not very operator-friendly since you have to go manually unfreeze it.

We don't need to be quite this aggressive about freezing things. The repository state is still consistent after we've "upgraded" the lock by setting isWriting = 1, so we're actually fine even if we lost the global lock.

Instead of just freezing the repository immediately, sit there in a loop waiting for the master to come back up for a few minutes. If it recovers, we can release the lock and everything will be OK again.

Basically, the changes are:

  • If we can't release the lock at first, sit in a loop trying really hard to release it for a while.
  • Add a unique lock identifier so we can be certain we're only releasing our lock no matter what else is going on.
  • Do the version reads on the same connection holding the lock, so we can be sure we haven't lost the lock before we do that read.
Test Plan
  • Added a sleep(10) after accepting the write but before releasing the lock so I could run mysqld stop and force this issue to occur.
  • Pushed like this:
$ echo D >> record && git commit -am D && git push
[master 707ecc3] D
 1 file changed, 1 insertion(+)
# Push received by "local001.phacility.net", forwarding to cluster host.
# Waiting up to 120 second(s) for a cluster write lock...
# Acquired write lock immediately.
# Waiting up to 120 second(s) for a cluster read lock on "local001.phacility.net"...
# Acquired read lock immediately.
# Device "local001.phacility.net" is already a cluster leader and does not need to be synchronized.
# Ready to receive on cluster host "local001.phacility.net".
Counting objects: 3, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 254 bytes | 0 bytes/s, done.
Total 3 (delta 1), reused 0 (delta 0)
BEGIN SLEEP
  • Here, I stopped mysqld from the CLI in another terminal window.
END SLEEP
# CRITICAL. Failed to release cluster write lock!
# The connection to the master database was lost while receiving the write.
# This process will spend 300 more second(s) attempting to recover, then give up.
  • Here, I started mysqld again.
# RECOVERED. Link to master database was restored.
# Released cluster write lock.
To ssh://local@localvault.phacility.com/diffusion/26/locktopia.git
   2cbf87c..707ecc3  master -> master

Diff Detail

Repository
rP Phabricator
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

epriestley retitled this revision from to Make cluster repositories more resistant to freezing.
epriestley updated this object.
epriestley edited the test plan for this revision. (Show Details)
epriestley added a reviewer: chad.
chad edited edge metadata.
This revision is now accepted and ready to land.Apr 25 2016, 3:22 PM
This revision was automatically updated to reflect the committed changes.