After an inconsistent cluster repository write, consider just ignoring the lock
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	epriestley
	Apr 23 2016, 1:06 PM

Description

Currently, we freeze repositories after a committed but unacknowledged write is aborted. This errs very heavily on the side of safety.

The situation is:

A user begins pushing a change.
Before the push finishes, the receiving server dies or the repository/database link is severed.

We currently respond to this by freezing the repository and making an operator sort it out.

This rule may be too paranoid. In particular, I suspect we may end up freezing repositories too aggressively if there is a push in flight while the database is restarted, since the global lock will drop and we won't release the durable lock. In general:

We don't acknowledge these pushes, so the client doesn't see them succeed. It should be reasonable to roll them back.
We can sometimes automatically roll back if there are 2+ nodes in the cluster. Even if there are a lot of nodes, it's possible we were the only leader before the push, so we may not be able to roll back reliably. And we can never roll back reliably with only one node. That is, it's technically possible to implement "roll back", but it's complicated and there's no git command for it so we'd have to do it ourselves and it wouldn't be atomic.
The actual confusing case here is probably not a data loss case, since we won't acknowledge the push and data loss is what the client should expect -- it's fine and consistent if we lose the data. The actual problem is the opposite: accepting the push even though we didn't acknowledge it. For human pushers, this is probably fine. For automated pushers, this might not be so great. And it's definitely inconsistent: if we didn't say "yes, we got the push", we shouldn't get the push.

A possible softening of the protocol might be:

When we acquire the lock, write a unique process ID to the lock table.
After we receive a push, try very very very very hard to release the lock if we're the thread holding it, even if we lost the global lock (e.g., sit there for 15 minutes just retrying to connect to the database so we can unfreeze the repository).

This would still be very paranoid and allow no inconsistencies, but should reduce the number of cases where we get stuck with a frozen repository.

Maybe also helpful would be an operational way to drain VCS write traffic before a database server restart. We can do this in a practical sense by acquiring every write lock, then writing a stateful lock.

When you needed to restart the master database, the protocol would become something like:

$ ./bin/repository write-lock --enable # Drain and lock ALL VCS write traffic.
$ service mysqld restart
$ ./bin/repository write-lock --disable # Release VCS write traffic lock.

Revisions and Commits

rPHU libphutil
	D15789	rPHU4b44d03ba60c Don't retry queries on connections holding global locks
rP Phabricator
	D15792	rP892a9a1f07d9 Make cluster repositories more resistant to freezing
	D15791	rPd0b5dac36b87 Make cluster repositories more chatty
	D15790	rPdc75b4bd0602 Move all cluster locking logic to a separate class

Related Objects
Search...

Status	Assigned	Task
Duplicate	epriestley	T4209 Multiserver / High-Availability Configuration
Resolved	epriestley	T10751 Make Phabricator Highly Available
Resolved	epriestley	T4292 Implement repository replication
Resolved	epriestley	T10860 After an inconsistent cluster repository write, consider just ignoring the lock

Event Timeline

epriestley created this task.Apr 23 2016, 1:06 PM

epriestley created this object with edit policy "All Users".

Herald added a subscriber: eadler. · View Herald TranscriptApr 23 2016, 1:06 PM

joshuaspence added a subscriber: joshuaspence.Apr 24 2016, 12:51 AM

epriestley added a revision: D15789: Don't retry queries on connections holding global locks.Apr 24 2016, 3:33 PM

epriestley added a revision: D15790: Move all cluster locking logic to a separate class.Apr 24 2016, 4:07 PM

epriestley added a revision: D15791: Make cluster repositories more chatty.Apr 24 2016, 4:58 PM

epriestley added a revision: D15792: Make cluster repositories more resistant to freezing.Apr 24 2016, 5:59 PM

After we receive a push, try very very very very hard to release the lock if we're the thread holding it, even if we lost the global lock

D15789 + D15790 + D15791 + D15792 implement this.

I don't currently plan to implement any automatic method of ignoring/breaking a frozen lock, since it's potentially unsafe and I think "try very hard to release it" may be good enough to make remaining failures very rare/legitimate.

I don't currently plan to implement a global administrative write lock, but am open to it if we run into issues in the wild that it would help resolve. I suspect "try harder to release locks" will just fix this problem automatically in reasonable cases (routine restarts of MySQL).

epriestley added a commit: rPHU4b44d03ba60c: Don't retry queries on connections holding global locks.Apr 25 2016, 6:20 PM

epriestley added a commit: rPdc75b4bd0602: Move all cluster locking logic to a separate class.

epriestley added a commit: rPd0b5dac36b87: Make cluster repositories more chatty.

epriestley added a commit: rP892a9a1f07d9: Make cluster repositories more resistant to freezing.Apr 25 2016, 6:37 PM

epriestley mentioned this in T10751: Make Phabricator Highly Available.Apr 26 2016, 3:18 AM

Going any farther than this risks some inconsistency / data loss and we haven't managed to break this yet, so I'm going to consider this sufficiently robust until we see evidence otherwise.

After an inconsistent cluster repository write, consider just ignoring the lockClosed, ResolvedPublicActions

Description

Revisions and Commits

Related ObjectsSearch...

Event Timeline

After an inconsistent cluster repository write, consider just ignoring the lock
Closed, ResolvedPublic
Actions

Related Objects
Search...