Page MenuHomePhabricator

After an inconsistent cluster repository write, consider just ignoring the lock
Closed, ResolvedPublic

Description

Currently, we freeze repositories after a committed but unacknowledged write is aborted. This errs very heavily on the side of safety.

The situation is:

  • A user begins pushing a change.
  • Before the push finishes, the receiving server dies or the repository/database link is severed.

We currently respond to this by freezing the repository and making an operator sort it out.

This rule may be too paranoid. In particular, I suspect we may end up freezing repositories too aggressively if there is a push in flight while the database is restarted, since the global lock will drop and we won't release the durable lock. In general:

  • We don't acknowledge these pushes, so the client doesn't see them succeed. It should be reasonable to roll them back.
  • We can sometimes automatically roll back if there are 2+ nodes in the cluster. Even if there are a lot of nodes, it's possible we were the only leader before the push, so we may not be able to roll back reliably. And we can never roll back reliably with only one node. That is, it's technically possible to implement "roll back", but it's complicated and there's no git command for it so we'd have to do it ourselves and it wouldn't be atomic.
  • The actual confusing case here is probably not a data loss case, since we won't acknowledge the push and data loss is what the client should expect -- it's fine and consistent if we lose the data. The actual problem is the opposite: accepting the push even though we didn't acknowledge it. For human pushers, this is probably fine. For automated pushers, this might not be so great. And it's definitely inconsistent: if we didn't say "yes, we got the push", we shouldn't get the push.

A possible softening of the protocol might be:

  • When we acquire the lock, write a unique process ID to the lock table.
  • After we receive a push, try very very very very hard to release the lock if we're the thread holding it, even if we lost the global lock (e.g., sit there for 15 minutes just retrying to connect to the database so we can unfreeze the repository).

This would still be very paranoid and allow no inconsistencies, but should reduce the number of cases where we get stuck with a frozen repository.

Maybe also helpful would be an operational way to drain VCS write traffic before a database server restart. We can do this in a practical sense by acquiring every write lock, then writing a stateful lock.

When you needed to restart the master database, the protocol would become something like:

$ ./bin/repository write-lock --enable # Drain and lock ALL VCS write traffic.
$ service mysqld restart
$ ./bin/repository write-lock --disable # Release VCS write traffic lock.

Event Timeline

epriestley created this object with edit policy "All Users".
epriestley lowered the priority of this task from Normal to Low.Apr 24 2016, 6:39 PM

After we receive a push, try very very very very hard to release the lock if we're the thread holding it, even if we lost the global lock

D15789 + D15790 + D15791 + D15792 implement this.

I don't currently plan to implement any automatic method of ignoring/breaking a frozen lock, since it's potentially unsafe and I think "try very hard to release it" may be good enough to make remaining failures very rare/legitimate.

I don't currently plan to implement a global administrative write lock, but am open to it if we run into issues in the wild that it would help resolve. I suspect "try harder to release locks" will just fix this problem automatically in reasonable cases (routine restarts of MySQL).

Going any farther than this risks some inconsistency / data loss and we haven't managed to break this yet, so I'm going to consider this sufficiently robust until we see evidence otherwise.