Currently, we freeze repositories after a committed but unacknowledged write is aborted. This errs very heavily on the side of safety.
The situation is:
- A user begins pushing a change.
- Before the push finishes, the receiving server dies or the repository/database link is severed.
We currently respond to this by freezing the repository and making an operator sort it out.
This rule may be too paranoid. In particular, I suspect we may end up freezing repositories too aggressively if there is a push in flight while the database is restarted, since the global lock will drop and we won't release the durable lock. In general:
- We don't acknowledge these pushes, so the client doesn't see them succeed. It should be reasonable to roll them back.
- We can sometimes automatically roll back if there are 2+ nodes in the cluster. Even if there are a lot of nodes, it's possible we were the only leader before the push, so we may not be able to roll back reliably. And we can never roll back reliably with only one node. That is, it's technically possible to implement "roll back", but it's complicated and there's no git command for it so we'd have to do it ourselves and it wouldn't be atomic.
- The actual confusing case here is probably not a data loss case, since we won't acknowledge the push and data loss is what the client should expect -- it's fine and consistent if we lose the data. The actual problem is the opposite: accepting the push even though we didn't acknowledge it. For human pushers, this is probably fine. For automated pushers, this might not be so great. And it's definitely inconsistent: if we didn't say "yes, we got the push", we shouldn't get the push.
A possible softening of the protocol might be:
- When we acquire the lock, write a unique process ID to the lock table.
- After we receive a push, try very very very very hard to release the lock if we're the thread holding it, even if we lost the global lock (e.g., sit there for 15 minutes just retrying to connect to the database so we can unfreeze the repository).
This would still be very paranoid and allow no inconsistencies, but should reduce the number of cases where we get stuck with a frozen repository.
Maybe also helpful would be an operational way to drain VCS write traffic before a database server restart. We can do this in a practical sense by acquiring every write lock, then writing a stateful lock.
When you needed to restart the master database, the protocol would become something like:
$ ./bin/repository write-lock --enable # Drain and lock ALL VCS write traffic. $ service mysqld restart $ ./bin/repository write-lock --disable # Release VCS write traffic lock.