Page MenuHomePhabricator

Until MySQL 5.7, each MySQL connection may hold only one simultaneous lock
Closed, ResolvedPublic

Description

See PHI2009. Until MySQL 5.7, GET_LOCK(...) releases other locks held on the same connection as a side effect:

Before MySQL 5.7, only a single simultaneous lock can be acquired and GET_LOCK() releases any existing lock.
https://dev.mysql.com/doc/refman/5.7/en/locking-functions.html#function_get-lock

That is:

mysql> GET_LOCK('A', 0);
msyql> GET_LOCK('B', 0); # Releases lock "A"!

I think this has escaped notice through a combination of factors:

  • The behavior is wildly surprising.
  • Most development/testing occurs against newer versions of MySQL that don't have this behavior.
  • We currently only stack locks in one specific workflow (when accepting writes against non-leader repository nodes).
  • This is only observable in a multi-master writable cluster with a fairly high write rate.
  • (See Below) The code accidentally attempts to prevent it.
  • (See Below) The reproduction case is even more complicated and subtle than I initially believed, and requires an external connection be improperly returned to the connection pool after a failure to acquire a write lock.

Most test environments don't have this behavior, and secure doesn't have a high enough write rate to hit it.

In the short term, the fix is:

  • Never issue GET_LOCK() on a connection already holding a lock.

In the longer term, perhaps:

  • Require MySQL 5.7 or newer, or condition this logic on old versions of MySQL, since it has a small performance cost and the old GET_LOCK() behavior is wholly ridiculous.

Event Timeline

epriestley triaged this task as Normal priority.Mar 2 2021, 7:27 PM
epriestley created this task.

This is actually very subtle.

In addition to the above, we usually dodged this because PhabricatorGlobalLock attempts to always use new connections. This is desirable given MySQL's behavior, but also entirely by accident, because the first version of PhabricatorGlobalLock (in D2864) just held a transaction open instead of using GET_LOCK(), which would have required a unique connection.

This means that it's actually fairly difficult to accidentally acquire two simultaneous locks on the same connection. However, we can do it like this:

  1. Set an external connection.
  2. Attempt to acquire lock A.
  3. Lock acquisition fails.
  4. Attempt to acquire lock A.
  5. Lock acquisition succeeds.
  6. Attempt to acquire lock B.

In step (3), the external connection is incorrectly returned to the connection pool, since D21369. In step (4), we acquire the first lock on the connection. In step (6), we acquire a second lock on the same connection. In MySQL versions older than 5.5, this releases the lock from step (4).

Since we use an external connection to guarantee that the ephemeral and durable write locks are on the same host, this particular perfect storm of conditions can occur under high write volume on a multi-master cluster.

I deployed this to the hosts affected by PHI2009 yesterday, and it appears to have resolved the problem.

It may still be worthwhile to try to navigate the MySQL 5.7 issue as part of some future change, like T11908 (in MySQL 5.7 or newer, we do not need a unique connection per lock) but the impact is small and the code is generally in a reasonable state after these changes.