Page MenuHomePhabricator

PhabricatorTaskmasterDaemon getting into MySQL deadlocks when executing PhabricatorApplicationTransactionPublishWorker tasks
Closed, ResolvedPublic

Description

After I've upgraded to "36 week stable release", PhabricatorTaskmasterDaemon started to fail when executing PhabricatorApplicationTransactionPublishWorker tasks.

Tried to launch workers separately, tried with bin/phd start, tried launch single worker - still getting same errors. Some tasks are failing few times, some are failing over 20 times. Some get executed sooner or later - some do not. This fills the queue, workers needs to be restarted because exception and other tasks are delayed.

Currently I'm using python script to cancel tasks via bin/worker cancel if their class is PhabricatorApplicationTransactionPublishWorker and failure count is at least 1.

Any ideas what can cause this issue and how to solve it?

[2015-09-09 23:30:09] EXCEPTION: (PhutilProxyException) Error while executing Task ID 35057595. {>} (AphrontDeadlockQueryException) #1213: Deadlock found when trying to get lock; try restarting transaction at [<phutil>/src/aphront/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:293]
  <...>
  #0 <#2> AphrontBaseMySQLDatabaseConnection::throwQueryCodeException(integer, string) called at [<phutil>/src/aphront/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:275]
  #1 <#2> AphrontBaseMySQLDatabaseConnection::throwQueryException(mysqli) called at [<phutil>/src/aphront/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:181]
  #2 <#2> AphrontBaseMySQLDatabaseConnection::executeRawQuery(string) called at [<phutil>/src/xsprintf/queryfx.php:6]
  #3 <#2> queryfx(AphrontMySQLiDatabaseConnection, string, string, string) called at [<phabricator>/src/applications/feed/PhabricatorFeedStoryPublisher.php:185]
  #4 <#2> PhabricatorFeedStoryPublisher::insertNotifications(integer, array) called at [<phabricator>/src/applications/feed/PhabricatorFeedStoryPublisher.php:134]
  #5 <#2> PhabricatorFeedStoryPublisher::publish() called at [<phabricator>/src/applications/transactions/editor/PhabricatorApplicationTransactionEditor.php:2764]
  #6 <#2> PhabricatorApplicationTransactionEditor::publishFeedStory(DifferentialRevision, array, array) called at [<phabricator>/src/applications/transactions/editor/PhabricatorApplicationTransactionEditor.php:1068]
  #7 <#2> PhabricatorApplicationTransactionEditor::publishTransactions(DifferentialRevision, array) called at [<phabricator>/src/applications/transactions/worker/PhabricatorApplicationTransactionPublishWorker.php:21]
  #8 <#2> PhabricatorApplicationTransactionPublishWorker::doWork() called at [<phabricator>/src/infrastructure/daemon/workers/PhabricatorWorker.php:91]
  #9 <#2> PhabricatorWorker::executeTask() called at [<phabricator>/src/infrastructure/daemon/workers/storage/PhabricatorWorkerActiveTask.php:162]
  #10 <#2> PhabricatorWorkerActiveTask::executeTask() called at [<phabricator>/src/infrastructure/daemon/workers/PhabricatorTaskmasterDaemon.php:22]
  #11 PhabricatorTaskmasterDaemon::run() called at [<phutil>/src/daemon/PhutilDaemon.php:183]
  #12 PhutilDaemon::execute() called at [<phutil>/scripts/daemon/exec/exec_daemon.php:125]

Event Timeline

Pawka assigned this task to epriestley.
Pawka updated the task description. (Show Details)
Pawka added a project: Daemons.
Pawka added subscribers: Pawka, joshuaspence.

My previous Phabricator version was early 2015-May. I see, a bit later FeedStoryNotificationGarbageCollector was introduced (D13233).

Following query is running ~15-30s:

DELETE FROM `feed_storynotification` WHERE chronologicalKey < (1434093288 << 32)
	ORDER BY chronologicalKey ASC LIMIT 100;

I guess this might be my issue.

Yes, this looks consistent with the GC and publisher deadlocking.

The expected behavior is that:

  • the DELETE FROM ... should complete very quickly (a few milliseconds; there should be a key on the table);
  • immediately after the introduction of a GC, it may have to do a lot of work to catch up because there's a lot of old data to delete which wasn't previously GC'd;
  • once the GC catches up (deleting all notifications older than 90 days) the DELETE query should only run a few times a day, with about 4 hours between invocations.

If your table is somehow missing the chronologicalKey key, that might explain the slow queries, but bin/storage adjust should add this key if it is missing.

If you have the key, I'm not sure why the queries would be slow.

You can omit the LIMIT 100 on the query (or change it to LIMIT 10000 or LIMIT 100000) and run the query manually to try to do more of the backlog of work at once:

DELETE FROM `feed_storynotification` WHERE chronologicalKey < (1434093288 << 32)
	ORDER BY chronologicalKey ASC LIMIT 10000;

Since 95% of the rows in the table are probably scheduled for GC, you could also try moving just the good rows out, truncating, and then putting them back:

CREATE TABLE notify_temp SELECT * FROM feed_storynotification WHERE chronologicalKey >= (1434093288 << 32);
/* Make sure a "reasonable" number of rows got selected, i.e. likely many thousands, but only 10-20ish percent of the table. */
TRUNCATE TABLE feed_storynotification;
INSERT INTO feed_storynotification SELECT * FROM notify_temp;
DROP TABLE notify_temp;

This is riskier, but might be faster. You could also just truncate the table completely. This will clear all users notifications (both read and unread) but doesn't do anything else bad.

Yes, I've solved this way. chronologicalKey already existed on the table but we had over 13 million records there. And performing DELETE query took quite long. Apparently it locked inserts to feed_storynotification table what caused deadlocks.

Truncating feed_storynotification solved the issue. No more deadlocks and Phabricator works faster. I'd suggest at least remove ORDER BY chronologicalKey ASC from PhabricatorApplicationTransactionPublishWorker at least - this makes query ~twice faster (I've tested that). You do not need older than 90 days records anyway - so it is not important in which order you will delete them.

My workflow (if somebody will get into the same trouble):

CREATE TABLE `feed_storynotification_new` LIKE `feed_storynotification`;
RENAME TABLE `feed_storynotification` TO `feed_storynotification_old`, `feed_storynotification_new` TO `feed_storynotification`;
/* Query below will import all notifications from recent 30 days */
INSERT INTO `feed_storynotification` SELECT * FROM `feed_storynotification_old` WHERE `chronologicalKey` > (UNIX_TIMESTAMP() - 60*60*24*30 << 32);
DROP TABLE `feed_storynotification_old`;

Hmm, that's surprising. At least on our dataset, I get a worse-looking query plan without the ORDER BY:

mysql> EXPLAIN SELECT * FROM feed_storynotification WHERE chronologicalKey < 6175228766710272490 LIMIT 100;
+----+-------------+------------------------+-------+-------------------+------------+---------+------+-------+--------------------------+
| id | select_type | table                  | type  | possible_keys     | key        | key_len | ref  | rows  | Extra                    |
+----+-------------+------------------------+-------+-------------------+------------+---------+------+-------+--------------------------+
|  1 | SIMPLE      | feed_storynotification | index | key_chronological | userPHID_2 | 133     | NULL | 57978 | Using where; Using index |
+----+-------------+------------------------+-------+-------------------+------------+---------+------+-------+--------------------------+
1 row in set (0.00 sec)

With the ORDER BY, the query plan improves:

mysql> EXPLAIN SELECT * FROM feed_storynotification WHERE chronologicalKey < 6175228766710272490 ORDER BY chronologicalKey ASC LIMIT 100;
+----+-------------+------------------------+-------+-------------------+-------------------+---------+------+------+-------------+
| id | select_type | table                  | type  | possible_keys     | key               | key_len | ref  | rows | Extra       |
+----+-------------+------------------------+-------+-------------------+-------------------+---------+------+------+-------------+
|  1 | SIMPLE      | feed_storynotification | index | key_chronological | key_chronological | 8       | NULL |  200 | Using where |
+----+-------------+------------------------+-------+-------------------+-------------------+---------+------+------+-------------+
1 row in set (0.00 sec)

This makes sense to me conceptually, in that the ORDER can be satisfied entirely with the key.

I'm getting different results:

mysql> EXPLAIN SELECT * FROM feed_storynotification WHERE chronologicalKey < 6175228766710272490 LIMIT 100;
+----+-------------+------------------------+-------+---------------+------------+---------+------+---------+--------------------------+
| id | select_type | table                  | type  | possible_keys | key        | key_len | ref  | rows    | Extra                    |
+----+-------------+------------------------+-------+---------------+------------+---------+------+---------+--------------------------+
|  1 | SIMPLE      | feed_storynotification | index | NULL          | userPHID_2 | 133     | NULL | 2941801 | Using where; Using index |
+----+-------------+------------------------+-------+---------------+------------+---------+------+---------+--------------------------+
1 row in set (0.01 sec)

mysql> EXPLAIN SELECT * FROM feed_storynotification WHERE chronologicalKey < 6175228766710272490 ORDER BY chronologicalKey ASC LIMIT 100;
+----+-------------+------------------------+-------+---------------+------------+---------+------+---------+------------------------------------------+
| id | select_type | table                  | type  | possible_keys | key        | key_len | ref  | rows    | Extra                                    |
+----+-------------+------------------------+-------+---------------+------------+---------+------+---------+------------------------------------------+
|  1 | SIMPLE      | feed_storynotification | index | NULL          | userPHID_2 | 133     | NULL | 2941801 | Using where; Using index; Using filesort |
+----+-------------+------------------------+-------+---------------+------------+---------+------+---------+------------------------------------------+
1 row in set (0.00 sec)

Out of curiosity, what's the query plan for this (DESC instead of ASC)?

EXPLAIN SELECT * FROM feed_storynotification WHERE chronologicalKey < 6175228766710272490 ORDER BY chronologicalKey DESC LIMIT 100;

Same

mysql> EXPLAIN SELECT * FROM feed_storynotification WHERE chronologicalKey < 6175228766710272490 ORDER BY chronologicalKey DESC LIMIT 100;
+----+-------------+------------------------+-------+---------------+------------+---------+------+---------+------------------------------------------+
| id | select_type | table                  | type  | possible_keys | key        | key_len | ref  | rows    | Extra                                    |
+----+-------------+------------------------+-------+---------------+------------+---------+------+---------+------------------------------------------+
|  1 | SIMPLE      | feed_storynotification | index | NULL          | userPHID_2 | 133     | NULL | 2941950 | Using where; Using index; Using filesort |
+----+-------------+------------------------+-------+---------------+------------+---------+------+---------+------------------------------------------+
1 row in set (0.00 sec)

Yuck. Can you come up with any query where it uses the key? What if we do something really dumb like this to try to trick it?

EXPLAIN SELECT * FROM feed_storynotification WHERE chronologicalKey BETWEEN 0 AND 6175228766710272490 ORDER BY chronologicalKey ASC LIMIT 100;

Getting the same as above still.

mysql> EXPLAIN SELECT * FROM feed_storynotification WHERE chronologicalKey BETWEEN 0 AND 6175228766710272490 ORDER BY chronologicalKey ASC LIMIT 100;
+----+-------------+------------------------+-------+---------------+------------+---------+------+---------+------------------------------------------+
| id | select_type | table                  | type  | possible_keys | key        | key_len | ref  | rows    | Extra                                    |
+----+-------------+------------------------+-------+---------------+------------+---------+------+---------+------------------------------------------+
|  1 | SIMPLE      | feed_storynotification | index | NULL          | userPHID_2 | 133     | NULL | 2942224 | Using where; Using index; Using filesort |
+----+-------------+------------------------+-------+---------------+------------+---------+------+---------+------------------------------------------+
1 row in set (0.00 sec)

I'm tentatively inclined to think that this is a bug in Percona, since:

  • vanilla MySQL uses the key; and
  • the query really looks like it should definitely be able to use the key.

It's possible that this isn't really a Percona issue, and that the MySQL optimizer would make the same choice on your dataset, but I haven't previously encountered any similar behavior.

This table is unusual, in that it does not have an AUTO_INCREMENT PRIMARY KEY. We have encountered some unusual behaviors in InnoDB on tables which are missing a primary key or have an unusual primary key, and generally provide one on new tables (this also improves consistency and simplicity).

If you create an alternate table with a standard primary key, does that help the optimizer? (I think this should only take a minute or two now that the table is GC'd, but feel free to not bother if you don't want to keep fishing.)

CREATE TABLE `feed_storynotification_prikey` LIKE `feed_storynotification`;
ALTER TABLE `feed_storynotification_prikey` ADD id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY;
INSERT INTO feed_storynotification_prikey SELECT * FROM feed_storynotification;
EXPLAIN SELECT * FROM feed_storynotification_prikey WHERE chronologicalKey < 6175228766710272490 ORDER BY chronologicalKey ASC LIMIT 100;

The fact that it's not using the key at all explains why the GC was slow for you and why removing the ORDER BY improved performance, at least, but it seems like it should be able to use the key to execute this query efficiently, and appears to do so in MySQL (we don't have as much data as you do, but I don't recall the GC having any issues when we pushed this change here, and other large installs haven't reported issues but I'd expect them to if the GC was not able to use the key).

I doubt it is Percona issue. I've just tried perform same queries on my laptop with MySQL 5.6 and small data set. Still getting the same results as above.

Results on table with prikey:

mysql> EXPLAIN SELECT * FROM feed_storynotification_prikey WHERE chronologicalKey < 6175228766710272490 ORDER BY chronologicalKey ASC LIMIT 100;
+----+-------------+-------------------------------+------+---------------+------+---------+------+---------+-----------------------------+
| id | select_type | table                         | type | possible_keys | key  | key_len | ref  | rows    | Extra                       |
+----+-------------+-------------------------------+------+---------------+------+---------+------+---------+-----------------------------+
|  1 | SIMPLE      | feed_storynotification_prikey | ALL  | NULL          | NULL | NULL    | NULL | 2584335 | Using where; Using filesort |
+----+-------------+-------------------------------+------+---------------+------+---------+------+---------+-----------------------------+
1 row in set (0.00 sec)

mysql> EXPLAIN SELECT * FROM feed_storynotification_prikey WHERE chronologicalKey < 6175228766710272490  LIMIT 100;
+----+-------------+-------------------------------+------+---------------+------+---------+------+---------+-------------+
| id | select_type | table                         | type | possible_keys | key  | key_len | ref  | rows    | Extra       |
+----+-------------+-------------------------------+------+---------------+------+---------+------+---------+-------------+
|  1 | SIMPLE      | feed_storynotification_prikey | ALL  | NULL          | NULL | NULL    | NULL | 2584335 | Using where |
+----+-------------+-------------------------------+------+---------------+------+---------+------+---------+-------------+
1 row in set (0.00 sec)

I get some odd results locally, in MySQL too, although I have a very small table on my laptop (~1100 rows):

mysql> EXPLAIN SELECT * FROM feed_storynotification WHERE chronologicalKey < 6175228766710272490 ORDER BY chronologicalKey ASC LIMIT 100;
+----+-------------+------------------------+-------+-------------------+------------+---------+------+------+------------------------------------------+
| id | select_type | table                  | type  | possible_keys     | key        | key_len | ref  | rows | Extra                                    |
+----+-------------+------------------------+-------+-------------------+------------+---------+------+------+------------------------------------------+
|  1 | SIMPLE      | feed_storynotification | index | key_chronological | userPHID_2 | 133     | NULL | 1078 | Using where; Using index; Using filesort |
+----+-------------+------------------------+-------+-------------------+------------+---------+------+------+------------------------------------------+
1 row in set (0.00 sec)

I don't really understand why this query is ever not using key_chronological, so maybe we're just at the limits of my ability to use MySQL correctly here.

What if we do this?

EXPLAIN SELECT chronologicalKey, id FROM feed_storynotification_prikey WHERE chronologicalKey < 6175228766710272490 LIMIT 100;

That surely uses the key, right?

This was false alarm. key_chronological index was not added on feed_storynotification table on production db (but was added on local). I'm sorry for wasting your time :-)

What is interesting, that index existed on localhost with smaller data set and still query plans on both tables were the same.