Find ways to detect/fix/warn about MySQL indexes with poor cardinality estimates
Closed, ResolvedPublic

Description

An install with a large overall datasize saw MySQL estimate the cardinality of commitID in repository_pathchange as 18, when the real cardinality was probably more like 18 million.

It would be desirable to prevent (or at least detect) these errors, but the best pathway forward isn't clear.


Original Report

We're seeing this query:

# 2016-03-04T19:17:45 KILL 19676457 (Query 42 sec) SELECT commitID, pathID FROM `repository_pathchange`
        WHERE commitID IN (1474828, 1474829, 1474830, 1474831, 1474832, 1474833, 1474834, 1474836, 1474837, 1474838, 1474839, 1474840, 1474841, 1474842, 1474843, 1474844, 1474845, 1474847, 1474848, 1474849, 1474850, 1474851, 1474852, 1474853, 1474854, 1474855, 1474856, 1474857, 1474858, 1474859, 1474860, 1474862, 1474863, 1474864, 1474865, 1474866, 1474867, 1474868, 1474870, 1474871, 1474872, 1474873, 1474874, 1474875, 1474876, 1474877, 1474878, 1474879, 1474880, 1474881, 1474882, 1474883, 1474884, 1474885, 1474886, 1474887, 1474889, 1474891, 1474892, 1474893, 1474894, 1474895, 1474896, 1474897, 1474898, 1474899, 1474900, 1474901, 1474902, 1474903, 1474904, 1474905, 1474906, 1474907, 1474908, 1474909, 1474910, 1474911, 1474912, 1474914, 1474915, 1474916, 1474917, 1474918, 1474919, 1474920, 1474921, 1474922, 1474923, 1474924, 1474925, 1474926, 1474928, 1474929, 1474930, 1474931, 1474932, 1474933, 1474934, 1474935, 1474936, 1474937, 1474938, 1474939, 1474940, 1474941, 1474942, 1474943, 1474944, 1474945, 1474946, 1474947, 1474948, 1474949, 1474950, 1474951, 1474952, 1474953, 1474955, 1474956, 1474957, 1474958, 1474959, 1474960, 1474961, 1474962, 1474963, 1474964, 1474965, 1474967, 1474968, 1474969, 1474970, 1474971, 1474972, 1474973, 1474974, 1474975, 1474976, 1474977, 1474978, 1474979, 1474980, 1474981, 1474982, 1474983, 1474984, 1474985, 1474986, 1474987, 1474988, 1474989, 1474990, 1474992, 1474993, 1474994, 1474995, 1474996, 1474997, 1474998, 1474999, 1475000, 1475001, 1475002, 1475004, 1475005, 1475006, 1475007, 1475008, 1475009, 1475010, 1475011, 1475013, 1475014, 1475015, 1475016, 1475017, 1475018, 1475019, 1475020, 1475021, 1475022, 1475023, 1475024, 1475025, 1475026, 1475027, 1475028, 1475029, 1475030, 1475031, 1475032, 1475034, 1475035, 1475036, 1475037, 1475038, 1475039, 1475041, 1475042, 1475043, 1475044, 1475045, 1475046, 1475047, 1475048, 1475049, 1475050, 1475052, 1475053, 1475054, 1475055, 1475056, 1475057, 1475058, 1475059, 1475060, 1475061, 1475062, 1475063, 1475064, 1475065, 1475066, 1475067, 1475068, 1475069, 1475070, 1475072, 1475073, 1475074, 1475075, 1475076, 1475077, 1475078, 1475079, 1475080, 1475081, 1475082, 1475083, 1475084, 1475085, 1475086, 1475087, 1475088, 1475089, 1475091, 1475092, 1475093, 1475094, 1475095, 1475096, 1475097, 1475098, 1475099, 1475100, 1475101, 1475102, 1475104, 1475105, 1475106, 1475107, 1475108, 1475109, 1475110, 1475111, 1475112, 1475113, 1475114, 1475115, 1475116, 1475118, 1475119, 1475120, 1475121, 1475122, 1475123, 1475124, 1475125, 1475126, 1475127, 1475128, 1475129, 1475130, 1475131, 1475132, 1475134, 1475135, 1475136, 1475137, 1475138, 1475139, 1475140, 1475141, 1475142, 1475143, 1475144, 1475145, 1475146, 1475147, 1475148, 1475149, 1475150, 1475151, 1475152, 1475153, 1475154, 1475155, 1475156, 1475158, 1475159, 1475160, 1475161, 1475162, 1475163, 1475164, 1475165, 1475166, 1475167, 1475168, 1475169, 1475170, 1475171, 1475172, 1475173, 1475175, 1475180, 1475181, 1475182, 1475184, 1475185, 1475186, 1475187, 1475188, 1475189, 1475190, 1475191, 1475192, 1475193, 1475194, 1475195, 1475196, 1475197, 1475198, 1475199, 1475200, 1475201, 1475202, 1475204, 1475205, 1475206, 1475207, 1475208, 1475209, 1475210, 1475213, 1475214, 1475215, 1475216, 1475217, 1475218, 1475219, 1475220, 1475221, 1475222, 1475223, 1475224, 1475225, 1475226, 1475227, 1475229, 1475230, 1475231, 1475232, 1475233, 1475234, 1475235, 1475236, 1475237, 1475239, 1475240, 1475241, 1475242, 1475243, 1475244, 1475246, 1475248, 1475249, 1475250, 1475251, 1475252, 1475253, 1475254, 1475255, 1475256, 1475257, 1475258, 1475260, 1475261, 1475262, 1475263, 1475265, 1475266, 1475267, 1475268, 1475269, 1475270, 1475271, 1475272, 1475273)
        AND (isDirect = 1 OR changeType = 10)

take 2+ minutes.

If we add an index:

CREATE TABLE `test_rp` (
  `repositoryID` int(10) unsigned NOT NULL,
  `pathID` int(10) unsigned NOT NULL,
  `commitID` int(10) unsigned NOT NULL,
  `targetPathID` int(10) unsigned DEFAULT NULL,
  `targetCommitID` int(10) unsigned DEFAULT NULL,
  `changeType` int(10) unsigned NOT NULL,
  `fileType` int(10) unsigned NOT NULL,
  `isDirect` tinyint(1) NOT NULL,
  `commitSequence` int(10) unsigned NOT NULL,
  PRIMARY KEY (`commitID`,`pathID`),
  KEY `repositoryID` (`repositoryID`,`pathID`,`commitSequence`),
  KEY `commitID` (`commitID`,`isDirect`,`changeType`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin
5139 rows in set (2 min 2.24 sec)

vs

5139 rows in set (0.02 sec)

then it takes a reasonable amount of time

eadler created this task.Mar 10 2016, 10:13 PM
eadler added a project: Restricted Project.
eadler moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.

to clarify

KEY `commitID` (`commitID`,`isDirect`,`changeType`)

is the new index.

Ah, nice!

That key seems like it's probably not great for the OR changeType part of the query. If you still have the table handy and playing with it doesn't take an unreasonably huge amount of time, is this better?

  • Add KEY (commitID, isDirect).
  • Add KEY (commitID, changeType).

Then run this variant of the query:

SELECT commitID, pathID FROM `repository_pathchange` WHERE commitID IN (...) AND isDirect = 1
UNION ALL
SELECT commitID, pathID FROM `repository_pathchange` WHERE commitID IN (...) AND changeType = 10;

I think that should go fast, and maybe have a better query plan than the triple (commitID, isDirect, changeType) key?

That variant of the query is slow as well.

Hmm, weird. Can you show me the EXPLAIN SELECT ... for that?

eadler added a comment.EditedMar 11 2016, 1:08 AM
id	select_type	table	type	possible_keys	key	key_len	ref	rows	Extra
1	PRIMARY	repository_pathchange	index	PRIMARY,cidid,cidct	cidid	5	NULL	303320447	Using where; Using index
2	UNION	repository_pathchange	index	PRIMARY,cidid,cidct	cidct	8	NULL	303320447	Using where; Using index
NULL	UNION RESULT	<union1,2>	ALL	NULL	NULL	NULL	NULL	NULL	Using temporary

is the explanation of the query

id	select_type	table	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	PRIMARY	repository_pathchange	index	PRIMARY,cidid,cidct	cidid	5	NULL	303320519	100.00	Using where; Using index
2	UNION	repository_pathchange	index	PRIMARY,cidid,cidct	cidct	8	NULL	303320519	100.00	Using where; Using index
NULL	UNION RESULT	<union1,2>	ALL	NULL	NULL	NULL	NULL	NULL	NULL	Using temporary
KEY `cidid` (`commitID`,`isDirect`),
KEY `cidct` (`commitID`,`changeType`)

are the keys created

Hrrm, let me see if I can reproduce anything like that, that's totally bizarre to me.

to clarify: I removed the key I originally added

KEY `commitID` (`commitID`,`isDirect`,`changeType`)

was I not supposed to do that?

That's fine -- it looks like it's picking up the keys, I'm just not sure why it's then examining 303 million rows when it has the keys available.

It also seems like the table primary key should get us almost all the way there regardless of these other keys.

We only have ~300K rows locally (so the table is 1000x smaller) but I get reasonable query plans even with the unmodified query, using the primary key:

mysql> explain SELECT commitID, pathID FROM repository_pathchange WHERE commitID IN (18102, 18101, 18100, 18099, 18098, 18097, 18096, 18095, 18094, 18093, 18092, 18091, 18090, 18089, 18088, 18087, 18086, 18085, 18084, 18083, 18082, 18081, 18080, 18079, 18078, 18077, 18076, 18075, 18074, 18073, 18072, 18071, 18070, 18069, 18068, 18067, 18066, 18065, 18064, 18063, 18062, 18061, 18060, 18059, 18058, 18057, 18056, 18055, 18054, 18053, 18052, 18051, 18050, 18049, 18048, 18047, 18046, 18045, 18044, 18043, 18042, 18041, 18040, 18039, 18038, 18037, 18036, 18035, 18034, 18033, 18032, 18031, 18030, 18029, 18028, 18027, 18026, 18025, 18024, 18023, 18022, 18021, 18020, 18019, 18018, 18017, 18016, 18015, 18014, 18013, 18012, 18011, 18010, 18009, 18008, 18007, 18006, 18005, 18004, 18003) AND (isDirect = 1 OR changeType = 10);
+----+-------------+-----------------------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type | table                 | type  | possible_keys | key     | key_len | ref  | rows | Extra       |
+----+-------------+-----------------------+-------+---------------+---------+---------+------+------+-------------+
|  1 | SIMPLE      | repository_pathchange | range | PRIMARY       | PRIMARY | 4       | NULL | 1784 | Using where |
+----+-------------+-----------------------+-------+---------------+---------+---------+------+------+-------------+
1 row in set (0.00 sec)

Actual number of result rows is 1,761, so the examined row value is reasonable.

If you remove the (isDirect = 1 OR changeType = 10) clause, does the query return an unreasonably ridiculous number of rows?

I suppose the originally proposed key may be better than I think. I was imagining that it couldn't use the third part of the key (on changeType) but I suppose it actually can. I'm still confused by the UNION ALL behavior, but if the key's working on your dataset I think it's fine to add it -- it takes <1 second to add to our dataset, so even for large datasets it shouldn't be wildly unreasonable, even if we do find a better key later.

I'll leave this open for now since the behavior still feels a bit spooky to me, let me know how things go once you get the key in place? If they're smooth, we can chalk this up to be being bad at MySQL.

Hrm, I'm going to do some followup with our DBAs as well. The performance numbers above were copied from the DBA analysis; however, when I make the change in production, rather than in staging, some of the behavior does not directly replicate.

Hopefully we'll figure this out and be able to provide more detailed information.

Quick update here. We've found that adding FORCE INDEX (primary) results in 79.4ms in production even without the key_history index.

I'm still working with our DBAs to discover (a) the difference between our production and staging environments which cause the performance differences and (b) why the query optimizer isn't picking the primary key on its own.

Without the key_history index and without FORCE INDEX the result of EXPLAIN EXTENDED shows that primary is a candidate, but not selected.

Interesting (and weird). Thanks for the update -- and good luck figuring things out.

The discrepancy between staging + production is almost certainly caused by the size of the data.

You do not need to create a secondary index on changeId as that is the leftmost part of the primary key. The secondary index just points to the primary index where the data is stored.

The underlying cause of the slowness is because the stats were completely off in the production table (was showing cardnality of 18 for changeId).

https://bugs.mysql.com/bug.php?id=36513 is also related.

Should we get rid of the <commitID, isDirect, changeType> key, then?

We could possibly work around MySQL 36513 by doing something like this periodically (say, daily as part of GC):

  • Set innodb_stats_sample_pages to a larger value than 8.
  • Explicitly ANALYZE any tables which we know are prone to cardinality errors.

I'm not sure if that would help or not. If the alternative is something along the lines of a DBA doing this manually this might be an improvement, though.

Well, the docs suggest not doing that:

Because the statistics are automatically recalculated at various times other than on execution of ANALYZE TABLE, it does not make sense to increase the index sample size, run ANALYZE TABLE, then decrease sample size again. The more accurate statistics calculated by ANALYZE running with a high value of innodb_stats_sample_pages can be wiped away later.
https://dev.mysql.com/doc/refman/5.5/en/innodb-statistics-estimation.html

We could also consider suggesting users raise the default value of 8, but I'm not sure what the accuracy/performance tradeoff curve looks like.

Should we get rid of the <commitID, isDirect, changeType> key, then?

I think so. Sorry for the originally misleading claim.

We could possibly work around MySQL 36513 by doing something like this periodically (say, daily as part of GC):

  • Explicitly ANALYZE any tables which we know are prone to cardinality errors.

    I'm not sure if that would help or not. If the alternative is something along the lines of a DBA doing this manually this might be an improvement, though.

I think triggering ANALYZE as a part of T9554 is most reasonable here. In addition it may be helpful to periodically check table cardinality and run ANALYZE if it seems wildly off. One concern with running it daily is that depending on the version of MySQL it may cause a table lock to be taken.

I think so. Sorry for the originally misleading claim.Sorry for the originally misleading claim.

No problem, this stuff is easy to swap out and I hadn't seen the cardinality issue before so I'll have another idea of something to look for in the future.

We could also make bin/storage adjust just run ANALYZE on every table, since we know it's safe to take table locks during a schema change anyway. Let me see if that takes an unreasonably long amount of time for a moderately-sized install. I'm not sure if it'll really help, but it probably shouldn't hurt.

Here's a diff to make bin/storage adjust always run ANALYZE on every table after it makes other adjustments, but I'm not really sure if that's actually helpful or not (e.g., from the documentation, it sounds like the good analysis can be immediately overwritten by a bad one in some cases?):

https://secure.phabricator.com/differential/diff/37323/

It only adds 1-2 seconds to the cost of storage adjust on my local dataset so the cost is very small, but I'd like a better understanding of exactly when ANALYZE can fix a problem before bringing it upstream.

eadler added a comment.EditedMar 16 2016, 10:05 PM

We'll try that in our next upgrade. Thanks for the help!

[ we consider this issue resolved, but I think you had some plans for adding additional setup checks so I won't change the status ]

epriestley renamed this task from missing index on repository_pathchange table to Find ways to detect/fix/warn about MySQL indexes with poor cardinality estimates.Mar 16 2016, 10:46 PM
epriestley updated the task description. (Show Details)
epriestley edited projects, added Infrastructure; removed Diffusion, Bug Report.
epriestley updated the task description. (Show Details)Mar 16 2016, 10:49 PM
eadler removed a project: Restricted Project.Mar 28 2016, 8:18 PM
BYK added a subscriber: BYK.Sep 12 2016, 7:18 PM
epriestley closed this task as Resolved.Sep 11 2017, 12:04 PM
epriestley claimed this task.

bin/storage adjust and bin/storage upgrade now run ANALYZE, unless users opt-out with flags.

bin/storage analyze does this in a standalone way.

See T12819 for some additional context.