Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Duplicate
-
10.4.17
-
debian
Description
We have two hosts serving on the same replication topology with same hardware.
pc2007: runs 10.4.15
pc2010: runs 10.4.17
Those two hosts do not receive read traffic. All the traffic comes thru replication and it is mostly REPLACE statements.
However they do receive DELETEs from time to time.
After upgrading to 10.4.17 we noticed pc2010 (and another host) started to lag often, and it sort of matched the times where the DELETE arrived.
Investigating we saw that the host running 10.4.17 was having a very strange pattern regarding handler_read_next compared to the host running 10.4.15
10.4.17: https://phab.wmfusercontent.org/file/data/442ozs4b36ctxwzqdse5/PHID-FILE-hp6qf2cv6je4zoks3fvn/Captura_de_pantalla_2020-11-23_a_las_12.25.47.png
10.4.15: https://phab.wmfusercontent.org/file/data/byag7m3a4z3v5r4pe7l3/PHID-FILE-gt6pten2ty6no3czi23j/Captura_de_pantalla_2020-11-23_a_las_12.26.06.png
Both hosts run the same global variables and a diff doesn't really show any difference:
https://phabricator.wikimedia.org/P13361
There is +innodb_max_purge_lag_wait which was introduced on 10.4.16, but I asked Marko at https://jira.mariadb.org/browse/MDEV-16952 and he kindly pointed out that it should have no difference.
However, digging into the DELETEs optimizers behaviour we can see there's a big difference between 10.4.15 and 10.4.17
These are the handlers differences:
10.4.17: https://phabricator.wikimedia.org/P13362
10.4.15: https://phabricator.wikimedia.org/P13363
We can see the big difference there with the scans:
10.4.17
| Handler_read_next | 15218 |
|
10.4.15:
| Handler_read_next | 0 |
|
Checking the optimizer we can see there's a big difference on rows and cost:
Optimizer trace for 10.4.17: https://phabricator.wikimedia.org/P13365
Optimizer trace for 10.4.15: https://phabricator.wikimedia.org/P13364
This difference is interesting:
10.4.17:
 |
"rowid_ordered": true,
|
"using_mrr": false,
|
"index_only": false,
|
"rows": 100,
|
"cost": 65515,
|
"chosen": true
|
},
|
{
|
"index": "exptime",
|
"ranges": ["(NULL) < (exptime) < (2020-11-21 18:06:30)"],
|
"rowid_ordered": false,
|
"using_mrr": false,
|
"index_only": false,
|
"rows": 26088,
|
"cost": 33264,
|
"chosen": true
|
}
|
],
|
"analyzing_index_merge_union": []
|
},
|
"chosen_range_access_summary": {
|
"range_access_plan": {
|
"type": "range_scan",
|
"index": "exptime",
|
"rows": 26088,
|
"ranges": ["(NULL) < (exptime) < (2020-11-21 18:06:30)"]
|
},
|
"rows_for_plan": 26088,
|
"cost_for_plan": 33264,
|
"chosen": true
|
Vs the much lighter 10.4.15 plan
"rowid_ordered": true,
|
"using_mrr": false,
|
"index_only": false,
|
"rows": 100,
|
"cost": 120.13,
|
"chosen": true
|
},
|
{
|
"index": "exptime",
|
"ranges": ["(NULL) < (exptime) < (2020-11-21 18:06:30)"],
|
"rowid_ordered": false,
|
"using_mrr": false,
|
"index_only": false,
|
"rows": 1,
|
"cost": 2.5033,
|
"chosen": true
|
}
|
],
|
"analyzing_index_merge_union": []
|
},
|
"chosen_range_access_summary": {
|
"range_access_plan": {
|
"type": "range_scan",
|
"index": "exptime",
|
"rows": 1,
|
"ranges": ["(NULL) < (exptime) < (2020-11-21 18:06:30)"]
|
},
|
"rows_for_plan": 1,
|
"cost_for_plan": 2.5033,
|
"chosen": true
|
We are not sure if this might be the culprit of our lag, but it is definitely something that has regressed on 10.4.17.
The table schema:
CREATE TABLE `pc000` (
|
`keyname` varbinary(255) NOT NULL DEFAULT '',
|
`value` mediumblob DEFAULT NULL,
|
`exptime` datetime DEFAULT NULL,
|
UNIQUE KEY `keyname` (`keyname`),
|
KEY `exptime` (`exptime`)
|
) ENGINE=InnoDB DEFAULT CHARSET=binary
|
This is being tracked publicly on our tracking system: https://phabricator.wikimedia.org/T268457
Attachments
Issue Links
- duplicates
-
MDEV-24275 InnoDB persistent stats analyze forces full scan forcing lock crash
- Closed
-
MDEV-24438 Primary KEY not used in range lookups
- Closed