We have two hosts serving on the same replication topology with same hardware.
pc2007: runs 10.4.15
pc2010: runs 10.4.17
Those two hosts do not receive read traffic. All the traffic comes thru replication and it is mostly REPLACE statements.
However they do receive DELETEs from time to time.
After upgrading to 10.4.17 we noticed pc2010 (and another host) started to lag often, and it sort of matched the times where the DELETE arrived.
Investigating we saw that the host running 10.4.17 was having a very strange pattern regarding handler_read_next compared to the host running 10.4.15
Both hosts run the same global variables and a diff doesn't really show any difference:
There is +innodb_max_purge_lag_wait which was introduced on 10.4.16, but I asked Marko at https://jira.mariadb.org/browse/MDEV-16952 and he kindly pointed out that it should have no difference.
However, digging into the DELETEs optimizers behaviour we can see there's a big difference between 10.4.15 and 10.4.17
These are the handlers differences:
We can see the big difference there with the scans:
Checking the optimizer we can see there's a big difference on rows and cost:
Optimizer trace for 10.4.17: https://phabricator.wikimedia.org/P13365
Optimizer trace for 10.4.15: https://phabricator.wikimedia.org/P13364
This difference is interesting:
Vs the much lighter 10.4.15 plan
We are not sure if this might be the culprit of our lag, but it is definitely something that has regressed on 10.4.17.
The table schema:
This is being tracked publicly on our tracking system: https://phabricator.wikimedia.org/T268457