[MDEV-24275] InnoDB persistent stats analyze forces full scan forcing lock crash Created: 2020-11-24  Updated: 2021-10-28  Resolved: 2020-11-25

Status: Closed
Project: MariaDB Server
Component/s: Admin statements, Storage Engine - InnoDB
Affects Version/s: 10.2.35, 10.2.36, 10.3.26, 10.3.27, 10.4.16, 10.4.17, 10.5.7, 10.5.8
Fix Version/s: 10.2.37, 10.3.28, 10.4.18, 10.5.9

Type: Bug Priority: Major
Reporter: Jukka Santala Assignee: Eugene Kosov (Inactive)
Resolution: Fixed Votes: 0
Labels: crash, optimizer, performance, regression, replication
Environment:

Tested on CentOS 7, CentOS 8, MariaDB 10.4.16, 10.4.17, 10.5.8


Issue Links:
Duplicate
duplicates MDEV-24504 [FATAL] InnoDB: Semaphore wait has la... Closed
is duplicated by MDEV-24266 Possible optimizer regression on 10.4... Closed
is duplicated by MDEV-24438 Primary KEY not used in range lookups Closed
is duplicated by MDEV-25955 InnoDB: Semaphore wait has lasted > 6... Closed
PartOf
Problem/Incident
is caused by MDEV-23991 dict_table_stats_lock() has unnecessa... Closed
Relates
relates to MDEV-24606 InnoDB: Semaphore wait has lasted > 6... Closed
relates to MDEV-24869 The replication suddenly stops for N ... Closed
relates to MDEV-25111 Long semaphore wait (> 800 secs), ser... Closed

 Description   

MDEV-23991 reduced ANALYZE TABLE/background analyze lock scope. In doing so btr_get_size(index, BTR_N_LEAF_PAGES, &mtr) was stored temporarily into result.n_leaf_pages instead of index->stat_n_leaf_pages to avoid needing lock.

But the following compare is still using index->stat_n_leaf_pages to determine whether a full table scan is necessary. This variable is neither protected by a lock, nor calculated correctly, reading as 1 no matter how many leaf pages the index has.

This causes an unnecessary full scan of the table, locking the index for write access. At least when a replication thread attempts to write into a larger table, 600 second semaphore wait triggers server crash for coredump. Because the table analysis does not complete, automated table analysis will be re-triggered after crash recovery, causing an endless crash loop.

The fix appears to be using result.n_leaf_pages instead of index->stat_n_leaf_pages in the comparison for whether sampling whole table has been requested, as it is local to the running thread and holds the value used previous to the patch.

if (root_level == 0

N_SAMPLE_PAGES(index) * n_uniq > result.n_leaf_pages) {


 Comments   
Comment by Eugene Kosov (Inactive) [ 2020-11-25 ]

Thanks a lot!

Comment by Faustin Lammler [ 2021-03-30 ]

You can find some more information on that bug at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=985821

Generated at Thu Feb 08 09:28:45 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.