[MDEV-28122] OPTIMIZE TABLE crash Created: 2022-03-18  Updated: 2024-01-15  Resolved: 2023-10-16

Status: Closed
Project: MariaDB Server
Component/s: Data Definition - Alter Table, Storage Engine - InnoDB
Affects Version/s: 10.6.0, 10.6.7, 10.11.4
Fix Version/s: 10.6.16, 10.10.7, 10.11.6, 11.0.4, 11.1.3

Type: Bug Priority: Critical
Reporter: Axel Dörfler Assignee: Thirunarayanan Balathandayuthapani
Resolution: Fixed Votes: 1
Labels: crash, regression
Environment:

Windows Server 2022 Standard
Intel(R) Xeon(R) Gold 5317 CPU @ 3.00GHz 2.99 GHz
128 GB RAM


Issue Links:
Duplicate
is duplicated by MDEV-26199 Assertion `n_null_bytes >= n_core_nul... Closed
Problem/Incident
is caused by MDEV-515 innodb bulk insert Closed
Relates
relates to MDEV-17644 DRAFT: Assertion `0' failed in row_lo... Open
relates to MDEV-18361 row0log.cc:3057: dberr_t row_log_tabl... Closed
relates to MDEV-19044 mysqld: storage/innobase/row/row0log.... Open
relates to MDEV-26198 Assertion `0' failed in row_log_table... Closed

 Description   

The InnoDB table file is 144 GB; the #sql-alter-2964-a.ibd file is 56 GB in size, which might very well be its final size; about 700 million rows had been deleted from the table. There are currently 174 GB free on the data device (with both files in place).

I received the following crash:

220318  9:01:01 [ERROR] mysqld got exception 0xc0000005 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.6.7-MariaDB
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=28
max_threads=65537
thread_count=12
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 142742881 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x270ab8fc038
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
server.dll!row_log_table_apply_op()[row0log.cc:2475]
server.dll!row_log_table_apply_ops()[row0log.cc:3066]
server.dll!row_log_table_apply()[row0log.cc:3174]
server.dll!ha_innobase::inplace_alter_table()[handler0alter.cc:8485]
server.dll!mysql_inplace_alter_table()[sql_table.cc:7409]
server.dll!mysql_alter_table()[sql_table.cc:10266]
server.dll!mysql_recreate_table()[sql_table.cc:11319]
server.dll!admin_recreate_table()[sql_admin.cc:67]
server.dll!mysql_admin_table()[sql_admin.cc:1130]
server.dll!Sql_cmd_optimize_table::execute()[sql_admin.cc:1515]
server.dll!mysql_execute_command()[sql_parse.cc:6008]
server.dll!mysql_parse()[sql_parse.cc:8033]
server.dll!dispatch_command()[sql_parse.cc:1898]
server.dll!do_command()[sql_parse.cc:1404]
server.dll!threadpool_process_request()[threadpool_common.cc:403]
server.dll!tp_callback()[threadpool_common.cc:203]
KERNEL32.DLL!QueryPerformanceCounter()
ntdll.dll!TpReleaseWork()
ntdll.dll!RtlInitializeResource()
KERNEL32.DLL!BaseThreadInitThunk()
ntdll.dll!RtlUserThreadStart()
 
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x270a8fd3480): OPTIMIZE TABLE node_log
Connection ID (thread ID): 10
Status: NOT_KILLED
 
Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off
 
The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
information that should help you find out what is causing the crash.
Writing a core file at C:\ProgramData\MariaDB 10.6\
Minidump written to C:\ProgramData\MariaDB 10.6\mysqld.dmp



 Comments   
Comment by Alice Sherepa [ 2022-03-18 ]

probably the same bug as MDEV-26198

Comment by Marko Mäkelä [ 2022-03-22 ]

The latest trace for MDEV-26198 involves a table in ROW_FORMAT=REDUNDANT on which columns had been previously instantly added.
What does SHOW TABLE STATUS display for this table?

Comment by Axel Dörfler [ 2022-03-23 ]

The server has garbled its boot disk to a point where it could not be repaired. It has now been reset from a backup. While the server uses a RAID-1 and ECC memory, and did not log anything suspicious, there is a good chance that this is a hardware error. And if that's true, the crash could well be caused by that, too.

I have now resurrected the server's predecessor, and a SHOW TABLE STATUS shows this:

Name Engine Version Row_format Rows Avg_row_length Data_length Max_data_length Index_length Data_free Auto_increment Create_time Update_time Check_time Collation Checksum Create_options Comment
node_log InnoDB 10 Compact 449973282 186 84136689664 0 60923265024 36799774720 1084647359 2017-07-31 13:24:01 \N \N utf8_bin \N    

But this was before I deleted more than half the table (its oldest entries).

Comment by Axel Dörfler [ 2022-03-23 ]

I will try the same procedure on the old server, and see how it goes there...

Comment by Marko Mäkelä [ 2023-09-19 ]

MDEV-18361 may be specific to ROW_FORMAT=REDUNDANT, and I was unable to reproduce it.

Both in 10.6.7 and 10.11.4, the crash occurs here in row_log_table_apply_op():

        switch (*mrec++) {
        default:
                ut_ad(0);
                *error = DB_CORRUPTION;
                return(NULL);
        case ROW_T_INSERT:
                extra_size = *mrec++;

I now realize that we do not check for mrec > mrec_end until some steps later. In another case we do have a proper check for end-of-buffer:

        case ROW_T_DELETE:
                /* 1 (extra_size) + at least 1 (payload) */
                if (mrec + 2 >= mrec_end) {
                        return(NULL);
                }

Also in ROW_T_UPDATE we are potentially reading after the end of the buffer. In row_log_apply_op() (which is used during online CREATE INDEX) we have proper overflow checks in place.

Comment by Marko Mäkelä [ 2023-09-19 ]

I failed to notice that there actually is a check right before the switch statement:

	/* 3 = 1 (op type) + 1 (extra_size) + at least 1 byte payload */
	if (mrec + 3 >= mrec_end) {
		return(NULL);
	}

So, the check after case ROW_T_DELETE: is actually redundant, and the reason for this crash remains a mystery until we have a reproducible test case.

Comment by Marko Mäkelä [ 2023-10-13 ]

The check that I quoted had been removed as part of MDEV-515 in MariaDB Server 10.6.0, related to the introduction of a ROW_T_EMPTY record, which was later removed in MDEV-15250.

I think that we just need to put back the check right before the switch(*mrec++) statement, like this:

diff --git a/storage/innobase/row/row0log.cc b/storage/innobase/row/row0log.cc
index 6842a093b5c..e1ce8161a02 100644
--- a/storage/innobase/row/row0log.cc
+++ b/storage/innobase/row/row0log.cc
@@ -2167,6 +2167,11 @@ row_log_table_apply_op(
 
 	*error = DB_SUCCESS;
 
+	/* 3 = 1 (op type) + 1 (extra_size) + at least 1 byte payload */
+	if (mrec + 3 >= mrec_end) {
+		return(NULL);
+	}
+
 	const bool is_instant = log->is_instant(dup->index);
 	const mrec_t* const mrec_start = mrec;
 
@@ -2214,11 +2219,6 @@ row_log_table_apply_op(
 		break;
 
 	case ROW_T_DELETE:
-		/* 1 (extra_size) + at least 1 (payload) */
-		if (mrec + 2 >= mrec_end) {
-			return(NULL);
-		}
-
 		extra_size = *mrec++;
 		ut_ad(mrec < mrec_end);
 

It would be great to have a test for this in our regression test suite.

Generated at Thu Feb 08 09:58:14 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.