[MDEV-28122] OPTIMIZE TABLE crash - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.6.0, 10.6.7, 10.11.4
Fix Version/s: 10.6.16, 10.10.7, 10.11.6, 11.0.4, 11.1.3
Component/s: Data Definition - Alter Table, Storage Engine - InnoDB
Labels:
- crash
- regression
Environment:
Windows Server 2022 Standard
Intel(R) Xeon(R) Gold 5317 CPU @ 3.00GHz 2.99 GHz
128 GB RAM

Description

The InnoDB table file is 144 GB; the #sql-alter-2964-a.ibd file is 56 GB in size, which might very well be its final size; about 700 million rows had been deleted from the table. There are currently 174 GB free on the data device (with both files in place).

I received the following crash:

220318  9:01:01 [ERROR] mysqld got exception 0xc0000005 ;

This could be because you hit a bug. It is also possible that this binary

or one of the libraries it was linked against is corrupt, improperly built,

or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help

diagnose the problem, but since we have already crashed,

something is definitely wrong and this may fail.

Server version: 10.6.7-MariaDB

key_buffer_size=134217728

read_buffer_size=131072

max_used_connections=28

max_threads=65537

thread_count=12

It is possible that mysqld could use up to

key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 142742881 K  bytes of memory

Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x270ab8fc038

Attempting backtrace. You can use the following information to find out

where mysqld died. If you see no messages after this, something went

terribly wrong...

server.dll!row_log_table_apply_op()[row0log.cc:2475]

server.dll!row_log_table_apply_ops()[row0log.cc:3066]

server.dll!row_log_table_apply()[row0log.cc:3174]

server.dll!ha_innobase::inplace_alter_table()[handler0alter.cc:8485]

server.dll!mysql_inplace_alter_table()[sql_table.cc:7409]

server.dll!mysql_alter_table()[sql_table.cc:10266]

server.dll!mysql_recreate_table()[sql_table.cc:11319]

server.dll!admin_recreate_table()[sql_admin.cc:67]

server.dll!mysql_admin_table()[sql_admin.cc:1130]

server.dll!Sql_cmd_optimize_table::execute()[sql_admin.cc:1515]

server.dll!mysql_execute_command()[sql_parse.cc:6008]

server.dll!mysql_parse()[sql_parse.cc:8033]

server.dll!dispatch_command()[sql_parse.cc:1898]

server.dll!do_command()[sql_parse.cc:1404]

server.dll!threadpool_process_request()[threadpool_common.cc:403]

server.dll!tp_callback()[threadpool_common.cc:203]

KERNEL32.DLL!QueryPerformanceCounter()

ntdll.dll!TpReleaseWork()

ntdll.dll!RtlInitializeResource()

KERNEL32.DLL!BaseThreadInitThunk()

ntdll.dll!RtlUserThreadStart()

Trying to get some variables.

Some pointers may be invalid and cause the dump to abort.

Query (0x270a8fd3480): OPTIMIZE TABLE node_log

Connection ID (thread ID): 10

Status: NOT_KILLED

Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off

The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains

information that should help you find out what is causing the crash.

Writing a core file at C:\ProgramData\MariaDB 10.6\

Minidump written to C:\ProgramData\MariaDB 10.6\mysqld.dmp

Attachments

Issue Links

is caused by

MDEV-515 innodb bulk insert

Closed

is duplicated by

MDEV-26199 Assertion `n_null_bytes >= n_core_null_bytes || n_core < index->n_core_fields' in rec_init_offsets_comp_ordinary

Closed

relates to

MDEV-17644 DRAFT: Assertion `0' failed in row_log_table_apply_update

Open

MDEV-18361 row0log.cc:3057: dberr_t row_log_table_apply_ops(que_thr_t*, row_merge_dup_t*, ut_stage_alter_t*): Assertion `0' failed.

Closed

MDEV-19044 Alter table corrupts while applying the modification log

Closed

MDEV-26198 Assertion `0' failed in row_log_table_apply_op during redundant table rebuild

Closed

(1 relates to)

Activity

Ascending order - Click to sort in descending order

Alice Sherepa added a comment - 2022-03-18 15:29

probably the same bug as ~~MDEV-26198~~

Alice Sherepa added a comment - 2022-03-18 15:29 probably the same bug as MDEV-26198

Marko Mäkelä added a comment - 2022-03-22 10:00

The latest trace for ~~MDEV-26198~~ involves a table in ROW_FORMAT=REDUNDANT on which columns had been previously instantly added.
What does SHOW TABLE STATUS display for this table?

Marko Mäkelä added a comment - 2022-03-22 10:00 The latest trace for MDEV-26198 involves a table in ROW_FORMAT=REDUNDANT on which columns had been previously instantly added. What does SHOW TABLE STATUS display for this table?

Axel Dörfler added a comment - 2022-03-23 14:00 - edited

The server has garbled its boot disk to a point where it could not be repaired. It has now been reset from a backup. While the server uses a RAID-1 and ECC memory, and did not log anything suspicious, there is a good chance that this is a hardware error. And if that's true, the crash could well be caused by that, too.

I have now resurrected the server's predecessor, and a SHOW TABLE STATUS shows this:

Name

Engine

Version

Row_format

Rows

Avg_row_length

Data_length

Max_data_length

Index_length

Data_free

Auto_increment

Create_time

Update_time

Check_time

Collation

Checksum

Create_options

Comment

node_log

InnoDB

Compact

449973282

186

84136689664

60923265024

36799774720

1084647359

2017-07-31 13:24:01

utf8_bin

But this was before I deleted more than half the table (its oldest entries).

Axel Dörfler added a comment - 2022-03-23 14:00 - edited The server has garbled its boot disk to a point where it could not be repaired. It has now been reset from a backup. While the server uses a RAID-1 and ECC memory, and did not log anything suspicious, there is a good chance that this is a hardware error. And if that's true, the crash could well be caused by that, too. I have now resurrected the server's predecessor, and a SHOW TABLE STATUS shows this: Name Engine Version Row_format Rows Avg_row_length Data_length Max_data_length Index_length Data_free Auto_increment Create_time Update_time Check_time Collation Checksum Create_options Comment node_log InnoDB 10 Compact 449973282 186 84136689664 0 60923265024 36799774720 1084647359 2017-07-31 13:24:01 \N \N utf8_bin \N But this was before I deleted more than half the table (its oldest entries).

Axel Dörfler added a comment - 2022-03-23 14:03

I will try the same procedure on the old server, and see how it goes there...

Axel Dörfler added a comment - 2022-03-23 14:03 I will try the same procedure on the old server, and see how it goes there...

Marko Mäkelä added a comment - 2023-09-19 09:53

~~MDEV-18361~~ may be specific to ROW_FORMAT=REDUNDANT, and I was unable to reproduce it.

Both in 10.6.7 and 10.11.4, the crash occurs here in row_log_table_apply_op():

        switch (*mrec++) {

        default:

                ut_ad(0);

                *error = DB_CORRUPTION;

                return(NULL);

        case ROW_T_INSERT:

                extra_size = *mrec++;

I now realize that we do not check for mrec > mrec_end until some steps later. In another case we do have a proper check for end-of-buffer:

        case ROW_T_DELETE:

                /* 1 (extra_size) + at least 1 (payload) */

                if (mrec + 2 >= mrec_end) {

                        return(NULL);

Also in ROW_T_UPDATE we are potentially reading after the end of the buffer. In row_log_apply_op() (which is used during online CREATE INDEX) we have proper overflow checks in place.

Marko Mäkelä added a comment - 2023-09-19 09:53 MDEV-18361 may be specific to ROW_FORMAT=REDUNDANT , and I was unable to reproduce it. Both in 10.6.7 and 10.11.4, the crash occurs here in row_log_table_apply_op() : switch (*mrec++) { default : ut_ad(0); *error = DB_CORRUPTION; return (NULL); case ROW_T_INSERT: extra_size = *mrec++; I now realize that we do not check for mrec > mrec_end until some steps later. In another case we do have a proper check for end-of-buffer: case ROW_T_DELETE: /* 1 (extra_size) + at least 1 (payload) */ if (mrec + 2 >= mrec_end) { return (NULL); } Also in ROW_T_UPDATE we are potentially reading after the end of the buffer. In row_log_apply_op() (which is used during online CREATE INDEX ) we have proper overflow checks in place.

Marko Mäkelä added a comment - 2023-09-19 10:11

I failed to notice that there actually is a check right before the switch statement:

	/* 3 = 1 (op type) + 1 (extra_size) + at least 1 byte payload */

	if (mrec + 3 >= mrec_end) {

		return(NULL);

So, the check after case ROW_T_DELETE: is actually redundant, and the reason for this crash remains a mystery until we have a reproducible test case.

Marko Mäkelä added a comment - 2023-09-19 10:11 I failed to notice that there actually is a check right before the switch statement: /* 3 = 1 (op type) + 1 (extra_size) + at least 1 byte payload */ if (mrec + 3 >= mrec_end) { return (NULL); } So, the check after case ROW_T_DELETE: is actually redundant, and the reason for this crash remains a mystery until we have a reproducible test case.

Marko Mäkelä added a comment - 2023-10-13 11:04

The check that I quoted had been removed as part of ~~MDEV-515~~ in MariaDB Server 10.6.0, related to the introduction of a ROW_T_EMPTY record, which was later removed in ~~MDEV-15250~~.

I think that we just need to put back the check right before the switch(*mrec++) statement, like this:

diff --git a/storage/innobase/row/row0log.cc b/storage/innobase/row/row0log.cc

index 6842a093b5c..e1ce8161a02 100644

--- a/storage/innobase/row/row0log.cc

+++ b/storage/innobase/row/row0log.cc

@@ -2167,6 +2167,11 @@ row_log_table_apply_op(

 	*error = DB_SUCCESS;

+	/* 3 = 1 (op type) + 1 (extra_size) + at least 1 byte payload */

+	if (mrec + 3 >= mrec_end) {

+		return(NULL);

+	}

 	const bool is_instant = log->is_instant(dup->index);

 	const mrec_t* const mrec_start = mrec;

@@ -2214,11 +2219,6 @@ row_log_table_apply_op(

 		break;

 	case ROW_T_DELETE:

-		/* 1 (extra_size) + at least 1 (payload) */

-		if (mrec + 2 >= mrec_end) {

-			return(NULL);

-		}

 		extra_size = *mrec++;

 		ut_ad(mrec < mrec_end);

It would be great to have a test for this in our regression test suite.

Marko Mäkelä added a comment - 2023-10-13 11:04 The check that I quoted had been removed as part of MDEV-515 in MariaDB Server 10.6.0, related to the introduction of a ROW_T_EMPTY record, which was later removed in MDEV-15250 . I think that we just need to put back the check right before the switch(*mrec++) statement, like this: diff --git a/storage/innobase/row/row0log.cc b/storage/innobase/row/row0log.cc index 6842a093b5c..e1ce8161a02 100644 --- a/storage/innobase/row/row0log.cc +++ b/storage/innobase/row/row0log.cc @@ -2167,6 +2167,11 @@ row_log_table_apply_op( *error = DB_SUCCESS; + /* 3 = 1 (op type) + 1 (extra_size) + at least 1 byte payload */ + if (mrec + 3 >= mrec_end) { + return(NULL); + } + const bool is_instant = log->is_instant(dup->index); const mrec_t* const mrec_start = mrec; @@ -2214,11 +2219,6 @@ row_log_table_apply_op( break; case ROW_T_DELETE: - /* 1 (extra_size) + at least 1 (payload) */ - if (mrec + 2 >= mrec_end) { - return(NULL); - } - extra_size = *mrec++; ut_ad(mrec < mrec_end); It would be great to have a test for this in our regression test suite.

People

Assignee:: Thirunarayanan Balathandayuthapani

Reporter:: Axel Dörfler

Votes:: 1 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 2022-03-18 14:26

Updated:: 2024-07-07 20:41

Resolved:: 2023-10-16 15:45

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.