[MDEV-28167] MariaDB Core Dump / [ERROR] InnoDB: Corruption of an index tree - Jira

Details

Type: Bug
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 10.5.10
Fix Version/s: None
Component/s: Galera SST
Labels:
None
Environment:
Compiled from source on FreeBSD 12.3

Description

My mariadb instance(s) crashed with the following error:

2022-03-22 21:24:13 0 [ERROR] InnoDB: Corruption of an index tree: table `xxxx`.`yyy` index `UNIQ_1483A5E965AB1D88`, fa

ther ptr page no 3306, child page no 1420

PHYSICAL RECORD: n_fields 2; compact format; info bits 0

 0: len 28; hex 593346463266526354774d6250504a337766623837334e65675a5832; asc Y3FF2fRcTwMbPPJ3wfb873NegZX2;;

 1: len 30; hex 31346266653932392d393139622d346364662d396337312d323064333331; asc 14bfe929-919b-4cdf-9c71-20d331; (total 36 b

ytes);

2022-03-22 21:24:13 0 [Note] InnoDB: n_owned: 0; heap_no: 2; next rec: 200

PHYSICAL RECORD: n_fields 3; compact format; info bits 0

 0: len 28; hex 4f5044356775356f30634e55536f536d445657617236514635336f31; asc OPD5gu5o0cNUSoSmDVWar6QF53o1;;

 1: len 30; hex 31383661376531652d303631662d346538372d396430662d626430646437; asc 186a7e1e-061f-4e87-9d0f-bd0dd7; (total 36 bytes);

 2: len 4; hex 00000cea; asc     ;;

2022-03-22 21:24:13 0 [Note] InnoDB: n_owned: 0; heap_no: 99; next rec: 8184

2022-03-22 21:24:13 0 [ERROR] [FATAL] InnoDB: You should dump + drop + reimport the table to fix the corruption. If the crash happens at database startup. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery. Then dump + drop + reimport.

220322 21:24:13 [ERROR] mysqld got signal 6 ;

This could be because you hit a bug. It is also possible that this binary

or one of the libraries it was linked against is corrupt, improperly built,

or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help

diagnose the problem, but since we have already crashed,

something is definitely wrong and this may fail.

Server version: 10.5.10-MariaDB-log

Server version: 10.5.10-MariaDB-log

key_buffer_size=33554432

read_buffer_size=8388608

max_used_connections=49

max_threads=2002

thread_count=16

It is possible that mysqld could use up to

key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 32883467 K  bytes of memory

Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x128471b698

Attempting backtrace. You can use the following information to find out

where mysqld died. If you see no messages after this, something went

terribly wrong...

stack_bottom = 0x7fffded4de50 thread_stack 0x49000

0x13155bc <my_print_stacktrace+0x3c> at /usr/local/libexec/mariadbd

0xc77f2f <handle_fatal_signal+0x28f> at /usr/local/libexec/mariadbd

0x80190cb70 <_pthread_sigmask+0x530> at /lib/libthr.so.3

Trying to get some variables.

Some pointers may be invalid and cause the dump to abort.

Query (0x0): (null)

Connection ID (thread ID): 0

Status: NOT_KILLED

Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_

intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=o

n,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subque

ry_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_increment

al=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=

on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowi

d_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off

The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains

information that should help you find out what is causing the crash.

We think the query pointer is invalid, but we will try to print it anyway.

Query:

I tried opening the core file with gdb, to get more insights from not really getting any clues:

warning: exec file is newer than core file.

[New LWP 100784]

[New LWP 100847]

[New LWP 100921]

[New LWP 100922]

[New LWP 100923]

[New LWP 100932]

[New LWP 100936]

[New LWP 100938]

[New LWP 100968]

[New LWP 100970]

[New LWP 100982]

[New LWP 100983]

[New LWP 100997]

[New LWP 100999]

[New LWP 101044]

[New LWP 100929]

[New LWP 101219]

[New LWP 100822]

[New LWP 100947]

[New LWP 101490]

[New LWP 101658]

[New LWP 100858]

[New LWP 100722]

[New LWP 100992]

[New LWP 101097]

[New LWP 101137]

[New LWP 101164]

[New LWP 101220]

[New LWP 101084]

[New LWP 100606]

[New LWP 101227]

Core was generated by `/usr/local/libexec/mariadbd --defaults-extra-file=/usr/local/etc/mysql/my.cnf --'.

cProgram terminated with signal SIGABRT, Aborted.

Sent by kill() from pid 46919 and user 88.

#0  0x0000000801b7cbda in ?? ()

[Current thread is 1 (LWP 100784)]

(gdb) thread 1

[Switching to thread 1 (LWP 100784)]

#0  0x0000000801b7cbda in ?? ()

(gdb) bt

#0  0x0000000801b7cbda in ?? ()

#1  0x0000000000c7814b in ?? ()

#2  0x06fa6e0a0000000d in ?? ()

#3  0x000000095d9fb578 in ?? ()

#4  0x0000000100000000 in ?? ()

#5  0x000000180000000d in ?? ()

#6  0x0000001600000015 in ?? ()

#7  0x0000007a00000002 in ?? ()

#8  0x0000005000000002 in ?? ()

#9  0x00007fff00000000 in ?? ()

#10 0x0000000000000e10 in ?? ()

#11 0x0000000801e212a1 in ?? ()

#12 0x00000000623a306d in ?? ()

#13 0x0000000000000008 in ?? ()

#14 0x0065726f632e4e25 in ?? ()

#15 0x00000009a5a85b80 in ?? ()

#16 0x00007fffded4bb00 in ?? ()

#17 0x0000000801beb017 in ?? ()

#18 0x00007fffded4bb40 in ?? ()

#19 0x000000000000004d in ?? ()

#20 0x0000000000000000 in ?? ()

I guess SIGABRT is in libc?

Attachments

Activity

Ascending order - Click to sort in descending order

View 6 older comments

Michael Landin added a comment - 2022-03-30 14:58

Lol.. I am not able to reproduce the crash on the production system. I have no idea what caused it (what I am trying to figure out)
The build system I was using from the new version of mariadbd was same libc, same source build path ete ctc - done from FreeBSD ports system - so it should be "OK"

But regardless, I would say the focus should be on these lines: "[ERROR] InnoDB: Corruption of an index tree: table" right? Those caused InnoDB to invoke abort().. Or did I miss something?

Michael Landin added a comment - 2022-03-30 14:58 Lol.. I am not able to reproduce the crash on the production system. I have no idea what caused it (what I am trying to figure out) The build system I was using from the new version of mariadbd was same libc, same source build path ete ctc - done from FreeBSD ports system - so it should be "OK" But regardless, I would say the focus should be on these lines: " [ERROR] InnoDB: Corruption of an index tree: table" right? Those caused InnoDB to invoke abort().. Or did I miss something?

Marko Mäkelä added a comment - 2022-03-30 15:45

The fatal message is output when the internal links of a table are found to be corrupted. CHECK TABLE without QUICK should exercise this code. You could start by executing CHECK TABLE on every InnoDB table. Once you have identified the corrupted table, the schema of that table would be helpful to know.

I do not have any idea what could cause this type of corruption in normal circumstances.

Abnormal circumstances could include the following:

Memory corruption due to faulty hardware, or a software bug. InnoDB only updates or validates page checksums when writing or reading from a data file; there is no checksum validation while the data remains cached in the buffer pool. So, it could happily compute a valid checksum right before writing out a page that was corrupted while it resided in the buffer pool.
Unsafe copying of the data directory while the server is running. Use mariadb-backup or a file system snapshot.
Deleting the ib_logfile0 to ‘fix’ crash recovery trouble.

Marko Mäkelä added a comment - 2022-03-30 15:45 The fatal message is output when the internal links of a table are found to be corrupted. CHECK TABLE without QUICK should exercise this code. You could start by executing CHECK TABLE on every InnoDB table. Once you have identified the corrupted table, the schema of that table would be helpful to know. I do not have any idea what could cause this type of corruption in normal circumstances. Abnormal circumstances could include the following: Memory corruption due to faulty hardware, or a software bug. InnoDB only updates or validates page checksums when writing or reading from a data file; there is no checksum validation while the data remains cached in the buffer pool. So, it could happily compute a valid checksum right before writing out a page that was corrupted while it resided in the buffer pool. Unsafe copying of the data directory while the server is running. Use mariadb-backup or a file system snapshot. Deleting the ib_logfile0 to ‘fix’ crash recovery trouble.

Michael Landin added a comment - 2022-03-30 15:53

Thank you for the insights.

I restarted mariadb with --innodb-force-recovery=2 flag - after that I ran OPTIMIZE on the database showing problems and then everything was fine again.

I do not believe it was faulty hardware (as I run galera on top of my mariadb - and the change cuased all 3 cluster nodes to core dump with the same error)
No operations (like deleting or copying files were done)

Michael Landin added a comment - 2022-03-30 15:53 Thank you for the insights. I restarted mariadb with --innodb-force-recovery=2 flag - after that I ran OPTIMIZE on the database showing problems and then everything was fine again. I do not believe it was faulty hardware (as I run galera on top of my mariadb - and the change cuased all 3 cluster nodes to core dump with the same error) No operations (like deleting or copying files were done)

Marko Mäkelä added a comment - 2022-03-30 18:15

Thank you for clarifying that Galera snapshot transfers are involved. I think that there have been some problems with that, and also some recent fixes by sysprg.

FreeBSD does not support asynchronous I/O (or at least we do not implement any API for that), and therefore my remarks in ~~MDEV-24845~~ about innodb_disallow_writes not blocking InnoDB page writes should not apply to FreeBSD.

Which Galera snapshot transfer method do you use?

Marko Mäkelä added a comment - 2022-03-30 18:15 Thank you for clarifying that Galera snapshot transfers are involved. I think that there have been some problems with that, and also some recent fixes by sysprg . FreeBSD does not support asynchronous I/O (or at least we do not implement any API for that), and therefore my remarks in MDEV-24845 about innodb_disallow_writes not blocking InnoDB page writes should not apply to FreeBSD. Which Galera snapshot transfer method do you use?

Michael Landin added a comment - 2022-03-30 18:47

I use rsync

Michael Landin added a comment - 2022-03-30 18:47 I use rsync

People

Assignee:: Unassigned

Reporter:: Michael Landin

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 2022-03-24 11:39

Updated:: 2022-05-01 23:14

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server