[MDEV-28167] MariaDB Core Dump / [ERROR] InnoDB: Corruption of an index tree Created: 2022-03-24  Updated: 2022-05-01

Status: Open
Project: MariaDB Server
Component/s: Galera SST
Affects Version/s: 10.5.10
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Michael Landin Assignee: Unassigned
Resolution: Unresolved Votes: 1
Labels: None
Environment:

Compiled from source on FreeBSD 12.3



 Description   

My mariadb instance(s) crashed with the following error:

2022-03-22 21:24:13 0 [ERROR] InnoDB: Corruption of an index tree: table `xxxx`.`yyy` index `UNIQ_1483A5E965AB1D88`, fa
ther ptr page no 3306, child page no 1420
PHYSICAL RECORD: n_fields 2; compact format; info bits 0
 0: len 28; hex 593346463266526354774d6250504a337766623837334e65675a5832; asc Y3FF2fRcTwMbPPJ3wfb873NegZX2;;
 1: len 30; hex 31346266653932392d393139622d346364662d396337312d323064333331; asc 14bfe929-919b-4cdf-9c71-20d331; (total 36 b
ytes);
2022-03-22 21:24:13 0 [Note] InnoDB: n_owned: 0; heap_no: 2; next rec: 200
PHYSICAL RECORD: n_fields 3; compact format; info bits 0
 0: len 28; hex 4f5044356775356f30634e55536f536d445657617236514635336f31; asc OPD5gu5o0cNUSoSmDVWar6QF53o1;;
 1: len 30; hex 31383661376531652d303631662d346538372d396430662d626430646437; asc 186a7e1e-061f-4e87-9d0f-bd0dd7; (total 36 bytes);
 2: len 4; hex 00000cea; asc     ;;
2022-03-22 21:24:13 0 [Note] InnoDB: n_owned: 0; heap_no: 99; next rec: 8184
2022-03-22 21:24:13 0 [ERROR] [FATAL] InnoDB: You should dump + drop + reimport the table to fix the corruption. If the crash happens at database startup. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery. Then dump + drop + reimport.
220322 21:24:13 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
 
Server version: 10.5.10-MariaDB-log
Server version: 10.5.10-MariaDB-log
key_buffer_size=33554432
read_buffer_size=8388608
max_used_connections=49
max_threads=2002
thread_count=16
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 32883467 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x128471b698
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7fffded4de50 thread_stack 0x49000
0x13155bc <my_print_stacktrace+0x3c> at /usr/local/libexec/mariadbd
0xc77f2f <handle_fatal_signal+0x28f> at /usr/local/libexec/mariadbd
0x80190cb70 <_pthread_sigmask+0x530> at /lib/libthr.so.3
 
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x0): (null)
Connection ID (thread ID): 0
Status: NOT_KILLED
 
Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_
intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=o
n,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subque
ry_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_increment
al=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=
on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowi
d_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off
 
The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
information that should help you find out what is causing the crash.
 
We think the query pointer is invalid, but we will try to print it anyway.
Query:

I tried opening the core file with gdb, to get more insights from not really getting any clues:

warning: exec file is newer than core file.
[New LWP 100784]
[New LWP 100847]
[New LWP 100921]
[New LWP 100922]
[New LWP 100923]
[New LWP 100932]
[New LWP 100936]
[New LWP 100938]
[New LWP 100968]
[New LWP 100970]
[New LWP 100982]
[New LWP 100983]
[New LWP 100997]
[New LWP 100999]
[New LWP 101044]
[New LWP 100929]
[New LWP 101219]
[New LWP 100822]
[New LWP 100947]
[New LWP 101490]
[New LWP 101658]
[New LWP 100858]
[New LWP 100722]
[New LWP 100992]
[New LWP 101097]
[New LWP 101137]
[New LWP 101164]
[New LWP 101220]
[New LWP 101084]
[New LWP 100606]
[New LWP 101227]
Core was generated by `/usr/local/libexec/mariadbd --defaults-extra-file=/usr/local/etc/mysql/my.cnf --'.
cProgram terminated with signal SIGABRT, Aborted.
Sent by kill() from pid 46919 and user 88.
#0  0x0000000801b7cbda in ?? ()
[Current thread is 1 (LWP 100784)]
(gdb) thread 1
[Switching to thread 1 (LWP 100784)]
#0  0x0000000801b7cbda in ?? ()
(gdb) bt
#0  0x0000000801b7cbda in ?? ()
#1  0x0000000000c7814b in ?? ()
#2  0x06fa6e0a0000000d in ?? ()
#3  0x000000095d9fb578 in ?? ()
#4  0x0000000100000000 in ?? ()
#5  0x000000180000000d in ?? ()
#6  0x0000001600000015 in ?? ()
#7  0x0000007a00000002 in ?? ()
#8  0x0000005000000002 in ?? ()
#9  0x00007fff00000000 in ?? ()
#10 0x0000000000000e10 in ?? ()
#11 0x0000000801e212a1 in ?? ()
#12 0x00000000623a306d in ?? ()
#13 0x0000000000000008 in ?? ()
#14 0x0065726f632e4e25 in ?? ()
#15 0x00000009a5a85b80 in ?? ()
#16 0x00007fffded4bb00 in ?? ()
#17 0x0000000801beb017 in ?? ()
#18 0x00007fffded4bb40 in ?? ()
#19 0x000000000000004d in ?? ()
#20 0x0000000000000000 in ?? ()

I guess SIGABRT is in libc?



 Comments   
Comment by Sergei Golubchik [ 2022-03-30 ]

gdb complained that

warning: exec file is newer than core file.

Are you sure the executable matches the core file?

Comment by Marko Mäkelä [ 2022-03-30 ]

InnoDB invokes abort() when encountering a fatal error.

Two known sources of secondary index corruption are the InnoDB change buffer (see MDEV-27734) and indexed virtual columns (MDEV-5800, sometimes via MDEV-371). Note that disabling future use of the change buffer will not fix corrupted indexes; ALTER TABLE tablename FORCE should do that.

Comment by Michael Landin [ 2022-03-30 ]

@sergei - I created a new build box with the same version of mariadb (with debug symbols) to correctly debug the core file. That is the cause of this error.

Comment by Marko Mäkelä [ 2022-03-30 ]

michbsd, are you sure that it was a reproducible build?

Comment by Michael Landin [ 2022-03-30 ]

Euh.. ?
I just wanted a version of mariadbd with debugging symbols, so I could inspect the core file from my production machine - and possibly get a clue as to why it cored.

But, I guess you answered my question on your earlier comment - InnoDB threw a FATAL error, that invoked abort() - and that is why we got a core.

Now, I need to try to understand what could have caused the FATAL error - no schema changes or otherwise potentially breaking things were happening at the moment. IWe were just doing normal INSERT/UPDATE/SELECT statements when the corruption of the Index occurred.

Comment by Marko Mäkelä [ 2022-03-30 ]

michbsd, if you build something from source code without using an equivalent environment and tools that were used for building an executable, you will likely not get an executable that has the same addresses. For example, source code file names (with full paths) may be embedded in the executable. If the source code directory name or the build directory name differs from the original build, you could already have lost. That is what reproducible builds are about.

A stack trace will appear corrupted if the top of the stack does not match the libraries or executables. On GNU/Linux, the typical cause of this is when an executable and core dump are copied to a different system where libc.so differs from the one that generated the core dump. Here, the more likely cause for corrupted stack traces could be that the mariadbd executable differs.

Can you reproduce the crash with your self-built executable, to get a proper stack trace?

Comment by Michael Landin [ 2022-03-30 ]

Lol.. I am not able to reproduce the crash on the production system. I have no idea what caused it (what I am trying to figure out)
The build system I was using from the new version of mariadbd was same libc, same source build path ete ctc - done from FreeBSD ports system - so it should be "OK"

But regardless, I would say the focus should be on these lines: "[ERROR] InnoDB: Corruption of an index tree: table" right? Those caused InnoDB to invoke abort().. Or did I miss something?

Comment by Marko Mäkelä [ 2022-03-30 ]

The fatal message is output when the internal links of a table are found to be corrupted. CHECK TABLE without QUICK should exercise this code. You could start by executing CHECK TABLE on every InnoDB table. Once you have identified the corrupted table, the schema of that table would be helpful to know.

I do not have any idea what could cause this type of corruption in normal circumstances.

Abnormal circumstances could include the following:

  • Memory corruption due to faulty hardware, or a software bug. InnoDB only updates or validates page checksums when writing or reading from a data file; there is no checksum validation while the data remains cached in the buffer pool. So, it could happily compute a valid checksum right before writing out a page that was corrupted while it resided in the buffer pool.
  • Unsafe copying of the data directory while the server is running. Use mariadb-backup or a file system snapshot.
  • Deleting the ib_logfile0 to ‘fix’ crash recovery trouble.
Comment by Michael Landin [ 2022-03-30 ]

Thank you for the insights.

I restarted mariadb with --innodb-force-recovery=2 flag - after that I ran OPTIMIZE on the database showing problems and then everything was fine again.

I do not believe it was faulty hardware (as I run galera on top of my mariadb - and the change cuased all 3 cluster nodes to core dump with the same error)
No operations (like deleting or copying files were done)

Comment by Marko Mäkelä [ 2022-03-30 ]

Thank you for clarifying that Galera snapshot transfers are involved. I think that there have been some problems with that, and also some recent fixes by sysprg.

FreeBSD does not support asynchronous I/O (or at least we do not implement any API for that), and therefore my remarks in MDEV-24845 about innodb_disallow_writes not blocking InnoDB page writes should not apply to FreeBSD.

Which Galera snapshot transfer method do you use?

Comment by Michael Landin [ 2022-03-30 ]

I use rsync

Generated at Thu Feb 08 09:58:36 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.