[MDEV-27156] MariaDB 10.5.13 Galera Node core dumped Created: 2021-12-02  Updated: 2024-01-05

Status: Open
Project: MariaDB Server
Component/s: Galera, Server
Affects Version/s: 10.5.13
Fix Version/s: 10.5, 10.6

Type: Bug Priority: Major
Reporter: Rumen Palov Assignee: Jan Lindström
Resolution: Unresolved Votes: 1
Labels: crash, galera, replication, untable_to_start_again
Environment:

MariaDB 10.5.13 , galera provider 26.4.10 ,FreeBSD 12 and 13 . ZFS storage, 1500G RAM , 64 or 96 Cores



 Description   

Hello ,

one of our Galera Nodes dies hours ago with following state:

2021-12-02 21:34:38 41 [ERROR] [FATAL] InnoDB: Page old data size 15457 new data size 15959, page old max ins size 528 new max ins size 26
211202 21:34:38 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
 
Server version: 10.5.13-MariaDB-log
key_buffer_size=5242880
read_buffer_size=131072
max_used_connections=6
max_threads=1502
thread_count=102
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 3310938 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x129b99cb318
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7fffdce99f30 thread_stack 0xc0000
2021-12-02 21:34:41 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT3.54933S), skipping check
0x132bb7c <my_print_stacktrace+0x3c> at /usr/local/libexec/mariadbd
0xc8b250 <handle_fatal_signal+0x290> at /usr/local/libexec/mariadbd
2021-12-02 21:34:42 0 [Note] WSREP: (7422f571-baaf, 'tcp://10.10.70.221:4567') turning message relay requesting on, nonlive peers: tcp://10.10.70.211:4567 tcp://10.10.70.231:4567 tcp://10.10.70.241:4567 tcp://10.10.70.251:4567
2021-12-02 21:34:42 0 [Note] WSREP: (7422f571-baaf, 'tcp://10.10.70.221:4567') connection established to 32fb230c-8d6c tcp://10.10.70.251:4567
2021-12-02 21:34:42 0 [Note] WSREP: (7422f571-baaf, 'tcp://10.10.70.221:4567') connection established to 98597714-a0b1 tcp://10.10.70.211:4567
2021-12-02 21:34:42 0 [Note] WSREP: (7422f571-baaf, 'tcp://10.10.70.221:4567') connection established to b764b1d2-b98b tcp://10.10.70.231:4567
2021-12-02 21:34:42 0 [Note] WSREP: (7422f571-baaf, 'tcp://10.10.70.221:4567') connection established to b82f6161-9027 tcp://10.10.70.241:4567
0x801924b70 <_pthread_sigmask+0x530> at /lib/libthr.so.3
 
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x1c9a48adc4): INSERT INTO QUERY SCRABLED
 
Connection ID (thread ID): 41
Status: NOT_KILLED
 
Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off
 
The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
information that should help you find out what is causing the crash.
Core pattern: %N.core

If we try to join the back to the cluster the same behavior repeat .

First we have the same situation with node which was GTID replica slave of this one.

What can we to supply more useful information? The core dump file is 150G



 Comments   
Comment by Daniel Black [ 2021-12-02 ]

Looks similar to MDEV-26141 in the message. Are you doing similar SQL operations?

From the core, can you obtain a backtrace - https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/#analyzing-a-core-file-with-gdb-on-linux ( I assume the Linux method is similar to FreeBSD. Is it compiled with debug symbols there? Or are those a separate package? They will be needed to make a bit more sense out of it)

That aside and out of interest, I'm wondering why the core is so big, MADV_NOCORE is used on the large allocations. How big is the core compared to a running memory resident mariadbd process?

Comment by Rumen Palov [ 2021-12-03 ]

Hello Daniel,
thank you for fast response.

The RES memory of the process is between 200G and 500G depends if it is write accepting node or not. In our case it was not write accepting.

Mariadb was not compiled with the DEBUG symbols, default in the port.
I will compile it with DEBUG symbols and try to get backtrace from the core file.

The core dump from 10.5.13 was deleted in production recovery procedure.

We have the same situation with 10.5.9 - identical output in error log with preserved core dump.

I will try to get backtrace from it

Cheers
Rumen

Comment by Marko Mäkelä [ 2024-01-05 ]

First, MariaDB Server 10.6 is not supposed to crash on corrupted data, ever since MDEV-13542 and some related bugs were fixed. It is not feasible to port these fixes to earlier major versions.

There used to be a problem with the default wsrep_sst_method=rsync, which allowed InnoDB to write to data files while a snapshot transfer (SST) was in progress. This was fixed by me in MDEV-24845 by a rewrite. However, based on MDEV-32115 it is known that this rewritten logic sometimes fails on MariaDB Server 10.4. I don’t know how much effort janlindstrom spent to try to reproduce that on 10.5 or later versions.

Every now and then we reproduce (typically in 10.6 or later) and fix (in 10.5 or a later applicable release) some bugs in crash recovery and mariadb-backup. Such bugs could affect all other modes than wsrep_sst_method=mysqldump.

Generated at Thu Feb 08 09:50:46 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.