[MDEV-25835] mariadb 10.3.29 galera cluster crashes with errors like: "[ERROR] WSREP: Trx 236236 tries to abort slave trx 236238." Created: 2021-06-01  Updated: 2022-02-10  Resolved: 2021-10-27

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.2.36, 10.3.29
Fix Version/s: 10.2.41, 10.3.32, 10.4.22, 10.5.13, 10.6.5

Type: Bug Priority: Critical
Reporter: Jan Horstmann Assignee: Seppo Jaakola
Resolution: Fixed Votes: 8
Labels: None
Environment:

KVM machine with ubuntu bionic cloud image. Dockerised deployment of mariadb 10.3.29 from mariadb repository based on ubuntu bionic docker image:
$ uname -a
Linux os-ctrl-1 4.15.0-109-generic #110-Ubuntu SMP Tue Jun 23 02:39:32 UTC 2020 x8
$ docker exec -it mariadb mysql --version
mysql Ver 15.1 Distrib 10.3.29-MariaDB, for debian-linux-gnu (x86_64) using readline 5.26_64 x86_64 x86_64 GNU/Linux


Attachments: File os-ctrl-0-galera.cnf     File os-ctrl-0-mariadb.log     File os-ctrl-1-galera.cnf     File os-ctrl-1-mariadb.log     File os-ctrl-2-galera.cnf     File os-ctrl-2-mariadb.log    
Issue Links:
Duplicate
is duplicated by MDEV-26697 MariaDB keeps rebooting after upgrade... Closed
Relates
relates to MDEV-24978 SIGABRT in __libc_message Closed

 Description   

Starting with mariadb 10.3.29 when deploying a three node galera cluster, we are seeing crashes with errors like in [1].
Please find attached the log files from all three nodes and their configurations. A core dump of the crash in [1] can also be provided.

[1]

2021-06-01  9:06:16 0 [ERROR] WSREP: Trx 236236 tries to abort slave trx 236238. This could be caused by:
	1) unsupported configuration options combination, please check documentation.
	2) a bug in the code.
	3) a database corruption.
 Node consistency compromized, need to abort. Restart the node to resync with cluster.
210601  9:06:16 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.3.29-MariaDB-1:10.3.29+maria~bionic-log
key_buffer_size=67108864
read_buffer_size=131072
max_used_connections=1
max_threads=10002
thread_count=12
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 22055535 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x49000
/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x5647d689bb8e]
/usr/sbin/mysqld(handle_fatal_signal+0x515)[0x5647d63308c5]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7fd6316ed980]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7fd631328fb7]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7fd63132a921]
/usr/sbin/mysqld(+0x90a914)[0x5647d64ba914]
/usr/sbin/mysqld(+0x90f83c)[0x5647d64bf83c]
/usr/sbin/mysqld(handle_manager+0x1f3)[0x5647d6142a83]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7fd6316e26db]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7fd63140b71f]
The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
information that should help you find out what is causing the crash.
Writing a core file...
Working directory at /var/lib/mysql
Resource Limits:
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        unlimited            unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             unlimited            unlimited            processes 
Max open files            1048576              1048576              files     
Max locked memory         16777216             16777216             bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       64055                64055                signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        
Core pattern: core



 Comments   
Comment by Walter Doekes [ 2021-06-03 ]

Okay, so we notice this too:

2021-06-02 20:58:51 0 [Note] WSREP: Signalling provider to continue.
2021-06-02 20:58:51 0 [Note] WSREP: SST received: 13af80aa-836a-11eb-8c23-32f57cb601d4:54087553
2021-06-02 20:58:51 2 [Note] WSREP: Receiving IST: 1683 writesets, seqnos 54087553-54089236
2021-06-02 20:58:51 0 [Note] WSREP: Receiving IST...  0.0% (   0/1683 events) complete.
2021-06-02 20:58:51 0 [Note] Reading of all Master_info entries succeeded
2021-06-02 20:58:51 0 [Note] Added new Master_info '' to hash table
2021-06-02 20:58:51 0 [Note] /usr/sbin/mysqld: ready for connections.
Version: '10.3.29-MariaDB-1:10.3.29+maria~bionic-log'  socket: '/run/mysqld/MY_DEPLOYMENT-mysqld.sock'  port: 3306  mariadb.org binary distribution
2021-06-02 20:58:51 0 [ERROR] WSREP: Trx 54088471 tries to abort slave trx 54088527. This could be caused by:
        1) unsupported configuration options combination, please check documentation.
        2) a bug in the code.
        3) a database corruption.
 Node consistency compromized, need to abort. Restart the node to resync with cluster.

It went into a crashloop, with the same: Trx 54088471 tries to abort slave trx 54088527.

First core dump looks like:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007f30f06fa9fb in __GI_abort () at abort.c:100
#2  0x000055d85b55d914 in wsrep_abort_slave_trx (bf_seqno=<optimized out>, victim_seqno=<optimized out>) at ./storage/innobase/handler/ha_innodb.cc:18893
#3  0x000055d85b56283c in bg_wsrep_kill_trx (void_arg=0x7f1088023830) at ./storage/innobase/handler/ha_innodb.cc:19058
#4  0x000055d85b1e5a83 in handle_manager (arg=arg@entry=0x0) at ./sql/sql_manager.cc:112
#5  0x000055d85b8ea67a in pfs_spawn_thread (arg=0x55d8c61e5228) at ./storage/perfschema/pfs.cc:1869
#6  0x00007f30f0ab26db in start_thread (arg=0x7f1153399700) at pthread_create.c:463

        case QUERY_IDLE:
        {
                WSREP_DEBUG("kill IDLE for " TRX_ID_FMT, victim_trx->id);      
 
                if (wsrep_thd_exec_mode(thd) == REPL_RECV) {
                        WSREP_DEBUG("kill BF IDLE, seqno: %lld",
                                    (long long)wsrep_thd_trx_seqno(thd));      
                        wsrep_abort_slave_trx(arg->bf_seqno,     // <-- CRASH
                                              wsrep_thd_trx_seqno(thd));
                        goto ret_unlock;
                }
                /* This will lock thd from proceeding after net_read() */      
                wsrep_thd_set_conflict_state(thd, ABORTING);

(gdb) print *arg
$2 = {thd_id = 67, trx_id = 7493842400, bf_seqno = 54088471, signal = 1}
 
(gdb) print *(unsigned*)((void*)thd + 0x58e0)
$4 = 54088527
 
(gdb) print thd->wsrep_trx_meta
$22 = {gtid = {uuid = {data = "\023\257\200\252\203j\021\353\214#2\365|\266", <incomplete sequence \324>}, seqno = 54088527}, depends_on = 54088417}
 
(gdb) print thd->proc_info
$23 = 0x7f0fb0006570 "Applied write set 54088527"
 
(gdb) info locals
rcode = <optimized out>
arg = 0x7f1088023830
thd = 0x7f0fb0000c08
victim_trx = 0x7f242978d1f8
awake = false

(gdb) up
#4  0x000055d85b1e5a83 in handle_manager (arg=arg@entry=0x0) at ./sql/sql_manager.cc:112
 
(gdb) info locals
next = 0x0
cb = 0x7f1088023868
error = 0
abstime = {tv_sec = 94387747591648, tv_nsec = 139848169358893}
reset_flush_time = true

Went back to 10.3.25 because of MDEV-25190 (among others).


Let me know if I can get you more debug info from the dump.

Comment by Jan Horstmann [ 2021-06-04 ]

@wdoekes would you consider it to be save to go back from 10.3.29 to 10.3.28 or lower? Did you do a full SST per node and which method do you use?

Comment by Walter Doekes [ 2021-06-04 ]

If you're not affected by MDEV-25190, then I guess 10.3.28 is fine. We've had a bad streak after 10.3.25, so we're reverting back to that.

I'm not aware what ops did when reverting. SSTs generally start automatically here after certain failures, more often than appears necessary. So if there was an SST, I'm not sure if it was intentional either.

The crash loop started on the second node, apparently. ("Rebooted, started maria on the 2nd node. Crashloop. Downgrading made it happy again.")

Comment by Patrick Schlirf [ 2021-06-25 ]

Hey guys,

are there any updates?

We downgraded after the first three times this happened and freeze the patch cycle of a big number of database servers to prevent to hit this bug again.

Kind regards,
Patrick

Comment by Jan Lindström (Inactive) [ 2021-07-26 ]

This issue will be fixed when (if)MDEV-25114 is reviewed and applied.

Comment by Jan Lindström (Inactive) [ 2021-09-20 ]

I need more information. Can you run with --wsrep-debug=1 and provide error logs ? Naturally, steps how to reproduce would be most useful.

Comment by Seppo Jaakola [ 2021-10-20 ]

This issue seems to cause applier-applier conflicts, configuring for only one applier (wsrep_slave_threads=1) should be immediate help for this.
I suspect this is a duplicate of MDEV-24978, as the table definition has foreign key constraint over multi-byte character set

Comment by Seppo Jaakola [ 2021-10-20 ]

probably a duplicate

Comment by Daniel Black [ 2021-10-29 ]

With 47ba5523046094db33e68c92a182491a629bbd56 reverting MDEV-24978, is this still closed?

Comment by Jan Lindström (Inactive) [ 2021-10-29 ]

Yes, MDEV-24978 was fixed on d5a15f04f4ab0738b0a5f993b208bcfaad522fd4

Generated at Thu Feb 08 09:40:46 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.