[MDEV-24846] Rollback crashes active nodes in galera cluster Created: 2021-02-11  Updated: 2023-04-11  Resolved: 2023-04-11

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.3.25
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Lars Mikkelsen Assignee: Jan Lindström (Inactive)
Resolution: Won't Fix Votes: 0
Labels: None
Environment:

Red Hat Enterprise Linux Server release 7.5 (Maipo)



 Description   

Mariadb installed from public repositories version 10.3.25
3 node cluster with only 2 nodes having active connections
Master node crashes with:

2021-02-09  8:23:32 1960824 [ERROR] WSREP: FSM: no such a transition ROLLED_BACK -> ROLLED_BACK
210209  8:23:32 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
 
Server version: 10.3.25-MariaDB-log
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=559
max_threads=602
thread_count=291
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1454565 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x7f4a348bc4b8
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f4ae4506d30 thread_stack 0x49000
terminate called after throwing an instance of 'gu::Exception'
  what():  gu_mutex_destroy(): 16 (Device or resource busy)
         at galerautils/src/gu_mutex.hpp:~Mutex():32

Other node almost similar:

2021-02-09  8:23:30 1960770 [ERROR] WSREP: FSM: no such a transition ROLLED_BACK -> ROLLED_BACK
210209  8:23:30 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
 
Server version: 10.3.25-MariaDB-log
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=563
max_threads=602
thread_count=294
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1454565 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x7f3b08218728
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
2021-02-09  8:23:30 1960773 [ERROR] WSREP: FSM: no such a transition ROLLED_BACK -> ROLLED_BACK

The 3. node didn't crash but after 1st node had done SST it couldn't join the cluster due to status on 3. node. Had to start new cluster and wait for the second SST to complete.

Let me know if I can provide any additional info.



 Comments   
Comment by Lars Mikkelsen [ 2021-05-03 ]

Happened again:

2021-05-03 14:04:39 1669694 [ERROR] WSREP: FSM: no such a transition ROLLED_BACK -> ROLLED_BACK
210503 14:04:39 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
 
Server version: 10.3.25-MariaDB-log
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=701
max_threads=602
thread_count=382
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1461324 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x7f310845b2a8
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f31a29bad00 thread_stack 0x49000
/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x55562a271cde]
/usr/sbin/mysqld(handle_fatal_signal+0x30f)[0x555629d0720f]
sigaction.c:0(__restore_rt)[0x7f497ab0d5d0]
:0(__GI_raise)[0x7f4978de02c7]
:0(__GI_abort)[0x7f4978de19b8]
/usr/lib64/galera/libgalera_smm.so(+0x1a281c)[0x7f4954b2d81c]
src/fsm.hpp:104(galera::FSM<galera::TrxHandle::State, galera::TrxHandle::Transition, galera::EmptyGuard, galera::EmptyAction>::shift_to(galera::TrxHandle::State))[0x7f4954b23cc6]
src/gu_atomic.hpp:59(gu::Atomic<long long>::operator++())[0x7f4954b34975]
/usr/sbin/mysqld(_Z12wsrep_commitP10handlertonP3THDb+0xd2)[0x555629c72162]
/usr/sbin/mysqld(+0x7b4475)[0x555629d08475]
/usr/sbin/mysqld(_Z15ha_commit_transP3THDb+0x4e6)[0x555629d0a906]
/usr/sbin/mysqld(_Z12trans_commitP3THD+0x4a)[0x555629c116da]
/usr/sbin/mysqld(_Z21mysql_execute_commandP3THD+0x2e76)[0x555629b24396]
/usr/sbin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_statebb+0x36d)[0x555629b2a64d]
/usr/sbin/mysqld(+0x4db110)[0x555629a2f110]
/usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcjbb+0x2eac)[0x555629b2ddac]
/usr/sbin/mysqld(_Z10do_commandP3THD+0x11b)[0x555629b2e17b]
/usr/sbin/mysqld(_Z24do_handle_one_connectionP7CONNECT+0x1d6)[0x555629c04a96]
/usr/sbin/mysqld(handle_one_connection+0x3d)[0x555629c04bad]
/usr/sbin/mysqld(+0xcce97d)[0x55562a22297d]
pthread_create.c:0(start_thread)[0x7f497ab05dd5]
2021-05-03 14:04:40 1669201 [Warning] Aborted connection 1669201 to db: 'catalogservice' user: 'catalogservice' host: 'maxscale01.java.jysk.netic.dk' (CLOSE_CONNECTION)
/lib64/libc.so.6(clone+0x6d)[0x7f4978ea7f6d]
 
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x7f310859a660): COMMIT
Connection ID (thread ID): 1669694
Status: NOT_KILLED
 
Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_s
can=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=off,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on
 
The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
information that should help you find out what is causing the crash.
Writing a core file...
Working directory at /data/mysql/data
Resource Limits:
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        0                    unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             127405               127405               processes
Max open files            16384                16384                files
Max locked memory         65536                65536                bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       127405               127405               signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us
Core pattern: core

Comment by Lars Mikkelsen [ 2021-05-03 ]

No previous warnings or errors except some users being disconnected.

Comment by Lars Mikkelsen [ 2021-05-03 ]

This time only one node crashed.

Comment by Lars Mikkelsen [ 2021-05-04 ]

Could it be caused by the fix in MDEV-21473

Comment by Jan Lindström (Inactive) [ 2021-05-04 ]

Do you use asynchronous replication? There is no mention of this fact in this bug report.

Comment by Lars Mikkelsen [ 2021-05-04 ]


2 of the nodes acts as master but none of the crashing nodes have slaves running.

Comment by Jan Lindström [ 2023-04-11 ]

10.3 will EOL soon.

Generated at Thu Feb 08 09:33:08 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.