[MDEV-17243] Galera Server crashes with "WSREP: FSM: no such a transition ABORTING -> REPLICATING" on loading data Created: 2018-09-19  Updated: 2021-03-10  Resolved: 2019-03-18

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.1, 10.2.14
Fix Version/s: 10.2.23, 10.3.14, 10.4.4

Type: Bug Priority: Critical
Reporter: Zdravelina Sokolovska (Inactive) Assignee: Jan Lindström (Inactive)
Resolution: Fixed Votes: 1
Labels: None
Environment:

3x Master-Master Servers ;OS Fedora 27


Issue Links:
Relates
relates to MDEV-25104 Galera crashes with "FSM: no such a t... Closed

 Description   

Galera Server crashes with "WSREP: FSM: no such a transition ABORTING -> REPLICATING" on
loading data

the crash occurred on concurrent loading of several tables after interrupting the previous session, dropping database ,recreate schema and restarting load

2018-09-19 15:27:45 139794208356096 [ERROR] WSREP: FSM: no such a transition ABORTING -> REPLICATING
180919 15:27:45 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 10.2.14-MariaDB
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=25
max_threads=153
thread_count=36
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 467245 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7f23d40008d8
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f2460216cd8 thread_stack 0x49000
/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x55a0f3a1818e]
/usr/sbin/mysqld(handle_fatal_signal+0x5a3)[0x55a0f349e5f3]
/lib64/libpthread.so.0(+0x121c0)[0x7f252a6761c0]
/lib64/libc.so.6(gsignal+0x110)[0x7f25285a6750]
/lib64/libc.so.6(abort+0x151)[0x7f25285a7d31]
/usr/lib64/galera/libgalera_smm.so(ZN6galera3FSMINS_9TrxHandle5StateENS1_10TransitionENS_10EmptyGuardENS_11EmptyActionEE8shift_toES2+0x190)[0x7f25246c6850]
/usr/lib64/galera/libgalera_smm.so(_ZN6galera13ReplicatorSMM9replicateEPNS_9TrxHandleEP14wsrep_trx_meta+0x277)[0x7f25246bf457]
/usr/lib64/galera/libgalera_smm.so(galera_pre_commit+0xb3)[0x7f25246df303]
/usr/sbin/mysqld(wsrep_run_wsrep_commit+0x987)[0x55a0f3433147]
/usr/sbin/mysqld(+0x5e9d73)[0x55a0f3433d73]
/usr/sbin/mysqld(_Z15ha_commit_transP3THDb+0x31f)[0x55a0f34a172f]
/usr/sbin/mysqld(_Z17trans_commit_stmtP3THD+0x5d)[0x55a0f33e6f4d]
/usr/sbin/mysqld(_Z21mysql_execute_commandP3THD+0x3bf)[0x55a0f3307bff]
/usr/sbin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_statebb+0x2f3)[0x55a0f3310543]
/usr/sbin/mysqld(+0x4c6d16)[0x55a0f3310d16]
/usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcjbb+0x17b0)[0x55a0f3312ad0]
/usr/sbin/mysqld(_Z10do_commandP3THD+0x230)[0x55a0f3313fa0]
/usr/sbin/mysqld(_Z24do_handle_one_connectionP7CONNECT+0x20a)[0x55a0f33d830a]
/usr/sbin/mysqld(handle_one_connection+0x3d)[0x55a0f33d84dd]
/lib64/libpthread.so.0(+0x750b)[0x7f252a66b50b]
/lib64/libc.so.6(clone+0x3f)[0x7f252866738f]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x7f23d401ea30): LOAD DATA LOCAL INFILE '/root/QA/mariadb-columnstore-tpcds/insert-data-tables/data/tpcds_2000/item.tbl' INTO TABLE `item` FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' IGNORE 0 LINES
Connection ID (thread ID): 428
Status: NOT_KILLED

Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=off,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on

The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.



 Comments   
Comment by Jan Lindström (Inactive) [ 2018-09-24 ]

Looks in surface similar to MDEV-17262.

Comment by Julius Goryavsky [ 2018-10-16 ]

https://github.com/MariaDB/galera/pull/3

TrxMap structure doesn't take into consideration presence of two trx
objects with same trx_id (2^64 - 1 which is default trx_id) belonging
to two different connections.

This eventually causes same trx object to get shared among two
different unrelated connections which causes state inconsistency
leading to crash (RACE CONDITION).

This problem could be solved by taking into consideration conn-id,
but that would invite interface change. To avoid this we should
maintain a separate map of such trx objects based on gu_thread_id.

Comment by Jan Lindström (Inactive) [ 2019-01-17 ]

bar Please review the latest version or if you already did please mark both PR and this accordingly.

Comment by Alexander Barkov [ 2019-01-21 ]

jplindst, sorry I can't review this change. I suggest to ask someone more familiar with this code. Perhaps Sergey Vojtovich could review.

Comment by Jan Lindström (Inactive) [ 2019-01-22 ]

svoj Can you review this ?

Generated at Thu Feb 08 08:34:59 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.