[MDEV-26177] WSREP assertion failure in bf_abort when replicating certain DDL against in-memory tables Created: 2021-07-19  Updated: 2022-10-03  Resolved: 2022-10-03

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.5.11
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Kent Hoover Assignee: Jan Lindström (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

CentOS7 / RH7



 Description   

The bug is triggered by truncates of ENGINE=MEMORY tables: TRUNCATE TABLE statement is (erroneously?) replicated, fails to be applied because in-memory tables are node-local, tx is brute force aborted, then “local or streaming tx” assertion fails in WSREP and node crashes with an error similar to the following, and may eventually lead to cluster lockup when WSREP goes completely haywire due to repeated node crashes caused by this same transaction replicated again and again: Perhaps mode==local should evaluate to true ?

2021-07-17 10:26:43 10 [ERROR] Slave SQL: Error 'Table 'lsa.ttm$etl_dimension_discovery_etl06_lsa_001' doesn't exist' on query. Default database: 'lsa'. Query: '/* DATABASE_EXECUTE_DDL*/ TRUNCATE TABLE `ttm$etl_dimension_discovery_etl06_lsa_001` /* PROCESS(LSA [ETLs] <lsa@etl06>, metadata_background|abandoned[1h]|time_to_live[4h]|maximum_reuse[15m])*/', Internal MariaDB error code: 1146
2021-07-17 10:26:43 10 [Warning] WSREP: Ignoring error 'Table 'lsa.ttm$etl_dimension_discovery_etl06_lsa_001' doesn't exist' on query. Default database: 'lsa'. Query: '/* DATABASE_EXECUTE_DDL*/ TRUNCATE TABLE `ttm$etl_dimension_discovery_etl06_lsa_001` /* PROCESS(LSA [ETLs] <lsa@etl06>, metadata_background|abandoned[1h]|time_to_live[4h]|maximum_reuse[15m])*/', Error_code: 1146
mariadbd: /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.5.11/wsrep-lib/include/wsrep/client_state.hpp:668: int wsrep::client_state::bf_abort(wsrep::seqno): Assertion `mode_ == m_local || transaction_.is_streaming()' failed.

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 10.5.11-MariaDB
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=165
max_threads=3010
thread_count=182
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 6732055 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x49000
2021-07-17 10:27:18 12 [ERROR] Slave SQL: Could not execute Update_rows_v1 event on table lsa.configuration_group; Deadlock found when trying to get lock; try restarting transaction, Error_code: 1213; handler error HA_ERR_LOCK_DEADLOCK; the event's master log FIRST, end_log_pos 713, Internal MariaDB error code: 1213
??:0(my_print_stacktrace)[0x55c7d751179e]
??:0(handle_fatal_signal)[0x55c7d6f16457]
sigaction.c:0(__restore_rt)[0x7f6ad94c1630]
:0(__GI_raise)[0x7f6ad890c387]
:0(__GI_abort)[0x7f6ad890da78]
:0(__assert_fail_base)[0x7f6ad89051a6]
:0(_GI__assert_fail)[0x7f6ad8905252]
??:0(wsrep_bf_abort(THD const*, THD*))[0x55c7d71d9dc6]
??:0(wsrep_thd_bf_abort)[0x55c7d71e000f]
??:0(wsrep_notify_status(wsrep::server_state::state, wsrep::view const*))[0x55c7d72076cd]
??:0(handle_manager)[0x55c7d6d05fde]
??:0(MyCTX_nopad::finish(unsigned char*, unsigned int*))[0x55c7d716356d]
pthread_create.c:0(start_thread)[0x7f6ad94b9ea5]
??:0(__clone)[0x7f6ad89d49fd]
The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
information that should help you find out what is causing the crash.
Writing a core file...
Working directory at /data/mysql
Resource Limits:
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 127953 127953 processes
Max open files 32768 32768 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 127953 127953 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
Core pattern: core



 Comments   
Comment by Kent Hoover [ 2021-09-13 ]

Hello? Anything?

Comment by Gabor Orosz [ 2021-09-22 ]

Hi,

I think the problem is not the DDL itself, but the fact that schema changes are non-transactional and require special care. By default Galera uses Total Order Isolation to execute such changes, which means that the cluster members have to reach a synchronization point where application of ordinary DML transactions are suspended and then resumed after DDL is completed on all nodes. The crash happens in this resume phase as the applier threads are trying to process transactions in parallel and two of them are in a lock contention situation due to their replay order. I suspect that this is the same issue as the one that is reported in MDEV-26099.

Best regards,
GOro

Comment by Kent Hoover [ 2021-12-28 ]

Have you concluded that this issue, was indeed the same as MDEV-26099 (and will have been resolved )?

Cheers,
Kent

Comment by Jan Lindström (Inactive) [ 2022-10-03 ]

I did not see any problem using TRUNCATE on MEMORY tables. Note that DDL will be replicated but DML is not.

Generated at Thu Feb 08 09:43:19 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.