[MDEV-33129] Crash in wsrep::wsrep_provider_v26::replay when setting gtid_slave_pos Created: 2023-12-27  Updated: 2024-01-03

Status: Open
Project: MariaDB Server
Component/s: Galera, Replication
Affects Version/s: 10.6.16
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Michał Trzciński Assignee: Roel Van de Paar
Resolution: Unresolved Votes: 0
Labels: None
Environment:

Kubernetes


Issue Links:
Relates
relates to MDEV-30550 Assertion `state() == s_executing || ... Confirmed
relates to MDEV-32779 galera_concurrent_ctas: assertion in ... Open

 Description   

We are running 3 node Galera cluster (A) + replication to another Galera cluster (B).

When I set gtid_slave_pos on slave (B) it crashes:

MariaDB [(none)]> SET GLOBAL gtid_slave_pos = "0-1-872101,10-1-827994,11-1-1159,12-1-240";
ERROR 2013 (HY000): Lost connection to server during query

When slave starts up after the failure gtid_slave_pos is set to:

MariaDB [(none)]> select @@gtid_slave_pos;
+------------------+
| @@gtid_slave_pos |
+------------------+
| 0-1-872101       |
+------------------+
1 row in set (0.001 sec)

and I can't START SLAVE because gtid doesn't match.

It works fine on <=10.6.15, so as a workaround I started slave on 10.6.15 and then I bump it to 10.6.16.

Error log:

231227 22:32:21 [ERROR] mysqld got signal 11 ;
Sorry, we probably made a mistake, and this is a bug.
 
Your assistance in bug reporting will enable us to fix this for the next release.
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
 
Server version: 10.6.16-MariaDB-log source revision: b83c379420a8846ae4b28768d3c81fa354cca056
key_buffer_size=10485760
read_buffer_size=131072
max_used_connections=3
max_threads=111
thread_count=13
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 254674 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x7fefc8384618
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7fefac37fd98 thread_stack 0x50000
/opt/bitnami/mariadb/sbin/mysqld(my_print_stacktrace+0x2e)[0x560f9344323e]
/opt/bitnami/mariadb/sbin/mysqld(handle_fatal_signal+0x475)[0x560f92f72a55]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x13140)[0x7feff4866140]
/opt/bitnami/mariadb/lib/libgalera_smm.so(+0x66bf8)[0x7fefefe33bf8]
/opt/bitnami/mariadb/lib/libgalera_smm.so(+0x445eb)[0x7fefefe115eb]
Printing to addr2line failed
/opt/bitnami/mariadb/sbin/mysqld(_ZN5wsrep18wsrep_provider_v266replayERKNS_9ws_handleEPNS_21high_priority_serviceE+0x29)[0x560f934cb989]
/opt/bitnami/mariadb/sbin/mysqld(_ZN20Wsrep_client_service6replayEv+0xfe)[0x560f9321e45e]
/opt/bitnami/mariadb/sbin/mysqld(_ZN5wsrep11transaction6replayERSt11unique_lockINS_5mutexEE+0x96)[0x560f934c9616]
/opt/bitnami/mariadb/sbin/mysqld(_ZN5wsrep11transaction15after_statementERSt11unique_lockINS_5mutexEE+0xbd)[0x560f934c9b8d]
/opt/bitnami/mariadb/sbin/mysqld(_ZN5wsrep12client_state15after_statementEv+0x6a)[0x560f934aeaba]
/opt/bitnami/mariadb/sbin/mysqld(+0x7932f4)[0x560f92d452f4]
/opt/bitnami/mariadb/sbin/mysqld(+0x7a2060)[0x560f92d54060]
/opt/bitnami/mariadb/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcjb+0x1df2)[0x560f92d56552]
/opt/bitnami/mariadb/sbin/mysqld(_Z10do_commandP3THDb+0x131)[0x560f92d56e41]
/opt/bitnami/mariadb/sbin/mysqld(_Z24do_handle_one_connectionP7CONNECTb+0x3a7)[0x560f92e586b7]
/opt/bitnami/mariadb/sbin/mysqld(handle_one_connection+0x5d)[0x560f92e58a1d]
/opt/bitnami/mariadb/sbin/mysqld(+0xc105f9)[0x560f931c25f9]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7)[0x7feff485aea7]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7feff4447a2f]
 
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x0): (null)
Connection ID (thread ID): 1
Status: NOT_KILLED
 
Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off,hash_join_cardinality=off,cset_narrowing=off
 
The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mariadbd/ contains
information that should help you find out what is causing the crash.
 
We think the query pointer is invalid, but we will try to print it anyway.
Query:
 
Writing a core file...
Working directory at /bitnami/mariadb/data
Resource Limits:
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        unlimited            unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             unlimited            unlimited            processes
Max open files            1048576              1048576              files
Max locked memory         65536                65536                bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       117314               117314               signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us
Core pattern: |/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E
 
Kernel version: Linux version 5.15.0-91-generic (buildd@lcy02-amd64-045) (gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023



 Comments   
Comment by Roel Van de Paar [ 2024-01-03 ]

Demangled stack

/opt/bitnami/mariadb/sbin/mysqld(my_print_stacktrace+0x2e)[0x560f9344323e]
/opt/bitnami/mariadb/sbin/mysqld(handle_fatal_signal+0x475)[0x560f92f72a55]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x13140)[0x7feff4866140]
/opt/bitnami/mariadb/lib/libgalera_smm.so(+0x66bf8)[0x7fefefe33bf8]
/opt/bitnami/mariadb/lib/libgalera_smm.so(+0x445eb)[0x7fefefe115eb]
Printing to addr2line failed
/opt/bitnami/mariadb/sbin/mysqld(wsrep::wsrep_provider_v26::replay(wsrep::ws_handle const&, wsrep::high_priority_service*)+0x29)[0x560f934cb989]
/opt/bitnami/mariadb/sbin/mysqld(Wsrep_client_service::replay()+0xfe)[0x560f9321e45e]
/opt/bitnami/mariadb/sbin/mysqld(wsrep::transaction::replay(std::unique_lock<wsrep::mutex>&)+0x96)[0x560f934c9616]
/opt/bitnami/mariadb/sbin/mysqld(wsrep::transaction::after_statement(std::unique_lock<wsrep::mutex>&)+0xbd)[0x560f934c9b8d]
/opt/bitnami/mariadb/sbin/mysqld(wsrep::client_state::after_statement()+0x6a)[0x560f934aeaba]
/opt/bitnami/mariadb/sbin/mysqld(+0x7932f4)[0x560f92d452f4]
/opt/bitnami/mariadb/sbin/mysqld(+0x7a2060)[0x560f92d54060]
/opt/bitnami/mariadb/sbin/mysqld(dispatch_command(enum_server_command, THD*, char*, unsigned int, bool)+0x1df2)[0x560f92d56552]
/opt/bitnami/mariadb/sbin/mysqld(do_command(THD*, bool)+0x131)[0x560f92d56e41]
/opt/bitnami/mariadb/sbin/mysqld(do_handle_one_connection(CONNECT*, bool)+0x3a7)[0x560f92e586b7]
/opt/bitnami/mariadb/sbin/mysqld(handle_one_connection+0x5d)[0x560f92e58a1d]
/opt/bitnami/mariadb/sbin/mysqld(+0xc105f9)[0x560f931c25f9]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7)[0x7feff485aea7]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7feff4447a2f]

Comment by Roel Van de Paar [ 2024-01-03 ]

Findings:
1) Issue cannot be readily reproduced, and may be related to the contents of the binary log at the given GTID coords, especially so given that:
2) The query is empty:

Query (0x0): (null)

3) Yet, the actual crash is in wsrep code (wsrep::wsrep_provider_v26::replay)
I wonder if temporarily setting wsrep_on=OFF would help, but I do not know enough about this area. ramesh Do you have further thoughts?

Generated at Thu Feb 08 10:36:36 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.