[MDEV-24306] src/replicator_smm.cpp:1054(galera::ReplicatorSMM::replay_trx(galera::TrxHandleMaster&, galera::TrxHandleLock&, void*)) Created: 2020-11-30  Updated: 2021-05-23  Resolved: 2021-05-23

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.5.6
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Allen Lee (Inactive) Assignee: Seppo Jaakola
Resolution: Incomplete Votes: 1
Labels: galera_4, need_feedback
Environment:

RedHat Enterprise Linux, On-Premise, Virtualized


Attachments: File server.cnf    
Issue Links:
Relates

 Description   

User is hitting Signal 11 repeatedly(114 times) after applying SSL(TDE). Here is the stacktrace.

201124 12:14:54 [ERROR] mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.5.6-MariaDB
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=11
max_threads=153
thread_count=34
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 467800 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x7f19d40b1408
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f1a5ec24c90 thread_stack 0x49000
??:0(my_print_stacktrace)[0x55ac7a77d0ae]
??:0(handle_fatal_signal)[0x55ac7a17e5d7]
sigaction.c:0(__restore_rt)[0x7f1a735a9630]
src/replicator_smm.cpp:1054(galera::ReplicatorSMM::replay_trx(galera::TrxHandleMaster&, galera::TrxHandleLock&, void*))[0x7f1a6e5dce19]
src/trx_handle.hpp:1126(~TrxHandleLock)[0x7f1a6e5f177d]
??:0(wsrep::wsrep_provider_v26::replay(wsrep::ws_handle const&, wsrep::high_priority_service*))[0x55ac7a81d259]
??:0(Wsrep_client_service::replay())[0x55ac7a424a56]
??:0(wsrep::transaction::replay(wsrep::unique_lock<wsrep::mutex>&))[0x55ac7a8172a8]
??:0(wsrep::transaction::after_statement())[0x55ac7a8187d4]
??:0(wsrep::client_state::after_statement())[0x55ac7a7f44fd]
??:0(dispatch_command(enum_server_command, THD*, char*, unsigned int, bool, bool))[0x55ac79f86bfb]
??:0(do_command(THD*))[0x55ac79f88feb]
??:0(do_handle_one_connection(CONNECT*, bool))[0x55ac7a071b29]
??:0(handle_one_connection)[0x55ac7a071db4]
??:0(MyCTX_nopad::finish(unsigned char*, unsigned int*))[0x55ac7a3c627d]
pthread_create.c:0(start_thread)[0x7f1a735a1ea5]
??:0(__clone)[0x7f1a716ca96d]
 
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x0): 
Connection ID (thread ID): 1
Status: NOT_KILLED
 
Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off
 
The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
information that should help you find out what is causing the crash.
Writing a core file...
Working directory at /MariaDB/base/mysql
Resource Limits:
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             63412                63412                processes 
Max open files            16384                16384                files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       63412                63412                signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        
Core pattern: core
 
2020-11-24 12:15:03 0 [Note] WSREP: Loading provider /usr/lib64/galera-4/libgalera_smm.so initial position: 33c3e167-0a8e-11eb-bf65-f322c7adba58:646
2020-11-24 12:15:03 0 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera-4/libgalera_smm.so'
2020-11-24 12:15:03 0 [Note] WSREP: wsrep_load(): Galera 26.4.5(rb3764ab) by Codership Oy <info@codership.com> loaded successfully.
2020-11-24 12:15:03 0 [Note] WSREP: CRC-32C: using hardware acceleration.
2020-11-24 12:15:03 0 [Note] WSREP: Found saved state: 33c3e167-0a8e-11eb-bf65-f322c7adba58:-1, safe_to_bootstrap: 0
2020-11-24 12:15:03 0 [Note] WSREP: GCache DEBUG: opened preamble:
Version: 2
UUID: 33c3e167-0a8e-11eb-bf65-f322c7adba58
Seqno: -1 - -1
Offset: -1
Synced: 0

This is 5 ndoe galara cluster and some galera config as below.

# Mandatory settings
wsrep_on=ON
wsrep_provider=/usr/lib64/galera-4/libgalera_smm.so
wsrep_cluster_address="gcomm://xxx-oitmrdb01,xxx-oitmrdb02,xxx-oitmrdb03,xxx-oitmrdb04,d2-us-oitmrdb01,d2-us-oitmrdb02,d2-us-oitmrdb03"
wsrep_cluster_name='D1_D2_MARIADB_STAGE_CLUSTER'
wsrep_node_address='xxx-oitmrdb01'
wsrep_node_name='xxx-oitmrdb01'
wsrep_sst_method=mariabackup
wsrep_sst_auth=sst:xxx
wsrep_log_conflicts=ON
#wsrep_debug=ON
 
# Optional setting
wsrep_slave_threads=20
wsrep_certify_nonPK=1
wsrep_max_ws_size=1073741824
wsrep_convert_LOCK_to_trx=0
wsrep_retry_autocommit=1
wsrep_auto_increment_control=1
 
#The following parameters can tolerate 30 second connectivity outages on a wan replication
wsrep_provider_options = "evs.keepalive_period = PT3S
wsrep_provider_options = "evs.suspect_timeout = PT30S"
wsrep_provider_options = "evs.inactive_timeout = PT5S"
wsrep_provider_options = "evs.install_timeout = PT5S"
wsrep_provider_options = "pc.weight=1"
wsrep_provider_options = "evs.join_retrans_period=PT0.10S"
wsrep_provider_options = "gcache.recover = yes; gcache.size = 1G;gcs.fc_factor = 0.8; gcs.fc_limit = 64; gcs.fc_master_slave = yes;"
wsrep_provider_options = "ist.recv_addr=xxx-oitmrdb01"
wsrep_provider_options = "socket.ssl_key=/etc/mysql/ssl/mariadb-server-stage.key;socket.ssl_cert=/etc/mysql/ssl/mariadb-server-stage.cer;socket.ssl_ca=/etc/mysql/ssl/SEC_Root_CA.cer"
wsrep_provider_options = "socket.checksum=2"
wsrep_provider_options = "socket.ssl_cipher=ALL:!EXP:!NULL:!ADH:!LOW:!SSLv2:!SSLv3:!MD5:!RC4:!RSA"
wsrep_provider_options = "cert.log_conflicts=ON"



 Comments   
Comment by Roger Eisentrager [ 2021-02-22 ]

CS0168361 - I did let the client know to upgrade and retest, java errors thrown still (from the mariaDB connector), and disconnects after that. New log of these errors in the CS ticket I linked. Client really needs to advice here as moving them off of Galera, might not be a viable solution. I am still unsure of this an issue with the connector, mariaDB server, or galera or a combination here. This is impacting the client's rollout to production.

Comment by Roger Eisentrager [ 2021-02-22 ]

Caused by: java.sql.SQLException: Connection reset
at org.mariadb.jdbc.internal.protocol.AbstractQueryProtocol.handleIoException(AbstractQueryProtocol.java:1789)
at org.mariadb.jdbc.internal.protocol.AbstractQueryProtocol.executeQuery(AbstractQueryProtocol.java:201)
at org.mariadb.jdbc.MariaDbStatement.executeInternal(MariaDbStatement.java:328)
... 46 more
Caused by: java.net.SocketException: Connection reset
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:115)
at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
at sun.security.ssl.OutputRecord.writeBuffer(OutputRecord.java:431)
at sun.security.ssl.OutputRecord.write(OutputRecord.java:417)
at sun.security.ssl.SSLSocketImpl.writeRecordInternal(SSLSocketImpl.java:886)
at sun.security.ssl.SSLSocketImpl.writeRecord(SSLSocketImpl.java:857)
at sun.security.ssl.AppOutputStream.write(AppOutputStream.java:123)
at org.mariadb.jdbc.internal.io.output.StandardPacketOutputStream.flushBuffer(StandardPacketOutputStream.java:110)
at org.mariadb.jdbc.internal.io.output.AbstractPacketOutputStream.flush(AbstractPacketOutputStream.java:172)
at org.mariadb.jdbc.internal.protocol.AbstractQueryProtocol.executeQuery(AbstractQueryProtocol.java:195)

Comment by Roger Eisentrager [ 2021-02-22 ]

1789 error code refers back to: The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires. — 1789 HY000 ER_MASTER_HAS_PURGED_REQUIRED_GTIDS ----> not sure if that helps at all.

Comment by Seppo Jaakola [ 2021-03-31 ]

I tried some test scenarios on 10.5.8 cluster using conflicting XA load, and could reproduce a crash fairly easily. However, the stack trace is not identical with the the one posted in this jira tracker:

210331 15:58:06 [ERROR] mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 10.5.7-MariaDB-log
key_buffer_size=1048576
read_buffer_size=131072
max_used_connections=1
max_threads=153
thread_count=4
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 63636 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7fbb00000c58
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7fbb45258dd8 thread_stack 0x49000
/home/seppo/work/wsrep/mariadb-server/sql/mariadbd(my_print_stacktrace+0x32)[0x55b994fadce2]
/home/seppo/work/wsrep/mariadb-server/sql/mariadbd(handle_fatal_signal+0x485)[0x55b994a2fe85]
sigaction.c:0(__restore_rt)[0x7fbb5885a3c0]
/home/seppo/work/wsrep/mariadb-server/sql/mariadbd(_Z16dispatch_command19enum_server_commandP3THDPcjbb+0x61f)[0x55b99482e9ef]
sql/sql_class.h:4202(THD::reset_kill_query())[0x55b994831ab6]
sql/sql_parse.cc:1348(do_command(THD*))[0x55b99492f3a8]
sql/sql_connect.cc:1410(do_handle_one_connection(CONNECT*, bool))[0x55b99492f84d]
sql/sql_connect.cc:1312(handle_one_connection)[0x55b994c8e596]
nptl/pthread_create.c:478(start_thread)[0x7fbb5884e609]
x86_64/clone.S:97(_GI__clone)[0x7fbb58422293]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x0): (null)
Connection ID (thread ID): 9
Status: NOT_KILLED

I suspect there are more crashing issues with different XA test load scenarios. As the cluster configuration appears to at least prepare the cluster for XA transaction usage, I assume the the crash with this tracker is also due to XA transactions run on the cluster.

Later 10.5 versions reject XA transactions in cluster, but 10.5.8 still allows this harmful cluster usage.

Generated at Thu Feb 08 09:28:59 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.