[MDEV-26099] MariaDB 10.5.10/10.5.11 Galera assertion crash Created: 2021-07-06  Updated: 2021-10-07  Resolved: 2021-10-07

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.5.10
Fix Version/s: 10.5.13

Type: Bug Priority: Critical
Reporter: Enrico Kern Assignee: Jan Lindström (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Attachments: File coredump.maria.gz     File mysqld.log    
Issue Links:
Problem/Incident
is caused by MDEV-25114 Crash: WSREP: invalid state ROLLED_BA... Closed

 Description   

With MariaDB 10.5.10 we recently see alot of full cluster crashes with assertion errors. Workload didnt change really much. The identified queries that do not read are mainly updating some minor session table or cronjob tables, none of which is anything complex.

Error in mysql log:

mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.5.10-MariaDB
key_buffer_size=16777216
read_buffer_size=131072
max_used_connections=0
max_threads=502
thread_count=5
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 350423 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x40000
??:0(my_print_stacktrace)[0x55b9447ee45e]
??:0(handle_fatal_signal)[0x55b94427db75]
sigaction.c:0(__restore_rt)[0x7fb9cfd00b20]
:0(__GI_raise)[0x7fb9cf04537f]
:0(__GI_abort)[0x7fb9cf02fdb5]
loadmsgcat.c:0(_nl_load_domain.cold.0)[0x7fb9cf02fc89]
assert.c:0(.annobin_assert.c_end)[0x7fb9cf03da76]
??:0(wsrep_bf_abort(THD const*, THD*))[0x55b94451b903]
??:0(wsrep_thd_bf_abort)[0x55b94452348f]
??:0(wsrep_notify_status(wsrep::server_state::state, wsrep::view const*))[0x55b94454a78c]
??:0(handle_manager)[0x55b94407c5bb]
??:0(MyCTX_nopad::finish(unsigned char*, unsigned int*))[0x55b9444a07aa]
pthread_create.c:0(start_thread)[0x7fb9cfcf614a]
:0(__GI___clone)[0x7fb9cf10adc3]

Coredump is attached. I see in gdb:

#0  __pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
#1  0x000055b9447ee257 in my_write_core (sig=sig@entry=6) at /usr/src/debug/MariaDB-/src_0/mysys/stacktrace.c:424
#2  0x000055b94427db90 in handle_fatal_signal (sig=6) at /usr/src/debug/MariaDB-/src_0/sql/signal_handler.cc:343
#3  <signal handler called>
#4  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#5  0x00007fb9cf02fdb5 in __GI_abort () at abort.c:79
#6  0x00007fb9cf02fc89 in __assert_fail_base (fmt=0x7fb9cf1985f8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x55b944a1f138 "mode_ == m_local || transaction_.is_streaming()",
    at assert.c:92
    file=0x55b9448acf18 "/home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.5.10/wsrep-lib/include/wsrep/client_state.hpp", line=668, function=<optimized out>)
#7  0x00007fb9cf03da76 in __GI___assert_fail (assertion=assertion@entry=0x55b944a1f138 "mode_ == m_local || transaction_.is_stream
#8  0x000055b015144903 in wsrep::client_state::bf_abort (bf_seqno=..., this=0x7f3570006ea0) at /usr/src/debug/MariaDB-/src_0/sql/sql_class.h:4828
#9  wsrep_bf_abort (bf_thd=0x7f356c000c58, bf_thd@entry=0x88e8d6fb, victim_thd=victim_thd@entry=0x7f3570000c58) at /usr/src/debug/MariaDB-/src_0/sql/wsrep_thd.cc:362
#10 0x000055b01514c48f in wsrep_thd_bf_abort (bf_thd=0x88e8d6fb, bf_thd@entry=0x7f356c000c58, victim_thd=victim_thd@entry=0x7f3570000c58, signal=<optimized out>)
    at /usr/src/debug/MariaDB-/src_0/sql/service_wsrep.cc:222
#11 0x000055b01517378c in bg_wsrep_kill_trx (void_arg=0x88e8d6fb) at /usr/src/debug/MariaDB-/src_0/storage/innobase/handler/ha_innodb.cc:18846
#12 0x0000000000000000 in ?? ()

my.cnf:

 
[client]
port = 3306
socket = /var/lib/mysql/mysql.sock
ssl-ca = /etc/mysql/ssl/ca-cert.pem
ssl-cert = /etc/mysql/ssl/client-cert.pem
ssl-key = /etc/mysql/ssl/client-key.pem
 
[isamchk]
key_buffer = 16M
key_buffer_size = 16M
 
[mysqld]
basedir = /usr
bind-address = ::
binlog_format = row
binlog_row_image = minimal
core_file
datadir = /var/lib/mysql
default_storage_engine = InnoDB
expire_logs_days = 14
ft_min_word_len = 3
general_log = 0
general_log_file = /var/log/mysql/general.log
gtid_domain_id = 23236823
gtid_strict_mode = ON
ignore_db_dirs = lost+found
innodb_autoinc_lock_mode = 2
innodb_buffer_pool_size = 3908M
innodb_file_per_table = 1
innodb_flush_log_at_trx_commit = 2
innodb_flush_method = fsync
innodb_io_capacity = 1000
innodb_io_capacity_max = 2500
innodb_log_buffer_size = 64M
innodb_log_file_size = 1024M
innodb_open_files = 8192
innodb_purge_threads = 1
innodb_thread_concurrency = 0
key_buffer_size = 16M
local-infile = 0
log-error = /var/log/mysql/mysqld.log
log_slave_updates
log_warnings = 1
long_query_time = 5
max_allowed_packet = 1G
max_binlog_size = 100M
max_connect_errors = 5000
max_connections = 500
net_read_timeout = 240
open_files_limit = 32768
performance_schema = ON
pid-file = /var/lib/mysql/mysqld.pid
plugin_load_add = server_audit
port = 3306
query_cache_size = 0
query_cache_type = 0
read_buffer_size = 128k
server-id = 23236823
server_audit = FORCE_PLUS_PERMANENT
server_audit_events = CONNECT,QUERY,TABLE
server_audit_excl_users = mysqld_exporter,
server_audit_file_path = server_audit.log
server_audit_logging = false
server_audit_output_type = SYSLOG
server_audit_query_log_limit = 4096
server_audit_syslog_facility = LOG_LOCAL6
server_audit_syslog_ident = mysql-server_auditing
server_audit_syslog_info = db04vp.payback.noris.de
server_audit_syslog_priority = LOG_INFO
skip-external-locking
skip_name_resolve = 1
skip_show_database = 1
slow_query_log = 0
slow_query_log_file = /var/log/mysql/mariadb-slow.log
socket = /var/lib/mysql/mysql.sock
sort_buffer_size = 512k
ssl
ssl-ca = /etc/mysql/ssl/ca.pem
ssl-cert = /etc/mysql/ssl/server-cert.pem
ssl-key = /etc/mysql/ssl/server-key.pem
sync_binlog = 1
thread_cache_size = 8
thread_stack = 256K
tmpdir = /tmp
user = mysql
wsrep_auto_increment_control = 1
wsrep_cluster_address = ###STRIPPED###
wsrep_cluster_name = UniqueClusterName
wsrep_gtid_domain_id = 0
wsrep_gtid_mode = ON
wsrep_node_address = ###STRIPPED###
wsrep_node_name = ###STRIPPED###
wsrep_on
wsrep_provider = /usr/lib64/galera-4/libgalera_smm.so
wsrep_provider_options = gcache.size=10G;gcache.name=/var/lib/mysql/galera.cache;gcs.fc_limit=20;gcs.fc_factor=0.8;
wsrep_slave_threads = 4
wsrep_sst_auth = sst_user:###STRIPPED###
wsrep_sst_method = mariabackup
 
[mysqld-5.0]
myisam-recover = BACKUP
 
[mysqld-5.1]
myisam-recover = BACKUP
 
[mysqld-5.5]
myisam-recover = BACKUP
query_cache_limit = 1M
query_cache_size = 16M
 
[mysqld-5.6]
myisam-recover-options = BACKUP
query_cache_limit = 1M
query_cache_size = 16M
 
[mysqld-5.7]
myisam-recover-options = BACKUP
query_cache_limit = 1M
query_cache_size = 16M
 
[mysqld_safe]
log-error = /var/log/mysql/mysqld.log
nice = 0
pid-file = /var/lib/mysql/mysqld.pid
socket = /var/lib/mysql/mysql.sock
 
[mysqldump]
max_allowed_packet = 1G
quick = true
quote-names = true



 Comments   
Comment by Alice Sherepa [ 2021-07-06 ]

Could you please try 10.5.11 - MDEV-25551 is probably the same issue (fixed in 10.6.2, 10.4.20, 10.5.11)

Comment by Enrico Kern [ 2021-07-06 ]

I will try. Thank you. Sorry that i missed that

Comment by Enrico Kern [ 2021-07-08 ]

Upgrade to 10.5.11 didnt help. Issue is stil present. I try to get a new coredump and will upload it here.

Comment by Enrico Kern [ 2021-07-09 ]

It is basically stil the same, sure that it got fixed in 10.5.11 ?

(gdb) backtrace
#0  __pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
#1  0x000055f5e0dbc117 in my_write_core (sig=sig@entry=6) at /usr/src/debug/MariaDB-/src_0/mysys/stacktrace.c:424
#2  0x000055f5e084bf30 in handle_fatal_signal (sig=6) at /usr/src/debug/MariaDB-/src_0/sql/signal_handler.cc:343
#3  <signal handler called>
#4  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#5  0x00007f6825093db5 in __GI_abort () at abort.c:79
#6  0x00007f6825093c89 in __assert_fail_base (fmt=0x7f68251fc5f8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x55f5e0fed258 "mode_ == m_local || transaction_.is_streaming()", 
    file=0x55f5e0e7af18 "/home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.5.11/wsrep-lib/include/wsrep/client_state.hpp", line=668, function=<optimized out>)
    at assert.c:92
#7  0x00007f68250a1a76 in __GI___assert_fail (assertion=assertion@entry=0x55f5e0fed258 "mode_ == m_local || transaction_.is_streaming()", 
    file=file@entry=0x55f5e0e7af18 "/home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.5.11/wsrep-lib/include/wsrep/client_state.hpp", line=line@entry=668, 
    function=function@entry=0x55f5e0fed540 <wsrep::client_state::bf_abort(wsrep::seqno)::__PRETTY_FUNCTION__> "int wsrep::client_state::bf_abort(wsrep::seqno)") at assert.c:101
#8  0x000055f5e0ae9df3 in wsrep::client_state::bf_abort (bf_seqno=..., this=0x7f6580006eb0) at /usr/src/debug/MariaDB-/src_0/sql/sql_class.h:4825
#9  wsrep_bf_abort (bf_thd=0x7f6444000c58, bf_thd@entry=0x89fb8ea1, victim_thd=victim_thd@entry=0x7f6580000c58) at /usr/src/debug/MariaDB-/src_0/sql/wsrep_thd.cc:362
#10 0x000055f5e0af197f in wsrep_thd_bf_abort (bf_thd=0x89fb8ea1, bf_thd@entry=0x7f6444000c58, victim_thd=victim_thd@entry=0x7f6580000c58, signal=<optimized out>)
    at /usr/src/debug/MariaDB-/src_0/sql/service_wsrep.cc:222
#11 0x000055f5e0b1864c in bg_wsrep_kill_trx (void_arg=0x89fb8ea1) at /usr/src/debug/MariaDB-/src_0/storage/innobase/handler/ha_innodb.cc:18768
#12 0x0000000000000000 in ?? ()
(gdb) quit

the actual coredump is to big to upload here even zipped. Guess need to try to figure out how to replicate the crash behavior

Comment by Mario Karuza (Inactive) [ 2021-07-15 ]

Hi,

Can you enable WSREP debug mode with wsrep_debug=3 and try to reproduce assert.

Comment by Enrico Kern [ 2021-07-27 ]

Sorry for the late response. The downgrade to 10.5.9 solved the issue for now. So there must be some breaking change between 10.5.9 and 10.5.10 in the new galera plugin.

I wasnt able to reproduce a crash in time with the debug parameter, we will create a new cluster with 10.5.11 and copy traffic to reproduce the crashes, then i can hopefully provide the requested informations.

Comment by Gabor Orosz [ 2021-09-15 ]

Hei!

We are in the same situation after upgrading from 10.4.17 (+ MDEV-23851 patch) to 10.5.12. I've uploaded some excerpts from our logs that contain several occurrences. The recent ones with wsrep_debug set to 3. For me, this seems to be the very same issue that get fixed with MDEV-23851.
I'll try to compile a simple reproducer and fiddle with the cores while setting up a build environment to produce build from upstream code base (We are using SLES 15 SP3 packages) for further testing.
If you need further details and information for the investigation or have any kind of request, then please just indicate here.

Thanks and regards,
GOro

Comment by Danilo Spinella [ 2021-09-30 ]

Hi there! Is there any plan to work on this issue in the near future?

Comment by Jan Lindström (Inactive) [ 2021-09-30 ]

danyspin97 It will be fixed in MDEV-25114.

Comment by Gabor Orosz [ 2021-09-30 ]

Hi,

My observation, at least in our case, is that the issue concerns transactions that try to modify tables which are in relation with a foreign key constraint defined over a multi-byte (UTF-8) character column. Similar issues that I've found on this topic are MDEV-26298, MDEV-26518 and maybe MDEV-26177.

Best regards,
GOro

Generated at Thu Feb 08 09:42:44 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.