[MDEV-25170] Mariadb/Galera crashes after FreeSWITCH shutdown Created: 2021-03-16  Updated: 2021-12-23  Resolved: 2021-12-23

Status: Closed
Project: MariaDB Server
Component/s: Galera, Server
Affects Version/s: 10.5.9
Fix Version/s: 10.5.13

Type: Bug Priority: Major
Reporter: Jerry Kendall Assignee: Jan Lindström (Inactive)
Resolution: Fixed Votes: 1
Labels: None
Environment:

FreeSWITCH v1.10 VoIP Server talking to MariaDB v10.5.9 Cluster of 2 nodes
All systems Debian 10

[galera]
wsrep_on=ON
wsrep_slave_threads=16
wsrep_provider=/usr/lib/galera/libgalera_smm.so
wsrep_cluster_address="gcomm://ip1,ip2"

wsrep_provider_options="gmcast.segment=88; "

binlog_format=row
default_storage_engine=InnoDB
innodb_autoinc_lock_mode=2
bind-address=0.0.0.0
wsrep_cluster_name="TEST Cluster"
wsrep_node_address="IP1"
wsrep_node_name='SQL-1'
wsrep_sst_method=rsync


Issue Links:
Relates
relates to MDEV-23851 Galera assertion at lock0lock.cc line... Closed
relates to MDEV-24923 Port selected Galera conflict resolut... Closed
relates to MDEV-25579 MariaDB 10.4.18 crashes with InnoDB: ... Closed

 Description   

/var/log/mysql/error.log

2021-03-16 16:27:16 20 [ERROR] InnoDB: Conflicting lock on table: `FreeSWITCH`.`interfaces` index: GEN_CLUST_INDEX that has lock
RECORD LOCKS space id 66 page no 4 n bits 168 index GEN_CLUST_INDEX of table `FreeSWITCH`.`interfaces` trx id 2939 lock_mode X locks rec but not gap
Record lock, heap no 2 PHYSICAL RECORD: n_fields 10; compact format; info bits 0
0: len 6; hex 0000000016b8; asc ;;
1: len 6; hex 000000000000; asc ;;
2: len 7; hex 80000000000000; asc ;;
3: len 5; hex 74696d6572; asc timer;;
4: len 4; hex 736f6674; asc soft;;
5: len 0; hex ; asc ;;
6: len 21; hex 434f52455f534f465454494d45525f4d4f44554c45; asc CORE_SOFTTIMER_MODULE;;
7: len 25; hex 2f434f52455f534f465454494d45525f4d4f44554c452e736f; asc /CORE_SOFTTIMER_MODULE.so;;
8: len 0; hex ; asc ;;
9: len 18; hex 46532d467265655377697463682d494e2d30; asc FS-FreeSwitch-IN-0;;

Record lock, heap no 3 PHYSICAL RECORD: n_fields 10; compact format; info bits 0
0: len 6; hex 0000000016b9; asc ;;
1: len 6; hex 000000000000; asc ;;
2: len 7; hex 80000000000000; asc ;;
3: len 5; hex 636f646563; asc codec;;
4: len 24; hex 50524f585920564944454f20504153532d5448524f554748; asc PROXY VIDEO PASS-THROUGH;;
5: len 0; hex ; asc ;;
6: len 15; hex 434f52455f50434d5f4d4f44554c45; asc CORE_PCM_MODULE;;
7: len 19; hex 2f434f52455f50434d5f4d4f44554c452e736f; asc /CORE_PCM_MODULE.so;;
8: len 0; hex ; asc ;;
9: len 18; hex 46532d467265655377697463682d494e2d30; asc FS-FreeSwitch-IN-0;;

Record lock, heap no 4 PHYSICAL RECORD: n_fields 10; compact format; info bits 0
0: len 6; hex 0000000016ba; asc ;;
1: len 6; hex 000000000000; asc ;;
2: len 7; hex 80000000000000; asc ;;
3: len 5; hex 636f646563; asc codec;;
4: len 18; hex 50524f585920504153532d5448524f554748; asc PROXY PASS-THROUGH;;
5: len 0; hex ; asc ;;
6: len 15; hex 434f52455f50434d5f4d4f44554c45; asc CORE_PCM_MODULE;;
7: len 19; hex 2f434f52455f50434d5f4d4f44554c452e736f; asc /CORE_PCM_MODULE.so;;
8: len 0; hex ; asc ;;
9: len 18; hex 46532d467265655377697463682d494e2d30; asc FS-FreeSwitch-IN-0;;

....... many more of the above ------

2021-03-16 16:27:16 20 [ERROR] InnoDB: WSREP state:
2021-03-16 16:27:16 20 [ERROR] WSREP: Thread BF trx_id: 2942 thread: 20 seqno: 1901 client_state: exec client_mode: high priority transaction_mode: executing applier: 1 toi: 0 local: 0 query: delete from interfaces where type='api' and name='vm_fsdb_msg_delete' and hostname='Host1'<97>^VQ`^S^A
2021-03-16 16:27:16 20 [ERROR] WSREP: Thread BF trx_id: 2939 thread: 18 seqno: 1888 client_state: exec client_mode: high priority transaction_mode: executing applier: 1 toi: 0 local: 0 query: delete from interfaces where type='application' and name='voicemail' and hostname='Host1'<97>^VQ`^S^A
2021-03-16 16:27:16 0x7f8020336700 InnoDB: Assertion failure in file /home/buildbot/buildbot/build/mariadb-10.5.9/storage/innobase/lock/lock0lock.cc line 674
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/
InnoDB: about forcing recovery.
210316 16:27:16 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 10.5.9-MariaDB-1:10.5.9+maria~buster
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=1
max_threads=153
thread_count=18
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 467864 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7f7fd80133c8
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f8020335d98 thread_stack 0x49000
/usr/sbin/mariadbd(my_print_stacktrace+0x2e)[0x55d08b25315e]
Printing to addr2line failed
/usr/sbin/mariadbd(handle_fatal_signal+0x485)[0x55d08ace0895]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x12730)[0x7f820451f730]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x10b)[0x7f82039567bb]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x121)[0x7f8203941535]
/usr/sbin/mariadbd(+0x62b9f4)[0x55d08a9c69f4]
/usr/sbin/mariadbd(+0x608630)[0x55d08a9a3630]
/usr/sbin/mariadbd(+0xc5d97d)[0x55d08aff897d]
/usr/sbin/mariadbd(+0xc62785)[0x55d08affd785]
/usr/sbin/mariadbd(+0xceb6e4)[0x55d08b0866e4]
/usr/sbin/mariadbd(+0xcf1702)[0x55d08b08c702]
/usr/sbin/mariadbd(+0xc1c800)[0x55d08afb7800]
/usr/sbin/mariadbd(+0xc1cad3)[0x55d08afb7ad3]
/usr/sbin/mariadbd(_ZN7handler11ha_rnd_nextEPh+0x13f)[0x55d08ace65cf]
/usr/sbin/mariadbd(_ZN14Rows_log_event8find_rowEP14rpl_group_info+0x586)[0x55d08ade9aa6]
/usr/sbin/mariadbd(_ZN21Delete_rows_log_event11do_exec_rowEP14rpl_group_info+0x99)[0x55d08ade9c49]
/usr/sbin/mariadbd(_ZN14Rows_log_event14do_apply_eventEP14rpl_group_info+0x27f)[0x55d08adde5df]
/usr/sbin/mariadbd(_Z18wsrep_apply_eventsP3THDP14Relay_log_infoPKvm+0x1dc)[0x55d08af8339c]
/usr/sbin/mariadbd(_ZN21Wsrep_applier_service15apply_write_setERKN5wsrep7ws_metaERKNS0_12const_bufferERNS0_14mutable_bufferE+0xa4)[0x55d08af6bea4]
/usr/sbin/mariadbd(+0xf4028a)[0x55d08b2db28a]
/usr/sbin/mariadbd(_ZN5wsrep12server_state8on_applyERNS_21high_priority_serviceERKNS_9ws_handleERKNS_7ws_metaERKNS_12const_bufferE+0x15d)[0x55d08b2dbddd]
/usr/sbin/mariadbd(+0xf54104)[0x55d08b2ef104]
/usr/lib/galera/libgalera_smm.so(+0x535fb)[0x7f82034365fb]
/usr/lib/galera/libgalera_smm.so(+0x63e70)[0x7f8203446e70]
/usr/lib/galera/libgalera_smm.so(+0x77e42)[0x7f820345ae42]
/usr/lib/galera/libgalera_smm.so(+0x78396)[0x7f820345b396]
/usr/lib/galera/libgalera_smm.so(+0x66d53)[0x7f8203449d53]
/usr/lib/galera/libgalera_smm.so(+0x4248b)[0x7f820342548b]
/usr/sbin/mariadbd(_ZN5wsrep18wsrep_provider_v2611run_applierEPNS_21high_priority_serviceE+0xe)[0x55d08b2efdfe]
/usr/sbin/mariadbd(+0xbea623)[0x55d08af85623]
/usr/sbin/mariadbd(_Z15start_wsrep_THDPv+0x2cf)[0x55d08af769cf]
/usr/sbin/mariadbd(+0xb6e74b)[0x55d08af0974b]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7fa3)[0x7f8204514fa3]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7f8203a184cf]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x7f81f4197123): delete from interfaces where type='api' and name='vm_fsdb_msg_delete' and hostname='Tor01'

Connection ID (thread ID): 20
Status: NOT_KILLED

Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off

The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
information that should help you find out what is causing the crash.
Writing a core file...
Working directory at /var/lib/mysql
Resource Limits:
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 39861 39861 processes
Max open files 16384 16384 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 39861 39861 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
Core pattern: core



 Comments   
Comment by Jerry Kendall [ 2021-03-16 ]

Update....

I changed the 'wsrep_slave_threads' from 16 to 1 and the crashing has improved - still doing repeated start/stop on the FreeSWITCH server.

I had set the wsrep_slave-threads to 16 as each mariadb server has 4 cores and everything I found indicates 4x #cores....

anyone have an official opinion on this setting ?

Comment by Jerry Kendall [ 2021-03-17 ]

So, after 12 hours of constant start/stop of the FreeSWITCH server with wsrep_slave_threads set to '1', the DB cluster does not crash.

Anyone have any comments as to what the wsrep_slave_threads should be set to for a server with 12GB RAM and 4 Cores ?

Comment by Jerry Kendall [ 2021-03-17 ]

So, after 16 hours of of constant start/stop of the FreeSWITCH server with wsrep_slave_threads set to '1', the DB cluster does not crash.

I then set wsrep_slave_threads set to '2' and the DB crashes after only 10 minutes of start/stop freeswitch.....

Clearly something not quite right here.

Comment by Jerry Kendall [ 2021-03-17 ]

So, here is the latest galera.cnf file

[galera]
wsrep_on=ON
wsrep_slave_threads=1
wsrep_provider=/usr/lib/galera/libgalera_smm.so
wsrep_cluster_address="gcomm://IP1,IP2,IP3"

wsrep_provider_options="gmcast.segment=88; gcache.size=1G;"
binlog_format=row
default_storage_engine=InnoDB
innodb_autoinc_lock_mode=2
bind-address=0.0.0.0
wsrep_cluster_name="VoIP TEST Cluster"
wsrep_node_address="IP2"
wsrep_node_name='SQL-3'
wsrep_sst_method=rsync

[mysqld]

  1. InnoDB is enabled by default with a 10MB datafile in /var/lib/mysql/.
  2. Read the manual for more InnoDB related options. There are many!
  3. Most important is to give InnoDB 80 % of the system RAM for buffer use:
  4. https://mariadb.com/kb/en/innodb-system-variables/#innodb_buffer_pool_size
    innodb_buffer_pool_size = 6G
    innodb_read_io_threads=4
    innodb_write_io_threads=4
    innodb_log_buffer_size=128M
Comment by Jerry Kendall [ 2021-03-17 ]

Hosts are all the same,

Debian 10
12GB Ram
4 Core
250GB STD Platter disk
10GB NIC

Comment by Jerry Kendall [ 2021-03-17 ]

When FreeSWITCH is shutdown (systemctl stop freeswitch), it removes a few records from a few tables.

Comment by Jerry Kendall [ 2021-03-18 ]

is there are formula for these params?

innodb_buffer_pool_size
innodb_read_io_threads
innodb_write_io_threads
innodb_log_buffer_size

and

wsrep_slave_threads

Comment by Jerry Kendall [ 2021-03-18 ]

Also,

with wsrep_slave_threads=2, it regularly crashed the DB when FreeSWITCH was stopping.

Comment by Nathan Neulinger [ 2021-04-04 ]

I'm seeing the exact same failure after upgrading cluster from 10.5.5 to 10.5.9.

I have been running this and a whole bunch of other clusters with wsrep_slave_threads=8 and just started noticing this issue after it failed two nights in a row. Unclear what is triggering it, but same trace including the same triggering table.

Comment by Marko Mäkelä [ 2021-05-03 ]

Starting with MDEV-24923, the function wsrep_assert_no_bf_bf_wait() that would intentionally kill the server would only be available in debug builds.

Generated at Thu Feb 08 09:35:40 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.