[MDEV-19791] MariaDB Galera4 Crashes Created: 2019-06-18  Updated: 2020-12-08  Resolved: 2019-08-28

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.4.5
Fix Version/s: 10.3.17, 10.4.7

Type: Bug Priority: Major
Reporter: Tibor Korocz Assignee: Stepan Patryshev (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

ec2, ubuntu



 Description   

Hi,

I have 3 node MariDB Galera 4 cluster for testing.
I was running sysbench trough ProxySQL.
All 3 nodes were in the same hostgroup to loadbalance the queries on all the node. When a deadlock appears 2 nodes were crashing out of tree and the cluster goes into a non-Primary sate.

The whole point of the test was to see what happens if we are writing all the nodes, looks like the cluster dies.

2019-06-18  6:01:01 0 [Note] WSREP: Writing down CC checksum: 09e5753c 59fa600d 2e63f8a0 053de98b at offset 176
2019-06-18  6:01:01 0 [Note] WSREP: Flow-control interval: [23, 23]
2019-06-18  6:01:01 0 [Note] WSREP: Trying to continue unpaused monitor
2019-06-18  6:01:01 12 [Note] WSREP: ####### processing CC 187746, local, ordered
2019-06-18  6:01:01 12 [Note] WSREP: ####### drain monitors upto 187745
2019-06-18  6:01:01 12 [Note] WSREP: REPL Protocols: 10 (5, 3)
2019-06-18  6:01:01 12 [Note] WSREP: ####### My UUID: 269e2eb0-918e-11e9-8882-cb66bb34dda1
2019-06-18  6:01:01 12 [Note] WSREP: ####### ST not required
2019-06-18  6:01:01 12 [Note] WSREP: Skipping cert index reset
2019-06-18  6:01:01 12 [Note] WSREP: ####### Adjusting cert position: 187745 -> 187746
2019-06-18  6:01:01 0 [Note] WSREP: Service thread queue flushed.
2019-06-18  6:01:01 12 [Note] WSREP: ####### Setting monitor position to 187746
2019-06-18  6:01:01 12 [Note] WSREP: Lowest cert indnex boundary for CC from group: 187726
2019-06-18  6:01:01 12 [Note] WSREP: Min available from gcache for CC from group: 187593
2019-06-18  6:01:01 12 [Note] WSREP: ================================================
View:
  id: 7fdaf3de-9147-11e9-84de-22f8161d0a7c:187746
  status: primary
  protocol_version: 4
  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
  final: no
  own_index: 0
  members(2):
	0: 269e2eb0-918e-11e9-8882-cb66bb34dda1, Node2
	1: 96783e50-9154-11e9-ab2b-560064419d5a, Node3
=================================================
2019-06-18  6:01:01 12 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
terminate called after throwing an instance of 'wsrep::runtime_error'
  what():  client_state: Unallowed state transition: 2 -> 0
190618  6:01:01 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
 
Server version: 10.4.5-MariaDB-1:10.4.5+maria~bionic-log
key_buffer_size=134217728
read_buffer_size=2097152
max_used_connections=7
max_threads=102
thread_count=20
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 760240 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x7f804c01d018
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f8378214dd8 thread_stack 0x49000
/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x557c294232ce]
/usr/sbin/mysqld(handle_fatal_signal+0x515)[0x557c28e9e275]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7f838f283890]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f838e593e97]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f838e595801]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8c957)[0x7f838ed70957]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92ab6)[0x7f838ed76ab6]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92af1)[0x7f838ed76af1]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92d24)[0x7f838ed76d24]
/usr/sbin/mysqld(_ZN5wsrep12client_state5stateERNS_11unique_lockINS_5mutexEEENS0_5stateE+0x1f6)[0x557c29490046]
/usr/sbin/mysqld(_ZN5wsrep12client_state7cleanupEv+0x60)[0x557c294904e0]
/usr/sbin/mysqld(_ZN3THD7cleanupEv+0x204)[0x557c28c467d4]
/usr/sbin/mysqld(_ZN3THD11change_userEv+0x17d)[0x557c28c472cd]
/usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcjbb+0x1260)[0x557c28c9b080]
/usr/sbin/mysqld(_Z10do_commandP3THD+0x148)[0x557c28c9c8f8]
/usr/sbin/mysqld(_Z24do_handle_one_connectionP7CONNECT+0x2c2)[0x557c28d764c2]
/usr/sbin/mysqld(handle_one_connection+0x3d)[0x557c28d7658d]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7f838f2786db]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7f838e67688f]
 
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x0):
Connection ID (thread ID): 69
Status: KILL_CONNECTION
 
Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on
 
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
Writing a core file...
Working directory at /var/lib/mysql
Resource Limits:
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        0                    unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             63501                63501                processes
Max open files            16364                16364                files
Max locked memory         16777216             16777216             bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       63501                63501                signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us
Core pattern: |/usr/share/apport/apport %p %s %c %d %P
 
2019-06-18  6:01:10 0 [Note] WSREP: Loading provider /usr/lib/galera/libgalera_smm.so initial position: 7fdaf3de-9147-11e9-84de-22f8161d0a7c:187784
wsrep loader: [INFO] wsrep_load(): loading provider library '/usr/lib/galera/libgalera_smm.so'
wsrep loader: [INFO] wsrep_load(): Galera 26.4.2(r4498) by Codership Oy <info@codership.com> loaded successfully.
2019-06-18  6:01:10 0 [Note] WSREP: CRC-32C: using hardware acceleration.
2019-06-18  6:01:10 0 [Note] WSREP: Found saved state: 7fdaf3de-9147-11e9-84de-22f8161d0a7c:-1, safe_to_bootstrap: 0
2019-06-18  6:01:10 0 [Note] WSREP: GCache DEBUG: opened preamble:
Version: 2
UUID: 7fdaf3de-9147-11e9-84de-22f8161d0a7c
Seqno: -1 - -1
Offset: -1
Synced: 0
2019-06-18  6:01:10 0 [Note] WSREP: Recovering GCache ring buffer: version: 2, UUID: 7fdaf3de-9147-11e9-84de-22f8161d0a7c, offset: -1
2019-06-18  6:01:10 0 [Note] WSREP: GCache::RingBuffer initial scan...  0.0% (        0/134217752 bytes) complete.
2019-06-18  6:01:10 0 [Note] WSREP: GCache::RingBuffer initial scan...100.0% (134217752/134217752 bytes) complete.
2019-06-18  6:01:10 0 [Note] WSREP: Recovering GCache ring buffer: found gapless sequence 187593-187785
2019-06-18  6:01:10 0 [Note] WSREP: GCache::RingBuffer unused buffers scan...  0.0% (     0/408960 bytes) complete.
2019-06-18  6:01:10 0 [Note] WSREP: GCache::RingBuffer unused buffers scan...100.0% (408960/408960 bytes) complete.
2019-06-18  6:01:10 0 [Note] WSREP: GCache DEBUG: RingBuffer::recover(): found 0/193 locked buffers
2019-06-18  6:01:10 0 [Note] WSREP: GCache DEBUG: RingBuffer::recover(): used space: 408960/134217728
2019-06-18  6:01:10 0 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 172.30.2.150; base_port = 4567; cert.log_conflicts = no; cert.optimistic_pa = yes; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = galera.cache; gcache.page_size = 128M; gcache.recover = yes; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum =
2019-06-18  6:01:10 0 [Note] WSREP: Start replication
2019-06-18  6:01:10 0 [Note] WSREP: Connecting with bootstrap option: 0
2019-06-18  6:01:10 0 [Note] WSREP: Setting GCS initial position to 7fdaf3de-9147-11e9-84de-22f8161d0a7c:187784
2019-06-18  6:01:10 0 [Note] WSREP: protonet asio version 0
2019-06-18  6:01:10 0 [Note] WSREP: Using CRC-32C for message checksums.
2019-06-18  6:01:10 0 [Note] WSREP: backend: asio
2019-06-18  6:01:10 0 [Note] WSREP: gcomm thread scheduling priority set to other:0
2019-06-18  6:01:10 0 [Note] WSREP: restore pc from disk successfully
2019-06-18  6:01:10 0 [Note] WSREP: GMCast version 0
2019-06-18  6:01:10 0 [Note] WSREP: (269e2eb0, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2019-06-18  6:01:10 0 [Note] WSREP: (269e2eb0, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2019-06-18  6:01:10 0 [Note] WSREP: EVS version 1
2019-06-18  6:01:10 0 [Note] WSREP: gcomm: connecting to group 'galera_cluster', peer '172.30.2.167:,172.30.2.150:,172.30.2.68:'
2019-06-18  6:01:10 0 [Note] WSREP: (269e2eb0, 'tcp://0.0.0.0:4567') connection established to 269e2eb0 tcp://172.30.2.150:4567
2019-06-18  6:01:10 0 [Warning] WSREP: (269e2eb0, 'tcp://0.0.0.0:4567') address 'tcp://172.30.2.150:4567' points to own listening address, blacklisting
2019-06-18  6:01:10 0 [Note] WSREP: (269e2eb0, 'tcp://0.0.0.0:4567') connection established to 646e290e tcp://172.30.2.167:4567
2019-06-18  6:01:10 0 [Note] WSREP: (269e2eb0, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
2019-06-18  6:01:10 0 [Note] WSREP: (269e2eb0, 'tcp://0.0.0.0:4567') connection established to 96783e50 tcp://172.30.2.68:4567
2019-06-18  6:01:11 0 [Note] WSREP: EVS version upgrade 0 -> 1
2019-06-18  6:01:11 0 [Note] WSREP: declaring 646e290e at tcp://172.30.2.167:4567 stable
2019-06-18  6:01:11 0 [Note] WSREP: declaring 96783e50 at tcp://172.30.2.68:4567 stable
2019-06-18  6:01:11 0 [Note] WSREP: PC protocol upgrade 0 -> 1
2019-06-18  6:01:11 0 [Note] WSREP: view(view_id(NON_PRIM,269e2eb0,34) memb {
	269e2eb0,0
	646e290e,0
	96783e50,0
} joined {
} left {
} partitioned {
	269e2eb0,0
})
2019-06-18  6:01:11 0 [Warning] WSREP: node uuid: 269e2eb0 last_prim(type: 3, uuid: 269e2eb0) is inconsistent to restored view(type: V_NON_PRIM, uuid: 269e2eb0
2019-06-18  6:01:11 0 [Note] WSREP: gcomm: connected
2019-06-18  6:01:11 0 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
2019-06-18  6:01:11 0 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
2019-06-18  6:01:11 0 [Note] WSREP: Opened channel 'galera_cluster'
2019-06-18  6:01:11 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 3
2019-06-18  6:01:11 0 [Note] WSREP: Writing down CC checksum: 90b061aa c20b88bc 88b0f096 8dd8ac7b at offset 256
2019-06-18  6:01:11 0 [Note] WSREP: Flow-control interval: [28, 28]
2019-06-18  6:01:11 0 [Note] WSREP: Trying to continue unpaused monitor
2019-06-18  6:01:11 0 [Note] WSREP: Received NON-PRIMARY.
2019-06-18  6:01:11 1 [Note] WSREP: Starting rollbacker thread 1
2019-06-18  6:01:11 2 [Note] WSREP: Starting post rollbacker thread 2
2019-06-18  6:01:11 3 [Note] WSREP: Starting applier thread 3
2019-06-18  6:01:11 3 [Note] WSREP: ####### processing CC -1, local, unordered
2019-06-18  6:01:11 3 [Note] WSREP: ####### drain monitors upto -1
2019-06-18  6:01:11 3 [Note] WSREP: ####### My UUID: 269e2eb0-918e-11e9-8883-cb66bb34dda1
2019-06-18  6:01:11 3 [Note] WSREP: Server Node2 connected to cluster at position 7fdaf3de-9147-11e9-84de-22f8161d0a7c:-1 with ID 269e2eb0-918e-11e9-8883-cb66bb34dda1
2019-06-18  6:01:11 3 [Note] WSREP: Server status change disconnected -> connected
2019-06-18  6:01:11 3 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2019-06-18  6:01:11 3 [Note] WSREP: ####### ST not required
2019-06-18  6:01:11 3 [Note] WSREP: ================================================
View:
  id: 7fdaf3de-9147-11e9-84de-22f8161d0a7c:-1
  status: non-primary
  protocol_version: -1
  capabilities:
  final: no
  own_index: 0
  members(3):
	0: 269e2eb0-918e-11e9-8883-cb66bb34dda1, Node2
	1: 646e290e-915a-11e9-a9fa-ca11152ebb52, unspecified
	2: 96783e50-9154-11e9-ab2b-560064419d5a, unspecified
=================================================
2019-06-18  6:01:11 3 [Note] WSREP: Non-primary view
2019-06-18  6:01:11 3 [Note] WSREP: Server status change connected -> connected
2019-06-18  6:01:11 3 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2019-06-18  6:01:11 3 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2019-06-18  6:01:13 0 [Note] WSREP: (269e2eb0, 'tcp://0.0.0.0:4567') connection to peer 269e2eb0 with addr tcp://172.30.2.150:4567 timed out, no messages seen in PT3S
2019-06-18  6:01:13 0 [Note] WSREP: (269e2eb0, 'tcp://0.0.0.0:4567') turning message relay requesting off

Sysbench command:

sysbench \
                /usr/share/sysbench/oltp_write_only.lua \
                --db-driver=mysql \
                --db-ps-mode=disable \
                --skip_trx=off \
                --mysql-user=sysbuser \
                --mysql-password=xxxxx \
                --mysql-db=sysbench \
                --mysql-ignore-errors \
                --mysql-port=6033 \
                --mysql-host=127.0.0.1 \
                --tables=1 \
                --table_size=100 \
                --report-interval=1 \
                --threads=3 \
                --time=60 \
                --forced-shutdown=1 \
                --events=0 \
                --debug \
                --db-debug \
                run



 Comments   
Comment by Jan Lindström (Inactive) [ 2019-07-09 ]

Can you provide full error log from nodes that fail and their configuration. I did run oltp_read_write to 2 node galera cluster for hours and did not see any problem. But then I did not use load balance. Does it send writes to several nodes ?

Comment by Dimitri Gavrilov [ 2019-07-11 ]

Hello,

We have exactly same behaviour while upgrading 3 nodes MariaDB cluster from version 10.3.16 (1:10.3.16+maria~stretch) to 10.4.6 (1:10.4.6+maria~stretch). We have tried to do rolling upgrade and after upgrading first node noticed that it started to rejoin cluster every few seconds:

2019-07-11 14:04:27 0 [Note] WSREP: (fbf8461e, 'tcp://0.0.0.0:4567') connection to peer fbf8461e with addr tcp://172.21.0.35:4567 timed out, no messages seen in PT3S
2019-07-11 14:04:28 0 [Note] WSREP: (fbf8461e, 'tcp://0.0.0.0:4567') turning message relay requesting off
terminate called after throwing an instance of 'wsrep::runtime_error'
  what():  client_state: Unallowed state transition: 2 -> 0
190711 14:04:31 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.4.6-MariaDB-1:10.4.6+maria~stretch
key_buffer_size=16777216
read_buffer_size=131072
max_used_connections=2
max_threads=386
thread_count=11
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 865773 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x7ff8f80009a8
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7ff9840b8cf8 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x55d54de5184e]
/usr/sbin/mysqld(handle_fatal_signal+0x3af)[0x55d54d8e256f]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x110e0)[0x7ff98901d0e0]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcf)[0x7ff9876bcfff]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x16a)[0x7ff9876be42a]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x15d)[0x7ff987dbe0ad]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8f066)[0x7ff987dbc066]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8f0b1)[0x7ff987dbc0b1]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8f2c9)[0x7ff987dbc2c9]
/usr/sbin/mysqld(_ZN5wsrep12client_state5stateERNS_11unique_lockINS_5mutexEEENS0_5stateE+0x1eb)[0x55d54deddd5b]
/usr/sbin/mysqld(_ZN5wsrep12client_state7cleanupEv+0x60)[0x55d54dede200]
/usr/sbin/mysqld(_ZN3THD7cleanupEv+0x204)[0x55d54d682ed4]
/usr/sbin/mysqld(_ZN3THD11change_userEv+0x17d)[0x55d54d683a2d]
/usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcjbb+0x18d4)[0x55d54d6d80c4]
/usr/sbin/mysqld(_Z10do_commandP3THD+0x11c)[0x55d54d6d921c]
/usr/sbin/mysqld(_Z24do_handle_one_connectionP7CONNECT+0x25a)[0x55d54d7b578a]
/usr/sbin/mysqld(handle_one_connection+0x3d)[0x55d54d7b58cd]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x74a4)[0x7ff9890134a4]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7ff987772d0f]
 
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x0): 
Connection ID (thread ID): 17
Status: KILL_CONNECTION
 
Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on
 
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
Writing a core file...
Working directory at /opt/mariadb
Resource Limits:
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             15728                15728                processes 
Max open files            16364                16364                files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       15728                15728                signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        
Core pattern: core
 
2019-07-11 14:04:40 0 [Note] WSREP: Loading provider /usr/lib/galera/libgalera_smm.so initial position: bfbe369e-a3c6-11e9-839d-17b5ea93774f:20
wsrep loader: [INFO] wsrep_load(): loading provider library '/usr/lib/galera/libgalera_smm.so'
wsrep loader: [INFO] wsrep_load(): Galera 26.4.2(r4498) by Codership Oy <info@codership.com> loaded successfully.
2019-07-11 14:04:40 0 [Note] WSREP: CRC-32C: using "slicing-by-8" algorithm.
2019-07-11 14:04:40 0 [Note] WSREP: Found saved state: bfbe369e-a3c6-11e9-839d-17b5ea93774f:-1, safe_to_bootstrap: 0
2019-07-11 14:04:40 0 [Note] WSREP: GCache DEBUG: opened preamble:
Version: 2
UUID: bfbe369e-a3c6-11e9-839d-17b5ea93774f
Seqno: -1 - -1
Offset: -1
Synced: 0
2019-07-11 14:04:40 0 [Note] WSREP: Recovering GCache ring buffer: version: 2, UUID: bfbe369e-a3c6-11e9-839d-17b5ea93774f, offset: -1
2019-07-11 14:04:40 0 [Note] WSREP: GCache::RingBuffer initial scan...  0.0% (        0/134217752 bytes) complete.
2019-07-11 14:04:40 0 [Note] WSREP: GCache::RingBuffer initial scan...100.0% (134217752/134217752 bytes) complete.
2019-07-11 14:04:40 0 [Note] WSREP: Recovering GCache ring buffer: didn't recover any events.
2019-07-11 14:04:40 0 [Note] WSREP: Passing config to GCS: base_dir = /opt/mariadb/; base_host = 172.21.0.35; base_port = 4567; cert.log_conflicts = no; cert.optimistic_pa = yes; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /opt/mariadb/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = galera.cache; gcache.page_size = 128M; gcache.recover = yes; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false
2019-07-11 14:04:40 0 [Note] WSREP: GCache history reset: old(bfbe369e-a3c6-11e9-839d-17b5ea93774f:0 -> bfbe369e-a3c6-11e9-839d-17b5ea93774f:20
2019-07-11 14:04:40 0 [Note] WSREP: Start replication
2019-07-11 14:04:40 0 [Note] WSREP: Connecting with bootstrap option: 0
2019-07-11 14:04:40 0 [Note] WSREP: Setting GCS initial position to bfbe369e-a3c6-11e9-839d-17b5ea93774f:20
2019-07-11 14:04:40 0 [Note] WSREP: protonet asio version 0
2019-07-11 14:04:40 0 [Note] WSREP: Using CRC-32C for message checksums.
2019-07-11 14:04:40 0 [Note] WSREP: backend: asio
2019-07-11 14:04:40 0 [Note] WSREP: gcomm thread scheduling priority set to other:0 
2019-07-11 14:04:40 0 [Note] WSREP: restore pc from disk successfully
2019-07-11 14:04:40 0 [Note] WSREP: GMCast version 0
2019-07-11 14:04:40 0 [Note] WSREP: (fbf8461e, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2019-07-11 14:04:40 0 [Note] WSREP: (fbf8461e, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2019-07-11 14:04:40 0 [Note] WSREP: EVS version 1
2019-07-11 14:04:40 0 [Note] WSREP: gcomm: connecting to group 'fc_r02_wsrep_cluster', peer '172.21.0.33:,172.21.0.34:,172.21.0.35:'
2019-07-11 14:04:40 0 [Note] WSREP: (fbf8461e, 'tcp://0.0.0.0:4567') connection established to fbf8461e tcp://172.21.0.35:4567
2019-07-11 14:04:40 0 [Warning] WSREP: (fbf8461e, 'tcp://0.0.0.0:4567') address 'tcp://172.21.0.35:4567' points to own listening address, blacklisting
2019-07-11 14:04:40 0 [Note] WSREP: (fbf8461e, 'tcp://0.0.0.0:4567') connection established to fe046845 tcp://172.21.0.33:4567
2019-07-11 14:04:40 0 [Note] WSREP: (fbf8461e, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: 
2019-07-11 14:04:40 0 [Note] WSREP: (fbf8461e, 'tcp://0.0.0.0:4567') connection established to df5acba8 tcp://172.21.0.34:4567
2019-07-11 14:04:40 0 [Note] WSREP: declaring df5acba8 at tcp://172.21.0.34:4567 stable
2019-07-11 14:04:40 0 [Note] WSREP: declaring fe046845 at tcp://172.21.0.33:4567 stable
2019-07-11 14:04:40 0 [Note] WSREP: Node df5acba8 state prim
2019-07-11 14:04:40 0 [Note] WSREP: view(view_id(PRIM,df5acba8,112) memb {
	df5acba8,0
	fbf8461e,0
	fe046845,0
} joined {
} left {
} partitioned {
})
2019-07-11 14:04:40 0 [Note] WSREP: save pc into disk
2019-07-11 14:04:40 0 [Note] WSREP: clear restored view
2019-07-11 14:04:41 0 [Note] WSREP: gcomm: connected
2019-07-11 14:04:41 0 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
2019-07-11 14:04:41 0 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
2019-07-11 14:04:41 0 [Note] WSREP: Opened channel 'fc_r02_wsrep_cluster'
2019-07-11 14:04:41 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 3
2019-07-11 14:04:41 0 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
2019-07-11 14:04:41 0 [Note] WSREP: STATE EXCHANGE: sent state msg: 10037f9e-a3d4-11e9-9806-d732cdfbc239
2019-07-11 14:04:41 0 [Note] WSREP: STATE EXCHANGE: got state msg: 10037f9e-a3d4-11e9-9806-d732cdfbc239 from 0 (fc-r02-mariadb-02)
2019-07-11 14:04:41 0 [Note] WSREP: STATE EXCHANGE: got state msg: 10037f9e-a3d4-11e9-9806-d732cdfbc239 from 2 (fc-r02-mariadb-01)
2019-07-11 14:04:41 1 [Note] WSREP: Starting rollbacker thread 1
2019-07-11 14:04:41 2 [Note] WSREP: Starting post rollbacker thread 2
2019-07-11 14:04:41 3 [Note] WSREP: Starting applier thread 3
2019-07-11 14:04:41 0 [Note] WSREP: STATE EXCHANGE: got state msg: 10037f9e-a3d4-11e9-9806-d732cdfbc239 from 1 (fc-r02-mariadb-03)
2019-07-11 14:04:41 0 [Note] WSREP: Quorum results:
	version    = 4,
	component  = PRIMARY,
	conf_id    = 30,
	members    = 3/3 (joined/total),
	act_id     = 20,
	last_appl. = 0,
	protocols  = 0/9/3 (gcs/repl/appl),
	vote policy= 1,
	group UUID = bfbe369e-a3c6-11e9-839d-17b5ea93774f
2019-07-11 14:04:41 0 [Note] WSREP: Writing down CC checksum: e15e8fc7 ebc14909 bc09f31a 6ba21824 at offset 288
2019-07-11 14:04:41 0 [Note] WSREP: Flow-control interval: [28, 28]
2019-07-11 14:04:41 0 [Note] WSREP: Trying to continue unpaused monitor
2019-07-11 14:04:41 0 [Note] WSREP: Restored state OPEN -> JOINED (20)
2019-07-11 14:04:41 3 [Note] WSREP: ####### processing CC 20, local, unordered
2019-07-11 14:04:41 3 [Note] WSREP: ####### drain monitors upto -1
2019-07-11 14:04:41 3 [Note] WSREP: REPL Protocols: 9 (4, 2)
2019-07-11 14:04:41 3 [Note] WSREP: ####### My UUID: fbf8461e-a3d3-11e9-afe5-52552c33620b
2019-07-11 14:04:41 3 [Note] WSREP: Server fc-r02-mariadb-03 connected to cluster at position bfbe369e-a3c6-11e9-839d-17b5ea93774f:20 with ID fbf8461e-a3d3-11e9-afe5-52552c33620b
2019-07-11 14:04:41 3 [Note] WSREP: Server status change disconnected -> connected
2019-07-11 14:04:41 3 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2019-07-11 14:04:41 3 [Note] WSREP: ####### ST not required
2019-07-11 14:04:41 3 [Note] WSREP: Cert index reset to bfbe369e-a3c6-11e9-839d-17b5ea93774f:20 (proto: 9), state transfer needed: no
2019-07-11 14:04:41 0 [Note] WSREP: Service thread queue flushed.
2019-07-11 14:04:41 3 [Note] WSREP: ####### Assign initial position for certification: bfbe369e-a3c6-11e9-839d-17b5ea93774f:20, protocol version: 4
2019-07-11 14:04:41 3 [Note] WSREP: ================================================
View:
  id: bfbe369e-a3c6-11e9-839d-17b5ea93774f:20
  status: primary
  protocol_version: 3
  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
  final: no
  own_index: 1
  members(3):
	0: df5acba8-a3c6-11e9-b76e-a31c7a69d5a3, fc-r02-mariadb-02
	1: fbf8461e-a3d3-11e9-afe5-52552c33620b, fc-r02-mariadb-03
	2: fe046845-a3c6-11e9-9394-62c40d2e282a, fc-r02-mariadb-01
=================================================
2019-07-11 14:04:41 3 [Note] WSREP: Server status change connected -> joiner
2019-07-11 14:04:41 3 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2019-07-11 14:04:41 0 [Note] WSREP: Member 1.0 (fc-r02-mariadb-03) synced with group.

We have cancelled upgrade and rolled back to 10.3 version where the cluster is running stable. We also using proxysql, but we have single writemaster setup.

UPDATE:

Since my first post there were updates for the both MariaDB major versions, so we tried to do major upgrade after updating 10.3 cluster to the latest minor version (10.3.16 -> 10.3.17). At this time everything run as expected and we were able to upgrade our clusters from MariaDB 10.3.17 to MariaDB 10.4.7 without any issues.

@Tibor Korocz: We suspect our issues were related to the, probably, different galera libraries (either not completely updated or from different Debian reporitory). But it is also possible that the problem has been fixed in the last minor release.

Comment by Jan Lindström (Inactive) [ 2019-08-28 ]

Thanks for the update, closing as fixed.

Generated at Thu Feb 08 08:54:23 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.