Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Not a Bug
-
10.5.8
-
None
Description
The startup of mariadb from clean shutdown with 3-node galera takes in our case ~40 minutes per node on 10.5.8 at the moment, even if there has been no significant sql traffic. The startup pauses like in following:
021-02-25 12:14:28 0 [Note] WSREP: STATE EXCHANGE: got state msg: 02212542-7763-11eb-9187-22a6981ff9b7 from 0 (cluster-A3)
|
2021-02-25 12:14:28 0 [Note] WSREP: STATE EXCHANGE: got state msg: 02212542-7763-11eb-9187-22a6981ff9b7 from 1 (cluster-A2)
|
2021-02-25 12:14:28 0 [Note] WSREP: STATE EXCHANGE: got state msg: 02212542-7763-11eb-9187-22a6981ff9b7 from 2 (cluster-A1)
|
2021-02-25 12:14:28 0 [Note] WSREP: Quorum results:
|
version = 5,
|
component = PRIMARY,
|
conf_id = 271,
|
members = 2/3 (joined/total),
|
act_id = 9126769098,
|
last_appl. = 0,
|
protocols = 1/10/4 (gcs/repl/appl),
|
vote policy= 0,
|
group UUID = bfdd2f74-608e-11eb-9fc6-2225f4a2f752
|
2021-02-25 12:14:28 0 [Note] WSREP: Writing down CC checksum: 33bfadb3 e7b66ed0 be18a877 3cb50d6e at offset 288
|
2021-02-25 12:14:28 0 [Note] WSREP: Flow-control interval: [28, 28]
|
2021-02-25 12:14:28 0 [Note] WSREP: Trying to continue unpaused monitor
|
2021-02-25 12:14:28 0 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 9126769099)
|
2021-02-25 12:14:28 2 [Note] WSREP: ####### processing CC 9126769099, local, ordered
|
2021-02-25 12:14:28 2 [Note] WSREP: ####### drain monitors upto -1
|
2021-02-25 12:14:31 0 [Note] WSREP: (01d3cc6f, 'tcp://0.0.0.0:4567') connection to peer 01d3cc6f with addr tcp://10.30.136.53:4567 timed out, no messages seen in PT3S
|
2021-02-25 12:14:31 0 [Note] WSREP: (01d3cc6f, 'tcp://0.0.0.0:4567') turning message relay requesting off
|
2021-02-25 12:56:42 2 [Note] WSREP: REPL Protocols: 10 (5, 3)
|
2021-02-25 12:56:42 2 [Note] WSREP: ####### My UUID: 01d3cc6f-7763-11eb-8346-2e3f91c6b8c1
|
2021-02-25 12:56:42 2 [Note] WSREP: Server cluster-A3 connected to cluster at position bfdd2f74-608e-11eb-9fc6-2225f4a2f752:9126769099 with ID 01d3cc6f-7763-11eb-8346-2e3f91c6b8c1
|
2021-02-25 12:56:42 2 [Note] WSREP: Server status change disconnected -> connected
|
2021-02-25 12:56:42 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
|
2021-02-25 12:56:42 2 [Note] WSREP: State transfer required:
|
Group state: bfdd2f74-608e-11eb-9fc6-2225f4a2f752:9126769099
|
Local state: bfdd2f74-608e-11eb-9fc6-2225f4a2f752:9126769097
|
2021-02-25 12:56:42 2 [Note] WSREP: Server status change connected -> joiner
|
2021-02-25 12:56:42 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
|
Tracking back on our logs, the pause is directly proportionate to the sequence number shown in logs:
WSREP: Shifting OPEN -> PRIMARY (TO: 22080534) 4s
|
WSREP: Shifting OPEN -> PRIMARY (TO: 158600621) 43s
|
WSREP: Shifting OPEN -> PRIMARY (TO: 507421267) 2m 20s
|
WSREP: Shifting OPEN -> PRIMARY (TO: 1841613431) 8m 44s
|
WSREP: Shifting OPEN -> PRIMARY (TO: 5111289582) 24m 32s
|
WSREP: Shifting OPEN -> PRIMARY (TO: 9028845037) 42m 50s
|
So there appears to be an empty loop which processes sequence numbers with speed of ~3500000/s.
Using the http://poormansprofiler.org/, it shows that the looping happens at
#2 galera::Monitor<galera::ReplicatorSMM::CommitOrder>::drain_common (this=this@entry=0x5609e2d01538, seqno=seqno@entry=-1, lock=...) at galera/src/monitor.hpp:507
|
#3 0x00007f4c2975ffd0 in drain (seqno=-1, this=0x5609e2d01538) at galera/src/monitor.hpp:326
|
#4 galera::ReplicatorSMM::drain_monitors (this=this@entry=0x5609e2d00630, upto=upto@entry=-1) at galera/src/replicator_smm.cpp:2196
|
#5 0x00007f4c29767a55 in galera::ReplicatorSMM::process_conf_change (this=0x5609e2d00630, recv_ctx=0x7f4c2821cbf0, cc=...) at galera/src/replicator_smm.cpp:2448
|
#6 0x00007f4c297418a0 in galera::GcsActionSource::dispatch (this=this@entry=0x5609e2d081c0, recv_ctx=recv_ctx@entry=0x7f4c2821cbf0, act=..., exit_loop=@0x7f4c2821c840: false) at galera/src/gcs_action_source.cpp:124
|
#7 0x00007f4c29741a4e in galera::GcsActionSource::process (this=0x5609e2d081c0, recv_ctx=0x7f4c2821cbf0, exit_loop=@0x7f4c2821c840: false) at galera/src/gcs_action_source.cpp:182
|
#8 0x00007f4c2976702b in galera::ReplicatorSMM::async_recv (this=0x5609e2d00630, recv_ctx=0x7f4c2821cbf0) at galera/src/replicator_smm.cpp:383
|
#9 0x00007f4c29779ba8 in galera_recv (gh=<optimized out>, recv_ctx=<optimized out>) at galera/src/wsrep_provider.cpp:236
|
#10 0x00005609e1011abe in wsrep::wsrep_provider_v26::run_applier(wsrep::high_priority_service*) ()
|
#11 0x00005609e0c3f8a8 in wsrep_replication_process(THD*, void*) ()
|
#12 0x00005609e0c33853 in start_wsrep_THD(void*) ()
|
#13 0x00005609e0bc979d in pfs_spawn_thread ()
|
#14 0x00007f4c2e728ea5 in start_thread () from /lib64/libpthread.so.0
|
#15 0x00007f4c2c85196d in clone () from /lib64/libc.so.6
|