[MDEV-24983] Slow startup when galera sequence number grows - Jira

XML

Word

Printable

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Not a Bug
Affects Version/s: 10.5.8
Fix Version/s: N/A
Component/s: Galera
Labels:
None

Description

The startup of mariadb from clean shutdown with 3-node galera takes in our case ~40 minutes per node on 10.5.8 at the moment, even if there has been no significant sql traffic. The startup pauses like in following:

021-02-25 12:14:28 0 [Note] WSREP: STATE EXCHANGE: got state msg: 02212542-7763-11eb-9187-22a6981ff9b7 from 0 (cluster-A3)

2021-02-25 12:14:28 0 [Note] WSREP: STATE EXCHANGE: got state msg: 02212542-7763-11eb-9187-22a6981ff9b7 from 1 (cluster-A2)

2021-02-25 12:14:28 0 [Note] WSREP: STATE EXCHANGE: got state msg: 02212542-7763-11eb-9187-22a6981ff9b7 from 2 (cluster-A1)

2021-02-25 12:14:28 0 [Note] WSREP: Quorum results:

	version    = 5,

	component  = PRIMARY,

	conf_id    = 271,

	members    = 2/3 (joined/total),

	act_id     = 9126769098,

	last_appl. = 0,

	protocols  = 1/10/4 (gcs/repl/appl),

	vote policy= 0,

	group UUID = bfdd2f74-608e-11eb-9fc6-2225f4a2f752

2021-02-25 12:14:28 0 [Note] WSREP: Writing down CC checksum: 33bfadb3 e7b66ed0 be18a877 3cb50d6e at offset 288

2021-02-25 12:14:28 0 [Note] WSREP: Flow-control interval: [28, 28]

2021-02-25 12:14:28 0 [Note] WSREP: Trying to continue unpaused monitor

2021-02-25 12:14:28 0 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 9126769099)

2021-02-25 12:14:28 2 [Note] WSREP: ####### processing CC 9126769099, local, ordered

2021-02-25 12:14:28 2 [Note] WSREP: ####### drain monitors upto -1

2021-02-25 12:14:31 0 [Note] WSREP: (01d3cc6f, 'tcp://0.0.0.0:4567') connection to peer 01d3cc6f with addr tcp://10.30.136.53:4567 timed out, no messages seen in PT3S

2021-02-25 12:14:31 0 [Note] WSREP: (01d3cc6f, 'tcp://0.0.0.0:4567') turning message relay requesting off

2021-02-25 12:56:42 2 [Note] WSREP: REPL Protocols: 10 (5, 3)

2021-02-25 12:56:42 2 [Note] WSREP: ####### My UUID: 01d3cc6f-7763-11eb-8346-2e3f91c6b8c1

2021-02-25 12:56:42 2 [Note] WSREP: Server cluster-A3 connected to cluster at position bfdd2f74-608e-11eb-9fc6-2225f4a2f752:9126769099 with ID 01d3cc6f-7763-11eb-8346-2e3f91c6b8c1

2021-02-25 12:56:42 2 [Note] WSREP: Server status change disconnected -> connected

2021-02-25 12:56:42 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

2021-02-25 12:56:42 2 [Note] WSREP: State transfer required:

	Group state: bfdd2f74-608e-11eb-9fc6-2225f4a2f752:9126769099

	Local state: bfdd2f74-608e-11eb-9fc6-2225f4a2f752:9126769097

2021-02-25 12:56:42 2 [Note] WSREP: Server status change connected -> joiner

2021-02-25 12:56:42 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

Tracking back on our logs, the pause is directly proportionate to the sequence number shown in logs:

WSREP: Shifting OPEN -> PRIMARY (TO: 22080534) 4s

WSREP: Shifting OPEN -> PRIMARY (TO: 158600621) 43s

WSREP: Shifting OPEN -> PRIMARY (TO: 507421267) 2m 20s

WSREP: Shifting OPEN -> PRIMARY (TO: 1841613431) 8m 44s

WSREP: Shifting OPEN -> PRIMARY (TO: 5111289582) 24m 32s

WSREP: Shifting OPEN -> PRIMARY (TO: 9028845037) 42m 50s

So there appears to be an empty loop which processes sequence numbers with speed of ~3500000/s.

Using the http://poormansprofiler.org/, it shows that the looping happens at

#2  galera::Monitor<galera::ReplicatorSMM::CommitOrder>::drain_common (this=this@entry=0x5609e2d01538, seqno=seqno@entry=-1, lock=...) at galera/src/monitor.hpp:507

#3  0x00007f4c2975ffd0 in drain (seqno=-1, this=0x5609e2d01538) at galera/src/monitor.hpp:326

#4  galera::ReplicatorSMM::drain_monitors (this=this@entry=0x5609e2d00630, upto=upto@entry=-1) at galera/src/replicator_smm.cpp:2196

#5  0x00007f4c29767a55 in galera::ReplicatorSMM::process_conf_change (this=0x5609e2d00630, recv_ctx=0x7f4c2821cbf0, cc=...) at galera/src/replicator_smm.cpp:2448

#6  0x00007f4c297418a0 in galera::GcsActionSource::dispatch (this=this@entry=0x5609e2d081c0, recv_ctx=recv_ctx@entry=0x7f4c2821cbf0, act=..., exit_loop=@0x7f4c2821c840: false) at galera/src/gcs_action_source.cpp:124

#7  0x00007f4c29741a4e in galera::GcsActionSource::process (this=0x5609e2d081c0, recv_ctx=0x7f4c2821cbf0, exit_loop=@0x7f4c2821c840: false) at galera/src/gcs_action_source.cpp:182

#8  0x00007f4c2976702b in galera::ReplicatorSMM::async_recv (this=0x5609e2d00630, recv_ctx=0x7f4c2821cbf0) at galera/src/replicator_smm.cpp:383

#9  0x00007f4c29779ba8 in galera_recv (gh=<optimized out>, recv_ctx=<optimized out>) at galera/src/wsrep_provider.cpp:236

#10 0x00005609e1011abe in wsrep::wsrep_provider_v26::run_applier(wsrep::high_priority_service*) ()

#11 0x00005609e0c3f8a8 in wsrep_replication_process(THD*, void*) ()

#12 0x00005609e0c33853 in start_wsrep_THD(void*) ()

#13 0x00005609e0bc979d in pfs_spawn_thread ()

#14 0x00007f4c2e728ea5 in start_thread () from /lib64/libpthread.so.0

#15 0x00007f4c2c85196d in clone () from /lib64/libc.so.6

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Pasi Pentikäinen

Votes:: 1 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 2021-02-25 14:16

Updated:: 2021-03-28 17:29

Resolved:: 2021-03-28 17:29

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.