Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-24983

Slow startup when galera sequence number grows

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Not a Bug
    • 10.5.8
    • N/A
    • Galera
    • None

    Description

      The startup of mariadb from clean shutdown with 3-node galera takes in our case ~40 minutes per node on 10.5.8 at the moment, even if there has been no significant sql traffic. The startup pauses like in following:

      021-02-25 12:14:28 0 [Note] WSREP: STATE EXCHANGE: got state msg: 02212542-7763-11eb-9187-22a6981ff9b7 from 0 (cluster-A3)
      2021-02-25 12:14:28 0 [Note] WSREP: STATE EXCHANGE: got state msg: 02212542-7763-11eb-9187-22a6981ff9b7 from 1 (cluster-A2)
      2021-02-25 12:14:28 0 [Note] WSREP: STATE EXCHANGE: got state msg: 02212542-7763-11eb-9187-22a6981ff9b7 from 2 (cluster-A1)
      2021-02-25 12:14:28 0 [Note] WSREP: Quorum results:
      	version    = 5,
      	component  = PRIMARY,
      	conf_id    = 271,
      	members    = 2/3 (joined/total),
      	act_id     = 9126769098,
      	last_appl. = 0,
      	protocols  = 1/10/4 (gcs/repl/appl),
      	vote policy= 0,
      	group UUID = bfdd2f74-608e-11eb-9fc6-2225f4a2f752
      2021-02-25 12:14:28 0 [Note] WSREP: Writing down CC checksum: 33bfadb3 e7b66ed0 be18a877 3cb50d6e at offset 288
      2021-02-25 12:14:28 0 [Note] WSREP: Flow-control interval: [28, 28]
      2021-02-25 12:14:28 0 [Note] WSREP: Trying to continue unpaused monitor
      2021-02-25 12:14:28 0 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 9126769099)
      2021-02-25 12:14:28 2 [Note] WSREP: ####### processing CC 9126769099, local, ordered
      2021-02-25 12:14:28 2 [Note] WSREP: ####### drain monitors upto -1
      2021-02-25 12:14:31 0 [Note] WSREP: (01d3cc6f, 'tcp://0.0.0.0:4567') connection to peer 01d3cc6f with addr tcp://10.30.136.53:4567 timed out, no messages seen in PT3S
      2021-02-25 12:14:31 0 [Note] WSREP: (01d3cc6f, 'tcp://0.0.0.0:4567') turning message relay requesting off
      2021-02-25 12:56:42 2 [Note] WSREP: REPL Protocols: 10 (5, 3)
      2021-02-25 12:56:42 2 [Note] WSREP: ####### My UUID: 01d3cc6f-7763-11eb-8346-2e3f91c6b8c1
      2021-02-25 12:56:42 2 [Note] WSREP: Server cluster-A3 connected to cluster at position bfdd2f74-608e-11eb-9fc6-2225f4a2f752:9126769099 with ID 01d3cc6f-7763-11eb-8346-2e3f91c6b8c1
      2021-02-25 12:56:42 2 [Note] WSREP: Server status change disconnected -> connected
      2021-02-25 12:56:42 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      2021-02-25 12:56:42 2 [Note] WSREP: State transfer required: 
      	Group state: bfdd2f74-608e-11eb-9fc6-2225f4a2f752:9126769099
      	Local state: bfdd2f74-608e-11eb-9fc6-2225f4a2f752:9126769097
      2021-02-25 12:56:42 2 [Note] WSREP: Server status change connected -> joiner
      2021-02-25 12:56:42 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      

      Tracking back on our logs, the pause is directly proportionate to the sequence number shown in logs:

      WSREP: Shifting OPEN -> PRIMARY (TO: 22080534) 4s
      WSREP: Shifting OPEN -> PRIMARY (TO: 158600621) 43s
      WSREP: Shifting OPEN -> PRIMARY (TO: 507421267) 2m 20s
      WSREP: Shifting OPEN -> PRIMARY (TO: 1841613431) 8m 44s
      WSREP: Shifting OPEN -> PRIMARY (TO: 5111289582) 24m 32s
      WSREP: Shifting OPEN -> PRIMARY (TO: 9028845037) 42m 50s
      

      So there appears to be an empty loop which processes sequence numbers with speed of ~3500000/s.

      Using the http://poormansprofiler.org/, it shows that the looping happens at

      #2  galera::Monitor<galera::ReplicatorSMM::CommitOrder>::drain_common (this=this@entry=0x5609e2d01538, seqno=seqno@entry=-1, lock=...) at galera/src/monitor.hpp:507
      #3  0x00007f4c2975ffd0 in drain (seqno=-1, this=0x5609e2d01538) at galera/src/monitor.hpp:326
      #4  galera::ReplicatorSMM::drain_monitors (this=this@entry=0x5609e2d00630, upto=upto@entry=-1) at galera/src/replicator_smm.cpp:2196
      #5  0x00007f4c29767a55 in galera::ReplicatorSMM::process_conf_change (this=0x5609e2d00630, recv_ctx=0x7f4c2821cbf0, cc=...) at galera/src/replicator_smm.cpp:2448
      #6  0x00007f4c297418a0 in galera::GcsActionSource::dispatch (this=this@entry=0x5609e2d081c0, recv_ctx=recv_ctx@entry=0x7f4c2821cbf0, act=..., exit_loop=@0x7f4c2821c840: false) at galera/src/gcs_action_source.cpp:124
      #7  0x00007f4c29741a4e in galera::GcsActionSource::process (this=0x5609e2d081c0, recv_ctx=0x7f4c2821cbf0, exit_loop=@0x7f4c2821c840: false) at galera/src/gcs_action_source.cpp:182
      #8  0x00007f4c2976702b in galera::ReplicatorSMM::async_recv (this=0x5609e2d00630, recv_ctx=0x7f4c2821cbf0) at galera/src/replicator_smm.cpp:383
      #9  0x00007f4c29779ba8 in galera_recv (gh=<optimized out>, recv_ctx=<optimized out>) at galera/src/wsrep_provider.cpp:236
      #10 0x00005609e1011abe in wsrep::wsrep_provider_v26::run_applier(wsrep::high_priority_service*) ()
      #11 0x00005609e0c3f8a8 in wsrep_replication_process(THD*, void*) ()
      #12 0x00005609e0c33853 in start_wsrep_THD(void*) ()
      #13 0x00005609e0bc979d in pfs_spawn_thread ()
      #14 0x00007f4c2e728ea5 in start_thread () from /lib64/libpthread.so.0
      #15 0x00007f4c2c85196d in clone () from /lib64/libc.so.6
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            pp Pasi Pentikäinen
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.