Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-11513

Database Cluster gets stopped by flowcontrol

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Incomplete
    • 10.1.9
    • N/A
    • Galera
    • Galera 25.3.18(r3632), Ubuntu 14.04.5 LTS, mysqld Ver 10.1.19-MariaDB-1~trusty for debian-linux-gnu on x86_64 (mariadb.org binary distribution)

    Description

      Hi there

      We’re running on a galera cluster since over a year now. Until recently we didn’t see bigger issues but lately we encountered that Flow Control is stopping our clusters replication regularly.

      After reading and researching we did following configuration changes:

      • cert.log_conflicts=on
      • Gcache.size = 4G
      • gcs.fc_limit=2048
      • fcs.fc_master_slave=YES
      • wsrep_slave_threads=16

      We also have a second Database Cluster setup in order to test things out but it's hard to reproduce because we don't know what exactly leads to the Flow Control issue.

      Our wsrep log shows following behaviour during the time when the flow control invokes the pausing

      The database connection is handled via maxscale. First there is the log of the master node based on maxscale where all the reads and writes end up:

      xx-cluster-zh1 / database2 (idx: 0) / Galera 25.3.18(r3632)
               Cluster  Node       Outbound      Inbound       FlowC     Conflct Gcache     Appl
          time P cnf  # stat laten msgs data que msgs data que pause snt lcf bfa   ist  idx  %ef
      14:15:46 P 323  3 Sync 1.4ms    0   0b   8    0   0b   0   0ns   0   0   0 66279   97  11%
      14:15:47 P 323  3 Sync 1.4ms    0   0b   9    0   0b   0   0ns   0   0   0 66279   97  11%
      14:15:48 P 323  3 Sync 1.4ms    0   0b   9    0   0b   0   0ns   0   0   0 66279   97  11%
      14:15:49 P 323  3 Sync 1.4ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97  11%
      14:15:50 P 323  3 Sync 1.4ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97  11%
      14:15:51 P 323  3 Sync 1.4ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97  11%
      14:15:52 P 323  3 Sync 1.4ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97  11%
      14:15:53 P 323  3 Sync 1.4ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97  11%
      14:15:54 P 323  3 Sync 0.8ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97  11%
      14:15:55 P 323  3 Sync 0.8ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97  11%
      14:15:56 P 323  3 Sync 0.8ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97  11%
      14:15:57 P 323  3 Sync 0.8ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97  11%
      14:15:58 P 323  3 Sync 0.8ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97  11%
      14:15:59 P 323  3 Sync 0.8ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97  11%
      14:16:00 P 323  3 Sync 0.8ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97  11%
      14:16:01 P 323  3 Sync 0.8ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97  11%
      14:16:02 P 323  3 Sync 0.8ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97  11%
      14:16:03 P 323  3 Sync 0.8ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97  11%
      14:16:04 P 323  3 Sync 0.8ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97  11%
      14:16:05 P 323  3 Sync 0.9ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97  11%
      14:16:06 P 323  3 Sync 0.9ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97  11%
      14:16:07 P 323  3 Sync 0.9ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97  11%
      14:16:08 P 323  3 Sync 2.0ms   79 785M   0    0   0b   0   0ns   0   0   0 66358  30k  11%
      14:16:09 P 323  3 Sync 2.3ms  191 3.8M   0    0   0b   0   0ns   0   0   0 66540  30k  11%
      14:16:10 P 323  3 Sync 2.2ms  204 2.2M   0    0   0b   0   0ns   0   0   0 66744  30k  11%
      14:16:11 P 323  3 Sync 2.1ms  152 2.5M   0    0   0b   0   0ns   0   0   0 66889  30k  11%
      14:16:12 P 323  3 Sync 2.1ms  244 1.4M   0    0   0b   0   0ns   0   0   0 67125  30k  11%
      14:16:13 P 323  3 Sync 2.2ms  340 4.2M   0    0   0b   0   0ns   0   0   0 67458  30k  11%
      14:16:14 P 323  3 Sync 2.1ms  399 3.2M   0    0   0b   0   0ns   0   0   0 67848  30k  11%
      14:16:15 P 323  3 Sync 2.0ms  286 2.4M   0    0   0b   0   0ns   0   0   0 68121  31k  11%
      14:16:16 P 323  3 Sync 2.0ms  173 2.2M   5    0   0b   0 344ms   0   0   0 68296  31k  11%
      14:16:17 P 323  3 Sync 2.0ms    0   0b   8    0   0b   0 999ms   0   0   0 68296  31k  11%
      14:16:18 P 323  3 Sync 2.0ms    0   0b   9    0   0b   0 1.00s   0   0   0 68296  31k  11%
      14:16:19 P 323  3 Sync 2.0ms    0   0b   9    0   0b   0 1.00s   0   0   0 68296  31k  11%
      14:16:20 P 323  3 Sync 2.0ms    0   0b  10    0   0b   0 1.00s   0   0   0 68296  31k  11%
      

      And the wsrep log on the slaves during that time

      xx-cluster-zh1 / database1 (idx: 1) / Galera 25.3.18(r3632)
               Cluster  Node       Outbound      Inbound       FlowC     Conflct Gcache     Appl
          time P cnf  # stat laten msgs data que msgs data que pause snt lcf bfa   ist  idx  %ef
      14:15:46 P 323  3 Sync 0.6ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97   9%
      14:15:47 P 323  3 Sync 0.6ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97   9%
      14:15:48 P 323  3 Sync 0.6ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97   9%
      14:15:49 P 323  3 Sync 0.6ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97   9%
      14:15:50 P 323  3 Sync 0.6ms    0   0b   0   10 784M   0   0ns   0   0   0 66279  30k   9%
      14:15:51 P 323  3 Sync 0.6ms    0   0b   0    1 306b   0   0ns   0   0   0 66279  30k   9%
      14:15:52 P 323  3 Sync 0.6ms    0   0b   0    0   0b   0   0ns   0   0   0 66279  30k   9%
      14:15:53 P 323  3 Sync 0.6ms    0   0b   0    1 297b   0   0ns   0   0   0 66279  30k   9%
      14:15:54 P 323  3 Sync 0.6ms    0   0b   0    1 573b   0   0ns   0   0   0 66279  30k   9%
      14:15:55 P 323  3 Sync 0.6ms    0   0b   0    0   0b   0   0ns   0   0   0 66279  30k   9%
      14:15:56 P 323  3 Sync 0.6ms    0   0b   0    3 2.1K   0   0ns   0   0   0 66279  30k   9%
      14:15:57 P 323  3 Sync 0.6ms    0   0b   0    0   0b   1   0ns   0   0   0 66279  30k   9%
      14:15:58 P 323  3 Sync 0.6ms    0   0b   0    0   0b   1   0ns   0   0   0 66279  30k   9%
      14:15:59 P 323  3 Sync 0.6ms    0   0b   0    0   0b   2   0ns   0   0   0 66279  30k   9%
      14:16:00 P 323  3 Sync 0.6ms    0   0b   0    0   0b   3   0ns   0   0   0 66279  30k   9%
      14:16:01 P 323  3 Sync 0.6ms    0   0b   0    0   0b   3   0ns   0   0   0 66279  30k   9%
      14:16:02 P 323  3 Sync 0.6ms    0   0b   0    0   0b   4   0ns   0   0   0 66279  30k   9%
      14:16:03 P 323  3 Sync 0.6ms    0   0b   0    0   0b   4   0ns   0   0   0 66279  30k   9%
      14:16:04 P 323  3 Sync 0.6ms    0   0b   0    0   0b   4   0ns   0   0   0 66279  30k   9%
      14:16:05 P 323  3 Sync 0.6ms    0   0b   0    0   0b   5   0ns   0   0   0 66279  30k   9%
      14:16:06 P 323  3 Sync 0.6ms    0   0b   0    0   0b   6   0ns   0   0   0 66279  30k   9%
      14:16:07 P 323  3 Sync 0.6ms    0   0b   0    0   0b   6   0ns   0   0   0 66279  30k   9%
      14:16:08 P 323  3 Sync 0.6ms    0   0b   0    0   0b   6   0ns   0   0   0 66279  30k   9%
      14:16:09 P 323  3 Sync 0.6ms    0   0b   0    0   0b 146   0ns   0   0   0 66271  30k   9%
      14:16:10 P 323  3 Sync 0.6ms    0   0b   0    0   0b 348   0ns   0   0   0 66271  30k   9%
      14:16:11 P 323  3 Sync 0.6ms    0   0b   0    0   0b 523   0ns   0   0   0 66271  30k   9%
      14:16:12 P 323  3 Sync 0.6ms    0   0b   0    0   0b 717   0ns   0   0   0 66257  30k   9%
      14:16:13 P 323  3 Sync 0.6ms    0   0b   0    0   0b 983   0ns   0   0   0 66257  30k   9%
      14:16:14 P 323  3 Sync 0.6ms    0   0b   0    0   0b  1k   0ns   0   0   0 66240  30k   9%
      14:16:15 P 323  3 Sync 0.6ms    0   0b   0    0   0b  2k   0ns   0   0   0 66240  30k   9%
      14:16:16 P 323  3 Sync 0.6ms    0   0b   0    0   0b  2k   0ns   0   0   0 66228  30k   9%
      14:16:17 P 323  3 Sync 0.7ms    0   0b   0    0   0b  2k 771ms   1   0   0 66228  30k   9%
      14:16:18 P 323  3 Sync 0.7ms    0   0b   0    0   0b  2k 1.00s   0   0   0 66228  30k   9%
      14:16:19 P 323  3 Sync 0.7ms    0   0b   0    0   0b  2k 1.00s   0   0   0 66228  30k   9%
      14:16:20 P 323  3 Sync 0.7ms    0   0b   0    0   0b  2k 995ms   0   0   0 66228  30k   9%
      

      xx-cluster-zh1 / database3 (idx: 2) / Galera 25.3.18(r3632)
               Cluster  Node       Outbound      Inbound       FlowC     Conflct Gcache     Appl
          time P cnf  # stat laten msgs data que msgs data que pause snt lcf bfa   ist  idx  %ef
      14:15:46 P 323  3 Sync 0.0ns    0   0b   0    0   0b   0   0ns   0   0   0 66279   97   9%
      14:15:47 P 323  3 Sync 0.0ns    0   0b   0    0   0b   0   0ns   0   0   0 66279   97   9%
      14:15:48 P 323  3 Sync 0.0ns    0   0b   0    0   0b   0   0ns   0   0   0 66279   97   9%
      14:15:49 P 323  3 Sync 0.0ns    0   0b   0    0   0b   0   0ns   0   0   0 66279   97   9%
      14:15:50 P 323  3 Sync 0.0ns    0   0b   0    0   0b   0   0ns   0   0   0 66279   97   9%
      14:15:51 P 323  3 Sync 0.0ns    0   0b   0    0   0b   0   0ns   0   0   0 66279   97   9%
      14:15:52 P 323  3 Sync 0.0ns    0   0b   0    0   0b   0   0ns   0   0   0 66279   97   9%
      14:15:53 P 323  3 Sync 0.0ns    0   0b   0    0   0b   0   0ns   0   0   0 66279   97   9%
      14:15:54 P 323  3 Sync 0.0ns    0   0b   0    0   0b   0   0ns   0   0   0 66279   97   9%
      14:15:55 P 323  3 Sync 0.0ns    0   0b   0    0   0b   0   0ns   0   0   0 66279   97   9%
      14:15:56 P 323  3 Sync 0.0ns    0   0b   0    0   0b   0   0ns   0   0   0 66279   97   9%
      14:15:57 P 323  3 Sync 0.0ns    0   0b   0    0   0b   0   0ns   0   0   0 66279   97   9%
      14:15:58 P 323  3 Sync 0.0ns    0   0b   0    0   0b   0   0ns   0   0   0 66279   97   9%
      14:15:59 P 323  3 Sync 0.0ns    0   0b   0   16 784M   3   0ns   0   0   0 66279  30k   9%
      14:16:00 P 323  3 Sync 0.0ns    0   0b   0    0   0b   3   0ns   0   0   0 66279  30k   9%
      14:16:01 P 323  3 Sync 0.0ns    0   0b   0    0   0b   3   0ns   0   0   0 66279  30k   9%
      14:16:02 P 323  3 Sync 0.0ns    0   0b   0    0   0b   4   0ns   0   0   0 66279  30k   9%
      14:16:03 P 323  3 Sync 0.0ns    0   0b   0    0   0b   4   0ns   0   0   0 66279  30k   9%
      14:16:04 P 323  3 Sync 0.0ns    0   0b   0    0   0b   4   0ns   0   0   0 66279  30k   9%
      14:16:05 P 323  3 Sync 0.0ns    0   0b   0    0   0b   6   0ns   0   0   0 66279  30k   9%
      14:16:06 P 323  3 Sync 0.0ns    0   0b   0    0   0b   6   0ns   0   0   0 66279  30k   9%
      14:16:07 P 323  3 Sync 0.0ns    0   0b   0    0   0b   6   0ns   0   0   0 66279  30k   9%
      14:16:08 P 323  3 Sync 0.0ns    0   0b   0    0   0b  55   0ns   0   0   0 66279  30k   9%
      14:16:09 P 323  3 Sync 0.0ns    0   0b   0    0   0b 244   0ns   0   0   0 66271  30k   9%
      14:16:10 P 323  3 Sync 0.0ns    0   0b   0    0   0b 439   0ns   0   0   0 66271  30k   9%
      14:16:11 P 323  3 Sync 0.0ns    0   0b   0    0   0b 597   0ns   0   0   0 66271  30k   9%
      14:16:12 P 323  3 Sync 0.0ns    0   0b   0    0   0b 828   0ns   0   0   0 66257  30k   9%
      14:16:13 P 323  3 Sync 0.0ns    0   0b   0    0   0b  1k   0ns   0   0   0 66249  30k   9%
      14:16:14 P 323  3 Sync 0.0ns    0   0b   0    0   0b  2k   0ns   0   0   0 66240  30k   9%
      14:16:15 P 323  3 Sync 0.0ns    0   0b   0    0   0b  2k   0ns   0   0   0 66229  30k   9%
      14:16:16 P 323  3 Sync 1.0ms    0   0b   0    0   0b  2k 261ms   1   0   0 66228  30k   9%
      14:16:17 P 323  3 Sync 1.0ms    0   0b   0    0   0b  2k 1.00s   0   0   0 66228  30k   9%
      14:16:18 P 323  3 Sync 1.0ms    0   0b   0    0   0b  2k 997ms   0   0   0 66228  30k   9%
      14:16:19 P 323  3 Sync 1.0ms    0   0b   0    0   0b  2k 1.00s   0   0   0 66228  30k   9%
      14:16:20 P 323  3 Sync 1.0ms    0   0b   0    0   0b  2k 1.00s   0   0   0 66228  30k   9%
      

      What we noticed

      • First the IDX - wsrep_cert_index_size raises until 31k
      • Afterwards the queue on the slaves shoots up until the limit of 2048 is reached, afterwards flow control is invoked to stop the replication.

      We're trying to find out why this happens but till now we didn't find a solution.

      Is there a way to find out which writesets cause the IDX increase?

      An earlier post in the galera group something was posted which sounds like a familiar issue:
      https://groups.google.com/forum/#!msg/codership-team/iieFnDSBUNY/6CwsD_3qXxYJ

      Feel free to get back to me if i should provide further information. I'm happy to provide more information.

      Attachments

        Activity

          People

            jplindst Jan Lindström (Inactive)
            dasrecht Bastian Widmer
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.