[MDEV-11513] Database Cluster gets stopped by flowcontrol Created: 2016-12-09  Updated: 2019-05-20  Resolved: 2019-05-20

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.1.9
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Bastian Widmer Assignee: Jan Lindström (Inactive)
Resolution: Incomplete Votes: 0
Labels: galera
Environment:

Galera 25.3.18(r3632), Ubuntu 14.04.5 LTS, mysqld Ver 10.1.19-MariaDB-1~trusty for debian-linux-gnu on x86_64 (mariadb.org binary distribution)



 Description   

Hi there

We’re running on a galera cluster since over a year now. Until recently we didn’t see bigger issues but lately we encountered that Flow Control is stopping our clusters replication regularly.

After reading and researching we did following configuration changes:

  • cert.log_conflicts=on
  • Gcache.size = 4G
  • gcs.fc_limit=2048
  • fcs.fc_master_slave=YES
  • wsrep_slave_threads=16

We also have a second Database Cluster setup in order to test things out but it's hard to reproduce because we don't know what exactly leads to the Flow Control issue.

Our wsrep log shows following behaviour during the time when the flow control invokes the pausing

The database connection is handled via maxscale. First there is the log of the master node based on maxscale where all the reads and writes end up:

xx-cluster-zh1 / database2 (idx: 0) / Galera 25.3.18(r3632)
         Cluster  Node       Outbound      Inbound       FlowC     Conflct Gcache     Appl
    time P cnf  # stat laten msgs data que msgs data que pause snt lcf bfa   ist  idx  %ef
14:15:46 P 323  3 Sync 1.4ms    0   0b   8    0   0b   0   0ns   0   0   0 66279   97  11%
14:15:47 P 323  3 Sync 1.4ms    0   0b   9    0   0b   0   0ns   0   0   0 66279   97  11%
14:15:48 P 323  3 Sync 1.4ms    0   0b   9    0   0b   0   0ns   0   0   0 66279   97  11%
14:15:49 P 323  3 Sync 1.4ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97  11%
14:15:50 P 323  3 Sync 1.4ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97  11%
14:15:51 P 323  3 Sync 1.4ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97  11%
14:15:52 P 323  3 Sync 1.4ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97  11%
14:15:53 P 323  3 Sync 1.4ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97  11%
14:15:54 P 323  3 Sync 0.8ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97  11%
14:15:55 P 323  3 Sync 0.8ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97  11%
14:15:56 P 323  3 Sync 0.8ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97  11%
14:15:57 P 323  3 Sync 0.8ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97  11%
14:15:58 P 323  3 Sync 0.8ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97  11%
14:15:59 P 323  3 Sync 0.8ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97  11%
14:16:00 P 323  3 Sync 0.8ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97  11%
14:16:01 P 323  3 Sync 0.8ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97  11%
14:16:02 P 323  3 Sync 0.8ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97  11%
14:16:03 P 323  3 Sync 0.8ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97  11%
14:16:04 P 323  3 Sync 0.8ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97  11%
14:16:05 P 323  3 Sync 0.9ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97  11%
14:16:06 P 323  3 Sync 0.9ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97  11%
14:16:07 P 323  3 Sync 0.9ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97  11%
14:16:08 P 323  3 Sync 2.0ms   79 785M   0    0   0b   0   0ns   0   0   0 66358  30k  11%
14:16:09 P 323  3 Sync 2.3ms  191 3.8M   0    0   0b   0   0ns   0   0   0 66540  30k  11%
14:16:10 P 323  3 Sync 2.2ms  204 2.2M   0    0   0b   0   0ns   0   0   0 66744  30k  11%
14:16:11 P 323  3 Sync 2.1ms  152 2.5M   0    0   0b   0   0ns   0   0   0 66889  30k  11%
14:16:12 P 323  3 Sync 2.1ms  244 1.4M   0    0   0b   0   0ns   0   0   0 67125  30k  11%
14:16:13 P 323  3 Sync 2.2ms  340 4.2M   0    0   0b   0   0ns   0   0   0 67458  30k  11%
14:16:14 P 323  3 Sync 2.1ms  399 3.2M   0    0   0b   0   0ns   0   0   0 67848  30k  11%
14:16:15 P 323  3 Sync 2.0ms  286 2.4M   0    0   0b   0   0ns   0   0   0 68121  31k  11%
14:16:16 P 323  3 Sync 2.0ms  173 2.2M   5    0   0b   0 344ms   0   0   0 68296  31k  11%
14:16:17 P 323  3 Sync 2.0ms    0   0b   8    0   0b   0 999ms   0   0   0 68296  31k  11%
14:16:18 P 323  3 Sync 2.0ms    0   0b   9    0   0b   0 1.00s   0   0   0 68296  31k  11%
14:16:19 P 323  3 Sync 2.0ms    0   0b   9    0   0b   0 1.00s   0   0   0 68296  31k  11%
14:16:20 P 323  3 Sync 2.0ms    0   0b  10    0   0b   0 1.00s   0   0   0 68296  31k  11%

And the wsrep log on the slaves during that time

xx-cluster-zh1 / database1 (idx: 1) / Galera 25.3.18(r3632)
         Cluster  Node       Outbound      Inbound       FlowC     Conflct Gcache     Appl
    time P cnf  # stat laten msgs data que msgs data que pause snt lcf bfa   ist  idx  %ef
14:15:46 P 323  3 Sync 0.6ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97   9%
14:15:47 P 323  3 Sync 0.6ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97   9%
14:15:48 P 323  3 Sync 0.6ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97   9%
14:15:49 P 323  3 Sync 0.6ms    0   0b   0    0   0b   0   0ns   0   0   0 66279   97   9%
14:15:50 P 323  3 Sync 0.6ms    0   0b   0   10 784M   0   0ns   0   0   0 66279  30k   9%
14:15:51 P 323  3 Sync 0.6ms    0   0b   0    1 306b   0   0ns   0   0   0 66279  30k   9%
14:15:52 P 323  3 Sync 0.6ms    0   0b   0    0   0b   0   0ns   0   0   0 66279  30k   9%
14:15:53 P 323  3 Sync 0.6ms    0   0b   0    1 297b   0   0ns   0   0   0 66279  30k   9%
14:15:54 P 323  3 Sync 0.6ms    0   0b   0    1 573b   0   0ns   0   0   0 66279  30k   9%
14:15:55 P 323  3 Sync 0.6ms    0   0b   0    0   0b   0   0ns   0   0   0 66279  30k   9%
14:15:56 P 323  3 Sync 0.6ms    0   0b   0    3 2.1K   0   0ns   0   0   0 66279  30k   9%
14:15:57 P 323  3 Sync 0.6ms    0   0b   0    0   0b   1   0ns   0   0   0 66279  30k   9%
14:15:58 P 323  3 Sync 0.6ms    0   0b   0    0   0b   1   0ns   0   0   0 66279  30k   9%
14:15:59 P 323  3 Sync 0.6ms    0   0b   0    0   0b   2   0ns   0   0   0 66279  30k   9%
14:16:00 P 323  3 Sync 0.6ms    0   0b   0    0   0b   3   0ns   0   0   0 66279  30k   9%
14:16:01 P 323  3 Sync 0.6ms    0   0b   0    0   0b   3   0ns   0   0   0 66279  30k   9%
14:16:02 P 323  3 Sync 0.6ms    0   0b   0    0   0b   4   0ns   0   0   0 66279  30k   9%
14:16:03 P 323  3 Sync 0.6ms    0   0b   0    0   0b   4   0ns   0   0   0 66279  30k   9%
14:16:04 P 323  3 Sync 0.6ms    0   0b   0    0   0b   4   0ns   0   0   0 66279  30k   9%
14:16:05 P 323  3 Sync 0.6ms    0   0b   0    0   0b   5   0ns   0   0   0 66279  30k   9%
14:16:06 P 323  3 Sync 0.6ms    0   0b   0    0   0b   6   0ns   0   0   0 66279  30k   9%
14:16:07 P 323  3 Sync 0.6ms    0   0b   0    0   0b   6   0ns   0   0   0 66279  30k   9%
14:16:08 P 323  3 Sync 0.6ms    0   0b   0    0   0b   6   0ns   0   0   0 66279  30k   9%
14:16:09 P 323  3 Sync 0.6ms    0   0b   0    0   0b 146   0ns   0   0   0 66271  30k   9%
14:16:10 P 323  3 Sync 0.6ms    0   0b   0    0   0b 348   0ns   0   0   0 66271  30k   9%
14:16:11 P 323  3 Sync 0.6ms    0   0b   0    0   0b 523   0ns   0   0   0 66271  30k   9%
14:16:12 P 323  3 Sync 0.6ms    0   0b   0    0   0b 717   0ns   0   0   0 66257  30k   9%
14:16:13 P 323  3 Sync 0.6ms    0   0b   0    0   0b 983   0ns   0   0   0 66257  30k   9%
14:16:14 P 323  3 Sync 0.6ms    0   0b   0    0   0b  1k   0ns   0   0   0 66240  30k   9%
14:16:15 P 323  3 Sync 0.6ms    0   0b   0    0   0b  2k   0ns   0   0   0 66240  30k   9%
14:16:16 P 323  3 Sync 0.6ms    0   0b   0    0   0b  2k   0ns   0   0   0 66228  30k   9%
14:16:17 P 323  3 Sync 0.7ms    0   0b   0    0   0b  2k 771ms   1   0   0 66228  30k   9%
14:16:18 P 323  3 Sync 0.7ms    0   0b   0    0   0b  2k 1.00s   0   0   0 66228  30k   9%
14:16:19 P 323  3 Sync 0.7ms    0   0b   0    0   0b  2k 1.00s   0   0   0 66228  30k   9%
14:16:20 P 323  3 Sync 0.7ms    0   0b   0    0   0b  2k 995ms   0   0   0 66228  30k   9%

xx-cluster-zh1 / database3 (idx: 2) / Galera 25.3.18(r3632)
         Cluster  Node       Outbound      Inbound       FlowC     Conflct Gcache     Appl
    time P cnf  # stat laten msgs data que msgs data que pause snt lcf bfa   ist  idx  %ef
14:15:46 P 323  3 Sync 0.0ns    0   0b   0    0   0b   0   0ns   0   0   0 66279   97   9%
14:15:47 P 323  3 Sync 0.0ns    0   0b   0    0   0b   0   0ns   0   0   0 66279   97   9%
14:15:48 P 323  3 Sync 0.0ns    0   0b   0    0   0b   0   0ns   0   0   0 66279   97   9%
14:15:49 P 323  3 Sync 0.0ns    0   0b   0    0   0b   0   0ns   0   0   0 66279   97   9%
14:15:50 P 323  3 Sync 0.0ns    0   0b   0    0   0b   0   0ns   0   0   0 66279   97   9%
14:15:51 P 323  3 Sync 0.0ns    0   0b   0    0   0b   0   0ns   0   0   0 66279   97   9%
14:15:52 P 323  3 Sync 0.0ns    0   0b   0    0   0b   0   0ns   0   0   0 66279   97   9%
14:15:53 P 323  3 Sync 0.0ns    0   0b   0    0   0b   0   0ns   0   0   0 66279   97   9%
14:15:54 P 323  3 Sync 0.0ns    0   0b   0    0   0b   0   0ns   0   0   0 66279   97   9%
14:15:55 P 323  3 Sync 0.0ns    0   0b   0    0   0b   0   0ns   0   0   0 66279   97   9%
14:15:56 P 323  3 Sync 0.0ns    0   0b   0    0   0b   0   0ns   0   0   0 66279   97   9%
14:15:57 P 323  3 Sync 0.0ns    0   0b   0    0   0b   0   0ns   0   0   0 66279   97   9%
14:15:58 P 323  3 Sync 0.0ns    0   0b   0    0   0b   0   0ns   0   0   0 66279   97   9%
14:15:59 P 323  3 Sync 0.0ns    0   0b   0   16 784M   3   0ns   0   0   0 66279  30k   9%
14:16:00 P 323  3 Sync 0.0ns    0   0b   0    0   0b   3   0ns   0   0   0 66279  30k   9%
14:16:01 P 323  3 Sync 0.0ns    0   0b   0    0   0b   3   0ns   0   0   0 66279  30k   9%
14:16:02 P 323  3 Sync 0.0ns    0   0b   0    0   0b   4   0ns   0   0   0 66279  30k   9%
14:16:03 P 323  3 Sync 0.0ns    0   0b   0    0   0b   4   0ns   0   0   0 66279  30k   9%
14:16:04 P 323  3 Sync 0.0ns    0   0b   0    0   0b   4   0ns   0   0   0 66279  30k   9%
14:16:05 P 323  3 Sync 0.0ns    0   0b   0    0   0b   6   0ns   0   0   0 66279  30k   9%
14:16:06 P 323  3 Sync 0.0ns    0   0b   0    0   0b   6   0ns   0   0   0 66279  30k   9%
14:16:07 P 323  3 Sync 0.0ns    0   0b   0    0   0b   6   0ns   0   0   0 66279  30k   9%
14:16:08 P 323  3 Sync 0.0ns    0   0b   0    0   0b  55   0ns   0   0   0 66279  30k   9%
14:16:09 P 323  3 Sync 0.0ns    0   0b   0    0   0b 244   0ns   0   0   0 66271  30k   9%
14:16:10 P 323  3 Sync 0.0ns    0   0b   0    0   0b 439   0ns   0   0   0 66271  30k   9%
14:16:11 P 323  3 Sync 0.0ns    0   0b   0    0   0b 597   0ns   0   0   0 66271  30k   9%
14:16:12 P 323  3 Sync 0.0ns    0   0b   0    0   0b 828   0ns   0   0   0 66257  30k   9%
14:16:13 P 323  3 Sync 0.0ns    0   0b   0    0   0b  1k   0ns   0   0   0 66249  30k   9%
14:16:14 P 323  3 Sync 0.0ns    0   0b   0    0   0b  2k   0ns   0   0   0 66240  30k   9%
14:16:15 P 323  3 Sync 0.0ns    0   0b   0    0   0b  2k   0ns   0   0   0 66229  30k   9%
14:16:16 P 323  3 Sync 1.0ms    0   0b   0    0   0b  2k 261ms   1   0   0 66228  30k   9%
14:16:17 P 323  3 Sync 1.0ms    0   0b   0    0   0b  2k 1.00s   0   0   0 66228  30k   9%
14:16:18 P 323  3 Sync 1.0ms    0   0b   0    0   0b  2k 997ms   0   0   0 66228  30k   9%
14:16:19 P 323  3 Sync 1.0ms    0   0b   0    0   0b  2k 1.00s   0   0   0 66228  30k   9%
14:16:20 P 323  3 Sync 1.0ms    0   0b   0    0   0b  2k 1.00s   0   0   0 66228  30k   9%

What we noticed

  • First the IDX - wsrep_cert_index_size raises until 31k
  • Afterwards the queue on the slaves shoots up until the limit of 2048 is reached, afterwards flow control is invoked to stop the replication.

We're trying to find out why this happens but till now we didn't find a solution.

Is there a way to find out which writesets cause the IDX increase?

An earlier post in the galera group something was posted which sounds like a familiar issue:
https://groups.google.com/forum/#!msg/codership-team/iieFnDSBUNY/6CwsD_3qXxYJ

Feel free to get back to me if i should provide further information. I'm happy to provide more information.



 Comments   
Comment by Jan Lindström (Inactive) [ 2019-05-20 ]

Can you try to reproduce with more recent version of MariaDB and Galera library.

Generated at Thu Feb 08 07:50:31 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.