Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Incomplete
-
10.1.9
-
Galera 25.3.18(r3632), Ubuntu 14.04.5 LTS, mysqld Ver 10.1.19-MariaDB-1~trusty for debian-linux-gnu on x86_64 (mariadb.org binary distribution)
Description
Hi there
We’re running on a galera cluster since over a year now. Until recently we didn’t see bigger issues but lately we encountered that Flow Control is stopping our clusters replication regularly.
After reading and researching we did following configuration changes:
- cert.log_conflicts=on
- Gcache.size = 4G
- gcs.fc_limit=2048
- fcs.fc_master_slave=YES
- wsrep_slave_threads=16
We also have a second Database Cluster setup in order to test things out but it's hard to reproduce because we don't know what exactly leads to the Flow Control issue.
Our wsrep log shows following behaviour during the time when the flow control invokes the pausing
The database connection is handled via maxscale. First there is the log of the master node based on maxscale where all the reads and writes end up:
xx-cluster-zh1 / database2 (idx: 0) / Galera 25.3.18(r3632)
|
Cluster Node Outbound Inbound FlowC Conflct Gcache Appl
|
time P cnf # stat laten msgs data que msgs data que pause snt lcf bfa ist idx %ef
|
14:15:46 P 323 3 Sync 1.4ms 0 0b 8 0 0b 0 0ns 0 0 0 66279 97 11%
|
14:15:47 P 323 3 Sync 1.4ms 0 0b 9 0 0b 0 0ns 0 0 0 66279 97 11%
|
14:15:48 P 323 3 Sync 1.4ms 0 0b 9 0 0b 0 0ns 0 0 0 66279 97 11%
|
14:15:49 P 323 3 Sync 1.4ms 0 0b 0 0 0b 0 0ns 0 0 0 66279 97 11%
|
14:15:50 P 323 3 Sync 1.4ms 0 0b 0 0 0b 0 0ns 0 0 0 66279 97 11%
|
14:15:51 P 323 3 Sync 1.4ms 0 0b 0 0 0b 0 0ns 0 0 0 66279 97 11%
|
14:15:52 P 323 3 Sync 1.4ms 0 0b 0 0 0b 0 0ns 0 0 0 66279 97 11%
|
14:15:53 P 323 3 Sync 1.4ms 0 0b 0 0 0b 0 0ns 0 0 0 66279 97 11%
|
14:15:54 P 323 3 Sync 0.8ms 0 0b 0 0 0b 0 0ns 0 0 0 66279 97 11%
|
14:15:55 P 323 3 Sync 0.8ms 0 0b 0 0 0b 0 0ns 0 0 0 66279 97 11%
|
14:15:56 P 323 3 Sync 0.8ms 0 0b 0 0 0b 0 0ns 0 0 0 66279 97 11%
|
14:15:57 P 323 3 Sync 0.8ms 0 0b 0 0 0b 0 0ns 0 0 0 66279 97 11%
|
14:15:58 P 323 3 Sync 0.8ms 0 0b 0 0 0b 0 0ns 0 0 0 66279 97 11%
|
14:15:59 P 323 3 Sync 0.8ms 0 0b 0 0 0b 0 0ns 0 0 0 66279 97 11%
|
14:16:00 P 323 3 Sync 0.8ms 0 0b 0 0 0b 0 0ns 0 0 0 66279 97 11%
|
14:16:01 P 323 3 Sync 0.8ms 0 0b 0 0 0b 0 0ns 0 0 0 66279 97 11%
|
14:16:02 P 323 3 Sync 0.8ms 0 0b 0 0 0b 0 0ns 0 0 0 66279 97 11%
|
14:16:03 P 323 3 Sync 0.8ms 0 0b 0 0 0b 0 0ns 0 0 0 66279 97 11%
|
14:16:04 P 323 3 Sync 0.8ms 0 0b 0 0 0b 0 0ns 0 0 0 66279 97 11%
|
14:16:05 P 323 3 Sync 0.9ms 0 0b 0 0 0b 0 0ns 0 0 0 66279 97 11%
|
14:16:06 P 323 3 Sync 0.9ms 0 0b 0 0 0b 0 0ns 0 0 0 66279 97 11%
|
14:16:07 P 323 3 Sync 0.9ms 0 0b 0 0 0b 0 0ns 0 0 0 66279 97 11%
|
14:16:08 P 323 3 Sync 2.0ms 79 785M 0 0 0b 0 0ns 0 0 0 66358 30k 11%
|
14:16:09 P 323 3 Sync 2.3ms 191 3.8M 0 0 0b 0 0ns 0 0 0 66540 30k 11%
|
14:16:10 P 323 3 Sync 2.2ms 204 2.2M 0 0 0b 0 0ns 0 0 0 66744 30k 11%
|
14:16:11 P 323 3 Sync 2.1ms 152 2.5M 0 0 0b 0 0ns 0 0 0 66889 30k 11%
|
14:16:12 P 323 3 Sync 2.1ms 244 1.4M 0 0 0b 0 0ns 0 0 0 67125 30k 11%
|
14:16:13 P 323 3 Sync 2.2ms 340 4.2M 0 0 0b 0 0ns 0 0 0 67458 30k 11%
|
14:16:14 P 323 3 Sync 2.1ms 399 3.2M 0 0 0b 0 0ns 0 0 0 67848 30k 11%
|
14:16:15 P 323 3 Sync 2.0ms 286 2.4M 0 0 0b 0 0ns 0 0 0 68121 31k 11%
|
14:16:16 P 323 3 Sync 2.0ms 173 2.2M 5 0 0b 0 344ms 0 0 0 68296 31k 11%
|
14:16:17 P 323 3 Sync 2.0ms 0 0b 8 0 0b 0 999ms 0 0 0 68296 31k 11%
|
14:16:18 P 323 3 Sync 2.0ms 0 0b 9 0 0b 0 1.00s 0 0 0 68296 31k 11%
|
14:16:19 P 323 3 Sync 2.0ms 0 0b 9 0 0b 0 1.00s 0 0 0 68296 31k 11%
|
14:16:20 P 323 3 Sync 2.0ms 0 0b 10 0 0b 0 1.00s 0 0 0 68296 31k 11%
|
And the wsrep log on the slaves during that time
xx-cluster-zh1 / database1 (idx: 1) / Galera 25.3.18(r3632)
|
Cluster Node Outbound Inbound FlowC Conflct Gcache Appl
|
time P cnf # stat laten msgs data que msgs data que pause snt lcf bfa ist idx %ef
|
14:15:46 P 323 3 Sync 0.6ms 0 0b 0 0 0b 0 0ns 0 0 0 66279 97 9%
|
14:15:47 P 323 3 Sync 0.6ms 0 0b 0 0 0b 0 0ns 0 0 0 66279 97 9%
|
14:15:48 P 323 3 Sync 0.6ms 0 0b 0 0 0b 0 0ns 0 0 0 66279 97 9%
|
14:15:49 P 323 3 Sync 0.6ms 0 0b 0 0 0b 0 0ns 0 0 0 66279 97 9%
|
14:15:50 P 323 3 Sync 0.6ms 0 0b 0 10 784M 0 0ns 0 0 0 66279 30k 9%
|
14:15:51 P 323 3 Sync 0.6ms 0 0b 0 1 306b 0 0ns 0 0 0 66279 30k 9%
|
14:15:52 P 323 3 Sync 0.6ms 0 0b 0 0 0b 0 0ns 0 0 0 66279 30k 9%
|
14:15:53 P 323 3 Sync 0.6ms 0 0b 0 1 297b 0 0ns 0 0 0 66279 30k 9%
|
14:15:54 P 323 3 Sync 0.6ms 0 0b 0 1 573b 0 0ns 0 0 0 66279 30k 9%
|
14:15:55 P 323 3 Sync 0.6ms 0 0b 0 0 0b 0 0ns 0 0 0 66279 30k 9%
|
14:15:56 P 323 3 Sync 0.6ms 0 0b 0 3 2.1K 0 0ns 0 0 0 66279 30k 9%
|
14:15:57 P 323 3 Sync 0.6ms 0 0b 0 0 0b 1 0ns 0 0 0 66279 30k 9%
|
14:15:58 P 323 3 Sync 0.6ms 0 0b 0 0 0b 1 0ns 0 0 0 66279 30k 9%
|
14:15:59 P 323 3 Sync 0.6ms 0 0b 0 0 0b 2 0ns 0 0 0 66279 30k 9%
|
14:16:00 P 323 3 Sync 0.6ms 0 0b 0 0 0b 3 0ns 0 0 0 66279 30k 9%
|
14:16:01 P 323 3 Sync 0.6ms 0 0b 0 0 0b 3 0ns 0 0 0 66279 30k 9%
|
14:16:02 P 323 3 Sync 0.6ms 0 0b 0 0 0b 4 0ns 0 0 0 66279 30k 9%
|
14:16:03 P 323 3 Sync 0.6ms 0 0b 0 0 0b 4 0ns 0 0 0 66279 30k 9%
|
14:16:04 P 323 3 Sync 0.6ms 0 0b 0 0 0b 4 0ns 0 0 0 66279 30k 9%
|
14:16:05 P 323 3 Sync 0.6ms 0 0b 0 0 0b 5 0ns 0 0 0 66279 30k 9%
|
14:16:06 P 323 3 Sync 0.6ms 0 0b 0 0 0b 6 0ns 0 0 0 66279 30k 9%
|
14:16:07 P 323 3 Sync 0.6ms 0 0b 0 0 0b 6 0ns 0 0 0 66279 30k 9%
|
14:16:08 P 323 3 Sync 0.6ms 0 0b 0 0 0b 6 0ns 0 0 0 66279 30k 9%
|
14:16:09 P 323 3 Sync 0.6ms 0 0b 0 0 0b 146 0ns 0 0 0 66271 30k 9%
|
14:16:10 P 323 3 Sync 0.6ms 0 0b 0 0 0b 348 0ns 0 0 0 66271 30k 9%
|
14:16:11 P 323 3 Sync 0.6ms 0 0b 0 0 0b 523 0ns 0 0 0 66271 30k 9%
|
14:16:12 P 323 3 Sync 0.6ms 0 0b 0 0 0b 717 0ns 0 0 0 66257 30k 9%
|
14:16:13 P 323 3 Sync 0.6ms 0 0b 0 0 0b 983 0ns 0 0 0 66257 30k 9%
|
14:16:14 P 323 3 Sync 0.6ms 0 0b 0 0 0b 1k 0ns 0 0 0 66240 30k 9%
|
14:16:15 P 323 3 Sync 0.6ms 0 0b 0 0 0b 2k 0ns 0 0 0 66240 30k 9%
|
14:16:16 P 323 3 Sync 0.6ms 0 0b 0 0 0b 2k 0ns 0 0 0 66228 30k 9%
|
14:16:17 P 323 3 Sync 0.7ms 0 0b 0 0 0b 2k 771ms 1 0 0 66228 30k 9%
|
14:16:18 P 323 3 Sync 0.7ms 0 0b 0 0 0b 2k 1.00s 0 0 0 66228 30k 9%
|
14:16:19 P 323 3 Sync 0.7ms 0 0b 0 0 0b 2k 1.00s 0 0 0 66228 30k 9%
|
14:16:20 P 323 3 Sync 0.7ms 0 0b 0 0 0b 2k 995ms 0 0 0 66228 30k 9%
|
xx-cluster-zh1 / database3 (idx: 2) / Galera 25.3.18(r3632)
|
Cluster Node Outbound Inbound FlowC Conflct Gcache Appl
|
time P cnf # stat laten msgs data que msgs data que pause snt lcf bfa ist idx %ef
|
14:15:46 P 323 3 Sync 0.0ns 0 0b 0 0 0b 0 0ns 0 0 0 66279 97 9%
|
14:15:47 P 323 3 Sync 0.0ns 0 0b 0 0 0b 0 0ns 0 0 0 66279 97 9%
|
14:15:48 P 323 3 Sync 0.0ns 0 0b 0 0 0b 0 0ns 0 0 0 66279 97 9%
|
14:15:49 P 323 3 Sync 0.0ns 0 0b 0 0 0b 0 0ns 0 0 0 66279 97 9%
|
14:15:50 P 323 3 Sync 0.0ns 0 0b 0 0 0b 0 0ns 0 0 0 66279 97 9%
|
14:15:51 P 323 3 Sync 0.0ns 0 0b 0 0 0b 0 0ns 0 0 0 66279 97 9%
|
14:15:52 P 323 3 Sync 0.0ns 0 0b 0 0 0b 0 0ns 0 0 0 66279 97 9%
|
14:15:53 P 323 3 Sync 0.0ns 0 0b 0 0 0b 0 0ns 0 0 0 66279 97 9%
|
14:15:54 P 323 3 Sync 0.0ns 0 0b 0 0 0b 0 0ns 0 0 0 66279 97 9%
|
14:15:55 P 323 3 Sync 0.0ns 0 0b 0 0 0b 0 0ns 0 0 0 66279 97 9%
|
14:15:56 P 323 3 Sync 0.0ns 0 0b 0 0 0b 0 0ns 0 0 0 66279 97 9%
|
14:15:57 P 323 3 Sync 0.0ns 0 0b 0 0 0b 0 0ns 0 0 0 66279 97 9%
|
14:15:58 P 323 3 Sync 0.0ns 0 0b 0 0 0b 0 0ns 0 0 0 66279 97 9%
|
14:15:59 P 323 3 Sync 0.0ns 0 0b 0 16 784M 3 0ns 0 0 0 66279 30k 9%
|
14:16:00 P 323 3 Sync 0.0ns 0 0b 0 0 0b 3 0ns 0 0 0 66279 30k 9%
|
14:16:01 P 323 3 Sync 0.0ns 0 0b 0 0 0b 3 0ns 0 0 0 66279 30k 9%
|
14:16:02 P 323 3 Sync 0.0ns 0 0b 0 0 0b 4 0ns 0 0 0 66279 30k 9%
|
14:16:03 P 323 3 Sync 0.0ns 0 0b 0 0 0b 4 0ns 0 0 0 66279 30k 9%
|
14:16:04 P 323 3 Sync 0.0ns 0 0b 0 0 0b 4 0ns 0 0 0 66279 30k 9%
|
14:16:05 P 323 3 Sync 0.0ns 0 0b 0 0 0b 6 0ns 0 0 0 66279 30k 9%
|
14:16:06 P 323 3 Sync 0.0ns 0 0b 0 0 0b 6 0ns 0 0 0 66279 30k 9%
|
14:16:07 P 323 3 Sync 0.0ns 0 0b 0 0 0b 6 0ns 0 0 0 66279 30k 9%
|
14:16:08 P 323 3 Sync 0.0ns 0 0b 0 0 0b 55 0ns 0 0 0 66279 30k 9%
|
14:16:09 P 323 3 Sync 0.0ns 0 0b 0 0 0b 244 0ns 0 0 0 66271 30k 9%
|
14:16:10 P 323 3 Sync 0.0ns 0 0b 0 0 0b 439 0ns 0 0 0 66271 30k 9%
|
14:16:11 P 323 3 Sync 0.0ns 0 0b 0 0 0b 597 0ns 0 0 0 66271 30k 9%
|
14:16:12 P 323 3 Sync 0.0ns 0 0b 0 0 0b 828 0ns 0 0 0 66257 30k 9%
|
14:16:13 P 323 3 Sync 0.0ns 0 0b 0 0 0b 1k 0ns 0 0 0 66249 30k 9%
|
14:16:14 P 323 3 Sync 0.0ns 0 0b 0 0 0b 2k 0ns 0 0 0 66240 30k 9%
|
14:16:15 P 323 3 Sync 0.0ns 0 0b 0 0 0b 2k 0ns 0 0 0 66229 30k 9%
|
14:16:16 P 323 3 Sync 1.0ms 0 0b 0 0 0b 2k 261ms 1 0 0 66228 30k 9%
|
14:16:17 P 323 3 Sync 1.0ms 0 0b 0 0 0b 2k 1.00s 0 0 0 66228 30k 9%
|
14:16:18 P 323 3 Sync 1.0ms 0 0b 0 0 0b 2k 997ms 0 0 0 66228 30k 9%
|
14:16:19 P 323 3 Sync 1.0ms 0 0b 0 0 0b 2k 1.00s 0 0 0 66228 30k 9%
|
14:16:20 P 323 3 Sync 1.0ms 0 0b 0 0 0b 2k 1.00s 0 0 0 66228 30k 9%
|
What we noticed
- First the IDX - wsrep_cert_index_size raises until 31k
- Afterwards the queue on the slaves shoots up until the limit of 2048 is reached, afterwards flow control is invoked to stop the replication.
We're trying to find out why this happens but till now we didn't find a solution.
Is there a way to find out which writesets cause the IDX increase?
An earlier post in the galera group something was posted which sounds like a familiar issue:
https://groups.google.com/forum/#!msg/codership-team/iieFnDSBUNY/6CwsD_3qXxYJ
Feel free to get back to me if i should provide further information. I'm happy to provide more information.