[MDEV-10314] wsrep_sync_wait does not seem to be working - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.1.12
Fix Version/s: 10.1.17, 5.5.51-galera, 10.0.27-galera
Component/s: Galera, wsrep
Labels:
None

Description

We have been trying to test the critical reads that are enabled by 'wsrep_sync_wait' parameter but seems like the results are not deterministic.

Here is the test script:
https://github.com/sjangra-git/galera-tests/blob/master/scripts/test.sh

This test script based out of the following test-case:
https://github.com/MariaDB/server/blob/2783fc7d14bc8ad16acfeb509d3b19615023f47a/mysql-test/suite/galera/t/mysql-wsrep%23201.test#L5

Here is the version info:

MariaDB [(none)]> show variables like 'version%';

+-------------------------+---------------------------------+

| Variable_name           | Value                           |

+-------------------------+---------------------------------+

| version                 | 10.1.12-MariaDB                 |

| version_comment         | MariaDB Server                  |

| version_compile_machine | x86_64                          |

| version_compile_os      | Linux                           |

| version_malloc_library  | system jemalloc                 |

| version_ssl_library     | OpenSSL 1.0.1e-fips 11 Feb 2013 |

+-------------------------+---------------------------------+

The errors are more prominent when we run the following java test where the connection is persisted, so the queries are being attempted faster:
https://github.com/sjangra-git/galera-tests

Auto-commit is OFF on the servers in the cluster:

MariaDB [(none)]> show global variables like 'autocommit';

+---------------+-------+

| Variable_name | Value |

+---------------+-------+

| autocommit    | OFF   |

+---------------+-------+

Few runs failed, few passed for the same test:

-bash-4.1$ ./test.sh 10.226.76.38 10.226.76.32

val1=5735 val2=5732

syn_wait FAILED

-bash-4.1$ ./test.sh 10.226.76.38 10.226.76.32

-bash-4.1$ ./test.sh 10.226.76.38 10.226.76.32

-bash-4.1$ ./test.sh 10.226.76.38 10.226.76.32

-bash-4.1$ ./test.sh 10.226.76.38 10.226.76.32

-bash-4.1$ ./test.sh 10.226.76.38 10.226.76.32

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

my.cnf
3 kB
2016-07-19 01:32

Issue Links

links to

Commit

Activity

Ascending order - Click to sort in descending order

Sandeep Jangra created issue - 2016-07-01 02:26

Sandeep Jangra added a comment - 2016-07-01 02:38

Failures are easily reproducible when I have two instances of mariadb server running locally in VMs.

[app@cb-node2 ~]$ ./test.sh 192.168.42.102 192.168.42.101
val1=8791 val2=8789
syn_wait FAILED
[app@cb-node2 ~]$ ./test.sh 192.168.42.102 192.168.42.101
[app@cb-node2 ~]$ ./test.sh 192.168.42.102 192.168.42.101
[app@cb-node2 ~]$ ./test.sh 192.168.42.102 192.168.42.101
val1=2917 val2=2915
syn_wait FAILED
[app@cb-node2 ~]$ ./test.sh 192.168.42.102 192.168.42.101
val1=1087 val2=1085
syn_wait FAILED
[app@cb-node2 ~]$ ./test.sh 192.168.42.102 192.168.42.101
[app@cb-node2 ~]$ ./test.sh 192.168.42.102 192.168.42.101
[app@cb-node2 ~]$ ./test.sh 192.168.42.102 192.168.42.101
val1=2371 val2=2369
syn_wait FAILED
[app@cb-node2 ~]$ ./test.sh 192.168.42.102 192.168.42.101
val1=6287 val2=6285
syn_wait FAILED

Sandeep Jangra added a comment - 2016-07-01 02:38 Failures are easily reproducible when I have two instances of mariadb server running locally in VMs. [app@cb-node2 ~] $ ./test.sh 192.168.42.102 192.168.42.101 val1=8791 val2=8789 syn_wait FAILED [app@cb-node2 ~] $ ./test.sh 192.168.42.102 192.168.42.101 [app@cb-node2 ~] $ ./test.sh 192.168.42.102 192.168.42.101 [app@cb-node2 ~] $ ./test.sh 192.168.42.102 192.168.42.101 val1=2917 val2=2915 syn_wait FAILED [app@cb-node2 ~] $ ./test.sh 192.168.42.102 192.168.42.101 val1=1087 val2=1085 syn_wait FAILED [app@cb-node2 ~] $ ./test.sh 192.168.42.102 192.168.42.101 [app@cb-node2 ~] $ ./test.sh 192.168.42.102 192.168.42.101 [app@cb-node2 ~] $ ./test.sh 192.168.42.102 192.168.42.101 val1=2371 val2=2369 syn_wait FAILED [app@cb-node2 ~] $ ./test.sh 192.168.42.102 192.168.42.101 val1=6287 val2=6285 syn_wait FAILED

Elena Stepanova made changes - 2016-07-01 13:32

Field	Original Value	New Value
Description	We have been trying to test the critical reads that are enabled by 'wsrep_sync_wait' parameter but seems like the results are not deterministic. Here is the test script: https://github.com/sjangra-git/galera-tests/blob/master/scripts/test.sh This test script based out of the following test-case: https://github.com/MariaDB/server/blob/2783fc7d14bc8ad16acfeb509d3b19615023f47a/mysql-test/suite/galera/t/mysql-wsrep%23201.test#L5 Here is the version info: MariaDB [(none)]> show variables like 'version%'; +-------------------------+---------------------------------+ \| Variable_name \| Value \| +-------------------------+---------------------------------+ \| version \| 10.1.12-MariaDB \| \| version_comment \| MariaDB Server \| \| version_compile_machine \| x86_64 \| \| version_compile_os \| Linux \| \| version_malloc_library \| system jemalloc \| \| version_ssl_library \| OpenSSL 1.0.1e-fips 11 Feb 2013 \| +-------------------------+---------------------------------+ The errors are more prominent when we run the following java test where the connection is persisted, so the queries are being attempted faster: https://github.com/sjangra-git/galera-tests Auto-commit is OFF on the servers in the cluster: MariaDB [(none)]> show global variables like 'autocommit'; +---------------+-------+ \| Variable_name \| Value \| +---------------+-------+ \| autocommit \| OFF \| +---------------+-------+ Few runs failed, few passed for the same test: -bash-4.1$ ./test.sh 10.226.76.38 10.226.76.32 val1=5735 val2=5732 syn_wait FAILED -bash-4.1$ ./test.sh 10.226.76.38 10.226.76.32 -bash-4.1$ ./test.sh 10.226.76.38 10.226.76.32 -bash-4.1$ ./test.sh 10.226.76.38 10.226.76.32 -bash-4.1$ ./test.sh 10.226.76.38 10.226.76.32 -bash-4.1$ ./test.sh 10.226.76.38 10.226.76.32	We have been trying to test the critical reads that are enabled by 'wsrep_sync_wait' parameter but seems like the results are not deterministic. Here is the test script: https://github.com/sjangra-git/galera-tests/blob/master/scripts/test.sh This test script based out of the following test-case: https://github.com/MariaDB/server/blob/2783fc7d14bc8ad16acfeb509d3b19615023f47a/mysql-test/suite/galera/t/mysql-wsrep%23201.test#L5 Here is the version info: {noformat} MariaDB [(none)]> show variables like 'version%'; +-------------------------+---------------------------------+ \| Variable_name \| Value \| +-------------------------+---------------------------------+ \| version \| 10.1.12-MariaDB \| \| version_comment \| MariaDB Server \| \| version_compile_machine \| x86_64 \| \| version_compile_os \| Linux \| \| version_malloc_library \| system jemalloc \| \| version_ssl_library \| OpenSSL 1.0.1e-fips 11 Feb 2013 \| +-------------------------+---------------------------------+ {noformat} The errors are more prominent when we run the following java test where the connection is persisted, so the queries are being attempted faster: https://github.com/sjangra-git/galera-tests Auto-commit is OFF on the servers in the cluster: {noformat} MariaDB [(none)]> show global variables like 'autocommit'; +---------------+-------+ \| Variable_name \| Value \| +---------------+-------+ \| autocommit \| OFF \| +---------------+-------+ {noformat} Few runs failed, few passed for the same test: {noformat} -bash-4.1$ ./test.sh 10.226.76.38 10.226.76.32 val1=5735 val2=5732 syn_wait FAILED -bash-4.1$ ./test.sh 10.226.76.38 10.226.76.32 -bash-4.1$ ./test.sh 10.226.76.38 10.226.76.32 -bash-4.1$ ./test.sh 10.226.76.38 10.226.76.32 -bash-4.1$ ./test.sh 10.226.76.38 10.226.76.32 -bash-4.1$ ./test.sh 10.226.76.38 10.226.76.32 {noformat}

Elena Stepanova made changes - 2016-07-01 13:33

Assignee

Nirbhay Choubey [ nirbhay_c ]

Sandeep Jangra added a comment - 2016-07-08 17:46

Hey Nirbhay Choubey,

Based on my experiments it seems like this issue is easily reproducible when the network latency is low. This is evident from the second test when I had two instances of mariadb running locally.

Also, it would be interesting to run the unit test that is mentioned in the description for 5000-10,000 iterations.

Sandeep Jangra added a comment - 2016-07-08 17:46 Hey Nirbhay Choubey, Based on my experiments it seems like this issue is easily reproducible when the network latency is low. This is evident from the second test when I had two instances of mariadb running locally. Also, it would be interesting to run the unit test that is mentioned in the description for 5000-10,000 iterations.

Daniel Black added a comment - 2016-07-14 06:16 - edited

sandeep with wsrep_sync_wait not set (default to 0) can I get you to test the following with your scripts. I'm wondering if this is suitable as a synchronisation method.

on both:

set global wsrep_gtid_mode=1

connection 1:

select @@last_gtid

after the insert. and connection 2:

select MASTER_GTID_WAIT($last_gtid, 0.5), MAX(id) FROM t1

Daniel Black added a comment - 2016-07-14 06:16 - edited sandeep with wsrep_sync_wait not set (default to 0) can I get you to test the following with your scripts. I'm wondering if this is suitable as a synchronisation method. on both: set global wsrep_gtid_mode=1 connection 1: select @@last_gtid after the insert. and connection 2: select MASTER_GTID_WAIT($last_gtid, 0.5), MAX(id) FROM t1

Sandeep Jangra added a comment - 2016-07-14 21:27

Daniel, thanks for the providing workaround.

Here is the new test:
https://github.com/sjangra-git/galera-tests/blob/master/scripts/test-2.sh#L36

I tried originally with timeout of 0.5, and it did not seem to work. I did see the replica node running behind the source.
Then I increased this timeout to 5 and still running it. But that does not make much sense coz effectively I am slowing the writes down.

Would you mind running the test.sh on your installation and confirm if you see these errors too? I am going to update the documentation so it becomes easy to install/run these tests.

Sandeep Jangra added a comment - 2016-07-14 21:27 Daniel, thanks for the providing workaround. Here is the new test: https://github.com/sjangra-git/galera-tests/blob/master/scripts/test-2.sh#L36 I tried originally with timeout of 0.5, and it did not seem to work. I did see the replica node running behind the source. Then I increased this timeout to 5 and still running it. But that does not make much sense coz effectively I am slowing the writes down. Would you mind running the test.sh on your installation and confirm if you see these errors too? I am going to update the documentation so it becomes easy to install/run these tests.

Nirbhay Choubey (Inactive) added a comment - 2016-07-14 22:01 - edited

sandeep: Checkout ~~MDEV-10161~~. This issue has now been fixed. But, until you get a fixed version, a workaround would be to also set wsrep_causal_reads to ON.

Nirbhay Choubey (Inactive) added a comment - 2016-07-14 22:01 - edited sandeep : Checkout MDEV-10161 . This issue has now been fixed. But, until you get a fixed version, a workaround would be to also set wsrep_causal_reads to ON.

Nirbhay Choubey (Inactive) made changes - 2016-07-14 22:01

Component/s		Galera [ 10124 ]
Component/s		wsrep [ 11500 ]
Fix Version/s		10.1.15 [ 22018 ]
Resolution		Duplicate [ 3 ]
Status	Open [ 1 ]	Closed [ 6 ]

Sandeep Jangra added a comment - 2016-07-14 22:16 - edited

Hey Nirbhay,

MDEV-10161 talks about wsrep_sync_wait variable not being set from the config file.

In my test I am setting it at the session level. So I doubt that these two issues are similar. Also, in my test I do check for the wsrep_casual_reads flag and make sure that it is ON. Here: https://github.com/sjangra-git/galera-tests/blob/master/src/main/java/syncWrites/simplified.java#L162

But let me run these tests on 10.1.15 and verify.

Sandeep Jangra added a comment - 2016-07-14 22:16 - edited Hey Nirbhay, MDEV-10161 talks about wsrep_sync_wait variable not being set from the config file. In my test I am setting it at the session level. So I doubt that these two issues are similar. Also, in my test I do check for the wsrep_casual_reads flag and make sure that it is ON. Here: https://github.com/sjangra-git/galera-tests/blob/master/src/main/java/syncWrites/simplified.java#L162 But let me run these tests on 10.1.15 and verify.

Sandeep Jangra added a comment - 2016-07-15 00:33 - edited

Nirbhay,

I ran the same tests on 10.1.15 and I am still getting errors. The problem here is not that the variable 'wsrep_sync_wait' is not getting set to 1. I am setting this variable at the session level. The java test code also checks if this variable is set before running. The problem is that even when wsrep_sync_wait is set to 1 even then sometimes we read stale data from other nodes in the cluster (which did not receive the write).

If you observe the output captured below, sometimes the test fails at counter 2, 4340, 1366 etc. So it seems like some kind of race condition.

MariaDB [test]> select @@version;

+-----------------+

| @@version       |

+-----------------+

| 10.1.15-MariaDB |

+-----------------+

[app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102

val1=4 val2=2

syn_wait FAILED

[app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102

val1=4340 val2=4338

syn_wait FAILED

[app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102^C

[app@cb-node1 ~]$ vi test.sh

[app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102

[app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102

val1=1366 val2=1364

syn_wait FAILED

[app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102

[app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102

[app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102

val1=4 val2=2

syn_wait FAILED

[app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102

[app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102

val1=236 val2=234

syn_wait FAILED

Sandeep Jangra added a comment - 2016-07-15 00:33 - edited Nirbhay, I ran the same tests on 10.1.15 and I am still getting errors. The problem here is not that the variable 'wsrep_sync_wait' is not getting set to 1. I am setting this variable at the session level. The java test code also checks if this variable is set before running. The problem is that even when wsrep_sync_wait is set to 1 even then sometimes we read stale data from other nodes in the cluster (which did not receive the write). If you observe the output captured below, sometimes the test fails at counter 2, 4340, 1366 etc. So it seems like some kind of race condition. MariaDB [test]> select @@version; +-----------------+ | @@version | +-----------------+ | 10.1.15-MariaDB | +-----------------+ [app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102 val1=4 val2=2 syn_wait FAILED [app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102 val1=4340 val2=4338 syn_wait FAILED [app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102^C [app@cb-node1 ~]$ vi test.sh [app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102 [app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102 val1=1366 val2=1364 syn_wait FAILED [app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102 [app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102 [app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102 val1=4 val2=2 syn_wait FAILED [app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102 [app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102 val1=236 val2=234 syn_wait FAILED

Nirbhay Choubey (Inactive) added a comment - 2016-07-15 01:54

sandeep You are right. Its not same as ~~MDEV-10161~~.

Nirbhay Choubey (Inactive) added a comment - 2016-07-15 01:54 sandeep You are right. Its not same as MDEV-10161 .

Nirbhay Choubey (Inactive) added a comment - 2016-07-15 01:54

Reopening for further investigation.

Nirbhay Choubey (Inactive) added a comment - 2016-07-15 01:54 Reopening for further investigation.

Nirbhay Choubey (Inactive) made changes - 2016-07-15 01:54

Resolution	Duplicate [ 3 ]
Status	Closed [ 6 ]	Stalled [ 10000 ]

Nirbhay Choubey (Inactive) added a comment - 2016-07-15 21:40

sandeep: @@global.autocommit=OFF could possibly be the culprit here, but I cannot say for
sure, as I am not able to reproduce the failure of test.sh on my end.

See, with autocommit=OFF all SELECTs (except the very first) over a single connection become
part of a single large transaction, and thus no wait occurs for those SELECTs.

However, in case of test.sh, since all SELECTs are always executed over a new connection
(unlike your java test), since they are always the first one, I think it must wait irrespective of
autocommit value.

So, I would suggest you to rerun the tests after applying the following patch :

https://gist.github.com/nirbhayc/f5634f3b88bb7324ecee6705d319eb63 (~~MDEV-10314~~.sh.patch)
https://gist.github.com/nirbhayc/af53be2cf318ceb96fdc4ec7041a6b3d (~~MDEV-10314~~.java.patch)

Additionally, you can apply the following patch to 10.1 HEAD before running your tests, so that
mysqld aborts in case SELECT fails to wait.
https://gist.github.com/nirbhayc/5cc515b5a0fa1cd96eb165ab4f57b293

Nirbhay Choubey (Inactive) added a comment - 2016-07-15 21:40 sandeep : @@global.autocommit=OFF could possibly be the culprit here, but I cannot say for sure, as I am not able to reproduce the failure of test.sh on my end. See, with autocommit=OFF all SELECTs (except the very first) over a single connection become part of a single large transaction, and thus no wait occurs for those SELECTs. However, in case of test.sh, since all SELECTs are always executed over a new connection (unlike your java test), since they are always the first one, I think it must wait irrespective of autocommit value. So, I would suggest you to rerun the tests after applying the following patch : https://gist.github.com/nirbhayc/f5634f3b88bb7324ecee6705d319eb63 ( MDEV-10314 .sh.patch) https://gist.github.com/nirbhayc/af53be2cf318ceb96fdc4ec7041a6b3d ( MDEV-10314 .java.patch) Additionally, you can apply the following patch to 10.1 HEAD before running your tests, so that mysqld aborts in case SELECT fails to wait. https://gist.github.com/nirbhayc/5cc515b5a0fa1cd96eb165ab4f57b293

Sergei Golubchik made changes - 2016-07-18 12:27

Fix Version/s		10.1 [ 16100 ]
Fix Version/s	10.1.15 [ 22018 ]

Sandeep Jangra added a comment - 2016-07-19 00:09 - edited

Nirbhay - First off thanks for taking a look at these tests and the patches.

That autocommit=OFF and all selects running as part of same transaction makes sense. So let's focus on the shell test to begin with.

I ran the updated test.sh as per your patch: https://github.com/sjangra-git/galera-tests/blob/master/scripts/test.sh and still see some failures. I increased the number of iterations from 5000 to 50000 to let it run for longer. Here is the observation:

On a local 2 node cluster (running on vagrant on my laptop):
The test fails sometimes at 2, or 43K or 22K.
--------------------------------------------------------------------------------------------------------
[app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102
val1=22512 val2=22510
syn_wait FAILED
[app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102
val1=2496 val2=2494
syn_wait FAILED
[app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102
val1=9556 val2=9554
syn_wait FAILED
[app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102
val1=4 val2=2
syn_wait FAILED
[app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102
val1=42590 val2=42588
syn_wait FAILED
[app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102
val1=42590 val2=42588
syn_wait FAILED
[app@cb-node1 ~]$ vi test.sh
[app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102
val1=14498 val2=14496
syn_wait FAILED
[app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102
val1=3486 val2=3484
syn_wait FAILED
[app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102
val1=8426 val2=8424
syn_wait FAILED
[app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102
val1=43942 val2=43940
syn_wait FAILED

On a 3 node cluster running in our private cloud:
--------------------------------------------------------------------------------------------------------
-bash-4.1$ ./test.sh 10.65.227.132 10.65.227.159
val1=4743 val2=4741
syn_wait FAILED
-bash-4.1$ ./test.sh 10.65.227.132 10.65.227.159
-bash-4.1$ ./test.sh 10.65.227.132 10.65.227.159
^C
-bash-4.1$ vi test.sh
-bash-4.1$ ./test.sh 10.65.227.132 10.65.227.159
ERROR 1050 (42S01) at line 1: Table 't1' already exists
val1=43943 val2=43941
syn_wait FAILED
-bash-4.1$ vi test.sh
-bash-4.1$ ./test.sh 10.65.227.132 10.65.227.159
val1=43943 val2=43941
syn_wait FAILED
-bash-4.1$ ./test.sh 10.65.227.132 10.65.227.159
val1=43943 val2=43941
syn_wait FAILED
-bash-4.1$ ./test.sh 10.65.227.132 10.65.227.159
val1=43943 val2=43941
syn_wait FAILED

--------------------------------------------------------------------------------------------------------
To summarize I would say try these on your environment to see this test fail:

Increase the number of iterations that this test runs: https://github.com/sjangra-git/galera-tests/blob/master/scripts/test.sh#L29
Run this test.sh on one of the mariadb server node instead of running it from a separate client machine. Just so we can rule out network latency.
Give it a couple of runs. It passes for me too sometimes.

I will continue with the java test but don't want to distract you. The shell test seems simple so atleast that should pass for me. Thanks again!

Sandeep Jangra added a comment - 2016-07-19 00:09 - edited Nirbhay - First off thanks for taking a look at these tests and the patches. That autocommit=OFF and all selects running as part of same transaction makes sense. So let's focus on the shell test to begin with. I ran the updated test.sh as per your patch: https://github.com/sjangra-git/galera-tests/blob/master/scripts/test.sh and still see some failures. I increased the number of iterations from 5000 to 50000 to let it run for longer. Here is the observation: On a local 2 node cluster (running on vagrant on my laptop): The test fails sometimes at 2, or 43K or 22K. -------------------------------------------------------------------------------------------------------- [app@cb-node1 ~] $ ./test.sh 192.168.42.101 192.168.42.102 val1=22512 val2=22510 syn_wait FAILED [app@cb-node1 ~] $ ./test.sh 192.168.42.101 192.168.42.102 val1=2496 val2=2494 syn_wait FAILED [app@cb-node1 ~] $ ./test.sh 192.168.42.101 192.168.42.102 val1=9556 val2=9554 syn_wait FAILED [app@cb-node1 ~] $ ./test.sh 192.168.42.101 192.168.42.102 val1=4 val2=2 syn_wait FAILED [app@cb-node1 ~] $ ./test.sh 192.168.42.101 192.168.42.102 val1=42590 val2=42588 syn_wait FAILED [app@cb-node1 ~] $ ./test.sh 192.168.42.101 192.168.42.102 val1=42590 val2=42588 syn_wait FAILED [app@cb-node1 ~] $ vi test.sh [app@cb-node1 ~] $ ./test.sh 192.168.42.101 192.168.42.102 val1=14498 val2=14496 syn_wait FAILED [app@cb-node1 ~] $ ./test.sh 192.168.42.101 192.168.42.102 val1=3486 val2=3484 syn_wait FAILED [app@cb-node1 ~] $ ./test.sh 192.168.42.101 192.168.42.102 val1=8426 val2=8424 syn_wait FAILED [app@cb-node1 ~] $ ./test.sh 192.168.42.101 192.168.42.102 val1=43942 val2=43940 syn_wait FAILED On a 3 node cluster running in our private cloud: -------------------------------------------------------------------------------------------------------- -bash-4.1$ ./test.sh 10.65.227.132 10.65.227.159 val1=4743 val2=4741 syn_wait FAILED -bash-4.1$ ./test.sh 10.65.227.132 10.65.227.159 -bash-4.1$ ./test.sh 10.65.227.132 10.65.227.159 ^C -bash-4.1$ vi test.sh -bash-4.1$ ./test.sh 10.65.227.132 10.65.227.159 ERROR 1050 (42S01) at line 1: Table 't1' already exists val1=43943 val2=43941 syn_wait FAILED -bash-4.1$ vi test.sh -bash-4.1$ ./test.sh 10.65.227.132 10.65.227.159 val1=43943 val2=43941 syn_wait FAILED -bash-4.1$ ./test.sh 10.65.227.132 10.65.227.159 val1=43943 val2=43941 syn_wait FAILED -bash-4.1$ ./test.sh 10.65.227.132 10.65.227.159 val1=43943 val2=43941 syn_wait FAILED -------------------------------------------------------------------------------------------------------- To summarize I would say try these on your environment to see this test fail: Increase the number of iterations that this test runs: https://github.com/sjangra-git/galera-tests/blob/master/scripts/test.sh#L29 Run this test.sh on one of the mariadb server node instead of running it from a separate client machine. Just so we can rule out network latency. Give it a couple of runs. It passes for me too sometimes. I will continue with the java test but don't want to distract you. The shell test seems simple so atleast that should pass for me. Thanks again!

Nirbhay Choubey (Inactive) added a comment - 2016-07-19 00:27

Hi sandeep Please share your node configurations? What is the server version?

Nirbhay Choubey (Inactive) added a comment - 2016-07-19 00:27 Hi sandeep Please share your node configurations? What is the server version?

Sandeep Jangra made changes - 2016-07-19 01:32

Attachment

my.cnf [ 42291 ]

Sandeep Jangra added a comment - 2016-07-19 01:33 - edited

Here is my node config for my local VMs:

MariaDB [(none)]> select @@version;
-----------------

@@version

-----------------

10.1.15-MariaDB

-----------------

Attaching the my.cnf file with this ticket. my.cnf

I am also running 10.1.12 with the same configuration in a different cluster.

Sandeep Jangra added a comment - 2016-07-19 01:33 - edited Here is my node config for my local VMs: MariaDB [(none)] > select @@version; ----------------- @@version ----------------- 10.1.15-MariaDB ----------------- Attaching the my.cnf file with this ticket. my.cnf I am also running 10.1.12 with the same configuration in a different cluster.

Nirbhay Choubey (Inactive) added a comment - 2016-07-19 13:37

sandeep I did try with 50K iterations with no failures. Can you try it again with QC turned off just
to be sure you are not hitting https://github.com/codership/mysql-wsrep/issues/201?

Nirbhay Choubey (Inactive) added a comment - 2016-07-19 13:37 sandeep I did try with 50K iterations with no failures. Can you try it again with QC turned off just to be sure you are not hitting https://github.com/codership/mysql-wsrep/issues/201?

Nirbhay Choubey (Inactive) made changes - 2016-07-19 20:58

Status

Stalled [ 10000 ]

Confirmed [ 10101 ]

Sandeep Jangra added a comment - 2016-07-19 23:07

Nirbhay,

I tried with cache turned OFF on both nodes in my 2 node cluster.

MariaDB [(none)]> select @@global.query_cache_type;
---------------------------

@@global.query_cache_type

---------------------------

OFF

---------------------------
1 row in set (0.00 sec)

Updated the test.sh to disable the cache on each run. https://github.com/sjangra-git/galera-tests/blob/master/scripts/test.sh#L21

I still see errors are random values of the counter.
[app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102
val1=8 val2=6
syn_wait FAILED
[app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102
val1=3234 val2=3232
syn_wait FAILED
[app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102
val1=688 val2=686
syn_wait FAILED
[app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102
val1=9852 val2=9850

I will see if I can create a VM image of the environment and send it with this jira so we can look at the same environment.

Btw I did see the issue move to 'confirmed', just curious if it failed for you too.

Sandeep Jangra added a comment - 2016-07-19 23:07 Nirbhay, I tried with cache turned OFF on both nodes in my 2 node cluster. MariaDB [(none)] > select @@global.query_cache_type; --------------------------- @@global.query_cache_type --------------------------- OFF --------------------------- 1 row in set (0.00 sec) Updated the test.sh to disable the cache on each run. https://github.com/sjangra-git/galera-tests/blob/master/scripts/test.sh#L21 I still see errors are random values of the counter. [app@cb-node1 ~] $ ./test.sh 192.168.42.101 192.168.42.102 val1=8 val2=6 syn_wait FAILED [app@cb-node1 ~] $ ./test.sh 192.168.42.101 192.168.42.102 val1=3234 val2=3232 syn_wait FAILED [app@cb-node1 ~] $ ./test.sh 192.168.42.101 192.168.42.102 val1=688 val2=686 syn_wait FAILED [app@cb-node1 ~] $ ./test.sh 192.168.42.101 192.168.42.102 val1=9852 val2=9850 I will see if I can create a VM image of the environment and send it with this jira so we can look at the same environment. Btw I did see the issue move to 'confirmed', just curious if it failed for you too.

Nirbhay Choubey (Inactive) added a comment - 2016-07-19 23:13

Its has been confirmed. Its due to the thread pool (thread_handling=pool-of-threads).

Nirbhay Choubey (Inactive) added a comment - 2016-07-19 23:13 Its has been confirmed. Its due to the thread pool (thread_handling=pool-of-threads).

Nirbhay Choubey (Inactive) made changes - 2016-07-26 01:59

Fix Version/s		10.1.17 [ 22102 ]
Fix Version/s		10.0.27-galera [ 22104 ]
Fix Version/s	10.1 [ 16100 ]
Resolution		Fixed [ 1 ]
Status	Confirmed [ 10101 ]	Closed [ 6 ]

Nirbhay Choubey (Inactive) made changes - 2016-07-26 02:00

Remote Link

This issue links to "Commit (Web Link)" [ 27431 ]

Nirbhay Choubey (Inactive) made changes - 2016-08-12 14:57

Fix Version/s

5.5.51-galera [ 22103 ]

Sergei Golubchik made changes - 2021-12-06 21:43

Workflow

MariaDB v3 [ 76300 ]

MariaDB v4 [ 150571 ]

People

Assignee:: Nirbhay Choubey (Inactive)

Reporter:: Sandeep Jangra

Votes:: 1 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 2016-07-01 02:26

Updated:: 2016-08-12 14:57

Resolved:: 2016-07-26 01:59

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.