Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-10314

wsrep_sync_wait does not seem to be working

Details

    Description

      We have been trying to test the critical reads that are enabled by 'wsrep_sync_wait' parameter but seems like the results are not deterministic.

      Here is the test script:
      https://github.com/sjangra-git/galera-tests/blob/master/scripts/test.sh

      This test script based out of the following test-case:
      https://github.com/MariaDB/server/blob/2783fc7d14bc8ad16acfeb509d3b19615023f47a/mysql-test/suite/galera/t/mysql-wsrep%23201.test#L5

      Here is the version info:

      MariaDB [(none)]> show variables like 'version%';
      +-------------------------+---------------------------------+
      | Variable_name           | Value                           |
      +-------------------------+---------------------------------+
      | version                 | 10.1.12-MariaDB                 |
      | version_comment         | MariaDB Server                  |
      | version_compile_machine | x86_64                          |
      | version_compile_os      | Linux                           |
      | version_malloc_library  | system jemalloc                 |
      | version_ssl_library     | OpenSSL 1.0.1e-fips 11 Feb 2013 |
      +-------------------------+---------------------------------+
      

      The errors are more prominent when we run the following java test where the connection is persisted, so the queries are being attempted faster:
      https://github.com/sjangra-git/galera-tests

      Auto-commit is OFF on the servers in the cluster:

      MariaDB [(none)]> show global variables like 'autocommit';
      +---------------+-------+
      | Variable_name | Value |
      +---------------+-------+
      | autocommit    | OFF   |
      +---------------+-------+
      

      Few runs failed, few passed for the same test:

      -bash-4.1$ ./test.sh 10.226.76.38 10.226.76.32
      val1=5735 val2=5732
      syn_wait FAILED
      -bash-4.1$ ./test.sh 10.226.76.38 10.226.76.32
      -bash-4.1$ ./test.sh 10.226.76.38 10.226.76.32
      -bash-4.1$ ./test.sh 10.226.76.38 10.226.76.32
      -bash-4.1$ ./test.sh 10.226.76.38 10.226.76.32
      -bash-4.1$ ./test.sh 10.226.76.38 10.226.76.32
      

      Attachments

        Issue Links

          Activity

            sandeep Sandeep Jangra created issue -

            Failures are easily reproducible when I have two instances of mariadb server running locally in VMs.

            [app@cb-node2 ~]$ ./test.sh 192.168.42.102 192.168.42.101
            val1=8791 val2=8789
            syn_wait FAILED
            [app@cb-node2 ~]$ ./test.sh 192.168.42.102 192.168.42.101
            [app@cb-node2 ~]$ ./test.sh 192.168.42.102 192.168.42.101
            [app@cb-node2 ~]$ ./test.sh 192.168.42.102 192.168.42.101
            val1=2917 val2=2915
            syn_wait FAILED
            [app@cb-node2 ~]$ ./test.sh 192.168.42.102 192.168.42.101
            val1=1087 val2=1085
            syn_wait FAILED
            [app@cb-node2 ~]$ ./test.sh 192.168.42.102 192.168.42.101
            [app@cb-node2 ~]$ ./test.sh 192.168.42.102 192.168.42.101
            [app@cb-node2 ~]$ ./test.sh 192.168.42.102 192.168.42.101
            val1=2371 val2=2369
            syn_wait FAILED
            [app@cb-node2 ~]$ ./test.sh 192.168.42.102 192.168.42.101
            val1=6287 val2=6285
            syn_wait FAILED

            sandeep Sandeep Jangra added a comment - Failures are easily reproducible when I have two instances of mariadb server running locally in VMs. [app@cb-node2 ~] $ ./test.sh 192.168.42.102 192.168.42.101 val1=8791 val2=8789 syn_wait FAILED [app@cb-node2 ~] $ ./test.sh 192.168.42.102 192.168.42.101 [app@cb-node2 ~] $ ./test.sh 192.168.42.102 192.168.42.101 [app@cb-node2 ~] $ ./test.sh 192.168.42.102 192.168.42.101 val1=2917 val2=2915 syn_wait FAILED [app@cb-node2 ~] $ ./test.sh 192.168.42.102 192.168.42.101 val1=1087 val2=1085 syn_wait FAILED [app@cb-node2 ~] $ ./test.sh 192.168.42.102 192.168.42.101 [app@cb-node2 ~] $ ./test.sh 192.168.42.102 192.168.42.101 [app@cb-node2 ~] $ ./test.sh 192.168.42.102 192.168.42.101 val1=2371 val2=2369 syn_wait FAILED [app@cb-node2 ~] $ ./test.sh 192.168.42.102 192.168.42.101 val1=6287 val2=6285 syn_wait FAILED
            elenst Elena Stepanova made changes -
            Field Original Value New Value
            Description We have been trying to test the critical reads that are enabled by 'wsrep_sync_wait' parameter but seems like the results are not deterministic.

            Here is the test script:
            https://github.com/sjangra-git/galera-tests/blob/master/scripts/test.sh

            This test script based out of the following test-case:
            https://github.com/MariaDB/server/blob/2783fc7d14bc8ad16acfeb509d3b19615023f47a/mysql-test/suite/galera/t/mysql-wsrep%23201.test#L5

            Here is the version info:
            MariaDB [(none)]> show variables like 'version%';
            +-------------------------+---------------------------------+
            | Variable_name | Value |
            +-------------------------+---------------------------------+
            | version | 10.1.12-MariaDB |
            | version_comment | MariaDB Server |
            | version_compile_machine | x86_64 |
            | version_compile_os | Linux |
            | version_malloc_library | system jemalloc |
            | version_ssl_library | OpenSSL 1.0.1e-fips 11 Feb 2013 |
            +-------------------------+---------------------------------+

            The errors are more prominent when we run the following java test where the connection is persisted, so the queries are being attempted faster:
            https://github.com/sjangra-git/galera-tests

            Auto-commit is OFF on the servers in the cluster:
            MariaDB [(none)]> show global variables like 'autocommit';
            +---------------+-------+
            | Variable_name | Value |
            +---------------+-------+
            | autocommit | OFF |
            +---------------+-------+

            Few runs failed, few passed for the same test:
            -bash-4.1$ ./test.sh 10.226.76.38 10.226.76.32
            val1=5735 val2=5732
            syn_wait FAILED
            -bash-4.1$ ./test.sh 10.226.76.38 10.226.76.32
            -bash-4.1$ ./test.sh 10.226.76.38 10.226.76.32
            -bash-4.1$ ./test.sh 10.226.76.38 10.226.76.32
            -bash-4.1$ ./test.sh 10.226.76.38 10.226.76.32
            -bash-4.1$ ./test.sh 10.226.76.38 10.226.76.32
            We have been trying to test the critical reads that are enabled by 'wsrep_sync_wait' parameter but seems like the results are not deterministic.

            Here is the test script:
            https://github.com/sjangra-git/galera-tests/blob/master/scripts/test.sh

            This test script based out of the following test-case:
            https://github.com/MariaDB/server/blob/2783fc7d14bc8ad16acfeb509d3b19615023f47a/mysql-test/suite/galera/t/mysql-wsrep%23201.test#L5

            Here is the version info:
            {noformat}
            MariaDB [(none)]> show variables like 'version%';
            +-------------------------+---------------------------------+
            | Variable_name | Value |
            +-------------------------+---------------------------------+
            | version | 10.1.12-MariaDB |
            | version_comment | MariaDB Server |
            | version_compile_machine | x86_64 |
            | version_compile_os | Linux |
            | version_malloc_library | system jemalloc |
            | version_ssl_library | OpenSSL 1.0.1e-fips 11 Feb 2013 |
            +-------------------------+---------------------------------+
            {noformat}

            The errors are more prominent when we run the following java test where the connection is persisted, so the queries are being attempted faster:
            https://github.com/sjangra-git/galera-tests

            Auto-commit is OFF on the servers in the cluster:
            {noformat}
            MariaDB [(none)]> show global variables like 'autocommit';
            +---------------+-------+
            | Variable_name | Value |
            +---------------+-------+
            | autocommit | OFF |
            +---------------+-------+
            {noformat}

            Few runs failed, few passed for the same test:
            {noformat}
            -bash-4.1$ ./test.sh 10.226.76.38 10.226.76.32
            val1=5735 val2=5732
            syn_wait FAILED
            -bash-4.1$ ./test.sh 10.226.76.38 10.226.76.32
            -bash-4.1$ ./test.sh 10.226.76.38 10.226.76.32
            -bash-4.1$ ./test.sh 10.226.76.38 10.226.76.32
            -bash-4.1$ ./test.sh 10.226.76.38 10.226.76.32
            -bash-4.1$ ./test.sh 10.226.76.38 10.226.76.32
            {noformat}
            elenst Elena Stepanova made changes -
            Assignee Nirbhay Choubey [ nirbhay_c ]

            Hey Nirbhay Choubey,

            Based on my experiments it seems like this issue is easily reproducible when the network latency is low. This is evident from the second test when I had two instances of mariadb running locally.

            Also, it would be interesting to run the unit test that is mentioned in the description for 5000-10,000 iterations.

            sandeep Sandeep Jangra added a comment - Hey Nirbhay Choubey, Based on my experiments it seems like this issue is easily reproducible when the network latency is low. This is evident from the second test when I had two instances of mariadb running locally. Also, it would be interesting to run the unit test that is mentioned in the description for 5000-10,000 iterations.
            danblack Daniel Black added a comment - - edited

            sandeep with wsrep_sync_wait not set (default to 0) can I get you to test the following with your scripts. I'm wondering if this is suitable as a synchronisation method.

            on both:

            set global wsrep_gtid_mode=1
            

            connection 1:

            select @@last_gtid
            

            after the insert. and connection 2:

            select MASTER_GTID_WAIT($last_gtid, 0.5), MAX(id) FROM t1
            

            danblack Daniel Black added a comment - - edited sandeep with wsrep_sync_wait not set (default to 0) can I get you to test the following with your scripts. I'm wondering if this is suitable as a synchronisation method. on both: set global wsrep_gtid_mode=1 connection 1: select @@last_gtid after the insert. and connection 2: select MASTER_GTID_WAIT($last_gtid, 0.5), MAX(id) FROM t1

            Daniel, thanks for the providing workaround.

            Here is the new test:
            https://github.com/sjangra-git/galera-tests/blob/master/scripts/test-2.sh#L36

            I tried originally with timeout of 0.5, and it did not seem to work. I did see the replica node running behind the source.
            Then I increased this timeout to 5 and still running it. But that does not make much sense coz effectively I am slowing the writes down.

            Would you mind running the test.sh on your installation and confirm if you see these errors too? I am going to update the documentation so it becomes easy to install/run these tests.

            sandeep Sandeep Jangra added a comment - Daniel, thanks for the providing workaround. Here is the new test: https://github.com/sjangra-git/galera-tests/blob/master/scripts/test-2.sh#L36 I tried originally with timeout of 0.5, and it did not seem to work. I did see the replica node running behind the source. Then I increased this timeout to 5 and still running it. But that does not make much sense coz effectively I am slowing the writes down. Would you mind running the test.sh on your installation and confirm if you see these errors too? I am going to update the documentation so it becomes easy to install/run these tests.
            nirbhay_c Nirbhay Choubey (Inactive) added a comment - - edited

            sandeep: Checkout MDEV-10161. This issue has now been fixed. But, until you get a fixed version, a workaround would be to also set wsrep_causal_reads to ON.

            nirbhay_c Nirbhay Choubey (Inactive) added a comment - - edited sandeep : Checkout MDEV-10161 . This issue has now been fixed. But, until you get a fixed version, a workaround would be to also set wsrep_causal_reads to ON.
            nirbhay_c Nirbhay Choubey (Inactive) made changes -
            Component/s Galera [ 10124 ]
            Component/s wsrep [ 11500 ]
            Fix Version/s 10.1.15 [ 22018 ]
            Resolution Duplicate [ 3 ]
            Status Open [ 1 ] Closed [ 6 ]
            sandeep Sandeep Jangra added a comment - - edited

            Hey Nirbhay,

            MDEV-10161 talks about wsrep_sync_wait variable not being set from the config file.

            In my test I am setting it at the session level. So I doubt that these two issues are similar. Also, in my test I do check for the wsrep_casual_reads flag and make sure that it is ON. Here: https://github.com/sjangra-git/galera-tests/blob/master/src/main/java/syncWrites/simplified.java#L162

            But let me run these tests on 10.1.15 and verify.

            sandeep Sandeep Jangra added a comment - - edited Hey Nirbhay, MDEV-10161 talks about wsrep_sync_wait variable not being set from the config file. In my test I am setting it at the session level. So I doubt that these two issues are similar. Also, in my test I do check for the wsrep_casual_reads flag and make sure that it is ON. Here: https://github.com/sjangra-git/galera-tests/blob/master/src/main/java/syncWrites/simplified.java#L162 But let me run these tests on 10.1.15 and verify.
            sandeep Sandeep Jangra added a comment - - edited

            Nirbhay,

            I ran the same tests on 10.1.15 and I am still getting errors. The problem here is not that the variable 'wsrep_sync_wait' is not getting set to 1. I am setting this variable at the session level. The java test code also checks if this variable is set before running. The problem is that even when wsrep_sync_wait is set to 1 even then sometimes we read stale data from other nodes in the cluster (which did not receive the write).

            If you observe the output captured below, sometimes the test fails at counter 2, 4340, 1366 etc. So it seems like some kind of race condition.

            MariaDB [test]> select @@version;
            +-----------------+
            | @@version       |
            +-----------------+
            | 10.1.15-MariaDB |
            +-----------------+
             
            [app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102
            val1=4 val2=2
            syn_wait FAILED
            [app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102
            val1=4340 val2=4338
            syn_wait FAILED
            [app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102^C
            [app@cb-node1 ~]$ vi test.sh 
            [app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102
            [app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102
            val1=1366 val2=1364
            syn_wait FAILED
            [app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102
            [app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102
            [app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102
            val1=4 val2=2
            syn_wait FAILED
            [app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102
            [app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102
            val1=236 val2=234
            syn_wait FAILED
            

            sandeep Sandeep Jangra added a comment - - edited Nirbhay, I ran the same tests on 10.1.15 and I am still getting errors. The problem here is not that the variable 'wsrep_sync_wait' is not getting set to 1. I am setting this variable at the session level. The java test code also checks if this variable is set before running. The problem is that even when wsrep_sync_wait is set to 1 even then sometimes we read stale data from other nodes in the cluster (which did not receive the write). If you observe the output captured below, sometimes the test fails at counter 2, 4340, 1366 etc. So it seems like some kind of race condition. MariaDB [test]> select @@version; +-----------------+ | @@version | +-----------------+ | 10.1.15-MariaDB | +-----------------+   [app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102 val1=4 val2=2 syn_wait FAILED [app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102 val1=4340 val2=4338 syn_wait FAILED [app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102^C [app@cb-node1 ~]$ vi test.sh [app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102 [app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102 val1=1366 val2=1364 syn_wait FAILED [app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102 [app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102 [app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102 val1=4 val2=2 syn_wait FAILED [app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102 [app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102 val1=236 val2=234 syn_wait FAILED

            sandeep You are right. Its not same as MDEV-10161.

            nirbhay_c Nirbhay Choubey (Inactive) added a comment - sandeep You are right. Its not same as MDEV-10161 .

            Reopening for further investigation.

            nirbhay_c Nirbhay Choubey (Inactive) added a comment - Reopening for further investigation.
            nirbhay_c Nirbhay Choubey (Inactive) made changes -
            Resolution Duplicate [ 3 ]
            Status Closed [ 6 ] Stalled [ 10000 ]

            sandeep: @@global.autocommit=OFF could possibly be the culprit here, but I cannot say for
            sure, as I am not able to reproduce the failure of test.sh on my end.

            See, with autocommit=OFF all SELECTs (except the very first) over a single connection become
            part of a single large transaction, and thus no wait occurs for those SELECTs.

            However, in case of test.sh, since all SELECTs are always executed over a new connection
            (unlike your java test), since they are always the first one, I think it must wait irrespective of
            autocommit value.

            So, I would suggest you to rerun the tests after applying the following patch :

            Additionally, you can apply the following patch to 10.1 HEAD before running your tests, so that
            mysqld aborts in case SELECT fails to wait.
            https://gist.github.com/nirbhayc/5cc515b5a0fa1cd96eb165ab4f57b293

            nirbhay_c Nirbhay Choubey (Inactive) added a comment - sandeep : @@global.autocommit=OFF could possibly be the culprit here, but I cannot say for sure, as I am not able to reproduce the failure of test.sh on my end. See, with autocommit=OFF all SELECTs (except the very first) over a single connection become part of a single large transaction, and thus no wait occurs for those SELECTs. However, in case of test.sh, since all SELECTs are always executed over a new connection (unlike your java test), since they are always the first one, I think it must wait irrespective of autocommit value. So, I would suggest you to rerun the tests after applying the following patch : https://gist.github.com/nirbhayc/f5634f3b88bb7324ecee6705d319eb63 ( MDEV-10314 .sh.patch) https://gist.github.com/nirbhayc/af53be2cf318ceb96fdc4ec7041a6b3d ( MDEV-10314 .java.patch) Additionally, you can apply the following patch to 10.1 HEAD before running your tests, so that mysqld aborts in case SELECT fails to wait. https://gist.github.com/nirbhayc/5cc515b5a0fa1cd96eb165ab4f57b293
            serg Sergei Golubchik made changes -
            Fix Version/s 10.1 [ 16100 ]
            Fix Version/s 10.1.15 [ 22018 ]
            sandeep Sandeep Jangra added a comment - - edited

            Nirbhay - First off thanks for taking a look at these tests and the patches.

            That autocommit=OFF and all selects running as part of same transaction makes sense. So let's focus on the shell test to begin with.

            I ran the updated test.sh as per your patch: https://github.com/sjangra-git/galera-tests/blob/master/scripts/test.sh and still see some failures. I increased the number of iterations from 5000 to 50000 to let it run for longer. Here is the observation:

            On a local 2 node cluster (running on vagrant on my laptop):
            The test fails sometimes at 2, or 43K or 22K.
            --------------------------------------------------------------------------------------------------------
            [app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102
            val1=22512 val2=22510
            syn_wait FAILED
            [app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102
            val1=2496 val2=2494
            syn_wait FAILED
            [app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102
            val1=9556 val2=9554
            syn_wait FAILED
            [app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102
            val1=4 val2=2
            syn_wait FAILED
            [app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102
            val1=42590 val2=42588
            syn_wait FAILED
            [app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102
            val1=42590 val2=42588
            syn_wait FAILED
            [app@cb-node1 ~]$ vi test.sh
            [app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102
            val1=14498 val2=14496
            syn_wait FAILED
            [app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102
            val1=3486 val2=3484
            syn_wait FAILED
            [app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102
            val1=8426 val2=8424
            syn_wait FAILED
            [app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102
            val1=43942 val2=43940
            syn_wait FAILED

            On a 3 node cluster running in our private cloud:
            --------------------------------------------------------------------------------------------------------
            -bash-4.1$ ./test.sh 10.65.227.132 10.65.227.159
            val1=4743 val2=4741
            syn_wait FAILED
            -bash-4.1$ ./test.sh 10.65.227.132 10.65.227.159
            -bash-4.1$ ./test.sh 10.65.227.132 10.65.227.159
            ^C
            -bash-4.1$ vi test.sh
            -bash-4.1$ ./test.sh 10.65.227.132 10.65.227.159
            ERROR 1050 (42S01) at line 1: Table 't1' already exists
            val1=43943 val2=43941
            syn_wait FAILED
            -bash-4.1$ vi test.sh
            -bash-4.1$ ./test.sh 10.65.227.132 10.65.227.159
            val1=43943 val2=43941
            syn_wait FAILED
            -bash-4.1$ ./test.sh 10.65.227.132 10.65.227.159
            val1=43943 val2=43941
            syn_wait FAILED
            -bash-4.1$ ./test.sh 10.65.227.132 10.65.227.159
            val1=43943 val2=43941
            syn_wait FAILED

            --------------------------------------------------------------------------------------------------------
            To summarize I would say try these on your environment to see this test fail:

            I will continue with the java test but don't want to distract you. The shell test seems simple so atleast that should pass for me. Thanks again!

            sandeep Sandeep Jangra added a comment - - edited Nirbhay - First off thanks for taking a look at these tests and the patches. That autocommit=OFF and all selects running as part of same transaction makes sense. So let's focus on the shell test to begin with. I ran the updated test.sh as per your patch: https://github.com/sjangra-git/galera-tests/blob/master/scripts/test.sh and still see some failures. I increased the number of iterations from 5000 to 50000 to let it run for longer. Here is the observation: On a local 2 node cluster (running on vagrant on my laptop): The test fails sometimes at 2, or 43K or 22K. -------------------------------------------------------------------------------------------------------- [app@cb-node1 ~] $ ./test.sh 192.168.42.101 192.168.42.102 val1=22512 val2=22510 syn_wait FAILED [app@cb-node1 ~] $ ./test.sh 192.168.42.101 192.168.42.102 val1=2496 val2=2494 syn_wait FAILED [app@cb-node1 ~] $ ./test.sh 192.168.42.101 192.168.42.102 val1=9556 val2=9554 syn_wait FAILED [app@cb-node1 ~] $ ./test.sh 192.168.42.101 192.168.42.102 val1=4 val2=2 syn_wait FAILED [app@cb-node1 ~] $ ./test.sh 192.168.42.101 192.168.42.102 val1=42590 val2=42588 syn_wait FAILED [app@cb-node1 ~] $ ./test.sh 192.168.42.101 192.168.42.102 val1=42590 val2=42588 syn_wait FAILED [app@cb-node1 ~] $ vi test.sh [app@cb-node1 ~] $ ./test.sh 192.168.42.101 192.168.42.102 val1=14498 val2=14496 syn_wait FAILED [app@cb-node1 ~] $ ./test.sh 192.168.42.101 192.168.42.102 val1=3486 val2=3484 syn_wait FAILED [app@cb-node1 ~] $ ./test.sh 192.168.42.101 192.168.42.102 val1=8426 val2=8424 syn_wait FAILED [app@cb-node1 ~] $ ./test.sh 192.168.42.101 192.168.42.102 val1=43942 val2=43940 syn_wait FAILED On a 3 node cluster running in our private cloud: -------------------------------------------------------------------------------------------------------- -bash-4.1$ ./test.sh 10.65.227.132 10.65.227.159 val1=4743 val2=4741 syn_wait FAILED -bash-4.1$ ./test.sh 10.65.227.132 10.65.227.159 -bash-4.1$ ./test.sh 10.65.227.132 10.65.227.159 ^C -bash-4.1$ vi test.sh -bash-4.1$ ./test.sh 10.65.227.132 10.65.227.159 ERROR 1050 (42S01) at line 1: Table 't1' already exists val1=43943 val2=43941 syn_wait FAILED -bash-4.1$ vi test.sh -bash-4.1$ ./test.sh 10.65.227.132 10.65.227.159 val1=43943 val2=43941 syn_wait FAILED -bash-4.1$ ./test.sh 10.65.227.132 10.65.227.159 val1=43943 val2=43941 syn_wait FAILED -bash-4.1$ ./test.sh 10.65.227.132 10.65.227.159 val1=43943 val2=43941 syn_wait FAILED -------------------------------------------------------------------------------------------------------- To summarize I would say try these on your environment to see this test fail: Increase the number of iterations that this test runs: https://github.com/sjangra-git/galera-tests/blob/master/scripts/test.sh#L29 Run this test.sh on one of the mariadb server node instead of running it from a separate client machine. Just so we can rule out network latency. Give it a couple of runs. It passes for me too sometimes. I will continue with the java test but don't want to distract you. The shell test seems simple so atleast that should pass for me. Thanks again!

            Hi sandeep Please share your node configurations? What is the server version?

            nirbhay_c Nirbhay Choubey (Inactive) added a comment - Hi sandeep Please share your node configurations? What is the server version?
            sandeep Sandeep Jangra made changes -
            Attachment my.cnf [ 42291 ]
            sandeep Sandeep Jangra added a comment - - edited

            Here is my node config for my local VMs:

            MariaDB [(none)]> select @@version;
            -----------------

            @@version

            -----------------

            10.1.15-MariaDB

            -----------------

            Attaching the my.cnf file with this ticket. my.cnf

            I am also running 10.1.12 with the same configuration in a different cluster.

            sandeep Sandeep Jangra added a comment - - edited Here is my node config for my local VMs: MariaDB [(none)] > select @@version; ----------------- @@version ----------------- 10.1.15-MariaDB ----------------- Attaching the my.cnf file with this ticket. my.cnf I am also running 10.1.12 with the same configuration in a different cluster.

            sandeep I did try with 50K iterations with no failures. Can you try it again with QC turned off just
            to be sure you are not hitting https://github.com/codership/mysql-wsrep/issues/201?

            nirbhay_c Nirbhay Choubey (Inactive) added a comment - sandeep I did try with 50K iterations with no failures. Can you try it again with QC turned off just to be sure you are not hitting https://github.com/codership/mysql-wsrep/issues/201?
            nirbhay_c Nirbhay Choubey (Inactive) made changes -
            Status Stalled [ 10000 ] Confirmed [ 10101 ]

            Nirbhay,

            I tried with cache turned OFF on both nodes in my 2 node cluster.

            MariaDB [(none)]> select @@global.query_cache_type;
            ---------------------------

            @@global.query_cache_type

            ---------------------------

            OFF

            ---------------------------
            1 row in set (0.00 sec)

            Updated the test.sh to disable the cache on each run. https://github.com/sjangra-git/galera-tests/blob/master/scripts/test.sh#L21

            I still see errors are random values of the counter.
            [app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102
            val1=8 val2=6
            syn_wait FAILED
            [app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102
            val1=3234 val2=3232
            syn_wait FAILED
            [app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102
            val1=688 val2=686
            syn_wait FAILED
            [app@cb-node1 ~]$ ./test.sh 192.168.42.101 192.168.42.102
            val1=9852 val2=9850

            I will see if I can create a VM image of the environment and send it with this jira so we can look at the same environment.

            Btw I did see the issue move to 'confirmed', just curious if it failed for you too.

            sandeep Sandeep Jangra added a comment - Nirbhay, I tried with cache turned OFF on both nodes in my 2 node cluster. MariaDB [(none)] > select @@global.query_cache_type; --------------------------- @@global.query_cache_type --------------------------- OFF --------------------------- 1 row in set (0.00 sec) Updated the test.sh to disable the cache on each run. https://github.com/sjangra-git/galera-tests/blob/master/scripts/test.sh#L21 I still see errors are random values of the counter. [app@cb-node1 ~] $ ./test.sh 192.168.42.101 192.168.42.102 val1=8 val2=6 syn_wait FAILED [app@cb-node1 ~] $ ./test.sh 192.168.42.101 192.168.42.102 val1=3234 val2=3232 syn_wait FAILED [app@cb-node1 ~] $ ./test.sh 192.168.42.101 192.168.42.102 val1=688 val2=686 syn_wait FAILED [app@cb-node1 ~] $ ./test.sh 192.168.42.101 192.168.42.102 val1=9852 val2=9850 I will see if I can create a VM image of the environment and send it with this jira so we can look at the same environment. Btw I did see the issue move to 'confirmed', just curious if it failed for you too.

            Its has been confirmed. Its due to the thread pool (thread_handling=pool-of-threads).

            nirbhay_c Nirbhay Choubey (Inactive) added a comment - Its has been confirmed. Its due to the thread pool (thread_handling=pool-of-threads).
            nirbhay_c Nirbhay Choubey (Inactive) made changes -
            Fix Version/s 10.1.17 [ 22102 ]
            Fix Version/s 10.0.27-galera [ 22104 ]
            Fix Version/s 10.1 [ 16100 ]
            Resolution Fixed [ 1 ]
            Status Confirmed [ 10101 ] Closed [ 6 ]
            nirbhay_c Nirbhay Choubey (Inactive) made changes -
            nirbhay_c Nirbhay Choubey (Inactive) made changes -
            Fix Version/s 5.5.51-galera [ 22103 ]
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 76300 ] MariaDB v4 [ 150571 ]

            People

              nirbhay_c Nirbhay Choubey (Inactive)
              sandeep Sandeep Jangra
              Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.