[MDEV-10314] wsrep_sync_wait does not seem to be working Created: 2016-07-01 Updated: 2016-08-12 Resolved: 2016-07-26 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Galera, wsrep |
| Affects Version/s: | 10.1.12 |
| Fix Version/s: | 10.1.17, 5.5.51-galera, 10.0.27-galera |
| Type: | Bug | Priority: | Major |
| Reporter: | Sandeep Jangra | Assignee: | Nirbhay Choubey (Inactive) |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | None | ||
| Attachments: |
|
| Description |
|
We have been trying to test the critical reads that are enabled by 'wsrep_sync_wait' parameter but seems like the results are not deterministic. Here is the test script: This test script based out of the following test-case: Here is the version info:
The errors are more prominent when we run the following java test where the connection is persisted, so the queries are being attempted faster: Auto-commit is OFF on the servers in the cluster:
Few runs failed, few passed for the same test:
|
| Comments |
| Comment by Sandeep Jangra [ 2016-07-01 ] | ||||||||||||||||||||||||||||
|
Failures are easily reproducible when I have two instances of mariadb server running locally in VMs. [app@cb-node2 ~]$ ./test.sh 192.168.42.102 192.168.42.101 | ||||||||||||||||||||||||||||
| Comment by Sandeep Jangra [ 2016-07-08 ] | ||||||||||||||||||||||||||||
|
Hey Nirbhay Choubey, Based on my experiments it seems like this issue is easily reproducible when the network latency is low. This is evident from the second test when I had two instances of mariadb running locally. Also, it would be interesting to run the unit test that is mentioned in the description for 5000-10,000 iterations. | ||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2016-07-14 ] | ||||||||||||||||||||||||||||
|
sandeep with wsrep_sync_wait not set (default to 0) can I get you to test the following with your scripts. I'm wondering if this is suitable as a synchronisation method. on both:
connection 1:
after the insert. and connection 2:
| ||||||||||||||||||||||||||||
| Comment by Sandeep Jangra [ 2016-07-14 ] | ||||||||||||||||||||||||||||
|
Daniel, thanks for the providing workaround. Here is the new test: I tried originally with timeout of 0.5, and it did not seem to work. I did see the replica node running behind the source. Would you mind running the test.sh on your installation and confirm if you see these errors too? I am going to update the documentation so it becomes easy to install/run these tests. | ||||||||||||||||||||||||||||
| Comment by Nirbhay Choubey (Inactive) [ 2016-07-14 ] | ||||||||||||||||||||||||||||
|
sandeep: Checkout | ||||||||||||||||||||||||||||
| Comment by Sandeep Jangra [ 2016-07-14 ] | ||||||||||||||||||||||||||||
|
Hey Nirbhay, MDEV-10161 talks about wsrep_sync_wait variable not being set from the config file. In my test I am setting it at the session level. So I doubt that these two issues are similar. Also, in my test I do check for the wsrep_casual_reads flag and make sure that it is ON. Here: https://github.com/sjangra-git/galera-tests/blob/master/src/main/java/syncWrites/simplified.java#L162 But let me run these tests on 10.1.15 and verify. | ||||||||||||||||||||||||||||
| Comment by Sandeep Jangra [ 2016-07-15 ] | ||||||||||||||||||||||||||||
|
Nirbhay, I ran the same tests on 10.1.15 and I am still getting errors. The problem here is not that the variable 'wsrep_sync_wait' is not getting set to 1. I am setting this variable at the session level. The java test code also checks if this variable is set before running. The problem is that even when wsrep_sync_wait is set to 1 even then sometimes we read stale data from other nodes in the cluster (which did not receive the write). If you observe the output captured below, sometimes the test fails at counter 2, 4340, 1366 etc. So it seems like some kind of race condition.
| ||||||||||||||||||||||||||||
| Comment by Nirbhay Choubey (Inactive) [ 2016-07-15 ] | ||||||||||||||||||||||||||||
|
sandeep You are right. Its not same as | ||||||||||||||||||||||||||||
| Comment by Nirbhay Choubey (Inactive) [ 2016-07-15 ] | ||||||||||||||||||||||||||||
|
Reopening for further investigation. | ||||||||||||||||||||||||||||
| Comment by Nirbhay Choubey (Inactive) [ 2016-07-15 ] | ||||||||||||||||||||||||||||
|
sandeep: @@global.autocommit=OFF could possibly be the culprit here, but I cannot say for See, with autocommit=OFF all SELECTs (except the very first) over a single connection become However, in case of test.sh, since all SELECTs are always executed over a new connection So, I would suggest you to rerun the tests after applying the following patch :
Additionally, you can apply the following patch to 10.1 HEAD before running your tests, so that | ||||||||||||||||||||||||||||
| Comment by Sandeep Jangra [ 2016-07-19 ] | ||||||||||||||||||||||||||||
|
Nirbhay - First off thanks for taking a look at these tests and the patches. That autocommit=OFF and all selects running as part of same transaction makes sense. So let's focus on the shell test to begin with. I ran the updated test.sh as per your patch: https://github.com/sjangra-git/galera-tests/blob/master/scripts/test.sh and still see some failures. I increased the number of iterations from 5000 to 50000 to let it run for longer. Here is the observation: On a local 2 node cluster (running on vagrant on my laptop): On a 3 node cluster running in our private cloud: --------------------------------------------------------------------------------------------------------
I will continue with the java test but don't want to distract you. The shell test seems simple so atleast that should pass for me. Thanks again! | ||||||||||||||||||||||||||||
| Comment by Nirbhay Choubey (Inactive) [ 2016-07-19 ] | ||||||||||||||||||||||||||||
|
Hi sandeep Please share your node configurations? What is the server version? | ||||||||||||||||||||||||||||
| Comment by Sandeep Jangra [ 2016-07-19 ] | ||||||||||||||||||||||||||||
|
Here is my node config for my local VMs: MariaDB [(none)]> select @@version;
-----------------
----------------- Attaching the my.cnf file with this ticket. my.cnf I am also running 10.1.12 with the same configuration in a different cluster. | ||||||||||||||||||||||||||||
| Comment by Nirbhay Choubey (Inactive) [ 2016-07-19 ] | ||||||||||||||||||||||||||||
|
sandeep I did try with 50K iterations with no failures. Can you try it again with QC turned off just | ||||||||||||||||||||||||||||
| Comment by Sandeep Jangra [ 2016-07-19 ] | ||||||||||||||||||||||||||||
|
Nirbhay, I tried with cache turned OFF on both nodes in my 2 node cluster. MariaDB [(none)]> select @@global.query_cache_type;
---------------------------
--------------------------- Updated the test.sh to disable the cache on each run. https://github.com/sjangra-git/galera-tests/blob/master/scripts/test.sh#L21 I still see errors are random values of the counter. I will see if I can create a VM image of the environment and send it with this jira so we can look at the same environment. Btw I did see the issue move to 'confirmed', just curious if it failed for you too. | ||||||||||||||||||||||||||||
| Comment by Nirbhay Choubey (Inactive) [ 2016-07-19 ] | ||||||||||||||||||||||||||||
|
Its has been confirmed. Its due to the thread pool (thread_handling=pool-of-threads). |