[MDEV-14423] Galera Cluster behaves asynchronously Created: 2017-11-16 Updated: 2017-11-23 Resolved: 2017-11-22 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Galera |
| Affects Version/s: | 10.2.10 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Critical |
| Reporter: | Konstantin Vasserman | Assignee: | Andrii Nikitin (Inactive) |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | None | ||
| Environment: |
3 node MaraiDB Galera cluster, running on Ubuntu 16.04 VMs. |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Description |
|
We've noticed something really troubling with our cluster. Deletes/Inserts/Updates don't seem to be happening synchronously on Galera nodes. To test this scenario I wrote a simple bash script that inserts records into a table on one node and immediately reads row count from a different node. Depending on the latency between servers I get wrong number of rows from the second node in about as high as 39% of the attempts. If I understand the nature of Galera cluster, this should never ever happen, for when I update one node, it shouldn't confirm the update until all nodes have received the data. This is what I've done: 1. I created a database called 'Testing' with a single table: CREATE TABLE `Table1` ( 2. I wrote a bash script that does the following: sql='DELETE FROM Table1;'; sql="INSERT INTO Table1(Text) VALUES('Text1'), ('Text2'), ('Text3'), ('Text4');"; sql='SELECT COUNT As you can see DELETE and INSERT are sent to node1, while SELECT is issued on the node2. 3. Some percentage of time, depending on the nodes selected and/or timing/network lag, SELECT returns 0 rows (between 0.1% and 40% of the time). 4. If I add 1 second delay between INSERT and SELECT, I always get the correct number of rows. 5. There are no errors on MariaDB nodes that I can see and replication seems to be working as far as I can tell. I'm attaching the test script I wrote. You have to modify it to set your username and password and call it with two parameters for hostnames of the nodes like so: ./test_galera.sh mynode1.domain.com mynode2.domain.com Please help, unless I'm missing something obvious, this is a critical issue. Let me know if I can provide any additional info to solve this situation. Thank you. |
| Comments |
| Comment by Andrii Nikitin (Inactive) [ 2017-11-17 ] |
|
I assume wsrep_sync_wait had default value here? Could you re-try the test with nodes restarted with wsrep_sync_wait=4 ? https://mariadb.com/kb/en/library/galera-cluster-system-variables/#wsrep_sync_wait |
| Comment by Konstantin Vasserman [ 2017-11-17 ] |
|
Thank you for the quick response, Andrii. I just tested setting wsrep_sync_wait to various values. Setting to 4 didn't make any difference, but setting it to 1 (or 3) did fix the issue. BTW, Galera people have a much better explanation of what this setting does (http://galeracluster.com/documentation-webpages/mysqlwsrepoptions.html#wsrep-sync-wait). However, I'm not sure why you thought that setting it 4 should make the difference in this case. 4 seems to cover only INSERTS and we were basically getting "dirty reads" on SELECT. In any case, it looks like we should be setting wsrep_sync_wait to some value other than zero if we want "true" synchronous behavior. One thing I don't completely understand about this flag is whether setting it to 3 (Checks made on READ, UPDATE and DELETE statements) will also cover the case of "INSERT INTO x Thank you for your help. |
| Comment by Daniel Black [ 2017-11-22 ] |
|
And there is an 8 bit to added in some unspecified version. greenman, documentation is rather ambiguous between bitmask and value (both upstream and the kb) |
| Comment by Andrii Nikitin (Inactive) [ 2017-11-22 ] |
|
kvasserman thank you for confirmation, indeed - I ought to suggest "7", not sure from where "4" came up. |
| Comment by Ian Gilfillan [ 2017-11-23 ] |
|
Thanks danblack I've updated the docs. |