[MDEV-13322] Cluster hangs from highly concurrent sysbench session to multiple nodes Created: 2017-07-13 Updated: 2017-09-07 Resolved: 2017-09-07 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Galera |
| Fix Version/s: | N/A |
| Type: | Task | Priority: | Blocker |
| Reporter: | Michaël de groot | Assignee: | Andrii Nikitin (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Description |
|
Hi, For this exercise I used virtualbox, 3 nodes and maxscale on 1 node. I do not think maxscale has to do with it but it is an easy way to make it run on multiple nodes. I used the training VM's from https://downloads.mariadb.com/training/courses/mariadb-enterprise-cluster/OS-Images/. Steps to reproduce:
In most cases the cluster locked up after 10 seconds, in another case it took a couple of minutes. The sysbench command used: In the processlist of galera2 you will find a wsrep thread that is locked by something: On the other nodes you see threads waiting on COMMIT. I am still waiting for the error logs, is there any other information you need? Thanks, |
| Comments |
| Comment by Andrii Nikitin (Inactive) [ 2017-07-14 ] | |||||||||||||||||||||||||||
|
So far I wasn't able to reproduce the problem with "the same" test on single machine (i.e. without virtualbox). Could you confirm exact maxscale and sysbench version just in case? Do you have chance to to try commands below to setup the same cluster on single machine and confirm if such setup shows any problem for you?
Now replace server address in maxscale.cnf like below and run the test. You can monitor server's load with
Meanwhile I will try the same on Virtualbox | |||||||||||||||||||||||||||
| Comment by Andrii Nikitin (Inactive) [ 2017-07-14 ] | |||||||||||||||||||||||||||
|
As was discussed on slack - I cannot reproduce this with Virtualbox as well; If you see the problem again - please capture in addition:
| |||||||||||||||||||||||||||
| Comment by Seppo Jaakola [ 2017-07-17 ] | |||||||||||||||||||||||||||
|
Node 2's slave thread is blocked and it has fallen 33 writes behind in replication. This triggered flow control and other nodes are blocked because of that. |