[MDEV-13322] Cluster hangs from highly concurrent sysbench session to multiple nodes Created: 2017-07-13  Updated: 2017-09-07  Resolved: 2017-09-07

Status: Closed
Project: MariaDB Server
Component/s: Galera
Fix Version/s: N/A

Type: Task Priority: Blocker
Reporter: Michaël de groot Assignee: Andrii Nikitin (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None

Attachments: Text File global_vars_status_galera1.txt     Text File global_vars_status_galera2.txt     Text File global_vars_status_galera3.txt     Text File maxscale.cnf.txt     Text File my.cnf_galera1.txt     Text File my.cnf_galera2.txt     Text File my.cnf_galera3.txt     Text File processlist_galera1.txt     Text File processlist_galera2.txt     Text File processlist_galera3.txt    

 Description   

Hi,

For this exercise I used virtualbox, 3 nodes and maxscale on 1 node. I do not think maxscale has to do with it but it is an easy way to make it run on multiple nodes. I used the training VM's from https://downloads.mariadb.com/training/courses/mariadb-enterprise-cluster/OS-Images/.

Steps to reproduce:

  • Install a cluster. I used a 3-node cluster
  • Install maxscale using the configuration attached. I used a readconnroute router that only routes to the slave. This means that the traffic will be round-robinned between the nodes that do not have wsrep_local_index set to 1.
  • Install sysbench
  • Run sysbench to mascale

In most cases the cluster locked up after 10 seconds, in another case it took a couple of minutes. The sysbench command used:
sysbench --db-driver=mysql --test=/usr/share/sysbench/tests/include/oltp_legacy/oltp.lua --mysql-user=galera --mysql-password=galera --mysql-db=test --oltp-table-size=25000 --report-interval=5 --max-requests=0 --time=300 --num-threads=5 run --mysql-host=galera4 --mysql-port=4006

In the processlist of galera2 you will find a wsrep thread that is locked by something:
2 | system user | | NULL | Sleep | 3317 | Update_rows_log_event::find_row(129056) | UPDATE sbtest1 SET c='21925224570-87741851440-47847350341-42585377753-88509468277-36281519091-414733 | 0.000 |

On the other nodes you see threads waiting on COMMIT.

I am still waiting for the error logs, is there any other information you need?

Thanks,
Michaël



 Comments   
Comment by Andrii Nikitin (Inactive) [ 2017-07-14 ]

So far I wasn't able to reproduce the problem with "the same" test on single machine (i.e. without virtualbox). Could you confirm exact maxscale and sysbench version just in case?

Do you have chance to to try commands below to setup the same cluster on single machine and confirm if such setup shows any problem for you?

export MDB_VER=10.2.6
git clone http://github.com/AndriiNikitin/mariadb-environs
cd mariadb-environs
./get_plugin.sh galera
_template/plant_cluster.sh cluster1
echo m1 > cluster1/nodes.lst
echo m2 >> cluster1/nodes.lst
echo m3 >> cluster1/nodes.lst
cluster1/replant.sh ${MDB_VER}
 
# this will download and unpack 10.2.6 tar
./build_or_download.sh m1
 
# workaround MDEV-13283
sed -i "s/Distrib 10.1/Distrib 10/g" _depot/m-tar/${MDB_VER}/bin/wsrep_sst_mysqldump
 
cluster1/gen_cnf.sh
cluster1/install_db.sh
cluster1/galera_setup_acl.sh
cluster1/galera_start_new.sh
 
sleep 45
# confirm that cluster size is 3 on every node
cluster1/galera_cluster_size.sh
 
cluster1/sql.sh set global innodb_flush_log_at_trx_commit=2

Now replace server address in maxscale.cnf like below and run the test.

address=192.168.56.201
port=3306
with
address=127.0.0.1
port=3307

address=192.168.56.202
port=3306
with
address=127.0.0.1
port=3308

address=192.168.56.203
port=3306
with
address=127.0.0.1
port=3309

You can monitor server's load with

cluster1/status.sh

Meanwhile I will try the same on Virtualbox

Comment by Andrii Nikitin (Inactive) [ 2017-07-14 ]

As was discussed on slack - I cannot reproduce this with Virtualbox as well; If you see the problem again - please capture in addition:

Comment by Seppo Jaakola [ 2017-07-17 ]

Node 2's slave thread is blocked and it has fallen 33 writes behind in replication. This triggered flow control and other nodes are blocked because of that.
Could not reproduce with MariaDB 10.1 installation, maybe node 2 had some resource problem, e.g. disk IO blocked?

Generated at Thu Feb 08 08:04:40 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.