[MXS-2603] MaxScale causes connections to break in Percona PXC Cluster - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 2.3.9
Fix Version/s: 2.4.1
Component/s: galeramon, readwritesplit
Labels:
None
Environment:
Centos 7, Percona XtraDB Cluster 5.7.26, MaxScale 2.3.9

Description

Markus, we have a potential customer that is having problems with MaxScale galeramon and readwritesplitter. It looks as though the transaction_replay is not functioning as expected in a Percona XtraDB Cluster (Galera 3).

[MaxScale]

threads=2

admin_host=0.0.0.0

[GaleraMonitor]

type=monitor

module=galeramon

servers=db1,db2,db3

user=maxscale

password=demo_password

available_when_donor=false

monitor_interval=100

[Splitter]

type=service

router=readwritesplit

servers=db1,db2,db3

user=maxscale

password=demo_password

transaction_replay=true

[SplitterListener]

type=listener

service=Splitter

protocol=MariaDBClient

port=3306

[db1]

type=server

address=10.10.10.10

port=3306

protocol=MariaDBBackend

[db2]

type=server

address=10.10.10.11

port=3306

protocol=MariaDBBackend

[db3]

type=server

address=10.10.10.12

port=3306

protocol=MariaDBBackend

I ran the following benchmark test:

sysbench /usr/share/sysbench/tests/include/oltp_legacy/oltp.lua --mysql-host=127.0.0.1 --mysql-user=dba --mysql-password=demo_password --mysql-db=sbtest --oltp-tables-count=2 --oltp-table-size=500000 --report-interval=5 --oltp-skip-trx=on --oltp-read-only=off --mysql-ignore-errors=1062 --rand-init=on --max-requests=0 --time=300 --threads=100 run

maxctrl list servers

┌────────┬─────────────┬──────┬─────────────┬─────────────────────────┬──────┐

│ Server │ Address     │ Port │ Connections │ State                   │ GTID │

├────────┼─────────────┼──────┼─────────────┼─────────────────────────┼──────┤

│ db1    │ 10.10.10.10 │ 3306 │ 100         │ Slave, Synced, Running  │      │

├────────┼─────────────┼──────┼─────────────┼─────────────────────────┼──────┤

│ db2    │ 10.10.10.11 │ 3306 │ 100         │ Master, Synced, Running │      │

├────────┼─────────────┼──────┼─────────────┼─────────────────────────┼──────┤

│ db3    │ 10.10.10.12 │ 3306 │ 100         │ Slave, Synced, Running  │      │

└────────┴─────────────┴──────┴─────────────┴─────────────────────────┴──────┘

Then I stop mysql process on node 2 to simulate a server crash:

systemctl stop mysql

All connections drop from MaxScale to all nodes.

Then I get a fatal error from sysbench:

FATAL: `thread_run' function failed: /usr/share/sysbench/tests/include/oltp_legacy/oltp.lua:103: db_query() failed

Error in my_thread_global_end(): 21 threads didn't exit

┌────────┬─────────────┬──────┬─────────────┬─────────────────────────┬──────┐

│ Server │ Address     │ Port │ Connections │ State                   │ GTID │

├────────┼─────────────┼──────┼─────────────┼─────────────────────────┼──────┤

│ db1    │ 10.10.10.10 │ 3306 │ 0           │ Master, Synced, Running │      │

├────────┼─────────────┼──────┼─────────────┼─────────────────────────┼──────┤

│ db2    │ 10.10.10.11 │ 3306 │ 0           │ Down                    │      │

├────────┼─────────────┼──────┼─────────────┼─────────────────────────┼──────┤

│ db3    │ 10.10.10.12 │ 3306 │ 0           │ Slave, Synced, Running  │      │

└────────┴─────────────┴──────┴─────────────┴─────────────────────────┴──────┘

I get a couple of errors like this:

2019-07-16 04:01:54   error  : Failed to execute query on server 'db2' ([10.10.10.11]:3306): Can't connect to MySQL server on '10.10.10.11' (115)

in the maxscale.log but not much else.

I tried this test with mysqlslap and got the same results. The app should not know that any of the nodes in the cluster went down and it should certainly not drop the connection.

Any ideas?

Attachments

Activity

Ascending order - Click to sort in descending order

markus makela added a comment - 2019-07-16 18:24 - edited

One case where this can happen is if prepared statements are used and the connector only sends the COM_STMT_EXECUTE payload metadata on the first execution. ~~MXS-2521~~ relates to this and as it was fixed in 2.3.9, it should log an error if this is what happens. This does not happen if prepared statements are disabled with the --db-ps-mode=disable option for sysbench.

Another case is that if a controlled shutdown is done, wsrep is disabled before the shutdown which causes the following errors:

FATAL: mysql_drv_query() returned error 1047 (WSREP has not yet prepared node for application use) for query 'INSERT INTO sbtest1 (id, k, c, pad) VALUES (4248, 5648, '66569344993-86514191216-39037804709-66944715838-08180357694-42559023759-35109164815-25396788447-42272109027-40306479926', '94884917449-23500970647-48004592049-64279775421-04171639299')'

FATAL: `thread_run' function failed: ./oltp_common.lua:488: SQL error, errno = 1047, state = '08S01': WSREP has not yet prepared node for application use

FATAL: mysql_drv_query() returned error 1047 (WSREP has not yet prepared node for application use) for query 'UPDATE sbtest1 SET k=k+1 WHERE id=4987'

FATAL: `thread_run' function failed: ./oltp_common.lua:458: SQL error, errno = 1047, state = '08S01': WSREP has not yet prepared node for application use

Killing the process with SIGKILL seems to produce the expected result: client receives no errors upon the failure of the node labeled as the master.

markus makela added a comment - 2019-07-16 18:24 - edited One case where this can happen is if prepared statements are used and the connector only sends the COM_STMT_EXECUTE payload metadata on the first execution. MXS-2521 relates to this and as it was fixed in 2.3.9, it should log an error if this is what happens. This does not happen if prepared statements are disabled with the --db-ps-mode=disable option for sysbench. Another case is that if a controlled shutdown is done, wsrep is disabled before the shutdown which causes the following errors: FATAL: mysql_drv_query() returned error 1047 (WSREP has not yet prepared node for application use) for query 'INSERT INTO sbtest1 (id, k, c, pad) VALUES (4248, 5648, '66569344993-86514191216-39037804709-66944715838-08180357694-42559023759-35109164815-25396788447-42272109027-40306479926', '94884917449-23500970647-48004592049-64279775421-04171639299')' FATAL: `thread_run' function failed: ./oltp_common.lua:488: SQL error, errno = 1047, state = '08S01': WSREP has not yet prepared node for application use FATAL: mysql_drv_query() returned error 1047 (WSREP has not yet prepared node for application use) for query 'UPDATE sbtest1 SET k=k+1 WHERE id=4987' FATAL: `thread_run' function failed: ./oltp_common.lua:458: SQL error, errno = 1047, state = '08S01': WSREP has not yet prepared node for application use Killing the process with SIGKILL seems to produce the expected result: client receives no errors upon the failure of the node labeled as the master.

markus makela added a comment - 2019-07-28 05:58

Combined the code that detects transaction rollbacks and triggers transaction replays into common code that retries queries if an ignorable error is received. Extended the errors to show when a WSREP error is received due to the node not yet being ready for SQL statements.

markus makela added a comment - 2019-07-28 05:58 Combined the code that detects transaction rollbacks and triggers transaction replays into common code that retries queries if an ignorable error is received. Extended the errors to show when a WSREP error is received due to the node not yet being ready for SQL statements.

People

Assignee:: markus makela

Reporter:: Todd Stoffel (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 2019-07-16 04:04

Updated:: 2019-07-30 10:53

Resolved:: 2019-07-29 10:47

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB MaxScale