[MXS-3472] Transaction Replay: transactions not replayed after Xpand group change Created: 2021-03-30  Updated: 2021-05-06  Resolved: 2021-04-14

Status: Closed
Project: MariaDB MaxScale
Component/s: readwritesplit
Affects Version/s: None
Fix Version/s: 2.5.11

Type: Bug Priority: Major
Reporter: Rahul Joshi (Inactive) Assignee: markus makela
Resolution: Fixed Votes: 0
Labels: None
Environment:

MariaDB MaxScale 2.5.9 (Commit: f4a081d82c5b00a33a71f7ba209501ee9ab223d4) and
xpand Glassbutte 5.3.15
Both using CentOS 7


Attachments: Text File maxscale-MXS-3472-500-retries.log     Text File maxscale-MXS-3472-drop session node working.log     Text File maxscale-MXS-3472-working.log    
Issue Links:
Issue split
split to MXS-3490 Xpand monitor should detect and handl... Closed
PartOf
includes MXS-3481 Complete test plan for transaction re... Closed

 Description   

1) What works: Transaction replay works fine with XPAND direct as long as there is NO activity on outstanding transactions during the group change. A transaction that begins before group change is initiated, stays idle during it, will be able to continue happily after group change is completed. This includes the situation when the transaction in question sits on a session connected on a departing node - MaxScale moves the session to the surviving node successfully.

It has been tested in 2 scenarios
a) very fast group change. Monitor does not have a chance to detect the node departure, but splitter gets a rollback. Splitter reacts properly, and completes session transfer and replay after group change finishes.
b) long running group change. Monitor detects group change, spins around waiting for it to complete, and abandons the node, yet keeps monitoring the situation. Rollback is not registering in the traces, it seems. As long as the transaction in question does not attempt any activity prior to group change completion, all is fine. Once the GC completes, the transaction can proceed, and there will be a successful session transfer (if needed) and a replay.

2) what does not work: Should the transaction in progress attempt to execute anything while group change is in progress, monitor loses track of its nodes. It attempts to engage a node long gone from the cluster - a node that had been taken down and subsequently restored under a different number in previous tests.

Below is a relevant fragment of the trace. Scenario:

• There are three nodes – 3, 15, and 22 up.
• We hang a session – and it goes to node 15. We create a transaction on it.
• We kill node 22 (10.2.13.175).

Group Change is detected all right. At 23:01:48 we attempt an insert, and transaction migration from node 15 begins. Except – it goes to a wrong node (node-20, a one killed in a previous experiment). What exactly happens is not obvious from the traces. Node 20 used to be on 10.2.13.175 where 22 was a second ago. But it is not really obvious from the trace what IP MaxScale is attempting to use, and with which credentials - but it fails hard enough for MaxScale to abandon all hope and communicate "session closed" to the client.

Comment: Splitter attempts session transfer and replay the moment insert is detected, even though monitor is still aware that group change is in progress. Not only that - there appears to be a gallant but ultimately unsuccessful attempt to find a Xpand bootstrap node from one of the nodes used during the previous run of MaxScale.. Could it be better to just wait until Group Change is over?

The fact of the matter is that no matter how hard one can try, there will not be a suitable node to do a replay on while this group change is going on. Of course we can keep trying, circling the cluster over and over - just that if there is a good number of outstanding transactions all in this same state, it may get very hairy very fast... Is there a way to coordinate with the splitter and have him pause all until group change is in fact over?

Trace fragment.

2021-04-02 23:01:45   info   : [xpandmon] Deleted Xpand node 3 from bookkeeping.
2021-04-02 23:01:45   info   : [xpandmon] Deleted Xpand node 15 from bookkeeping.
2021-04-02 23:01:45   info   : [xpandmon] Deleted Xpand node 22 from bookkeeping.
2021-04-02 23:01:47   error  : [xpandmon] Xpand: Could not execute 'SELECT status FROM system.membership WHERE nid = gtmnid()' on 10.2.12.248 via TCP/IP: [16388] Group change during GTM operation: group change in progress, try restarting transaction
2021-04-02 23:01:47   error  : [xpandmon] Xpand: Could either not ping or create connection to 10.2.12.248:3306: [16388] Group change during GTM operation: group change in progress, try restarting transaction
2021-04-02 23:01:47   error  : [xpandmon] Xpand: Could either not ping or create connection to 10.2.15.59:3306: [16388] Group change during GTM operation: group change in progress, try restarting transaction
2021-04-02 23:01:47   error  : [xpandmon] Xpand: Could either not ping or create connection to 10.2.13.175:3306: Can't connect to MySQL server on '10.2.13.175' (115)
2021-04-02 23:01:47   error  : [xpandmon] Xpand: Could either not ping or create connection to karma114:3306: [16388] Group change during GTM operation: group change in progress, try restarting transaction
2021-04-02 23:01:47   notice : [xpandmon] Attempting to find a Xpand bootstrap node from one of the nodes used during the previous run of MaxScale.
2021-04-02 23:01:47   error  : [xpandmon] Xpand: Could not connect to any server or no server that could be connected to was part of the quorum.
2021-04-02 23:01:48   info   : (7) (Splitter-Service) > Autocommit: [disabled], trx is [open], cmd: (0x03) COM_QUERY, plen: 30, type: QUERY_TYPE_WRITE, stmt: insert into t1 values (3)
2021-04-02 23:01:48   info   : (7) [readwritesplit] (Splitter-Service) Starting transaction migration to '@@Xpand:node-20'
2021-04-02 23:01:48   info   : (7) [readwritesplit] (Splitter-Service) Starting transaction replay 1
2021-04-02 23:01:48   info   : (7) [readwritesplit] (Splitter-Service) Replaying: begin
2021-04-02 23:01:49   info   : > Autocommit: [disabled], trx is [open], cmd: (0x03) COM_QUERY, plen: 10, type: QUERY_TYPE_BEGIN_TRX, stmt: begin
2021-04-02 23:01:49   info   : [readwritesplit] Replacing old master '@@Xpand:node-15' with new master '@@Xpand:node-20'
2021-04-02 23:01:49   info   : Execute sescmd #1 on '@@Xpand:node-20': [COM_INIT_DB]
2021-04-02 23:01:49   info   : [readwritesplit] Connected to '@@Xpand:node-20'
2021-04-02 23:01:49   info   : [readwritesplit] Queuing query until '@@Xpand:node-20' completes session command
2021-04-02 23:01:49   info   : (7) Connected to '@@Xpand:node-20' with thread id 1025
2021-04-02 23:01:49   error  : (7) Invalid authentication message from backend '@@Xpand:node-20'. Error code: 1045, Msg : #HY000: [39936] Access denied: for user 'maxscale'@'' (using password: YES)
2021-04-02 23:01:49   info   : (7) [readwritesplit] (Splitter-Service) Master '@@Xpand:node-20' failed: #HY000: Lost connection to backend server: connection closed by peer (@@Xpand:node-20)
2021-04-02 23:01:49   error  : (7) [readwritesplit] (Splitter-Service) Lost connection to the master server, closing session. Lost connection to master server while waiting for a result. Connection has been idle for 0 seconds. Error caused by: #HY000: Lost connection to backend server: connection closed by peer (@@Xpand:node-20). Last close reason: <none>. Last error:
2021-04-02 23:01:49   info   : Stopped Splitter-Service client session [7]

Original Description
Start trx.
Do a group change on Xpand.
While the group is yet to be formed try to insert another row.
After the group is formed try another insert.
Expected: Now, the whole trx should be retried.
Actual: trx replay fails.

mysql> set session autocommit=false;
Query OK, 0 rows affected (0.01 sec)
mysql> begin;
Query OK, 0 rows affected (0.00 sec)
mysql> insert into t1 values (1);
Query OK, 1 row affected (0.01 sec)
-- here, run /opt/clustrix/bin/clx dbrestart on one of the xpand nodes and before the cluster is formed, execute this:
mysql> insert into t1 values (2);
ERROR 1927 (HY000): Lost connection to backend server: network error (@@Xpand:node-1: 111, Connection refused)
mysql> insert into t1 values (3);
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
Connection id:    2
Current database: db1
Query OK, 1 row affected (6.35 sec)
mysql> commit;
Query OK, 0 rows affected (0.00 sec)
mysql> select * from t1;
+------+
| i    |
+------+
|    3 |
+------+
1 row in set (0.00 sec)
mysql>

Config file:

[maxscale]
log_info=1
threads=auto
logdir=/data/clustrix/log
[Bootstrap1-karma114]
type=server
address=karma114
port=3306
protocol=mariadbbackend
[Xpand]
type=monitor
module=xpandmon
servers=Bootstrap1-karma114
user=maxscale
password=maxscale_pw
#cluster_monitor_interval=10s
[Splitter-Service]
type=service
router=readwritesplit
user=maxscale
password=maxscale_pw
cluster=Xpand
transaction_replay=true
slave_selection_criteria=LEAST_GLOBAL_CONNECTIONS
delayed_retry_timeout=360s
transaction_replay_attempts=500
transaction_replay_retry_on_deadlock=true
[Splitter-Listener]
type=listener
service=Splitter-Service
protocol=MariaDBClient
address=0.0.0.0
port=3306

Logs:

2021-03-29 23:55:44   info   : (1) (Splitter-Service) > Autocommit: [disabled], trx is [open], cmd: (0x03) COM_QUERY, plen: 10, type: QUERY_TYPE_BEGIN_TRX, stmt: begin
2021-03-29 23:55:44   info   : (1) [readwritesplit] (Splitter-Service) Route query to master: @@Xpand:node-2 <
2021-03-29 23:55:44   info   : (1) [readwritesplit] (Splitter-Service) Transaction starting on '@@Xpand:node-2'
2021-03-29 23:55:44   info   : (1) [readwritesplit] (Splitter-Service) Adding to trx: begin
2021-03-29 23:55:44   info   : (1) [readwritesplit] (Splitter-Service) Reply complete, last reply from @@Xpand:node-2
2021-03-29 23:55:51   info   : (1) (Splitter-Service) > Autocommit: [disabled], trx is [open], cmd: (0x03) COM_QUERY, plen: 30, type: QUERY_TYPE_WRITE, stmt: insert into t1 values (1)
2021-03-29 23:55:51   info   : (1) [readwritesplit] (Splitter-Service) Route query to master: @@Xpand:node-2 <
2021-03-29 23:55:51   info   : (1) [readwritesplit] (Splitter-Service) Adding to trx: insert into t1 values (1)
2021-03-29 23:55:51   info   : (1) [readwritesplit] (Splitter-Service) Reply complete, last reply from @@Xpand:node-2
2021-03-29 23:56:18   info   : (1) [readwritesplit] (Splitter-Service) Master '@@Xpand:node-2' failed: #HY000: Lost connection to backend server: connection closed by peer (@@Xpand:node-2)
2021-03-29 23:56:18   info   : (1) [readwritesplit] (Splitter-Service) Starting transaction replay 1
2021-03-29 23:56:18   info   : (1) [readwritesplit] (Splitter-Service) Replaying: begin
2021-03-29 23:56:19   info   : > Autocommit: [disabled], trx is [open], cmd: (0x03) COM_QUERY, plen: 10, type: QUERY_TYPE_BEGIN_TRX, stmt: begin
2021-03-29 23:56:19   info   : [readwritesplit] Replacing old master '@@Xpand:node-2' with new master '@@Xpand:node-1'
2021-03-29 23:56:19   info   : Execute sescmd #1 on '@@Xpand:node-1': [COM_INIT_DB]
2021-03-29 23:56:19   info   : [readwritesplit] Connected to '@@Xpand:node-1'
2021-03-29 23:56:19   info   : [readwritesplit] Queuing query until '@@Xpand:node-1' completes session command
2021-03-29 23:56:19   info   : (1) [readwritesplit] (Splitter-Service) Master '@@Xpand:node-1' failed: #HY000: Lost connection to backend server: network error (@@Xpand:node-1: 111, Connection refused)
2021-03-29 23:56:19   error  : (1) [readwritesplit] (Splitter-Service) Lost connection to the master server, closing session. Lost connection to master server while waiting for a result. Connection has been idle for 0 seconds. Error caused by: #HY000: Lost connection to backend server: network error (@@Xpand:node-1: 111, Connection refused). Last close reason: <none>. Last error:
2021-03-29 23:56:19   info   : Stopped Splitter-Service client session [1]

[root@karma076 ~]# maxctrl list servers
┌─────────────────────┬─────────────┬──────┬─────────────┬─────────────────┬──────┐
│ Server              │ Address     │ Port │ Connections │ State           │ GTID │
├─────────────────────┼─────────────┼──────┼─────────────┼─────────────────┼──────┤
│ @@Xpand:node-3      │ 10.2.12.248 │ 3306 │ 0           │ Master, Running │      │
├─────────────────────┼─────────────┼──────┼─────────────┼─────────────────┼──────┤
│ @@Xpand:node-1      │ 10.2.13.175 │ 3306 │ 0           │ Master, Running │      │
├─────────────────────┼─────────────┼──────┼─────────────┼─────────────────┼──────┤
│ @@Xpand:node-2      │ 10.2.15.59  │ 3306 │ 1           │ Master, Running │      │
├─────────────────────┼─────────────┼──────┼─────────────┼─────────────────┼──────┤
│ Bootstrap1-karma114 │ karma114    │ 3306 │ 0           │ Down            │      │
└─────────────────────┴─────────────┴──────┴─────────────┴─────────────────┴──────┘



 Comments   
Comment by Rahul Joshi (Inactive) [ 2021-03-31 ]

Tried with Flex up. Added another node to xpand cluster while the trx was going on.
Error is different than the one in the description, but replay still does not work.

mysql> set session autocommit=false;
Query OK, 0 rows affected (0.00 sec)
 
mysql> begin;
Query OK, 0 rows affected (0.00 sec)
 
mysql> insert into t1 values (1);
Query OK, 1 row affected (0.01 sec)
 
-- here, run  "alter cluster add karma156" on one of the xpand nodes and before the cluster is formed, execute this:
 
mysql> insert into t1 values (2);
ERROR 2013 (HY000): Lost connection to MySQL server during query
mysql> insert into t1 values (3);
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
Connection id:    10
Current database: db1
 
Query OK, 1 row affected (0.38 sec)
 
mysql> commit;
Query OK, 0 rows affected (0.00 sec)
 
mysql> select * from t1;
+------+
| i    |
+------+
|    3 |
+------+
1 row in set (0.01 sec)

Conf file same as the one in the description.
Logs:

2021-03-31 19:42:30   info   : (9) (Splitter-Service) Execute sescmd #3 on '@@Xpand:node-2': [COM_QUERY] set session autocommit=false
2021-03-31 19:42:30   info   : (9) [readwritesplit] (Splitter-Service) Route query to master: @@Xpand:node-2
2021-03-31 19:42:30   info   : (9) [readwritesplit] (Splitter-Service) Will return response from '@@Xpand:node-2' to the client
2021-03-31 19:42:30   info   : (9) [readwritesplit] (Splitter-Service) Reply complete, last reply from @@Xpand:node-2
2021-03-31 19:42:36   info   : (9) (Splitter-Service) > Autocommit: [disabled], trx is [open], cmd: (0x03) COM_QUERY, plen: 10, type: QUERY_TYPE_BEGIN_TRX, stmt: begin
2021-03-31 19:42:36   info   : (9) [readwritesplit] (Splitter-Service) Route query to master: @@Xpand:node-2 <
2021-03-31 19:42:36   info   : (9) [readwritesplit] (Splitter-Service) Adding to trx: begin
2021-03-31 19:42:36   info   : (9) [readwritesplit] (Splitter-Service) Reply complete, last reply from @@Xpand:node-2
2021-03-31 19:42:40   info   : (9) (Splitter-Service) > Autocommit: [disabled], trx is [open], cmd: (0x03) COM_QUERY, plen: 30, type: QUERY_TYPE_WRITE, stmt: insert into t1 values (1)
2021-03-31 19:42:40   info   : (9) [readwritesplit] (Splitter-Service) Route query to master: @@Xpand:node-2 <
2021-03-31 19:42:40   info   : (9) [readwritesplit] (Splitter-Service) Adding to trx: insert into t1 values (1)
2021-03-31 19:42:40   info   : (9) [readwritesplit] (Splitter-Service) Reply complete, last reply from @@Xpand:node-2
2021-03-31 19:42:59   info   : [xpandmon] Deleted Xpand node 2 from bookkeeping.
2021-03-31 19:42:59   info   : [xpandmon] Deleted Xpand node 3 from bookkeeping.
2021-03-31 19:42:59   info   : (9) (Splitter-Service) > Autocommit: [disabled], trx is [open], cmd: (0x03) COM_QUERY, plen: 30, type: QUERY_TYPE_WRITE, stmt: insert into t1 values (2)
2021-03-31 19:42:59   error  : (9) [readwritesplit] (Splitter-Service) Could not find valid server for target type TARGET_MASTER (COM_QUERY: insert into t1 values (2)), closing connection.
 
name: [@@Xpand:node-2] status: [Down] state: [IN_USE] last opened at: [Wed Mar 31 19:40:43 2021] last closed at: [not closed] last close reason: [] num sescmd: [0]
name: [@@Xpand:node-1] status: [Down] state: [NOT_IN_USE] last opened at: [not opened] last closed at: [not closed] last close reason: [] num sescmd: [0]
name: [@@Xpand:node-3] status: [Down] state: [NOT_IN_USE] last opened at: [not opened] last closed at: [not closed] last close reason: [] num sescmd: [0]
name: [@@Xpand:node-4] status: [Down] state: [NOT_IN_USE] last opened at: [not opened] last closed at: [not closed] last close reason: [] num sescmd: [0]
name: [@@Xpand:node-5] status: [Down] state: [NOT_IN_USE] last opened at: [not opened] last closed at: [not closed] last close reason: [] num sescmd: [0]
2021-03-31 19:42:59   error  : (9) [mariadbclient] Routing the query failed. Session will be closed.
2021-03-31 19:42:59   info   : Stopped Splitter-Service client session [9]
2021-03-31 19:43:01   notice : Created server '@@Xpand:node-6' at 10.2.13.175:3306
2021-03-31 19:43:01   info   : [xpandmon] Updated Xpand node in bookkeeping: 6, '10.2.13.175', 3306, 3581.
2021-03-31 19:43:01   info   : [xpandmon] Updated Xpand node in bookkeeping: 2, '10.2.15.59', 3306, 3581.
2021-03-31 19:43:01   info   : [xpandmon] Updated Xpand node in bookkeeping: 3, '10.2.12.248', 3306, 3581.
2021-03-31 19:43:05   info   : (10) Found matching user 'maxscale'@'%' for client 'maxscale'@'10.2.16.29' with sufficient privileges.

Comment by markus makela [ 2021-04-04 ]

I think I've managed to reproduce at least a similar problem to this. I've created a patch that's on the 2.5-markusjm branch on GitHub if you want to test and see if it solves this as well.

Comment by Rahul Joshi (Inactive) [ 2021-04-06 ]

Hi johan.wikman,

I tried to ping on the health port in a loop for all nodes while a node was being dropped from the cluster.
Please let me know if my way of checking it is incorrect.

The dropping node errors on health check until being dropped. Then, it comes up as a single node and health check starts to succeed again.
Other nodes always return success on health check even during group change.

[root@karma114 ~]# clx s;mysql -A  -e "alter cluster drop 29";date; for i in {1..20}; do clx s; date; clx s | grep "Cluster Status:   OK"; if [ $? -eq 0 ] ; then echo "cluster up";break; fi;for node in karma114 karma146 karma156; do echo "$node:"; curl http://$node:3581/; if [ $? -eq 0 ] ; then echo "success" ; fi; done;sleep 1;done
Cluster Name:    cl5251201ed0b7ef50
Cluster Version: 5.3.15
Cluster Status:   OK
Cluster Size:    3 nodes - 16 CPUs per Node
Current Node:    karma114 - nid 15
 
nid |  Hostname | Status |  IP Address  | TPS |       Used      |  Total
----+-----------+--------+--------------+-----+-----------------+--------
  3 |  karma146 |    OK  |  10.2.12.248 |   0 |  271.9M (0.03%) |  761.9G
 15 |  karma114 |    OK  |   10.2.15.59 |   0 |  283.5M (0.04%) |  761.9G
 29 |  karma156 |    OK  |  10.2.13.175 |   0 |   69.6M (0.01%) |  761.9G
----+-----------+--------+--------------+-----+-----------------+--------
                                            0 |  625.0M (0.03%) |    2.2T
Tue Apr  6 22:03:42 UTC 2021
Error connecting to database via /data/clustrix/mysql.sock: (1, '[16388] Group change during GTM operation: group change in progress, try restarting transaction')
== Using cached node details ==
Cluster Name:    cl5251201ed0b7ef50
Cluster Version: 5.3.15
Cluster Status:   UNKW
Cluster Size:    3 nodes - 16 CPUs per Node
 
nid |  Hostname |  Status |  IP Address  | Zone | TPS |    Used    | Total
----+-----------+---------+--------------+------+-----+------------+------
  3 |  karma146 |   UNKW  |  10.2.12.248 |    0 |   0 |  0 (0.00%) |     0
 15 |  karma114 |   UNKW  |   10.2.15.59 |    0 |   0 |  0 (0.00%) |     0
 29 |  karma156 |   UNKW  |  10.2.13.175 |    0 |   0 |  0 (0.00%) |     0
----+-----------+---------+--------------+------+-----+------------+------
                                                    0 |  0 (0.00%) |     0
Tue Apr  6 22:03:42 UTC 2021
karma114:
success
karma146:
success
karma156:
curl: (7) Failed connect to karma156:3581; Connection refused
Error connecting to database via /data/clustrix/mysql.sock: (1, '[16388] Group change during GTM operation: group change in progress, try restarting transaction')
== Using cached node details ==
Cluster Name:    cl5251201ed0b7ef50
Cluster Version: 5.3.15
Cluster Status:   UNKW
Cluster Size:    3 nodes - 16 CPUs per Node
 
nid |  Hostname |  Status |  IP Address  | Zone | TPS |    Used    | Total
----+-----------+---------+--------------+------+-----+------------+------
  3 |  karma146 |   UNKW  |  10.2.12.248 |    0 |   0 |  0 (0.00%) |     0
 15 |  karma114 |   UNKW  |   10.2.15.59 |    0 |   0 |  0 (0.00%) |     0
 29 |  karma156 |   UNKW  |  10.2.13.175 |    0 |   0 |  0 (0.00%) |     0
----+-----------+---------+--------------+------+-----+------------+------
                                                    0 |  0 (0.00%) |     0
Tue Apr  6 22:03:43 UTC 2021
karma114:
success
karma146:
success
karma156:
curl: (7) Failed connect to karma156:3581; Connection refused
Error connecting to database via /data/clustrix/mysql.sock: (1, '[16388] Group change during GTM operation: group change in progress, try restarting transaction')
== Using cached node details ==
Cluster Name:    cl5251201ed0b7ef50
Cluster Version: 5.3.15
Cluster Status:   UNKW
Cluster Size:    3 nodes - 16 CPUs per Node
 
nid |  Hostname |  Status |  IP Address  | Zone | TPS |    Used    | Total
----+-----------+---------+--------------+------+-----+------------+------
  3 |  karma146 |   UNKW  |  10.2.12.248 |    0 |   0 |  0 (0.00%) |     0
 15 |  karma114 |   UNKW  |   10.2.15.59 |    0 |   0 |  0 (0.00%) |     0
 29 |  karma156 |   UNKW  |  10.2.13.175 |    0 |   0 |  0 (0.00%) |     0
----+-----------+---------+--------------+------+-----+------------+------
                                                    0 |  0 (0.00%) |     0
Tue Apr  6 22:03:45 UTC 2021
karma114:
success
karma146:
success
karma156:
curl: (7) Failed connect to karma156:3581; Connection refused
Error connecting to database via /data/clustrix/mysql.sock: (1, '[16388] Group change during GTM operation: group change in progress, try restarting transaction')
== Using cached node details ==
Cluster Name:    cl5251201ed0b7ef50
Cluster Version: 5.3.15
Cluster Status:   UNKW
Cluster Size:    3 nodes - 16 CPUs per Node
 
nid |  Hostname |  Status |  IP Address  | Zone | TPS |    Used    | Total
----+-----------+---------+--------------+------+-----+------------+------
  3 |  karma146 |   UNKW  |  10.2.12.248 |    0 |   0 |  0 (0.00%) |     0
 15 |  karma114 |   UNKW  |   10.2.15.59 |    0 |   0 |  0 (0.00%) |     0
 29 |  karma156 |   UNKW  |  10.2.13.175 |    0 |   0 |  0 (0.00%) |     0
----+-----------+---------+--------------+------+-----+------------+------
                                                    0 |  0 (0.00%) |     0
Tue Apr  6 22:03:46 UTC 2021
karma114:
success
karma146:
success
karma156:
curl: (7) Failed connect to karma156:3581; Connection refused
Error connecting to database via /data/clustrix/mysql.sock: (1, '[16388] Group change during GTM operation: group change in progress, try restarting transaction')
== Using cached node details ==
Cluster Name:    cl5251201ed0b7ef50
Cluster Version: 5.3.15
Cluster Status:   UNKW
Cluster Size:    3 nodes - 16 CPUs per Node
 
nid |  Hostname |  Status |  IP Address  | Zone | TPS |    Used    | Total
----+-----------+---------+--------------+------+-----+------------+------
  3 |  karma146 |   UNKW  |  10.2.12.248 |    0 |   0 |  0 (0.00%) |     0
 15 |  karma114 |   UNKW  |   10.2.15.59 |    0 |   0 |  0 (0.00%) |     0
 29 |  karma156 |   UNKW  |  10.2.13.175 |    0 |   0 |  0 (0.00%) |     0
----+-----------+---------+--------------+------+-----+------------+------
                                                    0 |  0 (0.00%) |     0
Tue Apr  6 22:03:47 UTC 2021
karma114:
success
karma146:
success
karma156:
curl: (7) Failed connect to karma156:3581; Connection refused
Error connecting to database via /data/clustrix/mysql.sock: (1, '[16388] Group change during GTM operation: group change in progress, try restarting transaction')
== Using cached node details ==
Cluster Name:    cl5251201ed0b7ef50
Cluster Version: 5.3.15
Cluster Status:   UNKW
Cluster Size:    3 nodes - 16 CPUs per Node
 
nid |  Hostname |  Status |  IP Address  | Zone | TPS |    Used    | Total
----+-----------+---------+--------------+------+-----+------------+------
  3 |  karma146 |   UNKW  |  10.2.12.248 |    0 |   0 |  0 (0.00%) |     0
 15 |  karma114 |   UNKW  |   10.2.15.59 |    0 |   0 |  0 (0.00%) |     0
 29 |  karma156 |   UNKW  |  10.2.13.175 |    0 |   0 |  0 (0.00%) |     0
----+-----------+---------+--------------+------+-----+------------+------
                                                    0 |  0 (0.00%) |     0
Tue Apr  6 22:03:48 UTC 2021
karma114:
success
karma146:
success
karma156:
success
Error connecting to database via /data/clustrix/mysql.sock: (1, '[16388] Group change during GTM operation: group change in progress, try restarting transaction')
== Using cached node details ==
Cluster Name:    cl5251201ed0b7ef50
Cluster Version: 5.3.15
Cluster Status:   UNKW
Cluster Size:    3 nodes - 16 CPUs per Node
 
nid |  Hostname |  Status |  IP Address  | Zone | TPS |    Used    | Total
----+-----------+---------+--------------+------+-----+------------+------
  3 |  karma146 |   UNKW  |  10.2.12.248 |    0 |   0 |  0 (0.00%) |     0
 15 |  karma114 |   UNKW  |   10.2.15.59 |    0 |   0 |  0 (0.00%) |     0
 29 |  karma156 |   UNKW  |  10.2.13.175 |    0 |   0 |  0 (0.00%) |     0
----+-----------+---------+--------------+------+-----+------------+------
                                                    0 |  0 (0.00%) |     0
Tue Apr  6 22:03:50 UTC 2021
karma114:
success
karma146:
success
karma156:
success
Error connecting to database via /data/clustrix/mysql.sock: (1, '[16388] Group change during GTM operation: group change in progress, try restarting transaction')
== Using cached node details ==
Cluster Name:    cl5251201ed0b7ef50
Cluster Version: 5.3.15
Cluster Status:   UNKW
Cluster Size:    3 nodes - 16 CPUs per Node
 
nid |  Hostname |  Status |  IP Address  | Zone | TPS |    Used    | Total
----+-----------+---------+--------------+------+-----+------------+------
  3 |  karma146 |   UNKW  |  10.2.12.248 |    0 |   0 |  0 (0.00%) |     0
 15 |  karma114 |   UNKW  |   10.2.15.59 |    0 |   0 |  0 (0.00%) |     0
 29 |  karma156 |   UNKW  |  10.2.13.175 |    0 |   0 |  0 (0.00%) |     0
----+-----------+---------+--------------+------+-----+------------+------
                                                    0 |  0 (0.00%) |     0
Tue Apr  6 22:03:51 UTC 2021
karma114:
success
karma146:
success
karma156:
success
Error connecting to database via /data/clustrix/mysql.sock: (1, '[16388] Group change during GTM operation: group change in progress, try restarting transaction')
== Using cached node details ==
Cluster Name:    cl5251201ed0b7ef50
Cluster Version: 5.3.15
Cluster Status:   UNKW
Cluster Size:    3 nodes - 16 CPUs per Node
 
nid |  Hostname |  Status |  IP Address  | Zone | TPS |    Used    | Total
----+-----------+---------+--------------+------+-----+------------+------
  3 |  karma146 |   UNKW  |  10.2.12.248 |    0 |   0 |  0 (0.00%) |     0
 15 |  karma114 |   UNKW  |   10.2.15.59 |    0 |   0 |  0 (0.00%) |     0
 29 |  karma156 |   UNKW  |  10.2.13.175 |    0 |   0 |  0 (0.00%) |     0
----+-----------+---------+--------------+------+-----+------------+------
                                                    0 |  0 (0.00%) |     0
Tue Apr  6 22:03:52 UTC 2021
karma114:
success
karma146:
success
karma156:
success
Error connecting to database via /data/clustrix/mysql.sock: (1, '[16388] Group change during GTM operation: group change in progress, try restarting transaction')
== Using cached node details ==
Cluster Name:    cl5251201ed0b7ef50
Cluster Version: 5.3.15
Cluster Status:   UNKW
Cluster Size:    3 nodes - 16 CPUs per Node
 
nid |  Hostname |  Status |  IP Address  | Zone | TPS |    Used    | Total
----+-----------+---------+--------------+------+-----+------------+------
  3 |  karma146 |   UNKW  |  10.2.12.248 |    0 |   0 |  0 (0.00%) |     0
 15 |  karma114 |   UNKW  |   10.2.15.59 |    0 |   0 |  0 (0.00%) |     0
 29 |  karma156 |   UNKW  |  10.2.13.175 |    0 |   0 |  0 (0.00%) |     0
----+-----------+---------+--------------+------+-----+------------+------
                                                    0 |  0 (0.00%) |     0
Tue Apr  6 22:03:53 UTC 2021
karma114:
success
karma146:
success
karma156:
success
Error connecting to database via /data/clustrix/mysql.sock: (1, '[16388] Group change during GTM operation: group change in progress, try restarting transaction')
== Using cached node details ==
Cluster Name:    cl5251201ed0b7ef50
Cluster Version: 5.3.15
Cluster Status:   UNKW
Cluster Size:    3 nodes - 16 CPUs per Node
 
nid |  Hostname |  Status |  IP Address  | Zone | TPS |    Used    | Total
----+-----------+---------+--------------+------+-----+------------+------
  3 |  karma146 |   UNKW  |  10.2.12.248 |    0 |   0 |  0 (0.00%) |     0
 15 |  karma114 |   UNKW  |   10.2.15.59 |    0 |   0 |  0 (0.00%) |     0
 29 |  karma156 |   UNKW  |  10.2.13.175 |    0 |   0 |  0 (0.00%) |     0
----+-----------+---------+--------------+------+-----+------------+------
                                                    0 |  0 (0.00%) |     0
Tue Apr  6 22:03:55 UTC 2021
karma114:
success
karma146:
success
karma156:
success
Error connecting to database via /data/clustrix/mysql.sock: (1, '[16388] Group change during GTM operation: group change in progress, try restarting transaction')
== Using cached node details ==
Cluster Name:    cl5251201ed0b7ef50
Cluster Version: 5.3.15
Cluster Status:   UNKW
Cluster Size:    3 nodes - 16 CPUs per Node
 
nid |  Hostname |  Status |  IP Address  | Zone | TPS |    Used    | Total
----+-----------+---------+--------------+------+-----+------------+------
  3 |  karma146 |   UNKW  |  10.2.12.248 |    0 |   0 |  0 (0.00%) |     0
 15 |  karma114 |   UNKW  |   10.2.15.59 |    0 |   0 |  0 (0.00%) |     0
 29 |  karma156 |   UNKW  |  10.2.13.175 |    0 |   0 |  0 (0.00%) |     0
----+-----------+---------+--------------+------+-----+------------+------
                                                    0 |  0 (0.00%) |     0
Tue Apr  6 22:03:56 UTC 2021
karma114:
success
karma146:
success
karma156:
success
Error connecting to database via /data/clustrix/mysql.sock: (1, '[16388] Group change during GTM operation: group change in progress, try restarting transaction')
== Using cached node details ==
Cluster Name:    cl5251201ed0b7ef50
Cluster Version: 5.3.15
Cluster Status:   UNKW
Cluster Size:    3 nodes - 16 CPUs per Node
 
nid |  Hostname |  Status |  IP Address  | Zone | TPS |    Used    | Total
----+-----------+---------+--------------+------+-----+------------+------
  3 |  karma146 |   UNKW  |  10.2.12.248 |    0 |   0 |  0 (0.00%) |     0
 15 |  karma114 |   UNKW  |   10.2.15.59 |    0 |   0 |  0 (0.00%) |     0
 29 |  karma156 |   UNKW  |  10.2.13.175 |    0 |   0 |  0 (0.00%) |     0
----+-----------+---------+--------------+------+-----+------------+------
                                                    0 |  0 (0.00%) |     0
Tue Apr  6 22:03:57 UTC 2021
karma114:
success
karma146:
success
karma156:
success
Cluster Name:    cl5251201ed0b7ef50
Cluster Version: 5.3.15
Cluster Status:   OK
Cluster Size:    2 nodes - 16 CPUs per Node
Current Node:    karma114 - nid 15
 
nid |  Hostname | Status |  IP Address  | TPS |       Used      |  Total
----+-----------+--------+--------------+-----+-----------------+--------
  3 |  karma146 |    OK  |  10.2.12.248 |   0 |  276.2M (0.04%) |  761.9G
 15 |  karma114 |    OK  |   10.2.15.59 |   0 |  286.0M (0.04%) |  761.9G
----+-----------+--------+--------------+-----+-----------------+--------
                                            0 |  562.1M (0.04%) |    1.5T
Tue Apr  6 22:03:58 UTC 2021
Cluster Status:   OK
cluster up
[root@karma114 ~]#

Comment by Rahul Joshi (Inactive) [ 2021-04-09 ]

gdorman Yes, that is correct. Works when session is on a surviving node. If the session is connected to the node being dropped, it tries to replay the set number of times (500 in my case) and then errors.

MySQL [db1]>  insert into t1 values (3);
ERROR 1927 (HY000): Lost connection to backend server: connection closed by peer (@@Xpand:node-33)

More details in maxscale-MXS-3472-500-retries.log

Comment by markus makela [ 2021-04-12 ]

Fixed a bug where the destruction of a server did not clear out the status bits from it. In 2.5.11 destroyed servers will now behave as if they were down. This should help prevent use of servers that are no longer monitored by MaxScale.

Comment by Gregory Dorman (Inactive) [ 2021-04-14 ]

We should probably do a full test as in the linked ticket?

Comment by Gregory Dorman (Inactive) [ 2021-04-14 ]

Yes, it behaves right in one test (flex down). Let’s put this into IN TESTING first. msnijjar , johan.wikman - can you coordinate a full test of maxscale - xpand replays?

Comment by Gregory Dorman (Inactive) [ 2021-04-14 ]

As in MXS-3481...

Comment by Manjinder Nijjar [ 2021-04-14 ]

We will close it as soon as we have completed our testing. We will open new Jira's for any new issues and wont overload this ticket. I have also updated MXS-3481 that Rahul is working on.

Comment by Manjinder Nijjar [ 2021-04-14 ]

markus makela Can you please move this to "in testing" / fixed? I cannot change state for this ticket.

Comment by markus makela [ 2021-04-14 ]

I closed it, there's no In Testing state for the MXS project. Submitting new issues for any new bugs is better, this way we'll have a more accurate record of what was the problem and what was fixed.

Comment by Manjinder Nijjar [ 2021-04-16 ]

Me and Rahul tested Maxscale for various different transaction replay scenarios and it seems to be working fine. We have documented all scenarios here.

Generated at Thu Feb 08 04:21:40 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.