[MXS-3481] Complete test plan for transaction replay against XPAND Direct Created: 2021-04-05  Updated: 2021-04-19  Resolved: 2021-04-19

Status: Closed
Project: MariaDB MaxScale
Component/s: xpandmon
Affects Version/s: None
Fix Version/s: 6.0.0

Type: Task Priority: Major
Reporter: Gregory Dorman (Inactive) Assignee: Rahul Joshi (Inactive)
Resolution: Done Votes: 0
Labels: None

Issue Links:
PartOf
is part of MXS-3472 Transaction Replay: transactions not ... Closed

 Description   

Prologue
There are multiple dimensions in the process, with lots of possible permutations of events which may occur at different times. Group Change may be very fast or rather slow. Sessions may be idle, or have transactions. Transactions may be quiet or running a statement. Group change can be initiated by flex-down or flex-up; or by node crash. Crashed node can return into the cluster without causing group change (10 minutes), or be considered dead (>10 min). There may be more.

It is safe to assume that by now MaxScale has these algorithms fully debugged for normal read-write splits. The new and unusual thing about XPAND is that Group Change makes entire cluster inaccessible in a special way. It is worth validating the behaviors empirically.

Given that the number of possible permutations is close to infinite, In this project we will select only a number of scenarios, trying to achieve the highest likelihood of overall success.

Plan
We concentrate on long group changes only.

Part 1. Actions on the departing node

  Flex Down Crash and restore
in 1 minute
Crash and no restore
first 1 minute (before GC)
existing idle connections      
existing connections
with outstanding idle transactions
     
existing connections
with active transactions
     

Part 2. Actions on the nodes not affected by group change

  Flex Down Crash and restore
in 1 minute
Crash and no restore
first 1 minute (before GC)
Flex Up
existing idle connections        
existing connections
with outstanding idle transactions
       
existing connections
with active transactions
       
new connections and transactions
hitting maxscale
       

Note Later on we would try to observe what happens to idle and active sessions and transactions living on a node which crashed and is not coming back when the 10 minutes interval is about to expire (i.e. when an activity is attempted while XPAND is still hoping for the node to come back, and then went into group change). But this is in due time. Let's get the fundamentals verified first.



 Comments   
Comment by Manjinder Nijjar [ 2021-04-14 ]

Few notes:
>>Crashed node can return into the cluster without causing group change (10 minutes), or be considered dead (>10 min).
This statement is not correct. There will be group change as soon as node stops responding. A node can return back after few seconds or 10mins, it does not matter. There will be group change irrespective when a node leaves (for whatever reason) and another when it rejoins at a later time. 10min interval is for rebalance_reprotect_queue_interval i.e. the reprotection (of replicas on departing node) starts only after a node is gone for that amount of time which is by default 10mins. There is no wait times in group change, its always immediate.

Flexdown Scenario:
Flex down is a planned downing of a node and it has to be soft failed before it can be dropped. Therefore it wont be a random group change. A softfail operation may not complete unless there are transactions going to the node (i.e. departing node). A node can only be dropped properly when it's done soft failing. A random dropping of a node will exhibit simillar behavior to that of crashed/dead node (that never rejoins back).

Crash and Restore Scenario:
As mentioned above there will be group change as soon as a node crash. It does not matter for this scenario if the node reappears on a later time or not. The cluster will go into group change and come back up with remaining nodes (if they can make a quorum).

So in short we have these scenarios (for departing node and for remaining cluster):

  Flex Down (Softfail) Crash (Force Stop) Flex down (Force Drop)
existing idle connections      
existing connections
with outstanding idle transactions
     
existing connections
with active transactions
     
Comment by Manjinder Nijjar [ 2021-04-16 ]

All scenarios with and without MaxScale are documented here.

Maxscale is working fine with transaction replay. We did not find any issues with new version or Maxscale (i.e. 2.5.11).

Comment by Manjinder Nijjar [ 2021-04-19 ]

This task is done and documented.

Generated at Thu Feb 08 04:21:44 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.