[MXS-3695] Causal Consistency with MaxScale's Read/Write Split Router issue Created: 2021-07-28  Updated: 2023-04-09  Resolved: 2021-08-02

Status: Closed
Project: MariaDB MaxScale
Component/s: readwritesplit
Affects Version/s: 2.5.13
Fix Version/s: 2.5.15

Type: Bug Priority: Major
Reporter: Domas Assignee: markus makela
Resolution: Fixed Votes: 0
Labels: None
Environment:

Ubuntu 20


Issue Links:
Duplicate
is duplicated by MXS-3696 Causal Consistency with MaxScale's Re... Closed
Relates
relates to MXS-3508 causal_reads=global results in missin... Closed

 Description   

Problem statement

Stale selects occur with maxscale casual read configuration in place (global parameter in use). Basically, we will be seeing stale records if the following condition over the maxscale will be met - number of client connections is greater than 1.
As our tests showed, records are delayed for about ±0.2 seconds.

In this report we are describing techniques used for testing and current maxscale configuration.

Testing Technique

1. We will be running a simple PHP script that is going to do INSERT and SELECT using one connection / thread.

The content of the script is shown below.

<?php
 
$conn = mysqli_connect('server', 'username', 'psw', "dbname", 4006);
 
if (!$conn) {
  die("Connection failed: " . mysqli_connect_error());
}
 
echo "Connected successfully\n";
 
$conn->query("TRUNCATE testtable");
 
$i = 0;
while (1){
  $i++;
 
  $conn->begin_transaction();
 
  try {
  $conn->query("INSERT INTO testtable (ID, LastName, FirstName, Age) VALUES ($i, 'Jonas', 'Testas', 44)");
  $conn->commit();
  } catch (mysqli_sql_exception $exception) {
      $conn->rollback();
      throw "Err: ".$exception;
  }
 
  $result = $conn->query("SELECT id FROM testtable WHERE id={$i}");
  $row = $result->fetch_row();
  if (!isset($row)){
     echo "-------> ID not found => $i\n";
  }
  if ($i % 100 == 0) echo "$i rec inserted.\n";
  usleep(1000);
}
?>

2. Running the script, provides following results with no or very low number of stale reads.

1800 rec inserted.
1900 rec inserted.
2000 rec inserted.
-------> ID not found => 2033
2100 rec inserted.
2200 rec inserted.
2300 rec inserted.

3. Once we generate additional traffic going over the maxscale, problem increases drastically with a lot of missing IDs.

sysbench --mysql-db=testas123 --mysql-host=DB --mysql-port=4006 --mysql-user=user --mysql-password=PSW --mysql-dry-run=off --table-size=10000 --tables=16 --db-driver=mysql --time=100 /usr/share/sysbench/oltp_read_write.lua prepare
sysbench --mysql-db=testas123 --mysql-host=DB --mysql-port=4006 --mysql-user=user --mysql-password=PSW --mysql-dry-run=off --table-size=10000 --tables=16 --db-driver=mysql --threads=10 --time=100 /usr/share/sysbench/oltp_read_write.lua run

--threads=1 - problem occurs sometimes
--threads=10 - problem occurs very often

Maxscale versions and config we use

MaxScale 2.5.13 - d1e1cdec7e3ac58566883dd9d2aebc6c69436455
CMake flags: -DBUILD_TESTS=N -DBUILD_MMMON=Y -DBUILD_CDC=Y -DBUILD_GUI=Y -DPACKAGE=Y -DDISTRIB_SUFFIX=ubuntu.focal MaxScale 2.5.13 - d1e1cdec7e3ac58566883dd9d2aebc6c69436455
CMake flags: -DBUILD_TESTS=N -DBUILD_MMMON=Y -DBUILD_CDC=Y -DBUILD_GUI=Y -DPACKAGE=Y -DDISTRIB_SUFFIX=ubuntu.focal

Config:

[maxscale]
threads=auto

[mariadb-1]
type=server
address=IP1
port=3306
protocol=MariaDBBackend

[mariadb-2]
type=server
address=IP2
port=3306
protocol=MariaDBBackend

[mariadb-3]
type=server
address=IP3
port=3306
protocol=MariaDBBackend

[MariaDB-Monitor]
type=monitor
module=mariadbmon
servers=mariadb-1, mariadb-2, mariadb-3
user=maxuser
password=PSW
monitor_interval=50
enforce_read_only_slaves=true

[Read-Write-Service]
type=service
router=readwritesplit
servers=mariadb-1, mariadb-2, mariadb-3
user=maxuser
password=PSW
master_reconnection=true
master_failure_mode=error_on_write
slave_selection_criteria=ADAPTIVE_ROUTING
master_accept_reads=true
causal_reads=global
transaction_replay=true

[Read-Write-Listener]
type=listener
service=Read-Write-Service
protocol=MariaDBClient
port=4006

MariaDB instances have this setting on:

session_track_system_variables="autocommit,character_set_client,character_set_connection,character_set_results,last_gtid,time_zone

Demonstration: https://www.youtube.com/watch?v=RMNQMgQBisw



 Comments   
Comment by markus makela [ 2021-07-28 ]

I think this might be the same problem that has been described in MXS-3508 but now with an actual test that proves it. Can you check if causal_reads=local makes the problem go away? If so I think we've narrowed the problem down to the global GTID tracking in MaxScale.

It's possible that the global GTID synchronization for some reason isn't picking up the latest change.

Comment by markus makela [ 2021-07-30 ]

I think I've managed to reproduce the problem and I also think I know why it happens. If two connections both execute a transaction, they both receive a GTID and they appear in the correct order in the database. The problem is that the responses that are delivered to MaxScale aren't guaranteed to arrive in order which means the global latest GTID could end up being set to the lower of the two values. The reason why this won't happen with causal_reads=local is due to the lack of parallellism in the traffic.

Comment by Marijus Planciunas [ 2021-08-03 ]

Many thanks for fixing our issue Markus.
We cannot use causal_reads value as local - in our case microservices are communicating each other through different database connections, by passing over record IDs; therefore local value will made this problem even worse.
Did you run our test created by Domas to see if problem is gone away? When do you expect next release will take place?

Comment by markus makela [ 2021-08-03 ]

I used the test to reproduce the problem and with the fix in place it no longer reports any errors. Unfortunately I don't know when the next release will be.

Comment by Domas [ 2021-08-16 ]

I have completed several tests and can confirm that the issue still exists with the latest 2.5.15 built directly from repository.

root@maxscale-2:/usr/bin# ./maxscale -V
MaxScale 2.5.15 - 636cf1678fc6459d9c61969e2ecf32e045cb393c
root@maxscale-2:/usr/bin#

I have tested with causal_reads=global.

...
-------> ID not found => 16938
-------> ID not found => 16952
17000 rec inserted.
17100 rec inserted.
17200 rec inserted.
17300 rec inserted.
17400 rec inserted.
17500 rec inserted.
-------> ID not found => 17584
17600 rec inserted.
-------> ID not found => 17623
-------> ID not found => 17699

Comment by markus makela [ 2021-08-16 ]

Hmm, that is strange. The GTID handling should work correctly now. For a sanity check, you could try disabling causal_reads=global and seeing if that makes the test fail a lot faster.

I'll try to reproduce your results locally and see if I can figure out what's going on.

Comment by Marijus Planciunas [ 2021-08-16 ]

Markus, disabling causal_reads or changing it to "local" value won't solve the issue. In our particular case, we have a microservice infrastructure, where modules are communicating through RabbitMQ queues - therefore, are using different DB connections for same data. As far as we understand, only "global" setting ensures all connections see data changes: https://mariadb.com/docs/reference/mxs/module-parameters/causal_reads/.

Comment by markus makela [ 2021-08-17 ]

I tested again with this modified test script:

<?php
 
$conn_write = mysqli_connect("127.0.0.1", "maxuser", "maxpwd", "test", 4006);
 
if (!$conn_write) {
    die("Connection failed: " . mysqli_connect_error() . "\n");
}
 
$conn_read = mysqli_connect("127.0.0.1", "maxuser", "maxpwd", "test", 4006);
 
if (!$conn_read) {
    die("Connection failed: " . mysqli_connect_error() . "\n");
}
 
echo "Connected successfully\n";
 
$res = $conn_write->query("CREATE TABLE IF NOT EXISTS t1(id SERIAL, LastName VARCHAR(255), FirstName VARCHAR(255), Age INT)");
 
if (!$res){
    die($conn->error . "\n");
}
 
while (1){
    $conn_write->begin_transaction();
    $id = -1;
 
    try {
        $conn_write->query("INSERT INTO t1 (LastName, FirstName, Age) VALUES ('Jonas', 'Testas', 44)");
        $id = $conn_write->insert_id;
        $conn_write->commit();
    } catch (mysqli_sql_exception $exception) {
        $conn_write->rollback();
        throw "Err: " . $exception . "\n";
    }
 
    if ($id == -1){
        die("ID is -1\n");
    }
 
    $q = "SELECT @@server_id, count(id), @@gtid_slave_pos FROM t1 WHERE id=" . $id;
    $result = $conn_read->query($q);
 
    if (!$result){
        echo "ERROR" . $conn_read->error . "\n";
        break;
    }
 
    $row = $result->fetch_row();
    if (!$row){
        die("Empty result for ${id}\n");
    } else if ($row[1] != 1){
        $server_id = $row[0];
        $rows = $row[1];
        $slave_pos = $row[2];
        $thr_read = $conn_read->thread_id;
        $thr_write = $conn_write->thread_id;
        echo "reader: ${thr_read} writer: ${thr_write} ID not found from ${server_id} ${slave_pos} => $id\n";
        exit(1);
    }
}
?>

Would it be possible for you to test if you still see the problem with this modified script?

Note that it uses two separate connections for reading, doesn't truncate the table at the start and uses an auto-increment field to allow multiple instances of it to be run in parallel. With 50 parallel executions of this script and sysbench with 100 threads, I can't reproduce this with the latest code. Reverting the fix makes the test fail pretty much immediately.

Comment by Alexander Zierhut [ 2023-04-08 ]

This issue is marked as fixed, but @Domas wrote that they were still able to reproduce this issue even after the fix. It this actually confirmed as resolved @markus makela ?

Comment by markus makela [ 2023-04-09 ]

alexander-zierhut yes, the original problem that was reproduced was fixed. So far we haven't heard of any causality violations and, as stated in the comments, the results they saw weren't reproducible. If you've observed any problems, please let us know and we can investigate the source of them.

Generated at Thu Feb 08 04:23:17 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.