[MXS-3893] read/write split service incorrectly times out valid sessions on master if timeout happens on replica Created: 2021-11-24  Updated: 2021-12-08  Resolved: 2021-12-08

Status: Closed
Project: MariaDB MaxScale
Component/s: Core
Affects Version/s: 2.5.14
Fix Version/s: 6.2.1

Type: Bug Priority: Major
Reporter: Rick Pizzi Assignee: markus makela
Resolution: Fixed Votes: 0
Labels: None


 Description   

The connection_timeout feature does not check whether the connection is truly idle before it kills it. Sessions that are actively performing some operation and are waiting for a result should be left alone.

Original description:


If using read/write split service and connection_timeout is set in service, if one of the connections on replica servers times out, it kills the master connection too, even if it has a valid transaction running and therefore is not idle.

Connect to maxscale with connection_timeout set to 30 seconds and execute the following:

drop database if exists rick;
create database rick;
use rick;
create table rick (id int not null);
start transaction;
insert into rick values(1000); 
select sleep(35);
insert into rick values(1001);
commit;
select * from rick where id=1000;

As you can see this transaction is killed by maxscale every time.

Thanks
Rick



 Comments   
Comment by markus makela [ 2021-11-24 ]

Tested with 2.5.16 and this is what I got:

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 1
Server version: 10.6.5-MariaDB-1:10.6.5+maria~focal-log mariadb.org binary distribution
 
Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.
 
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
 
MariaDB [(none)]> create database if not exists rick;
Query OK, 1 row affected (0.022 sec)
 
MariaDB [(none)]> use rick;
Database changed
MariaDB [rick]> create or replace table rick (id int not null);
Query OK, 0 rows affected (0.038 sec)
 
MariaDB [rick]> start transaction;
Query OK, 0 rows affected (0.000 sec)
 
MariaDB [rick]> insert into rick values(1000); 
Query OK, 1 row affected (0.003 sec)
 
MariaDB [rick]> select sleep(35);
ERROR 2013 (HY000): Lost connection to MySQL server during query
MariaDB [rick]> insert into rick values(1001);
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
Connection id:    2
Current database: rick
 
Query OK, 1 row affected (0.021 sec)
 
MariaDB [rick]> commit;
Query OK, 0 rows affected (0.001 sec)
 
MariaDB [rick]> select * from rick where id=1000;
Empty set (0.001 sec)

MaxScale logged this:

2021-11-24 15:51:38   warning: Timing out 'maxuser'@'::ffff:127.0.0.1', idle for 31 seconds

What exactly do you mean by transaction will never complete? Do you mean that it never successfully completes or it never completes because it hangs?

Comment by Rick Pizzi [ 2021-11-24 ]

Session on master gets killed, transaction is rolled back.
To be more clear: session on master should NOT get killed because the session on replica times out.

Comment by markus makela [ 2021-11-24 ]

Did MaxScale log the Timing out warning when this happened? If so then it was MaxScale that closed the connection.

Comment by Rick Pizzi [ 2021-11-24 ]

Of course it's maxscale. That's what this bug is about.

Comment by markus makela [ 2021-11-24 ]

OK, now it's clear. The part about replica timing out is what confused me.

This appears to be expected behavior for 2.5. It only uses the network idle time (seconds since last network read/write) as the measurement. There is this warning in the documentation:

Warning: If a connection is idle for longer than the configured connection timeout, it will be forcefully disconnected and a warning will be logged in the MaxScale log file. If you are performing long-running maintenance operations (e.g. ALTER TABLE) either do them with a direct connection to the server or set connection_timeout to zero before executing them.

This should be clarified in 2.5 to explain that neither transactions nor active operations prevent connections from being killed and only the time since the last network read is used.

Fixing this in 2.5 is not a small change but this can be fixed easily for version 6 where the required changes to the internal APIs have already been made.

Comment by Rick Pizzi [ 2021-11-24 ]

markus makela the issue is not what you described above.
the issue is that maxscale is killing client connection while a transaction is running on master just because a connection is idle on a replica.

Comment by markus makela [ 2021-11-24 ]

OK then it's not yet clear to me. Did MaxScale log a message like this when this happened for you?

2021-11-24 15:51:38   warning: Timing out 'maxuser'@'::ffff:127.0.0.1', idle for 31 seconds

Comment by Rick Pizzi [ 2021-11-24 ]

Yes
The issue is, the user is not idle. A transaction is open and running on master.
But since the session is idle on the replica, it times out nonetheless and
the transaction is killed.

Comment by markus makela [ 2021-11-24 ]

OK then it's the same mechanism that I described. MaxScale doesn't check the backend connections at all in 2.5 which is why it times out the client when the client TCP connection is idle longer than the configured connection_timeout value. From MaxScale's point of view, the client has been idle when in fact it is not idle as a query is being executed ( the select sleep(35); part).

As for transactions, I think a connection with an open transaction still respects wait_timeout which means MaxScale should behave the same way when connection_timeout is used.

Comment by Rick Pizzi [ 2021-11-25 ]

markus makela the TCP connection is idle in my example, but it's not the cause of the bug.
This happens identically if the TCP connection is busy sending data, it was first seen with a customer trying to execute a mysqldump with --single-transaction.

Once again: the root cause is the replica connection being idle, not the master TCP connection.

Comment by markus makela [ 2021-11-25 ]

If the client TCP conenction is not idle then it is a separate bug. Any time the warning: Timing out ... message is logged the connection is closed as the client TCP connection has been idle for longer than the configured idle timeout value. The replica or master connections have no bearing on why MaxScale closes the connection. I do understand that this is not the desired behavior which is why I think we should fix this.

As for the case when the client connection is not idle, do you have a way to reproduce that? This sounds more like an unknown bug that would happen because an unrelated connection is somehow closed.

Comment by Rick Pizzi [ 2021-11-25 ]

To reproduce the customer's situation, just create a database with a large table (that takes more than connection_timeout to be dumped) and then run a mysqldump on it through maxscale with the --single-transaction option so it will be sent to master.
At connection_timeout, the dump will abort with:

Error 2013: Lost connection to MySQL server during query when dumping table `rick3` at row: 4976300

Generated at Thu Feb 08 04:24:42 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.