[MXS-3893] read/write split service incorrectly times out valid sessions on master if timeout happens on replica Created: 2021-11-24 Updated: 2021-12-08 Resolved: 2021-12-08 |
|
| Status: | Closed |
| Project: | MariaDB MaxScale |
| Component/s: | Core |
| Affects Version/s: | 2.5.14 |
| Fix Version/s: | 6.2.1 |
| Type: | Bug | Priority: | Major |
| Reporter: | Rick Pizzi | Assignee: | markus makela |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Description |
|
The connection_timeout feature does not check whether the connection is truly idle before it kills it. Sessions that are actively performing some operation and are waiting for a result should be left alone. Original description: If using read/write split service and connection_timeout is set in service, if one of the connections on replica servers times out, it kills the master connection too, even if it has a valid transaction running and therefore is not idle. Connect to maxscale with connection_timeout set to 30 seconds and execute the following:
As you can see this transaction is killed by maxscale every time. Thanks |
| Comments |
| Comment by markus makela [ 2021-11-24 ] | ||||||||||||||||||||||||||||||||||||||
|
Tested with 2.5.16 and this is what I got:
MaxScale logged this:
What exactly do you mean by transaction will never complete? Do you mean that it never successfully completes or it never completes because it hangs? | ||||||||||||||||||||||||||||||||||||||
| Comment by Rick Pizzi [ 2021-11-24 ] | ||||||||||||||||||||||||||||||||||||||
|
Session on master gets killed, transaction is rolled back. | ||||||||||||||||||||||||||||||||||||||
| Comment by markus makela [ 2021-11-24 ] | ||||||||||||||||||||||||||||||||||||||
|
Did MaxScale log the Timing out warning when this happened? If so then it was MaxScale that closed the connection. | ||||||||||||||||||||||||||||||||||||||
| Comment by Rick Pizzi [ 2021-11-24 ] | ||||||||||||||||||||||||||||||||||||||
|
Of course it's maxscale. That's what this bug is about. | ||||||||||||||||||||||||||||||||||||||
| Comment by markus makela [ 2021-11-24 ] | ||||||||||||||||||||||||||||||||||||||
|
OK, now it's clear. The part about replica timing out is what confused me. This appears to be expected behavior for 2.5. It only uses the network idle time (seconds since last network read/write) as the measurement. There is this warning in the documentation:
This should be clarified in 2.5 to explain that neither transactions nor active operations prevent connections from being killed and only the time since the last network read is used. Fixing this in 2.5 is not a small change but this can be fixed easily for version 6 where the required changes to the internal APIs have already been made. | ||||||||||||||||||||||||||||||||||||||
| Comment by Rick Pizzi [ 2021-11-24 ] | ||||||||||||||||||||||||||||||||||||||
|
markus makela the issue is not what you described above. | ||||||||||||||||||||||||||||||||||||||
| Comment by markus makela [ 2021-11-24 ] | ||||||||||||||||||||||||||||||||||||||
|
OK then it's not yet clear to me. Did MaxScale log a message like this when this happened for you?
| ||||||||||||||||||||||||||||||||||||||
| Comment by Rick Pizzi [ 2021-11-24 ] | ||||||||||||||||||||||||||||||||||||||
|
Yes | ||||||||||||||||||||||||||||||||||||||
| Comment by markus makela [ 2021-11-24 ] | ||||||||||||||||||||||||||||||||||||||
|
OK then it's the same mechanism that I described. MaxScale doesn't check the backend connections at all in 2.5 which is why it times out the client when the client TCP connection is idle longer than the configured connection_timeout value. From MaxScale's point of view, the client has been idle when in fact it is not idle as a query is being executed ( the select sleep(35); part). As for transactions, I think a connection with an open transaction still respects wait_timeout which means MaxScale should behave the same way when connection_timeout is used. | ||||||||||||||||||||||||||||||||||||||
| Comment by Rick Pizzi [ 2021-11-25 ] | ||||||||||||||||||||||||||||||||||||||
|
markus makela the TCP connection is idle in my example, but it's not the cause of the bug. Once again: the root cause is the replica connection being idle, not the master TCP connection. | ||||||||||||||||||||||||||||||||||||||
| Comment by markus makela [ 2021-11-25 ] | ||||||||||||||||||||||||||||||||||||||
|
If the client TCP conenction is not idle then it is a separate bug. Any time the warning: Timing out ... message is logged the connection is closed as the client TCP connection has been idle for longer than the configured idle timeout value. The replica or master connections have no bearing on why MaxScale closes the connection. I do understand that this is not the desired behavior which is why I think we should fix this. As for the case when the client connection is not idle, do you have a way to reproduce that? This sounds more like an unknown bug that would happen because an unrelated connection is somehow closed. | ||||||||||||||||||||||||||||||||||||||
| Comment by Rick Pizzi [ 2021-11-25 ] | ||||||||||||||||||||||||||||||||||||||
|
To reproduce the customer's situation, just create a database with a large table (that takes more than connection_timeout to be dumped) and then run a mysqldump on it through maxscale with the --single-transaction option so it will be sent to master. Error 2013: Lost connection to MySQL server during query when dumping table `rick3` at row: 4976300 |