[MXS-4953] Lost connection to backend server: Network error: 104, Connection reset by peer Created: 2024-01-23  Updated: 2024-01-24

Status: Open
Project: MariaDB MaxScale
Component/s: readconnroute
Affects Version/s: 23.08.4
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Patrick Vandenbosch Assignee: markus makela
Resolution: Unresolved Votes: 0
Labels: None
Environment:

DEV


Attachments: PNG File max1.PNG     PNG File max2.PNG     PNG File max3.PNG    
Issue Links:
Relates
relates to MXS-4440 Lost connection to backend server: ne... Closed

 Description   

Hello,

We recently upgraded from maxscale 6.4 to 23.08.
Since then, we notice in the maxscale logfile error like these :

2024-01-23 11:29:07 error : (2039) (IM46130D-Service); Lost connection to backend server: Network error: 104, Connection reset by peer (IM46130D1, session=2039, conn_id=4531072)
2024-01-23 11:29:26 error : (2062) (IM46130D-Service); Lost connection to backend server: Network error: 104, Connection reset by peer (IM46130D1, session=2062, conn_id=4531091)
2024-01-23 11:29:38 error : (2071) (IM46130D-Service); Lost connection to backend server: Network error: 104, Connection reset by peer (IM46130D1, session=2071, conn_id=4531101)

It seems whenever a connection reaches interactive_timeout it is then reported as an error in the log file (I can see it clearly by doing some "list sessions", when 600s reached --> reported as error). I've tried to set connection_keepalive=30s to the service in maxscale config file but it does not help.

On the backend mariadb :
interactive_timeout=600
wait_timeout=600

Service in maxscale :
[IM46130D-Service]
type=service
router=readconnroute
router_options=master
servers=IM46130D1,IM46130D2
user=xxx
password=xxx
connection_keepalive=30s

I've found a similar issue saying it is fixed but I'm still having the problem. Here :
https://jira.mariadb.org/browse/MXS-4440

Could you have a look?

Thank you



 Comments   
Comment by markus makela [ 2024-01-23 ]

Can you confirm from the server error logs that these are indeed idle timeouts? You should find a log entry with a connection id that matches the conn_id=<number> part.

If the client is truly idle, MaxScale won't send a connection keepalive ping unless the force_connection_keepalive parameter is set to true. Can you try if turning that on solves the problem for you?

Comment by Patrick Vandenbosch [ 2024-01-23 ]

Yes I confirm these are idle timeouts. See session id 9995 I highlighted in yellow in attached screenshots.

At the time of doing "list sessions" that session had 30sec left before reaching idle timeout (max1.png), 30sec later it appears in the maxscale log as an error (max2.png).

I tried setting force_connection_keepalive parameter as you suggested, with that I do not have anymore any error reported in the maxscale log but when I run a "list sessions" I see sessions going above the 600sec allowed timeout by the db, is this normal? (max3.png)

It seems there is a difference of behavior between 6.4 & 23.08, is this intended? Shouldn't the sessions reaching idle timeouts be terminated without reporting it as an error in the maxscale log ?

Comment by markus makela [ 2024-01-23 ]

You most likely have no idle timeouts set in MaxScale and thus MaxScale is not able to evict them before the server evicts them. When the server kills idle clients, it just closes the socket which appears like a broken connection in MaxScale and any other "client" that connects to the database. Thus it is not possible to be certain whether a connection timed out or the connection was broken. Usually it's a timeout if it's above a certain threshold but right now MaxScale just reports them as errors.

The reason why MaxScale 23.08 behaves differently is because older 6.4 releases had a bug
(MXS-4139, MXS-4720) that kept pinging sessions every 300 seconds (the default for connection_keepalive)and the actual idleness of the client was not correctly taken into account. This effectively extended the values of wait_timeout and interactive_timeout in the server to infinite values.

if you want to get rid of idle connections in MaxScale, you can use the (unfortunately named) connection_timeout service parameter. By default there's no idle timeouts in MaxScale and the usual recommendation I give is to set it below any idle timeouts in the database. This downgrades them into warnings and lets you know which clients are being idle.

Comment by markus makela [ 2024-01-23 ]

Patrick Can you also fill in the exact 6.4 version you're upgrading from?

Comment by Patrick Vandenbosch [ 2024-01-24 ]

Thanks for the detailed explanation. Indeed we did not have connection_timeout set in maxscale config, I just did set it (well "wait_timeout" as "connection_timeout" is deprecated) to a lower value than backend servers and now there is no more error in the maxscale log, instead it's a warning like this :

2024-01-24 08:46:44 warning: (5) Timing out 'xxx'@'ipaddress', idle for 591 seconds

The exact version I'm upgrading from is 6.4.13-1.rhel.8

Generated at Thu Feb 08 04:32:20 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.