[MXS-4953] Lost connection to backend server: Network error: 104, Connection reset by peer - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Not a Bug
Affects Version/s: 23.08.4
Fix Version/s: N/A
Component/s: N/A
Labels:
None
Environment:
DEV

Description

Hello,

We recently upgraded from maxscale 6.4 to 23.08.
Since then, we notice in the maxscale logfile error like these :

2024-01-23 11:29:07 error : (2039) (IM46130D-Service); Lost connection to backend server: Network error: 104, Connection reset by peer (IM46130D1, session=2039, conn_id=4531072)
2024-01-23 11:29:26 error : (2062) (IM46130D-Service); Lost connection to backend server: Network error: 104, Connection reset by peer (IM46130D1, session=2062, conn_id=4531091)
2024-01-23 11:29:38 error : (2071) (IM46130D-Service); Lost connection to backend server: Network error: 104, Connection reset by peer (IM46130D1, session=2071, conn_id=4531101)

It seems whenever a connection reaches interactive_timeout it is then reported as an error in the log file (I can see it clearly by doing some "list sessions", when 600s reached --> reported as error). I've tried to set connection_keepalive=30s to the service in maxscale config file but it does not help.

On the backend mariadb :
interactive_timeout=600
wait_timeout=600

Service in maxscale :
[IM46130D-Service]
type=service
router=readconnroute
router_options=master
servers=IM46130D1,IM46130D2
user=xxx
password=xxx
connection_keepalive=30s

I've found a similar issue saying it is fixed but I'm still having the problem. Here :
https://jira.mariadb.org/browse/MXS-4440

Could you have a look?

Thank you

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

max1.PNG
53 kB
2024-01-23 13:06
max2.PNG
4 kB
2024-01-23 13:06
max3.PNG
27 kB
2024-01-23 13:06

Issue Links

relates to

MXS-4440 Lost connection to backend server: network error (server1: 104, Connection reset by peer)

Closed

Activity

Ascending order - Click to sort in descending order

markus makela added a comment - 2024-01-23 11:25

Can you confirm from the server error logs that these are indeed idle timeouts? You should find a log entry with a connection id that matches the conn_id=<number> part.

If the client is truly idle, MaxScale won't send a connection keepalive ping unless the force_connection_keepalive parameter is set to true. Can you try if turning that on solves the problem for you?

markus makela added a comment - 2024-01-23 11:25 Can you confirm from the server error logs that these are indeed idle timeouts? You should find a log entry with a connection id that matches the conn_id=<number> part. If the client is truly idle, MaxScale won't send a connection keepalive ping unless the force_connection_keepalive parameter is set to true. Can you try if turning that on solves the problem for you?

Patrick Vandenbosch added a comment - 2024-01-23 13:06

Yes I confirm these are idle timeouts. See session id 9995 I highlighted in yellow in attached screenshots.

At the time of doing "list sessions" that session had 30sec left before reaching idle timeout (max1.png), 30sec later it appears in the maxscale log as an error (max2.png).

I tried setting force_connection_keepalive parameter as you suggested, with that I do not have anymore any error reported in the maxscale log but when I run a "list sessions" I see sessions going above the 600sec allowed timeout by the db, is this normal? (max3.png)

It seems there is a difference of behavior between 6.4 & 23.08, is this intended? Shouldn't the sessions reaching idle timeouts be terminated without reporting it as an error in the maxscale log ?

Patrick Vandenbosch added a comment - 2024-01-23 13:06 Yes I confirm these are idle timeouts. See session id 9995 I highlighted in yellow in attached screenshots. At the time of doing "list sessions" that session had 30sec left before reaching idle timeout (max1.png), 30sec later it appears in the maxscale log as an error (max2.png). I tried setting force_connection_keepalive parameter as you suggested, with that I do not have anymore any error reported in the maxscale log but when I run a "list sessions" I see sessions going above the 600sec allowed timeout by the db, is this normal? (max3.png) It seems there is a difference of behavior between 6.4 & 23.08, is this intended? Shouldn't the sessions reaching idle timeouts be terminated without reporting it as an error in the maxscale log ?

markus makela added a comment - 2024-01-23 16:00 - edited

You most likely have no idle timeouts set in MaxScale and thus MaxScale is not able to evict them before the server evicts them. When the server kills idle clients, it just closes the socket which appears like a broken connection in MaxScale and any other "client" that connects to the database. Thus it is not possible to be certain whether a connection timed out or the connection was broken. Usually it's a timeout if it's above a certain threshold but right now MaxScale just reports them as errors.

The reason why MaxScale 23.08 behaves differently is because older 6.4 releases had a bug
(~~MXS-4139~~, ~~MXS-4720~~) that kept pinging sessions every 300 seconds (the default for connection_keepalive)and the actual idleness of the client was not correctly taken into account. This effectively extended the values of wait_timeout and interactive_timeout in the server to infinite values.

if you want to get rid of idle connections in MaxScale, you can use the (unfortunately named) connection_timeout service parameter. By default there's no idle timeouts in MaxScale and the usual recommendation I give is to set it below any idle timeouts in the database. This downgrades them into warnings and lets you know which clients are being idle.

markus makela added a comment - 2024-01-23 16:00 - edited You most likely have no idle timeouts set in MaxScale and thus MaxScale is not able to evict them before the server evicts them. When the server kills idle clients, it just closes the socket which appears like a broken connection in MaxScale and any other "client" that connects to the database. Thus it is not possible to be certain whether a connection timed out or the connection was broken. Usually it's a timeout if it's above a certain threshold but right now MaxScale just reports them as errors. The reason why MaxScale 23.08 behaves differently is because older 6.4 releases had a bug ( MXS-4139 , MXS-4720 ) that kept pinging sessions every 300 seconds (the default for connection_keepalive)and the actual idleness of the client was not correctly taken into account. This effectively extended the values of wait_timeout and interactive_timeout in the server to infinite values. if you want to get rid of idle connections in MaxScale, you can use the (unfortunately named) connection_timeout service parameter. By default there's no idle timeouts in MaxScale and the usual recommendation I give is to set it below any idle timeouts in the database. This downgrades them into warnings and lets you know which clients are being idle.

markus makela added a comment - 2024-01-23 19:27

Patrick Can you also fill in the exact 6.4 version you're upgrading from?

markus makela added a comment - 2024-01-23 19:27 Patrick Can you also fill in the exact 6.4 version you're upgrading from?

Patrick Vandenbosch added a comment - 2024-01-24 07:52

Thanks for the detailed explanation. Indeed we did not have connection_timeout set in maxscale config, I just did set it (well "wait_timeout" as "connection_timeout" is deprecated) to a lower value than backend servers and now there is no more error in the maxscale log, instead it's a warning like this :

2024-01-24 08:46:44 warning: (5) Timing out 'xxx'@'ipaddress', idle for 591 seconds

The exact version I'm upgrading from is 6.4.13-1.rhel.8

Patrick Vandenbosch added a comment - 2024-01-24 07:52 Thanks for the detailed explanation. Indeed we did not have connection_timeout set in maxscale config, I just did set it (well "wait_timeout" as "connection_timeout" is deprecated) to a lower value than backend servers and now there is no more error in the maxscale log, instead it's a warning like this : 2024-01-24 08:46:44 warning: (5) Timing out 'xxx'@'ipaddress', idle for 591 seconds The exact version I'm upgrading from is 6.4.13-1.rhel.8

markus makela added a comment - 2024-02-12 11:49

I'll close this as Not a Bug as it looks like it's behaving as expected.

markus makela added a comment - 2024-02-12 11:49 I'll close this as Not a Bug as it looks like it's behaving as expected.

MariaDB MaxScale

Lost connection to backend server: Network error: 104, Connection reset by peer

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration