[MXS-4440] Lost connection to backend server: network error (server1: 104, Connection reset by peer) Created: 2022-12-12  Updated: 2024-01-23  Resolved: 2022-12-14

Status: Closed
Project: MariaDB MaxScale
Component/s: Core
Affects Version/s: 6.4.2, 6.4.4, 22.08.3
Fix Version/s: 2.5.24, 6.4.5, 22.08.4

Type: Bug Priority: Major
Reporter: Allen Lee (Inactive) Assignee: markus makela
Resolution: Fixed Votes: 2
Labels: None

Attachments: PNG File image-2022-12-12-14-19-55-932.png     PNG File image-2022-12-12-14-22-35-422.png     File maxscale.cnf     Text File maxscale.log     File select_long_running.jmx    
Issue Links:
Problem/Incident
is caused by MXS-4139 connection_keepalive sends pings even... Closed
Relates
relates to MXS-4953 Lost connection to backend server: Ne... Open

 Description   

Query takes longer than wait_timeout value on backend server got disconnected on client by maxscale while query is still running on backend server.

Observations:
1) clients connects to master and slave.

2) once master's session reached the wait_timeout(120s), client connection got disconnected from master and slave, however, session on slave did not get disconnected, but kept running.

In above, user=allen is the client user connection through maxscale.
MaxScale threw the following error:

2022-12-12 14:20:29   info   : (10) [readwritesplit] (ReadWriteSplitService); Master 'server1' failed: #HY000: Lost connection to backend server: network error (server1: 104, Connection reset by peer)
2022-12-12 14:20:29   error  : (10) [readwritesplit] (ReadWriteSplitService); Lost connection to the master server, closing session. Lost connection to master server while connection was idle. Connection has been idle for 120 seconds. Error caused by: #HY000: Lost connection to backend server: network error (server1: 104, Connection reset by peer). Last close reason: <none>. Last error:
2022-12-12 14:20:29   info   : (10) Stopped ReadWriteSplitService client session [10]

At the same time, on mariab master's SQL error log also reported the following error, but not on slave as that session lived still on slave server.

2022-12-11 21:20:29 allen[allen] @  [192.168.254.29] ERROR 1159: Got timeout reading communication packets : (null)

The followings are the timeout related config from mariadb server:

innodb_flush_log_at_timeout=1
innodb_lock_wait_timeout=180
innodb_rollback_on_timeout=OFF
interactive_timeout=28800
lock_wait_timeout=10800
net_read_timeout=30
net_write_timeout=60
rpl_semi_sync_master_timeout=10000
rpl_semi_sync_slave_kill_conn_timeout=5
slave_net_timeout=10
thread_pool_idle_timeout=60
wait_timeout=3600
idle_readonly_transaction_timeout=0
idle_transaction_timeout=0
idle_write_transaction_timeout=0
delayed_insert_timeout=300
deadlock_timeout_long=50000000
deadlock_timeout_short=10000
connect_timeout=10

Interestingly, user confirmed that this does not happen in 2.5.20 and I could confirmed the same.



 Comments   
Comment by markus makela [ 2022-12-12 ]

If wait_timeout on the master server is less than the value of connection_keepalive then this is an expected result as MaxScale won't send the first keepalive ping until the connection has been idle for 300 seconds. If you add connection_keepalive=30s to the service, the problem should go away.

Comment by markus makela [ 2022-12-12 ]

We'll need some way of reproducing the problem. So far this seems like expected behavior if wait_timeout is less than connection_keepalive.

Comment by markus makela [ 2022-12-12 ]

I think this might be a regression of sorts caused by MXS-4139 where idle clients aren't pinged.

Generated at Thu Feb 08 04:28:39 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.