[MXS-2151] MaxScale does not log any info when "Connection killed by MaxScale: Router could not recover from connection errors" Created: 2018-11-08  Updated: 2020-08-25  Resolved: 2018-11-22

Status: Closed
Project: MariaDB MaxScale
Component/s: readwritesplit
Affects Version/s: 2.2.15
Fix Version/s: 2.2.17

Type: Bug Priority: Blocker
Reporter: Claudio Nanni Assignee: markus makela
Resolution: Fixed Votes: 1
Labels: None

Sprint: MXS-SPRINT-70

 Description   

It has been observed in more than one case that client receives:

"Connection killed by MaxScale: Router could not recover from connection errors"

But nothing is logged in maxscale's log, not even with log_info=1.

This makes impossible to debug the problem.



 Comments   
Comment by markus makela [ 2018-11-09 ]

It appears that this can happen when the network connection to the master is lost without it actually going down. To simulate this, the monitor_interval can be set to a very high value and the master can be killed. This causes the network socket to close without the monitor noticing it and no error is logged.

Given that this is a semi-transient problem, logging an error is an option but ideally the client should also get a more descriptive error. I'll start by adding an error that is always logged when a connection failure to the master causes the session to be closed.

Comment by Vidmantas Ĺ ablinskas [ 2018-11-09 ]

Comment from Marcus don't explains while active connections not dropped after it, it was stacked and continue to work without reconnection. Also connections to another databases from same client works in this time. I assume in this case all connections should stop to work.
One note: on 2.1.13 maxscale version in such situation jdbc driver stopped to work "forever", in 2.2.15 version it looks like this:
One data source stopped to work (from client side - we are getting unable to get managed connection exception). After 2 minutes we are getting this "Connection killed by MaxScale: Router could not recover from connection errors" and DS starts to work again. Keep in mind that connections from other pools (to other databases on same cluster) are working in this time.

Comment by markus makela [ 2018-11-09 ]

Adding retain_last_statements=5 and dump_last_statements=on_error under the [maxscale] section in the configuration will allow you to see which statement was in progress when the connection is closed. I would suggest adding these and seeing if it is some particular statement or set of statements that causes the problems.

Comment by Vidmantas Ĺ ablinskas [ 2018-11-10 ]

From our experience it happening on different data sources. We have more than 10 different DB's on same cluster. And issue happening randomly on different DS, more based on traffic ( which is loaded more for it happening more times ). So, I don't think it's related to statement. We think it's more related to connection timeout on DB. If connection is idle for longer period and it's dropped by DB, it looks like maxscale not always dropping it and such connection stacks when requested by client.

Comment by markus makela [ 2018-11-22 ]

Adding connection_keepalive=30 to the service fixes the problem. An error message is now logged if a master connection is lost due to a network error.

Generated at Thu Feb 08 04:12:02 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.