[MXS-487] lost connection to backend server Created: 2015-11-24  Updated: 2016-02-10  Resolved: 2016-02-10

Status: Closed
Project: MariaDB MaxScale
Component/s: readwritesplit
Affects Version/s: 1.2.1
Fix Version/s: 1.3.0

Type: Bug Priority: Minor
Reporter: Stephane Q. Assignee: markus makela
Resolution: Fixed Votes: 0
Labels: None
Environment:

centos 6.7 2.6.32-573.el6.x86_64, mariadb 5.5.41-1.el6 (yum repo) / maxscale 1.2.1


Attachments: File data.sql     File old.php    
Issue Links:
Relates
relates to MDEV-9195 Segmentation fault when using the emb... Closed

 Description   

Hello,

I'm using MaxScale 1.21 with MariaDB 5.5 in a readwritesplit environement, and I notice that every 5 or 10 min (it's random), I received errors from our php application:
(2003) Lost connection to backend server (the most frequent one)
(2013) Lost connection to MySQL server during query

it is very similar to what is described by someone else here:
http://stackoverflow.com/questions/33416078/maxscale-lost-connection

the maxadmin show session is showing several 'invalid state'.

Strange thing, is that if I activate the log_trace, the number of error jumps a lot: instead of 3 or 4 each hour, I have one error every minute.

We have maxscale configured in front of 4 differents group of mariadb servers,
maxscale is listening on 4 differents port

[maxscale]
threads=8
auth_connect_timeout=20
auth_read_timeout=20
auth_write_timeout=20
log_trace=0

example for 1 of the 4 clusters
[MySQL Monitor VHD]
type=monitor
module=mysqlmon
servers=dbvhd1,dbvhd2
user=max
passwd=hidden
monitor_interval=10000
disable_master_failback=1
detect_replication_lag=1

[fetch]
type=filter
module=regexfilter
match=fetch
replace=select

[hint]
type=filter
module=hintfilter

[VHDLISTENER]
type=listener
service=RWVHD
protocol=MySQLClient
port=3310
socket=/tmp/ClusterMaster1

[RWVHD]
type=service
router=readwritesplit
servers=dbvhd1, dbvhd2
user=max
passwd=hidden
max_slave_replication_lag=0

(etc...)

not sure if it is implied, but here is some of the sysctl parameter used on the servers:
net.ipv4.tcp_keepalive_time = 7200
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 10
net.netfilter.nf_conntrack_tcp_timeout_established = 432000
net.netfilter.nf_conntrack_generic_timeout = 300
net.ipv4.tcp_max_tw_buckets = 2000000
net.ipv4.tcp_fin_timeout = 10
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_keepalive_intvl = 15
net.ipv4.tcp_keepalive_probes = 9

Could it be linked with some sort of timeouts with the keep alive connections ?

As I said, the errors rate is increasing a lot if we activate the log_trace, so we tried to reduce the amount of request handled by the mascale server , we managed to divide it by 2, but the errors are still there.

we have around 280 req/s on each nodes, expect one with 2500req/s.

last point, we are using an old application, mysql/myisam , php and the old mysql extension. not sure if it is linked, maybe mysqli could help ?



 Comments   
Comment by Stephane Q. [ 2015-11-26 ]

we made some modifications on our php application, here is what we have found:

when we connect to a server, the mysql_connect works, then immediately after, we have a mysql_select_db and this is where we get the 'lost connection to backend server'. If we receive that error, we wait 0.5s and we try again the mysql_connect, mysql_select_db, and since 12h: no more errors.

now, we also have the same error on a mysql_query. so we only tried this: if the mysql_query fails with a 'lost connection to backend server', we wait 0.5s and try again the same query, with the same db handle: no more errors since 12h .

so it looks like the connection between the php application and maxscale works well, but for some reason, maxscale is losing the connection to the backend servers and manage to reopen it alone, but it sends an error to the clients.

Comment by markus makela [ 2015-11-26 ]

A test with the attached test script and data doesn't seem to yield any results. I'll continue testing with different environment.

Comment by markus makela [ 2015-12-31 ]

Can you retest this with the 1.3.0-beta version of MaxScale?

The binaries can be found here: http://maxscale-jenkins.mariadb.com/ci-repository/1.3.0-beta-debug/mariadb-maxscale/

Comment by Stephane Q. [ 2016-01-06 ]

ok I'll give it a try in the coming days, I will let you know

Comment by markus makela [ 2016-02-04 ]

Stephane Any update on this issue?

Comment by Stephane Q. [ 2016-02-04 ]

very sorry for the delay, got extra work to do , I couldn't try it earlier.
I installed the 1.3.0-1 today, I will monitor it and let you know within 24 hours.
It's installed on a live server with a lot of trafic.

Comment by Stephane Q. [ 2016-02-10 ]

it has been running since 5 days now, and everything looks much more stable. I don't see the lost connections problem anymore...

Comment by markus makela [ 2016-02-10 ]

I'm closing this as fixed in 1.3.0. If it happens again, please reopen this issue.

Generated at Thu Feb 08 03:59:38 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.