[MXS-388] maxscale proxy hang Created: 2015-09-28  Updated: 2015-12-08  Resolved: 2015-12-08

Status: Closed
Project: MariaDB MaxScale
Component/s: maxadmin
Affects Version/s: 1.2.0, 1.2.1
Fix Version/s: 1.3.0

Type: Bug Priority: Blocker
Reporter: cai sunny Assignee: Johan Wikman
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Linux 2.6.32-504.23.4.el6.x86_64 #1 SMP


Attachments: File maxscale.cnf    
Issue Links:
Relates
relates to MXS-413 MaxAdmin hangs with show session Closed

 Description   

the server run several days, maxscale proxy may hang.
We will check the proxy status by maxadmin every 5 minutes.
From log, the connections is not a lot before maxscale hang.
Please help check why maxscale hang.



 Comments   
Comment by markus makela [ 2015-09-29 ]

Can you provide the maxscale.cnf you used and the backend server types?

Comment by cai sunny [ 2015-10-07 ]

Please check maxscale.cnf.
Yesterday I upgraded maxscale 1.2.1, the proxy hang again after upgrade.

Comment by cai sunny [ 2015-10-07 ]

Backend server are Gelera MySQL server.
Server version: 5.6.23-log MySQL Community Server (GPL), wsrep_25.10

Comment by cai sunny [ 2015-10-13 ]

I checked, the database still be connected by proxy.
But when use maxadmin, it will hang when : list services

Comment by cai sunny [ 2015-10-13 ]

use netstat -at, a lot of connection show "CLOSE_WAIT"

Comment by cai sunny [ 2015-10-13 ]

What should I do if maxadmin hang?

Comment by markus makela [ 2015-10-13 ]

Is the process using 100% of CPU when it hangs? Are there any error messages in error log? When maxadmin hangs are you able to connect via MySQL client or does that also hang?

If there are no messages in the error log it would be a good idea if you can pinpoint which of the services is causing the hang. So if possible, test with each combination of service. This way we will know if some of the services work and if the hanging problem only happens with a certain combination of modules.

Usually when maxadmin hangs there is something wrong with spinlocks and how they are released. The only situation I've encountered when maxadmin hangs is when MaxScale is consuming 100% of CPU and there is a deadlock.

Also if the maxscale process is hanging, attaching a debugger to it would show where it is hanging. To attach a debugger to MaxScale, install the GDB and issue the following command:
gdb --batch --pid=$(pgrep maxscale) -ex 'thr appl all bt full' /usr/bin/maxscale
This should print out a large amount of information about what each thread is doing.

Comment by cai sunny [ 2015-10-13 ]

1: top:
top - 14:36:10 up 32 days, 12:23, 3 users, load average: 7.19, 7.13, 6.88
Tasks: 163 total, 1 running, 162 sleeping, 0 stopped, 0 zombie
Cpu(s): 1.3%us, 0.4%sy, 0.0%ni, 98.3%id, 0.0%wa, 0.0%hi, 0.1%si, 0.0%st
Mem: 3924328k total, 3209024k used, 715304k free, 475056k buffers
Swap: 4194300k total, 1444k used, 4192856k free, 1968220k cached

2 no other error message in log except:
2015-10-12 12:17:44 Error : Unable to write to backend due to authentication failure.
2015-10-12 12:27:45 Error : Unable to write to backend due to authentication failure.
2015-10-12 12:45:38 Error : Unable to write to backend due to authentication failure.
2015-10-12 13:29:14 Error : Unable to write to backend due to authentication failure.
2015-10-12 13:41:39 Error : Unable to write to backend due to authentication failure.

3 I can connect MySQL db by this maxscale proxy.

Comment by cai sunny [ 2015-10-13 ]

It is PROD environment, I cannot install gdb now.

Comment by cai sunny [ 2015-10-13 ]

netstat -at|grep 6603
tcp 0 0 :6603 *: LISTEN
tcp 1 0 localhost:6603 localhost:59031 CLOSE_WAIT
tcp 1 0 localhost:6603 localhost:58656 CLOSE_WAIT
tcp 1 0 localhost:6603 localhost:58242 CLOSE_WAIT
tcp 1 0 localhost:6603 localhost:57406 CLOSE_WAIT
tcp 1 0 localhost:6603 localhost:57403 CLOSE_WAIT
tcp 1 0 localhost:6603 localhost:57404 CLOSE_WAIT
tcp 1 0 localhost:6603 localhost:58989 CLOSE_WAIT
tcp 1 0 localhost:6603 localhost:57407 CLOSE_WAIT
tcp 1 0 localhost:6603 localhost:57405 CLOSE_WAIT
tcp 1 0 localhost:6603 localhost:59148 CLOSE_WAIT
tcp 1 0 localhost:6603 localhost:60553 CLOSE_WAIT
tcp 1 0 localhost:6603 localhost:60546 CLOSE_WAIT

Comment by cai sunny [ 2015-10-13 ]

netstat -at|wc
1758 10550 156426
netstat -at|grep CLOSE_WAIT|wc
1558 9348 138662

Comment by cai sunny [ 2015-10-13 ]

ulimit -a
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 30489
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 65536
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 16384
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

Comment by markus makela [ 2015-10-16 ]

If you can connect to the database through MaxScale, this seems to be a bug in the MaxAdmin client or the module that it connects to.

Comment by markus makela [ 2015-10-16 ]

I've found a bug (MXS-413) in maxadmin where it hangs if the show session <address> is executed. Can confirm that when you execute the command it hangs?

Comment by cai sunny [ 2015-10-17 ]

It is hang when run "show services".
After several minutes after maxadmin hang, MySQL client also cannot connect to MySQL DB by proxy. Also hang there.
A lot of connection by netstat show 'CLOSE_WAIT'.

Comment by Johan Wikman [ 2015-10-20 ]

caisunny Please try with version 1.2.1.

Comment by Alex Vladulescu [ 2015-10-22 ]

Hello,

I got extremely lucky to be in the same situation as cai sunny (but on 1.2.1).

I have a Debian 7.9 setup (described in the MXS-415 and MXS-419 tickets) and at around every 6 to 12 hours (platform in production) loadbalancer now hangs.

Logged in into VM, checked CPU all 8 cores at 100% usage, under htop only maxscale is using the CPU (~800%) and no other services working on the server besides keepalived.

The connection to maxadmin console is working (for a while), so I could get a few commands typed into the console before I needed to urgently restart the process (after everything gets back to normal). I need to add as well that if I leave the server in this state (tested) and not react to it in the idea of load getting back to normal, queries to DB via maxscale are visible getting heavier to complete and eventually go for a full stop.

In hope I could prove myself more useful to you guys I have put this log over web at:

http://www.bfproject.ro/maxscale-hanged-lsof.txt

..I could managed to collect from lsof before doing restart, while service become 100% unresponsive.

The output for the commands I managed to take from the console are:

MaxScale> list servers
Servers.
--------------------------------------------------------------------
Server | Address | Port | Connections | Status
--------------------------------------------------------------------
db01 | 10.200.100.151 | 3306 | 1 | Slave, Synced, Running
db02 | 10.200.100.152 | 3306 | 2 | Master, Synced, Running
db03 | 10.200.100.153 | 3306 | 4 | Slave, Synced, Running
db04 | 10.200.100.154 | 3306 | 1 | Slave, Synced, Running
--------------------------------------------------------------------
MaxScale> list services
Services.
---------------------------------------------------+--------------
Service Name | Router Module | #Users | Total Sessions
---------------------------------------------------+--------------
Read Connection Router | readconnroute | 16774 | 2889529
Debug Interface | debugcli | 1321 | 1321
CLI | cli | 12 | 1323
---------------------------------------------------+--------------

And from linux shell after process restart:
lbdb02:~# cat /proc/14172/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 63967 63967 processes
Max open files 65535 65535 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 63967 63967 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us

Is there we could identify what's causing these errors ?
(don't think of RAM, ram is plenty 8GB on which only ~ 512 MB are used by the whole system).

Thanks

Comment by Johan Wikman [ 2015-11-22 ]

A couple of locking issues have been discoverd.

  • In the binlog router some locks were not released under certain conditions.
  • The non-thread safe localtime was incorrectly used instead of localtime_t. It appears that the use of localtime not only causes a race, but can also cause lockups.

Both of these have beem fixed in develop, although it is not certain they were the cause the lockups described here.

Comment by markus makela [ 2015-12-08 ]

We're closing this since we haven't been able to reproduce this and because of no new information from the reporter. If this problem still persists in 1.3.0, this bug can be reopened.

Generated at Thu Feb 08 03:58:55 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.