[MXS-217] MaxScale fatal signal 11 Created: 2015-06-23  Updated: 2015-11-30  Resolved: 2015-11-30

Status: Closed
Project: MariaDB MaxScale
Component/s: Core, mariadbbackend, readwritesplit
Affects Version/s: 1.1.1
Fix Version/s: 1.3.0

Type: Bug Priority: Blocker
Reporter: Alex Lee Assignee: Johan Wikman
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

CentOS 5.4(Final)


Attachments: File MaxScale.cnf    
Sprint: 10.1.8-2

 Description   

2015-06-17 11:39:55 Fatal: MaxScale received fatal signal 11. Attempting backtrace.
2015-06-17 11:39:55 /home/db/mariadb-maxscale/bin/maxscale [0x50701e]
2015-06-17 11:39:55 /lib64/libpthread.so.0 [0x3c48c0eca0]
2015-06-17 11:39:55 /home/db/mariadb-maxscale/bin/maxscale(thd_wait_begin+0x14) [0x6a7b14]
2015-06-17 11:39:55 /home/db/mariadb-maxscale/bin/maxscale(vio_io_wait+0xa4) [0x5927e4]
2015-06-17 11:39:55 /home/db/mariadb-maxscale/bin/maxscale(vio_socket_io_wait+0x13) [0x592bb3]
2015-06-17 11:39:55 /home/db/mariadb-maxscale/bin/maxscale(vio_read+0xdc) [0x592dfc]
2015-06-17 11:39:55 /home/db/mariadb-maxscale/bin/maxscale(vio_read_buff+0x4e) [0x592ebe]
2015-06-17 11:39:55 /home/db/mariadb-maxscale/bin/maxscale [0x5408a4]
2015-06-17 11:39:55 /home/db/mariadb-maxscale/bin/maxscale(my_net_read_packet+0x184) [0x540ab4]
2015-06-17 11:39:55 /home/db/mariadb-maxscale/bin/maxscale(cli_safe_read+0x24) [0x534274]
2015-06-17 11:39:55 /home/db/mariadb-maxscale/bin/maxscale(cli_read_rows+0x20b) [0x53491b]
2015-06-17 11:39:55 /home/db/mariadb-maxscale/bin/maxscale(mysql_store_result+0x8a) [0x53346a]
2015-06-17 11:39:55 /home/db/mariadb-maxscale/bin/maxscale [0x52416d]
2015-06-17 11:39:55 /home/db/mariadb-maxscale/bin/maxscale(replace_mysql_users+0x49) [0x521524]
2015-06-17 11:39:55 /home/db/mariadb-maxscale/bin/maxscale(service_refresh_users+0x17f) [0x516357]
2015-06-17 11:39:55 /home/db/mariadb-maxscale/modules/libMySQLBackend.so [0x7f2fe8cb1b93]
2015-06-17 11:39:55 /home/db/mariadb-maxscale/bin/maxscale [0x519dcb]
2015-06-17 11:39:55 /home/db/mariadb-maxscale/bin/maxscale(poll_waitevents+0x63d) [0x51969d]
2015-06-17 11:39:55 /lib64/libpthread.so.0 [0x3c48c0683d]
2015-06-17 11:39:55 /lib64/libc.so.6(clone+0x6d) [0x3c484d4f8d]



 Comments   
Comment by Dipti Joshi (Inactive) [ 2015-06-23 ]

Alex Lee Please provide this information

Exact version of the OS.

Then we will provide you a debug build to produce core with and analyze the core for addressing the crash.

Thanks

Comment by Alex Lee [ 2015-06-24 ]

[root@localhost ~]# cat /etc/issue
CentOS release 5.4 (Final)

If you need a coredump file I will provide it.

Comment by Dipti Joshi (Inactive) [ 2015-06-24 ]

Alex Lee Here you can download MaxScale debug build as below
Change repo config in your /etc/yum.repo.d/maxscale.repo

[maxscale]
name=maxscale
baseurl= http://maxscale-jenkins.mariadb.com/repository/1.1.1-debug/mariadb-maxscale/yum/centos/5/x86_64
enabled=1
gpgcheck=false

Then do following to reinstall maxscale

yum reinstall maxscale

Next, enable coredump as below
You need do following to enable coredump

ulimit -c unlimited
export DAEMON_COREFILE_LIMIT='unlimited'
echo /tmp/core-%e-%s-%u-%g-%p-%t > /proc/sys/kernel/core_pattern
echo 2 > /proc/sys/fs/suid_dumpable
service maxscale restart

After this run your traffic that causes fatal singal 11, you should find the core file in /tmp directory. It will be named core-maxscale-*
Find the latest such core file generated and attach it to this jira item or upload it some where and provide us the link to coredump.

Also include all the log files for MaxScale.

Comment by Alex Lee [ 2015-06-25 ]

Thank you for support.

Comment by Dipti Joshi (Inactive) [ 2015-06-25 ]

Alex Lee Have you been able to produce core dump ?

Comment by Alex Lee [ 2015-06-25 ]

No. I have not.

Fatal signal 11 have not occured since I reinstalled MaxScale debug build rpm package.

Comment by Dipti Joshi (Inactive) [ 2015-06-26 ]

Alex Lee Can you please reinstall the debug build ? - the earlier debug build did not have useful debug symbol.

Just do
yum reinstall maxscale

Then do following to enable coredump

ulimit -c unlimited
export DAEMON_COREFILE_LIMIT='unlimited'
echo /tmp/core-%e-%s-%u-%g-%p-%t > /proc/sys/kernel/core_pattern
echo 2 > /proc/sys/fs/suid_dumpable
service maxscale restart

After this run your traffic that causes fatal singal 11, you should find the core file in /tmp directory. It will be named core-maxscale-*
Find the latest such core file generated and attach it to this jira item or upload it some where and provide us the link to coredump.
Also include all the log files for MaxScale.

Comment by Dipti Joshi (Inactive) [ 2015-06-28 ]

Alex Lee Could you also provide us the version of your backend database ? As well as the log files from MaxScale ?

Comment by Alex Lee [ 2015-06-29 ]

Sure. After i discuss with my boss i will provide you them.

Comment by Alex Lee [ 2015-06-30 ]

The new debug build has a problem. Maxscale daemon is suddenly downed after several hours and can not any clue in log files. When daemon is downed, coredump file is not made.

Comment by Dipti Joshi (Inactive) [ 2015-06-30 ]

Try

yum remove maxscale
yum install maxscale

Then start maxscale service with core enabled.

Comment by Alex Lee [ 2015-07-03 ]

I reinstalled maxscale to follow your direction.
BTW, What is these error log?

2015-07-03 11:05:01 Client error event handling.
2015-07-03 11:05:01 Client hangup error handling.

Comment by Dipti Joshi (Inactive) [ 2015-07-08 ]

Alex Lee Have you been able to reproduce the crash ?

Comment by Dipti Joshi (Inactive) [ 2015-08-16 ]

markus makela Can you analyze the stack trace and pin point location of crash ?

Thanks,
Dipti

Comment by markus makela [ 2015-08-20 ]

This is a crash in the embedded library. The bug should be assigned to the server team for further analysis.

Comment by Dipti Joshi (Inactive) [ 2015-08-20 ]

Even though it is in embedded library it is being called from MySQLBackend protocol module - can we locate where we are calling the embeded library in this stack trace ? johan.wikman

Comment by markus makela [ 2015-08-21 ]

The replace_mysql_users calls the getUsers function. This function has two calls to mysql_store_result at line 1279 and line 1385 in the 1.1.1 version of the source code. The first returns the result for the number of users, the second returns the actual user data.

Comment by Dipti Joshi (Inactive) [ 2015-08-23 ]

ratzpoPlease get connector team to work with Johan, Markus to diagnose and find fix for this.

Comment by Alexey Botchkov [ 2015-08-24 ]

Just to note - it fails in client-server operation, not the internal embedded-server one. The embedded-server library supposed to work as the ordinary client library here.
I think the problem is that either the current_thd returns something invalid or the PSI structures store bad THD poiter in it.

Comment by Dipti Joshi (Inactive) [ 2015-09-08 ]

johan.wikman, holyfoot Has there been any further insight into this ?

Comment by Johan Wikman [ 2015-09-09 ]

I analyzed this again and reached the same conclusion as Markus, that is, it is in either of the mysql_store_result calls in dbusers.c@getUsers where the crash occurs.

A strange thing is that the addresses of the strack trace do not to match the binaries of CentOS 5, and not CentOS 6 or CentOS 7 either.

Considering that replace_mysql_users is executed frequently, i.e. it is not a function that would be called only under exceptional circumstances, unless we get core files and/or detailed instructions for how to make the problem appear, it is very hard to do something about it.

Comment by Alexey Botchkov [ 2015-09-09 ]

I have one idea about the possible problem. Can you tell what version of libmysqld is the MaxScale compiled against?

Comment by Dipti Joshi (Inactive) [ 2015-09-12 ]

johan.wikman, Have we provided version of libmysqld used to complie MaxScale to Holyfoot ?

Comment by Johan Wikman [ 2015-09-14 ]

Yes

Comment by markus makela [ 2015-09-25 ]

I've managed to get this crash in the embedded library on the release-1.2.1 (commit 746dcd4111999aebd67eb7f397720dddae00d706) branch with the following embedded library:
mysqld Ver 10.0.21-MariaDB for Linux on x86_64 (MariaDB Server)

#0  0x00000000005f6ea9 in thd_wait_begin ()
#1  0x000000000057c205 in vio_io_wait ()
#2  0x000000000057c2a8 in vio_socket_io_wait ()
#3  0x000000000057c3b2 in vio_read ()
#4  0x000000000057c475 in vio_read_buff ()
#5  0x0000000000558cda in my_real_read(st_net*, unsigned long*, char) ()
#6  0x0000000000559ab8 in my_net_read_packet ()
#7  0x000000000054c33f in cli_safe_read ()
#8  0x000000000054d627 in cli_read_query_result ()
#9  0x000000000054ef36 in mysql_real_query ()
#10 0x000000000053fac6 in getDatabases (service=0x1b97670, con=0x7fffc8012a98) at /home/markusjm/MaxScale/server/core/dbusers.c:463
#11 0x0000000000542413 in getUsers (service=0x1b97670, users=0x7fffc8017b10) at /home/markusjm/MaxScale/server/core/dbusers.c:1401
#12 0x000000000053f1cc in replace_mysql_users (service=0x1b97670) at /home/markusjm/MaxScale/server/core/dbusers.c:191
#13 0x0000000000532483 in service_refresh_users (service=0x1b97670) at /home/markusjm/MaxScale/server/core/service.c:1432
#14 0x00007fffd8bcf366 in gw_read_backend_event (dcb=0x7fffc8011c30) at /home/markusjm/MaxScale/server/modules/protocol/mysql_backend.c:368
#15 0x0000000000536961 in process_pollq (thread_id=3) at /home/markusjm/MaxScale/server/core/poll.c:870
#16 0x0000000000535ff1 in poll_waitevents (arg=0x3) at /home/markusjm/MaxScale/server/core/poll.c:610
#17 0x00007ffff7063555 in start_thread () from /lib64/libpthread.so.0
#18 0x00007ffff58bdb9d in clone () from /lib64/libc.so.6

I was doing an insert of 775353 rows when the crash happened. It seems interrupting the mysql command line client triggers this.

Comment by Johan Wikman [ 2015-11-30 ]

Although it seems that this may occasionally occur, as it cannot be reproduced on purpose it is very hard to do something about.

Generated at Thu Feb 08 03:57:39 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.