[MXS-1009] maxinfo sigsegv in spinlock_release Created: 2016-11-19  Updated: 2016-11-29  Resolved: 2016-11-29

Status: Closed
Project: MariaDB MaxScale
Component/s: maxinfo
Affects Version/s: 1.4.4
Fix Version/s: 1.4.4, 2.0.3

Type: Bug Priority: Blocker
Reporter: Christopher Swingler Assignee: Esa Korhonen
Resolution: Fixed Votes: 0
Labels: None
Environment:

3.13.0-58-generic #97-Ubuntu SMP Wed Jul 8 02:56:15 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux



 Description   

Just saw maxscale crash with this error, after being online for close to a week.

There's little to no MySQL traffic currently on this system, almost all of the transactions it's handling are queries against Maxinfo.

2016-11-19 04:40:56   error  : Fatal: MaxScale 1.4.4 received fatal signal 11. Attempting backtrace.
2016-11-19 04:40:56   error  : Commit ID: f95d31eb44249500b4a261e4adc8aad9a7e11eb6 System name: Linux Release string: Ubuntu 14.04.5 LTS Embedded library version: (null)
2016-11-19 04:40:56   error  :   /usr/bin/maxscale() [0x403c20]
2016-11-19 04:40:56   error  :   /lib/x86_64-linux-gnu/libpthread.so.0(+0x10330) [0x7fcc5781d330]
2016-11-19 04:40:56   error  :   /usr/lib/x86_64-linux-gnu/maxscale/libmaxinfo.so(+0x314e) [0x7fcc512a914e]
2016-11-19 04:40:56   error  :   /usr/lib/x86_64-linux-gnu/maxscale/libMySQLClient.so(+0x4d20) [0x7fcc35857d20]
2016-11-19 04:40:56   error  :   /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(+0x3153f) [0x7fcc57cbb53f]
2016-11-19 04:40:56   error  :   /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(+0x31222) [0x7fcc57cbb222]
2016-11-19 04:40:56   error  :   /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(dcb_process_zombies+0x230) [0x7fcc57cbb026]
2016-11-19 04:40:56   error  :   /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(poll_waitevents+0x719) [0x7fcc57ccf2e2]
2016-11-19 04:40:56   error  :   /usr/bin/maxscale(main+0x180c) [0x406c31]
2016-11-19 04:40:56   error  :   /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7fcc5708df45]
2016-11-19 04:40:56   error  :   /usr/bin/maxscale() [0x4035c9]
 
root@maxscale-cluster-a-rax02:/var/log/maxscale# gdb /usr/bin/maxscale core
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/maxscale...done.
[New LWP 6924]
[New LWP 6925]
[New LWP 6930]
[New LWP 6926]
[New LWP 6928]
[New LWP 6929]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/bin/maxscale --user=maxscale'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fcc512a914e in closeSession (instance=0x159e820, router_session=0xf4c9f00) at /home/vagrant/workspace/server/modules/routing/maxinfo/maxinfo.c:248
248		spinlock_release(&inst->lock);



 Comments   
Comment by Esa Korhonen [ 2016-11-22 ]

Hello, cswingler.
I'm hoping to replicate this crash but would appreciate some additional info.

Running maxinfo-queries tens of thousands of times seems to not cause any memory leak on the setup I'm testing with (MaxScale 1.4.4, 3 x backend servers). Also, the line maxinfo.c:248 is not "spinlock_release" on my editor. I wonder if I have the correct MaxScale-version.

Has this crash happened again? Is there a pattern?
Thank you for any information.

Comment by Christopher Swingler [ 2016-11-22 ]

We did just see this again, this time on the second system configured identically to the first.

Here's a plot of the uptime variable reported by Maxinfo over the past 7 days:

https://snapshot.raintank.io/dashboard/snapshot/DGHMf322b6Rr4Y68tML1tj1l9TeDbH8c

It's worth mentioning that memory consumption is also on an upswing in correspondence with uptime with the Maxscale process.

Crash today was on a different box, but looks identical:

2016-11-22 11:43:38   error  : Fatal: MaxScale 1.4.4 received fatal signal 11. Attempting backtrace.
2016-11-22 11:43:38   error  : Commit ID: f95d31eb44249500b4a261e4adc8aad9a7e11eb6 System name: Linux Release string: Ubuntu 14.04.5 LTS Embedded libra
ry version: (null)
2016-11-22 11:43:38   error  :   /usr/bin/maxscale() [0x403c20]
2016-11-22 11:43:38   error  :   /lib/x86_64-linux-gnu/libpthread.so.0(+0x10330) [0x7fcffe79e330]
2016-11-22 11:43:38   error  :   /usr/lib/x86_64-linux-gnu/maxscale/libmaxinfo.so(+0x314e) [0x7fcff822a14e]
2016-11-22 11:43:38   error  :   /usr/lib/x86_64-linux-gnu/maxscale/libMySQLClient.so(+0x4d20) [0x7fcff44a8d20]
2016-11-22 11:43:38   error  :   /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(+0x3153f) [0x7fcffec3c53f]
2016-11-22 11:43:38   error  :   /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(+0x31222) [0x7fcffec3c222]
2016-11-22 11:43:38   error  :   /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(dcb_process_zombies+0x230) [0x7fcffec3c026]
2016-11-22 11:43:38   error  :   /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(poll_waitevents+0x719) [0x7fcffec502e2]
2016-11-22 11:43:38   error  :   /usr/bin/maxscale(main+0x180c) [0x406c31]
2016-11-22 11:43:38   error  :   /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7fcffe00ef45]
2016-11-22 11:43:38   error  :   /usr/bin/maxscale() [0x4035c9]

gdb points to:

root@maxscale-cluster-a-rax01:/var/log/maxscale# gdb /usr/bin/maxscale core
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/maxscale...done.
[New LWP 4361]
[New LWP 4362]
[New LWP 4363]
[New LWP 4369]
[New LWP 4370]
[New LWP 4366]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/bin/maxscale --user=maxscale'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fcff822a14e in closeSession (instance=0x1af0820, router_session=0x1ce5a010)
    at /home/vagrant/workspace/server/modules/routing/maxinfo/maxinfo.c:248
248			while (ptr && ptr->next != session)

and regarding your mismatch in line numbers - that's my error. I had the Maxscale 2.0.1 code unpacked on that box when I fired up GDB. Sorry about the goose chase on that one.

If you'd like, I can dig up exactly what HTTP requests/SQL queries we're running on a regular basis on this box in order so you can get a better idea of what needs to be done to reproduce this.

Comment by Esa Korhonen [ 2016-11-23 ]

Yes, please. If it's too much to paste on the comment box you can always send by email.

Comment by Christopher Swingler [ 2016-11-23 ]

We have two tests we run on a regular basis.

The first one is a health check - once a second, our load balancer calls a
Python app that runs an HTTP listener, which returns a pass/fail based on the
following:

  • Open maxscale.cnf and look for anything with type=listener.
  • Match up listeners with their services and grab the auth info stored therein.
  • Call the /listeners endpoint for MaxInfo HTTPD to see what's actually active compared to the config file
  • Run some tests:
    • For anything of listener type "HTTPD" (which would be only MaxInfo), perform a GET /
    • For anything of listener type "MySQLClient" (which is maxinfo and actual database services), connect using Python's MySQLdb library and perform a SHOW VARIABLES;
    • For any other listener type, simply check that we can open and close a TCP socket (this is usually just MaxAdmin via maxscaled)

If all of those pass, the python script returns a "pass", otherwise a "fail", which brings the MaxScale instance out of our load balancer.

The second test runs once every 30 seconds as a plugin in Collectd to collect and plot a bunch of things out of MaxInfo. It calls every documented MaxInfo endpoint once every 30 seconds:

/variables
/status
/services
/listeners
/modules
/sessions
/servers
/event/times

Hope that helps!

Comment by Esa Korhonen [ 2016-11-24 ]

Thank you for the information. With it I managed to replicate at least some of the memory leak. As a temporary measure, just avoiding the MaxInfo-SQL connection and query seems to stop the leak.

Comment by Esa Korhonen [ 2016-11-25 ]

The leaks are now plugged (at least according to tests and valgrind). I could build a new package for you if you are interested in testing it in your system.

Comment by Christopher Swingler [ 2016-11-28 ]

That won't be necessary, we'll simply put the workaround in place in our monitoring to skip that check and wait for the next release.

Does this issue effect the 2.0 branch as well (in particular, is it present in MaxScale 2.0.2)?

Comment by Esa Korhonen [ 2016-11-29 ]

Some of it certainly does affect 2.0 branch as well, but the query types affected may not be the same as with 1.4.4.

I have copied the fixes to the next 2.0-release, so it should be safe to use.

Generated at Thu Feb 08 04:03:34 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.