[MXS-1009] maxinfo sigsegv in spinlock_release Created: 2016-11-19 Updated: 2016-11-29 Resolved: 2016-11-29 |
|
| Status: | Closed |
| Project: | MariaDB MaxScale |
| Component/s: | maxinfo |
| Affects Version/s: | 1.4.4 |
| Fix Version/s: | 1.4.4, 2.0.3 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Christopher Swingler | Assignee: | Esa Korhonen |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
3.13.0-58-generic #97-Ubuntu SMP Wed Jul 8 02:56:15 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux |
||
| Description |
|
Just saw maxscale crash with this error, after being online for close to a week. There's little to no MySQL traffic currently on this system, almost all of the transactions it's handling are queries against Maxinfo.
|
| Comments |
| Comment by Esa Korhonen [ 2016-11-22 ] | |||||||||||||||||||||||||||||||||||||||||||
|
Hello, cswingler. Running maxinfo-queries tens of thousands of times seems to not cause any memory leak on the setup I'm testing with (MaxScale 1.4.4, 3 x backend servers). Also, the line maxinfo.c:248 is not "spinlock_release" on my editor. I wonder if I have the correct MaxScale-version. Has this crash happened again? Is there a pattern? | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Christopher Swingler [ 2016-11-22 ] | |||||||||||||||||||||||||||||||||||||||||||
|
We did just see this again, this time on the second system configured identically to the first. Here's a plot of the uptime variable reported by Maxinfo over the past 7 days: https://snapshot.raintank.io/dashboard/snapshot/DGHMf322b6Rr4Y68tML1tj1l9TeDbH8c It's worth mentioning that memory consumption is also on an upswing in correspondence with uptime with the Maxscale process. Crash today was on a different box, but looks identical:
gdb points to:
and regarding your mismatch in line numbers - that's my error. I had the Maxscale 2.0.1 code unpacked on that box when I fired up GDB. Sorry about the goose chase on that one. If you'd like, I can dig up exactly what HTTP requests/SQL queries we're running on a regular basis on this box in order so you can get a better idea of what needs to be done to reproduce this. | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Esa Korhonen [ 2016-11-23 ] | |||||||||||||||||||||||||||||||||||||||||||
|
Yes, please. If it's too much to paste on the comment box you can always send by email. | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Christopher Swingler [ 2016-11-23 ] | |||||||||||||||||||||||||||||||||||||||||||
|
We have two tests we run on a regular basis. The first one is a health check - once a second, our load balancer calls a
If all of those pass, the python script returns a "pass", otherwise a "fail", which brings the MaxScale instance out of our load balancer. The second test runs once every 30 seconds as a plugin in Collectd to collect and plot a bunch of things out of MaxInfo. It calls every documented MaxInfo endpoint once every 30 seconds:
Hope that helps! | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Esa Korhonen [ 2016-11-24 ] | |||||||||||||||||||||||||||||||||||||||||||
|
Thank you for the information. With it I managed to replicate at least some of the memory leak. As a temporary measure, just avoiding the MaxInfo-SQL connection and query seems to stop the leak. | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Esa Korhonen [ 2016-11-25 ] | |||||||||||||||||||||||||||||||||||||||||||
|
The leaks are now plugged (at least according to tests and valgrind). I could build a new package for you if you are interested in testing it in your system. | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Christopher Swingler [ 2016-11-28 ] | |||||||||||||||||||||||||||||||||||||||||||
|
That won't be necessary, we'll simply put the workaround in place in our monitoring to skip that check and wait for the next release. Does this issue effect the 2.0 branch as well (in particular, is it present in MaxScale 2.0.2)? | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Esa Korhonen [ 2016-11-29 ] | |||||||||||||||||||||||||||||||||||||||||||
|
Some of it certainly does affect 2.0 branch as well, but the query types affected may not be the same as with 1.4.4. I have copied the fixes to the next 2.0-release, so it should be safe to use. |