[MXS-358] Crash, Error in `/usr/bin/maxscale': free(): invalid next size (fast) Created: 2015-09-09  Updated: 2015-11-27  Resolved: 2015-11-24

Status: Closed
Project: MariaDB MaxScale
Component/s: Core, readwritesplit
Affects Version/s: 1.2.0
Fix Version/s: 1.3.0

Type: Bug Priority: Blocker
Reporter: Rich Samora Assignee: Johan Wikman
Resolution: Done Votes: 1
Labels: None
Environment:

Debian, x86_64


Attachments: File maxscale.cnf    
Issue Links:
Relates
relates to MXS-200 MaxScale crashes with backtrace Closed

 Description   

Hello,

I'm running a new instance of 1.2.0-develop (from git, updated as of 7th of September), I've observed a crash soon after maxscale was loaded, it is reproducible in my environment and happens a few minutes after maxscale starts to handle query load.

Reading symbols from maxscale...rdone.
(gdb) run -d
Starting program: /usr/bin/maxscale -d
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Info : MaxScale will be run in the terminal process.
Syslog logging is disabled.
[New Thread 0x7ffff256c700 (LWP 9881)]
Configuration file : /etc/maxscale.cnf
Log directory : /var/log/maxscale
Data directory : /var/lib/maxscale/data
Module directory : /usr/lib/x86_64-linux-gnu/maxscale
Service cache : /var/cache/maxscale

[New Thread 0x7ffff1d6b700 (LWP 9882)]
[Thread 0x7ffff1d6b700 (LWP 9882) exited]
[New Thread 0x7ffff1d6b700 (LWP 9883)]
[New Thread 0x7fffdc50b700 (LWP 9884)]
[New Thread 0x7fffdb904700 (LWP 9885)]
[New Thread 0x7fffdb103700 (LWP 9886)]
[New Thread 0x7fffda902700 (LWP 9887)]
[New Thread 0x7fffda101700 (LWP 9888)]
[New Thread 0x7fffd9900700 (LWP 9889)]

      • Error in `/usr/bin/maxscale': free(): invalid next size (fast): 0x00007fffc8001b80 ***

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffd9900700 (LWP 9889)]
0x00007ffff57a0107 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb)
(gdb)
(gdb)
(gdb) bt
#0 0x00007ffff57a0107 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007ffff57a14e8 in __GI_abort () at abort.c:89
#2 0x00007ffff57de214 in __libc_message (do_abort=do_abort@entry=1,
fmt=fmt@entry=0x7ffff58d1000 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/posix/libc_fatal.c:175
#3 0x00007ffff57e39ee in malloc_printerr (action=1, str=0x7ffff58d11c0 "free(): invalid next size (fast)",
ptr=<optimized out>) at malloc.c:4996
#4 0x00007ffff57e46f6 in _int_free (av=<optimized out>, p=<optimized out>, have_lock=0) at malloc.c:3840
#5 0x0000000000549296 in gwbuf_free (buf=0x7fffc80d51c0)
at /usr/local/maxscale/MaxScale/server/core/buffer.c:140
#6 0x0000000000549898 in gwbuf_consume (head=0x7fffc80d51c0, length=88)
at /usr/local/maxscale/MaxScale/server/core/buffer.c:383
#7 0x0000000000551898 in dcb_write (dcb=0x7fffd0116e90, queue=0x7fffc80d51c0)
at /usr/local/maxscale/MaxScale/server/core/dcb.c:1180
#8 0x00007fffd8ef9cfe in gw_send_authentication_to_backend (dbname=0x7fffd01149b5 "dbuser",
user=0x7fffd0114934 "dbuser",
passwd=0x7fffd0114920 "\020\372\216C\213\355\315Y\222'\212f\vM\302ƛL)\317dbuser", conn=0x7fffd01170a0)
at /usr/local/maxscale/MaxScale/server/modules/protocol/mysql_common.c:733
#9 0x00007fffd8ef59ce in gw_read_backend_event (dcb=0x7fffd0116e90)
at /usr/local/maxscale/MaxScale/server/modules/protocol/mysql_backend.c:240
#10 0x0000000000561fc2 in process_pollq (thread_id=3) at /usr/local/maxscale/MaxScale/server/core/poll.c:958
#11 0x000000000056181d in poll_waitevents (arg=0x3) at /usr/local/maxscale/MaxScale/server/core/poll.c:714
#12 0x00007ffff706d0a4 in start_thread (arg=0x7fffd9900700) at pthread_create.c:309
#13 0x00007ffff585107d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb)
#0 0x00007ffff57a0107 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007ffff57a14e8 in __GI_abort () at abort.c:89
#2 0x00007ffff57de214 in __libc_message (do_abort=do_abort@entry=1,
fmt=fmt@entry=0x7ffff58d1000 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/posix/libc_fatal.c:175
#3 0x00007ffff57e39ee in malloc_printerr (action=1, str=0x7ffff58d11c0 "free(): invalid next size (fast)",
ptr=<optimized out>) at malloc.c:4996
#4 0x00007ffff57e46f6 in _int_free (av=<optimized out>, p=<optimized out>, have_lock=0) at malloc.c:3840
#5 0x0000000000549296 in gwbuf_free (buf=0x7fffc80d51c0)
at /usr/local/maxscale/MaxScale/server/core/buffer.c:140
#6 0x0000000000549898 in gwbuf_consume (head=0x7fffc80d51c0, length=88)
at /usr/local/maxscale/MaxScale/server/core/buffer.c:383
#7 0x0000000000551898 in dcb_write (dcb=0x7fffd0116e90, queue=0x7fffc80d51c0)
at /usr/local/maxscale/MaxScale/server/core/dcb.c:1180
#8 0x00007fffd8ef9cfe in gw_send_authentication_to_backend (dbname=0x7fffd01149b5 "dbname",
user=0x7fffd0114934 "dbuser",
passwd=0x7fffd0114920 "\020\372\216C\213\355\315Y\222'\212f\vM\302ƛL)\317dbuser", conn=0x7fffd01170a0)
at /usr/local/maxscale/MaxScale/server/modules/protocol/mysql_common.c:733
#9 0x00007fffd8ef59ce in gw_read_backend_event (dcb=0x7fffd0116e90)
at /usr/local/maxscale/MaxScale/server/modules/protocol/mysql_backend.c:240
#10 0x0000000000561fc2 in process_pollq (thread_id=3) at /usr/local/maxscale/MaxScale/server/core/poll.c:958
#11 0x000000000056181d in poll_waitevents (arg=0x3) at /usr/local/maxscale/MaxScale/server/core/poll.c:714
#12 0x00007ffff706d0a4 in start_thread (arg=0x7fffd9900700) at pthread_create.c:309
#13 0x00007ffff585107d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) bt full
#0 0x00007ffff57a0107 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
resultvar = 0
pid = 9877
selftid = 9889
#1 0x00007ffff57a14e8 in __GI_abort () at abort.c:89
save_stage = 2
act = {__sigaction_handler =

{sa_handler = 0x7fffd98fc890, sa_sigaction = 0x7fffd98fc890}

,
sa_mask = {__val = {140737312052329, 4222451713, 140736843466896, 140736843466896,
140736843466896, 140736843466896, 140736843466980, 140736843466985, 140736843466896,
140736843466985, 140736548831264, 851184, 80, 140736550109968, 0, 140736843482880}},
sa_flags = -176260548, sa_restorer = 0x7fffd98f0000}
sigs = {__val = {32, 0 <repeats 15 times>}}
#2 0x00007ffff57de214 in __libc_message (do_abort=do_abort@entry=1,
fmt=fmt@entry=0x7ffff58d1000 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/posix/libc_fatal.c:175
ap = {{gp_offset = 40, fp_offset = 32767, overflow_arg_area = 0x7fffd98fc960,
reg_save_area = 0x7fffd98fc8f0}}
fd = 86
on_2 = <optimized out>
list = <optimized out>
nlist = <optimized out>
cp = <optimized out>
written = <optimized out>
#3 0x00007ffff57e39ee in malloc_printerr (action=1, str=0x7ffff58d11c0 "free(): invalid next size (fast)",
ptr=<optimized out>) at malloc.c:4996
buf = "00007fffc8001b80"
cp = <optimized out>
#4 0x00007ffff57e46f6 in _int_free (av=<optimized out>, p=<optimized out>, have_lock=0) at malloc.c:3840
size = <optimized out>
--Type <return> to continue, or q <return> to quit--
fb = <optimized out>
nextchunk = <optimized out>
nextsize = <optimized out>
nextinuse = <optimized out>
prevsize = <optimized out>
bck = <optimized out>
fwd = <optimized out>
errstr = <optimized out>
locked = <optimized out>
_func_ = "_int_free"
#5 0x0000000000549296 in gwbuf_free (buf=0x7fffc80d51c0)
at /usr/local/maxscale/MaxScale/server/core/buffer.c:140
prop = 0x0
bo = 0x5553c3 <gw_write+64>
#6 0x0000000000549898 in gwbuf_consume (head=0x7fffc80d51c0, length=88)
at /usr/local/maxscale/MaxScale/server/core/buffer.c:383
rval = 0x0
#7 0x0000000000551898 in dcb_write (dcb=0x7fffd0116e90, queue=0x7fffc80d51c0)
at /usr/local/maxscale/MaxScale/server/core/dcb.c:1180
written = 88
below_water = 0
_FUNCTION_ = "dcb_write"
#8 0x00007fffd8ef9cfe in gw_send_authentication_to_backend (dbname=0x7fffd01149b5 "dbname",
user=0x7fffd0114934 "dbuser",
passwd=0x7fffd0114920 "\020\372\216C\213\355\315Y\222'\212f\vM\302ƛL)\317dbuser", conn=0x7fffd01170a0)
at /usr/local/maxscale/MaxScale/server/modules/protocol/mysql_common.c:733
compress = 0
rv = 32767
--Type <return> to continue, or q <return> to quit--
payload = 0x7fffc8001bdf ""
payload_start = 0x7fffc8001b80 "T"
bytes = 88
client_scramble = "z\325\a\032K\221B2\201q\217>\005tJ\307\307r\023\232"
client_capabilities = "\r\242\016"
server_capabilities = 0
final_capabilities = 958989
dbpass = "9CD1ACCC1654B46867B02C9F0A8ED5E6A9023E81", '\000' <repeats 88 times>
buffer = 0x7fffc80d51c0
dcb = 0x7fffd0116e90
curr_db = 0x7fffd01149b5 "dbname"
curr_passwd = 0x7fffd0114920 "\020\372\216C\213\355\315Y\222'\212f\vM\302ƛL)\317dbuser"
charset = 8
#9 0x00007fffd8ef59ce in gw_read_backend_event (dcb=0x7fffd0116e90)
at /usr/local/maxscale/MaxScale/server/modules/protocol/mysql_backend.c:240
client_protocol = 0x0
backend_protocol = 0x7fffd01170a0
current_session = 0x7fffd0114920
rc = 0
_FUNCTION_ = "gw_read_backend_event"
#10 0x0000000000561fc2 in process_pollq (thread_id=3) at /usr/local/maxscale/MaxScale/server/core/poll.c:958
dcb = 0x7fffd0116e90
found = 1
ev = 5
qtime = 0
_FUNCTION_ = "process_pollq"
#11 0x000000000056181d in poll_waitevents (arg=0x3) at /usr/local/maxscale/MaxScale/server/core/poll.c:714
events = {{events = 5, data = {ptr = 0x7fffc80ed8a0, fd = -938551136, u32 = 3356416160,
--Type <return> to continue, or q <return> to quit--
u64 = 140736549804192}}, {events = 5, data = {ptr = 0x7fffd003a030, fd = -805068752,
u32 = 3489898544, u64 = 140736683286576}}, {events = 5, data = {ptr = 0x7fffc40e1660,
fd = -1005709728, u32 = 3289257568, u64 = 140736482645600}}, {events = 5, data = {
ptr = 0x7fffc802c270, fd = -939343248, u32 = 3355624048, u64 = 140736549012080}}, {
events = 8197, data = {ptr = 0x1bd82c0, fd = 29196992, u32 = 29196992, u64 = 29196992}}, {
events = 1, data = {ptr = 0x1b12ac0, fd = 28388032, u32 = 28388032, u64 = 28388032}}, {
events = 5, data = {ptr = 0x7fffc80820e0, fd = -938991392, u32 = 3355975904,
u64 = 140736549363936}}, {events = 5, data = {ptr = 0x1b34340, fd = 28525376, u32 = 28525376,
u64 = 28525376}}, {events = 5, data = {ptr = 0x1b34600, fd = 28526080, u32 = 28526080,
u64 = 28526080}}, {events = 5, data = {ptr = 0x1bab2e0, fd = 29012704, u32 = 29012704,
u64 = 29012704}}, {events = 0, data = {ptr = 0x0, fd = 0, u32 = 0,
u64 = 0}} <repeats 990 times>}
i = 2
nfds = 0
timeout_bias = 1
thread_id = 3
poll_spins = 0
_FUNCTION_ = "poll_waitevents"
#12 0x00007ffff706d0a4 in start_thread (arg=0x7fffd9900700) at pthread_create.c:309
__res = <optimized out>
pd = 0x7fffd9900700
now = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf =

{140736843482880, -3750240832955098642, 0, 140737354125408, 0, 140736843482880, 3750307490811337198, 3750260270414551534}

,
mask_was_saved = 0}}, priv = {pad =

{0x0, 0x0, 0x0, 0x0}

, data =

{prev = 0x0, cleanup = 0x0, canceltype = 0}

}}
not_first_call = <optimized out>
pagesize_m1 = <optimized out>
--Type <return> to continue, or q <return> to quit--
sp = <optimized out>
freesize = <optimized out>
_PRETTY_FUNCTION_ = "start_thread"
#13 0x00007ffff585107d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
No locals.
(gdb)



 Comments   
Comment by Dipti Joshi (Inactive) [ 2015-09-09 ]

geoma Which branch did you build from ? If you built from develop branch - this is unstable branch in development. If you are looking for 1.2.0 - you should use release-1.2 branch.

Comment by Rich Samora [ 2015-09-09 ]

Thanks for the quick response.

The reason I built from the develop branch was instructions by Markus from a chat we had on IRC.

The release version would not work at all and would immediately crash as we'd turn traffic on.

The develop version takes a few good minutes for a crash to happen, so that is a definitive improvement.

Comment by Johan Wikman [ 2015-09-10 ]

geoma Could you provide the exact commit you used, your built binary and the core file, so that I could look around?

Comment by Rich Samora [ 2015-09-10 ]

The tree I used to build it is showing commit e93f922e21cbf7196436ac8ff9013cd1db306cf4 as the latest one.

I cannot provide you with a core unfortunately since I ran it under gdb when it happened. I don't have a core for the previous crash and since it's a production environment I prefer not to run it until this crash is hopefully fixed.

I'll run it again if you must have a core, but I assume the binary won't help you unless I provide you with a matching core.

Do you happen to be on IRC? we could chat about this in real time.

Comment by Johan Wikman [ 2015-09-10 ]

No, the binary won't help unless I have a matching core. The core would be great, but I understand if you don't want to upset you production environment. At the moment I'm not on IRC but I'll set things up so that I can be there shortly. Latest tomorrow.

Comment by Johan Wikman [ 2015-09-10 ]

The malloc error message - "free(): invalid next size (fast) - indicates that the likely cause is memory overwrite. The buffer freed is one that is allocated in gw_send_authentication_to_backend, so a likely source for the error is that there is something wrong with the buffer size calculation in that function.

Comment by Rich Samora [ 2015-09-14 ]

maxscale.cnf

Attaching the config at Markus' request.

Comment by Johan Wikman [ 2015-09-14 ]

Thanks, a core file would be most helpful.

Comment by Johan Wikman [ 2015-09-16 ]

This is very peculiar. From the "bt full" stacktrace

payload = 0x7fffc8001bdf ""
payload_start = 0x7fffc8001b80 "T"
bytes = 88

it seems that 95 bytes is written into the buffer, which is 88 bytes long. That is, there appears to be a buffer overrun, which certainly could explain why free aborts. However, there does not seem to be anything wrong with the calculation of bytes and the manipulation of the buffer.

geoma, we are working on a bugfix release. Could you try the branch release-1.2.1 whether this problem appears with that one as well.

Comment by markus makela [ 2015-09-22 ]

geoma Could you retest on the release-1.2.1 branch? This branch has a fix in it that fixed a double free in the readwritesplit router.

Comment by Rich Samora [ 2015-09-22 ]

I've tested it and it didn't crash, so I think this problem is resolved.

I'm waiting for some changes in our system to do more serious testing to
be 100% sure this issue is gone.

I expect to be able to do this in the next few days.

Comment by Johan Wikman [ 2015-09-29 ]

geoma Have you had time to retest this on the release-1.2.1 branch? We are finalizing that one and it'd be nice if we could close this as having been fixed in 1.2.1.

Comment by Johan Wikman [ 2015-11-24 ]

Since the last report we got was that the problem had disappeared, and since MXS-329 fixes many of these issues, I'll close this bug.

If something similar occurs with 1.3 we will have to reopen this or create a new one.

Generated at Thu Feb 08 03:58:42 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.