[MXS-4090] Fatal signal 11 after slave down Created: 2022-04-12  Updated: 2022-07-06  Resolved: 2022-07-06

Status: Closed
Project: MariaDB MaxScale
Component/s: Monitor
Affects Version/s: 6.2.3, 6.3.0
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Richard Lee Assignee: markus makela
Resolution: Cannot Reproduce Votes: 0
Labels: crash
Environment:

Ubuntu 20.04.4


Attachments: Text File SearchResults_20220512.txt     File maxscale_20220512.cnf     Text File maxscale_20220512.log     Text File maxscale_20220512_1.log     Text File maxscale_20220512_info_reduced.txt    

 Description   

MaxScale service seems to crash and restart following detection of slave no longer running.
Not sure what exactly this fatal signal 11 refers to.

2022-04-12 12:28:39 error : Monitor timed out when connecting to server g1sql2[192.168.45.32:3306] : 'Can't connect to server on '192.168.45.32' (110)'
2022-04-12 12:28:39 notice : Server changed state: g1sql2[192.168.45.32:3306]: slave_down. [Slave, Running] -> [Down]
2022-04-12 12:28:40 notice : Server changed state: g1sql2[192.168.45.32:3306]: slave_up. [Down] -> [Slave, Running]
{color:red}2022-04-12 12:28:40 alert : MaxScale 6.2.3 received fatal signal 11. Commit ID: 1ac7a0ddfdb5e7dc2df5d120b4aa9bf5961f0c3e System name: Linux Release string: Ubuntu 20.04.4 LTS
2022-04-12 12:28:40 alert : Statement currently being classified: none/unknown
2022-04-12 12:28:40 alert : DCB: 0x7f8b34042610 Session: 7 Service: Read-Write-Service{color}
2022-04-12 12:28:40 notice : For a more detailed stacktrace, install GDB and add 'debug=gdb-stacktrace' under the [maxscale] section.
2022-04-12 12:28:41 notice : The systemd watchdog is Enabled. Internal timeout = 30s
2022-04-12 12:28:41 notice : Worker message queue size: 1MiB
2022-04-12 12:28:41 notice : Using up to 1.16GiB of memory for query classifier cache
2022-04-12 12:28:41 notice : syslog logging is disabled.
2022-04-12 12:28:41 notice : maxlog logging is enabled.
2022-04-12 12:28:41 notice : Host: 'g1maxscale1' OS: Linux@5.13.0-37-generic, #42~20.04.1-Ubuntu SMP Tue Mar 15 15:44:28 UTC 2022, x86_64 with 4 processor cores.
2022-04-12 12:28:41 notice : Total usable main memory: 7.77GiB.
2022-04-12 12:28:41 notice : MariaDB MaxScale 6.2.3 started (Commit: 1ac7a0ddfdb5e7dc2df5d120b4aa9bf5961f0c3e)
2022-04-12 12:28:41 notice : MaxScale is running in process 97161
2022-04-12 12:28:41 notice : Configuration file: /etc/maxscale.cnf
2022-04-12 12:28:41 notice : Log directory: /var/log/maxscale



 Comments   
Comment by markus makela [ 2022-04-19 ]

Is there a stacktrace for this or a way to reproduce the crash?

Comment by Richard Lee [ 2022-05-12 ]

Hi, how can i extract a stacktrace? I have just encountered the same issue again, but instead when a slave transitioned from Down to Running.

The slave was down due to SlaveSQL_Run being OFF because of a replication error, not because it was actually powered off.
maxscale_20220512.log

Comment by markus makela [ 2022-05-12 ]

Can you attach your maxscale.cnf? Make sure that you remove any passwords, hostnames or IP addresses or any other sensitive information before uploading it.

Comment by Richard Lee [ 2022-05-12 ]

I can replicate by manually stopping and starting the slave SQL server with "STOP SLAVE" and "START SLAVE" commands.
When MaxScale detects the slave transitioning from stopped to running, it causes the fatal signal error, and causes MaxScale to restart itself.
Which results in undesirable connection errors on the application side when it can't reach MaxScale temporarily.

Comment by Richard Lee [ 2022-05-12 ]

maxscale_20220512.cnf
Configuration attached.

Comment by markus makela [ 2022-05-12 ]

Can you remove debug=gdb-stacktrace from the config and see if that generates a stacktrace in the logs?

Comment by markus makela [ 2022-05-12 ]

I'd also recommend upgrading to 6.3.0, it has some bug fixes that might help.

Comment by Richard Lee [ 2022-05-12 ]

I originally didn't have that option during the first error. But i can certainly remove it again and re-test.
....
After update conf file, restarting service, and creating the fatal error again, i get the following output in the log file (see attached).
Something about binutils not installed.
maxscale_20220512_1.log

Comment by Richard Lee [ 2022-05-12 ]

Damn.. ok so I've upgraded to 6.3.0, and i still get the same errors about "binutils not installed" that i got in the last logs uploaded.

Comment by markus makela [ 2022-05-12 ]

The new log has the stacktraces for some reason:

2022-05-12 16:57:39   notice : Server changed state: g1sql2[192.168.45.32:3306]: new_slave. [Running] -> [Slave, Running]
free(): invalid pointer
alert  : MaxScale 6.2.3 received fatal signal 6. Commit ID: 1ac7a0ddfdb5e7dc2df5d120b4aa9bf5961f0c3e System name: Linux Release string: Ubuntu 20.04.4 LTS
 
 
2022-05-12 16:57:39   alert  : (23) (Read-Write-Service); MaxScale 6.2.3 received fatal signal 6. Commit ID: 1ac7a0ddfdb5e7dc2df5d120b4aa9bf5961f0c3e System name: Linux Release string: Ubuntu 20.04.4 LTS
2022-05-12 16:57:39   alert  : (23) (Read-Write-Service); Statement currently being classified: none/unknown
2022-05-12 16:57:39   alert  : (23) (Read-Write-Service); DCB: 0x7f0e48166c20 Session: 23 Service: Read-Write-Service
30	../sysdeps/unix/sysv/linux/epoll_wait.c: No such file or directory.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f0e6213049e in epoll_wait (epfd=5, events=events@entry=0x7ffda212bf30, maxevents=1000, timeout=993) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
 
Thread 18 (Thread 0x7f0e4cff9700 (LWP 273250)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f0e40001818) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7f0e400017c8, cond=0x7f0e400017f0) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x7f0e400017f0, mutex=0x7f0e400017c8) at pthread_cond_wait.c:638
#3  0x00007f0e62455e30 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f0e62a76a0f in std::condition_variable::wait<maxbase::ThreadPool::Thread::main()::<lambda()> > (__p=..., __lock=..., this=0x7f0e400017f0) at /home/timofey_turenko_mariadb_com/MaxScale/maxutils/maxbase/src/threadpool.cc:62
#5  maxbase::ThreadPool::Thread::main (this=0x7f0e40001770) at /home/timofey_turenko_mariadb_com/MaxScale/maxutils/maxbase/src/threadpool.cc:65
#6  0x00007f0e6245bde4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007f0e62226609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#8  0x00007f0e62130163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
 
Thread 17 (Thread 0x7f0e4d7fa700 (LWP 273249)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f0e400013a8) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7f0e40001358, cond=0x7f0e40001380) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x7f0e40001380, mutex=0x7f0e40001358) at pthread_cond_wait.c:638
#3  0x00007f0e62455e30 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f0e62a76a0f in std::condition_variable::wait<maxbase::ThreadPool::Thread::main()::<lambda()> > (__p=..., __lock=..., this=0x7f0e40001380) at /home/timofey_turenko_mariadb_com/MaxScale/maxutils/maxbase/src/threadpool.cc:62
#5  maxbase::ThreadPool::Thread::main (this=0x7f0e40001300) at /home/timofey_turenko_mariadb_com/MaxScale/maxutils/maxbase/src/threadpool.cc:65
#6  0x00007f0e6245bde4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007f0e62226609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#8  0x00007f0e62130163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
 
Thread 16 (Thread 0x7f0e4dffb700 (LWP 273248)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f0e40000f38) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7f0e40000ee8, cond=0x7f0e40000f10) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x7f0e40000f10, mutex=0x7f0e40000ee8) at pthread_cond_wait.c:638
#3  0x00007f0e62455e30 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f0e62a76a0f in std::condition_variable::wait<maxbase::ThreadPool::Thread::main()::<lambda()> > (__p=..., __lock=..., this=0x7f0e40000f10) at /home/timofey_turenko_mariadb_com/MaxScale/maxutils/maxbase/src/threadpool.cc:62
#5  maxbase::ThreadPool::Thread::main (this=0x7f0e40000e90) at /home/timofey_turenko_mariadb_com/MaxScale/maxutils/maxbase/src/threadpool.cc:65
#6  0x00007f0e6245bde4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007f0e62226609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#8  0x00007f0e62130163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
 
Thread 15 (Thread 0x7f0e4e7fc700 (LWP 273247)):
#0  0x00007f0e6213049e in epoll_wait (epfd=29, events=events@entry=0x7f0e4e7f64d0, maxevents=1000, timeout=999) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007f0e62bc9e71 in maxbase::Worker::poll_waitevents (this=this@entry=0x559fd7d3ce18) at /home/timofey_turenko_mariadb_com/MaxScale/maxutils/maxbase/src/worker.cc:798
#2  0x00007f0e62bca123 in maxbase::Worker::run (this=0x559fd7d3ce18, pSem=0x7ffda212bd20) at /home/timofey_turenko_mariadb_com/MaxScale/maxutils/maxbase/src/worker.cc:561
#3  0x00007f0e6245bde4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f0e62226609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5  0x00007f0e62130163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
 
Thread 14 (Thread 0x7f0e4effd700 (LWP 273246)):
#0  futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7f0e4effa3f0, clockid=<optimized out>, expected=0, futex_word=0x7f0e62d5d730 <(anonymous namespace)::this_unit+144>) at ../sysdeps/nptl/futex-internal.h:320
#1  __pthread_cond_wait_common (abstime=0x7f0e4effa3f0, clockid=<optimized out>, mutex=0x7f0e62d5d738 <(anonymous namespace)::this_unit+152>, cond=0x7f0e62d5d708 <(anonymous namespace)::this_unit+104>) at pthread_cond_wait.c:520
#2  __pthread_cond_timedwait (cond=0x7f0e62d5d708 <(anonymous namespace)::this_unit+104>, mutex=0x7f0e62d5d738 <(anonymous namespace)::this_unit+152>, abstime=0x7f0e4effa3f0) at pthread_cond_wait.c:656
#3  0x00007f0e62b74551 in __gthread_cond_timedwait (__abs_timeout=0x7f0e4effa3f0, __mutex=0x7f0e62d5d738 <(anonymous namespace)::this_unit+152>, __cond=0x7f0e62d5d708 <(anonymous namespace)::this_unit+104>) at /usr/include/x86_64-linux-gnu/c++/9/bits/gthr-default.h:872
#4  std::condition_variable::__wait_until_impl<std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (__atime=<synthetic pointer>..., __lock=<synthetic pointer>..., this=<optimized out>) at /usr/include/c++/9/condition_variable:188
#5  std::condition_variable::wait_until<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (__atime=<synthetic pointer>..., __lock=<synthetic pointer>..., this=<optimized out>) at /usr/include/c++/9/condition_variable:121
#6  std::condition_variable::wait_until<std::chrono::_V2::steady_clock, std::chrono::duration<long int, std::ratio<1, 1000000000> >, HttpSql::ConnectionManager::cleanup_thread_func()::<lambda()> > (__p=..., __atime=..., __lock=..., this=<optimized out>) at /usr/include/c++/9/condition_variable:138
#7  HttpSql::ConnectionManager::cleanup_thread_func (this=0x7f0e62d5d6a0 <(anonymous namespace)::this_unit>) at /home/timofey_turenko_mariadb_com/MaxScale/server/core/sql_conn_manager.cc:163
#8  0x00007f0e6245bde4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007f0e62226609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#10 0x00007f0e62130163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
 
Thread 13 (Thread 0x7f0e4f7fe700 (LWP 273245)):
#0  0x00007f0e6213049e in epoll_wait (epfd=35, events=events@entry=0x7f0e4f7fa820, maxevents=maxevents@entry=128, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007f0e62c92cba in MHD_epoll (daemon=daemon@entry=0x559fd7d44c90, may_block=<optimized out>, may_block@entry=1) at daemon.c:4467
#2  0x00007f0e62c93f5f in MHD_polling_thread (cls=0x559fd7d44c90) at daemon.c:4746
#3  0x00007f0e62226609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#4  0x00007f0e62130163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
 
Thread 12 (Thread 0x7f0e4ffff700 (LWP 273244)):
#0  futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7f0e4fffc400, clockid=<optimized out>, expected=0, futex_word=0x559fd7d448b4) at ../sysdeps/nptl/futex-internal.h:320
#1  __pthread_cond_wait_common (abstime=0x7f0e4fffc400, clockid=<optimized out>, mutex=0x559fd7d448b8, cond=0x559fd7d44888) at pthread_cond_wait.c:520
#2  __pthread_cond_timedwait (cond=0x559fd7d44888, mutex=0x559fd7d448b8, abstime=0x7f0e4fffc400) at pthread_cond_wait.c:656
#3  0x00007f0e62bb4aa0 in __gthread_cond_timedwait (__abs_timeout=0x7f0e4fffc400, __mutex=0x559fd7d448b8, __cond=0x559fd7d44888) at /usr/include/x86_64-linux-gnu/c++/9/bits/gthr-default.h:872
#4  std::condition_variable::__wait_until_impl<std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (__atime=<synthetic pointer>..., __lock=<synthetic pointer>..., this=<optimized out>) at /usr/include/c++/9/condition_variable:188
#5  std::condition_variable::wait_until<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (__atime=<synthetic pointer>..., __lock=<synthetic pointer>..., this=<optimized out>) at /usr/include/c++/9/condition_variable:121
#6  std::condition_variable::wait_until<std::chrono::_V2::steady_clock, std::chrono::duration<long int, std::ratio<1, 1000000000> >, MariaDBUserManager::updater_thread_function()::<lambda()> > (__p=..., __atime=..., __lock=..., this=<optimized out>) at /usr/include/c++/9/condition_variable:138
#7  MariaDBUserManager::updater_thread_function (this=0x559fd7d44830) at /home/timofey_turenko_mariadb_com/MaxScale/server/modules/protocol/MariaDB/user_data.cc:247
#8  0x00007f0e6245bde4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007f0e62226609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#10 0x00007f0e62130163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
 
Thread 11 (Thread 0x7f0e5cdf5700 (LWP 273243)):
#0  __GI___libc_read (nbytes=4096, buf=0x7f0e484da680, fd=162) at ../sysdeps/unix/sysv/linux/read.c:26
#1  __GI___libc_read (fd=162, buf=0x7f0e484da680, nbytes=4096) at ../sysdeps/unix/sysv/linux/read.c:24
#2  0x00007f0e620a1bcf in _IO_new_file_underflow (fp=0x7f0e482568b0) at libioP.h:948
#3  0x00007f0e620a32c4 in __GI___underflow (fp=0x7f0e482568b0) at libioP.h:948
#4  __GI___underflow (fp=0x7f0e482568b0) at genops.c:268
#5  __GI__IO_default_xsgetn (fp=0x7f0e482568b0, data=<optimized out>, n=511) at genops.c:443
#6  0x00007f0e62093f13 in __GI__IO_fread (buf=buf@entry=0x7f0e5cdedda0, size=size@entry=1, count=count@entry=511, fp=fp@entry=0x7f0e482568b0) at iofread.c:38
#7  0x0000559fd6d16fcb in fread (__stream=0x7f0e482568b0, __n=511, __size=1, __ptr=0x7f0e5cdedda0) at /usr/include/x86_64-linux-gnu/bits/stdio2.h:297
#8  (anonymous namespace)::get_command_output_cb (cb=cb@entry=0x559fd6d131d0 <<lambda(char const*)>::_FUN(const char *)>, format=0x559fd6d1b2c8 "gdb --pid=%d -batch -nx -iex 'set auto-load off' -iex 'set print thread-events off' -ex 'thr a a bt'", format=0x559fd6d1b2c8 "gdb --pid=%d -batch -nx -iex 'set auto-load off' -iex 'set print thread-events off' -ex 'thr a a bt'") at /home/timofey_turenko_mariadb_com/MaxScale/maxutils/maxbase/src/stacktrace.cc:78
#9  0x0000559fd6d1780c in maxbase::dump_gdb_stacktrace (handler=0x559fd6d131d0 <<lambda(char const*)>::_FUN(const char *)>) at /home/timofey_turenko_mariadb_com/MaxScale/maxutils/maxbase/src/stacktrace.cc:230
#10 0x0000559fd6d12edc in sigfatal_handler (i=6) at /home/timofey_turenko_mariadb_com/MaxScale/server/core/gateway.cc:485
#11 <signal handler called>
#12 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#13 0x00007f0e62033859 in __GI_abort () at abort.c:79
#14 0x00007f0e6209e29e in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f0e621c8298 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#15 0x00007f0e620a632c in malloc_printerr (str=str@entry=0x7f0e621c64c1 "free(): invalid pointer") at malloc.c:5347
#16 0x00007f0e620a7b5c in _int_free (av=<optimized out>, p=<optimized out>, have_lock=0) at malloc.c:4173
#17 0x00007f0e62a80b1c in __gnu_cxx::new_allocator<maxscale::Backend::response_type>::deallocate (this=0x7f0e480536a0, __p=<optimized out>) at /usr/include/c++/9/ext/new_allocator.h:119
#18 std::allocator_traits<std::allocator<maxscale::Backend::response_type> >::deallocate (__a=..., __n=128, __p=<optimized out>) at /usr/include/c++/9/bits/alloc_traits.h:469
#19 std::_Deque_base<maxscale::Backend::response_type, std::allocator<maxscale::Backend::response_type> >::_M_deallocate_node (this=0x7f0e480536a0, __p=<optimized out>) at /usr/include/c++/9/bits/stl_deque.h:627
#20 std::deque<maxscale::Backend::response_type, std::allocator<maxscale::Backend::response_type> >::_M_pop_front_aux (this=0x7f0e480536a0) at /usr/include/c++/9/bits/deque.tcc:576
#21 std::deque<maxscale::Backend::response_type, std::allocator<maxscale::Backend::response_type> >::pop_front (this=0x7f0e480536a0) at /usr/include/c++/9/bits/stl_deque.h:1616
#22 std::deque<maxscale::Backend::response_type, std::allocator<maxscale::Backend::response_type> >::pop_front (this=0x7f0e480536a0) at /usr/include/c++/9/bits/stl_deque.h:1605
#23 maxscale::Backend::ack_write (this=this@entry=0x7f0e48053620) at /home/timofey_turenko_mariadb_com/MaxScale/server/core/backend.cc:124
#24 0x00007f0e5c5d4392 in RWSplitSession::clientReply (this=0x7f0e480698f0, writebuf=0x7f0e482a80d0, down=..., reply=...) at /home/timofey_turenko_mariadb_com/MaxScale/server/modules/routing/readwritesplit/rwsplitsession.cc:586
#25 0x00007f0e62b5258c in ServiceEndpoint::clientReply (this=0x7f0e48051500, buffer=0x7f0e482a80d0, down=..., reply=...) at /home/timofey_turenko_mariadb_com/MaxScale/server/core/service.cc:1701
#26 0x00007f0e62b44d82 in ServerEndpoint::clientReply (this=0x7f0e481124b0, buffer=0x7f0e482a80d0, down=..., reply=...) at /home/timofey_turenko_mariadb_com/MaxScale/server/core/server.cc:1073
#27 0x00007f0e62b86a57 in MariaDBBackendConnection::normal_read (this=0x7f0e484865f0) at /home/timofey_turenko_mariadb_com/MaxScale/server/modules/protocol/MariaDB/mariadb_backend.cc:759
#28 0x00007f0e62acba2c in DCB::process_events (events=1, this=0x7f0e48166c20) at /home/timofey_turenko_mariadb_com/MaxScale/server/core/dcb.cc:1261
#29 DCB::process_events (this=0x7f0e48166c20, events=1) at /home/timofey_turenko_mariadb_com/MaxScale/server/core/dcb.cc:1195
#30 0x00007f0e62acbb61 in DCB::event_handler (dcb=0x7f0e48166c20, events=<optimized out>) at /home/timofey_turenko_mariadb_com/MaxScale/server/core/dcb.cc:1326
#31 0x00007f0e62bc9f81 in maxbase::Worker::poll_waitevents (this=this@entry=0x559fd7d0f320) at /home/timofey_turenko_mariadb_com/MaxScale/maxutils/maxbase/src/worker.cc:854
#32 0x00007f0e62bca123 in maxbase::Worker::run (this=0x559fd7d0f320, pSem=0x7ffda212ee40) at /home/timofey_turenko_mariadb_com/MaxScale/maxutils/maxbase/src/worker.cc:561
#33 0x00007f0e6245bde4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#34 0x00007f0e62226609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#35 0x00007f0e62130163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
 
Thread 10 (Thread 0x7f0e5d5f6700 (LWP 273242)):
#0  0x00007f0e6213049e in epoll_wait (epfd=21, events=events@entry=0x7f0e5d5f04d0, maxevents=1000, timeout=763) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007f0e62bc9e71 in maxbase::Worker::poll_waitevents (this=this@entry=0x559fd7d0ea50) at /home/timofey_turenko_mariadb_com/MaxScale/maxutils/maxbase/src/worker.cc:798
#2  0x00007f0e62bca123 in maxbase::Worker::run (this=0x559fd7d0ea50, pSem=0x7ffda212ee40) at /home/timofey_turenko_mariadb_com/MaxScale/maxutils/maxbase/src/worker.cc:561
#3  0x00007f0e6245bde4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f0e62226609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5  0x00007f0e62130163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
 
Thread 9 (Thread 0x7f0e5ddf7700 (LWP 273241)):
#0  0x00007f0e6213049e in epoll_wait (epfd=17, events=events@entry=0x7f0e5ddf14d0, maxevents=1000, timeout=743) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007f0e62bc9e71 in maxbase::Worker::poll_waitevents (this=this@entry=0x559fd7d0e280) at /home/timofey_turenko_mariadb_com/MaxScale/maxutils/maxbase/src/worker.cc:798
#2  0x00007f0e62bca123 in maxbase::Worker::run (this=0x559fd7d0e280, pSem=0x7ffda212ee40) at /home/timofey_turenko_mariadb_com/MaxScale/maxutils/maxbase/src/worker.cc:561
#3  0x00007f0e6245bde4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f0e62226609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5  0x00007f0e62130163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
 
Thread 8 (Thread 0x7f0e5e5f8700 (LWP 273240)):
#0  0x00007f0e6213049e in epoll_wait (epfd=13, events=events@entry=0x7f0e5e5f24d0, maxevents=1000, timeout=550) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007f0e62bc9e71 in maxbase::Worker::poll_waitevents (this=this@entry=0x559fd7d0db80) at /home/timofey_turenko_mariadb_com/MaxScale/maxutils/maxbase/src/worker.cc:798
#2  0x00007f0e62bca123 in maxbase::Worker::run (this=0x559fd7d0db80, pSem=0x7ffda212ee40) at /home/timofey_turenko_mariadb_com/MaxScale/maxutils/maxbase/src/worker.cc:561
#3  0x00007f0e6245bde4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f0e62226609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5  0x00007f0e62130163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
 
Thread 7 (Thread 0x7f0e5edf9700 (LWP 273239)):
#0  futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7f0e5edf6400, clockid=<optimized out>, expected=0, futex_word=0x559fd7d0f9b4) at ../sysdeps/nptl/futex-internal.h:320
#1  __pthread_cond_wait_common (abstime=0x7f0e5edf6400, clockid=<optimized out>, mutex=0x559fd7d0f958, cond=0x559fd7d0f988) at pthread_cond_wait.c:520
#2  __pthread_cond_timedwait (cond=cond@entry=0x559fd7d0f988, mutex=0x559fd7d0f958, abstime=abstime@entry=0x7f0e5edf6400) at pthread_cond_wait.c:656
#3  0x00007f0e62bc8605 in maxbase::ConditionVariable::wait_for (d=..., guard=..., this=0x559fd7d0f980) at /usr/include/c++/9/bits/std_mutex.h:123
#4  maxbase::WatchdogNotifier::Dependent::Ticker::run (this=0x559fd7d0f940) at /home/timofey_turenko_mariadb_com/MaxScale/maxutils/maxbase/src/watchdognotifier.cc:95
#5  0x00007f0e6245bde4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f0e62226609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#7  0x00007f0e62130163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
 
Thread 6 (Thread 0x7f0e5f5fa700 (LWP 273238)):
#0  futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7f0e5f5f7400, clockid=<optimized out>, expected=0, futex_word=0x559fd7d0f0e0) at ../sysdeps/nptl/futex-internal.h:320
#1  __pthread_cond_wait_common (abstime=0x7f0e5f5f7400, clockid=<optimized out>, mutex=0x559fd7d0f088, cond=0x559fd7d0f0b8) at pthread_cond_wait.c:520
#2  __pthread_cond_timedwait (cond=cond@entry=0x559fd7d0f0b8, mutex=0x559fd7d0f088, abstime=abstime@entry=0x7f0e5f5f7400) at pthread_cond_wait.c:656
#3  0x00007f0e62bc8605 in maxbase::ConditionVariable::wait_for (d=..., guard=..., this=0x559fd7d0f0b0) at /usr/include/c++/9/bits/std_mutex.h:123
#4  maxbase::WatchdogNotifier::Dependent::Ticker::run (this=0x559fd7d0f070) at /home/timofey_turenko_mariadb_com/MaxScale/maxutils/maxbase/src/watchdognotifier.cc:95
#5  0x00007f0e6245bde4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f0e62226609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#7  0x00007f0e62130163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
 
Thread 5 (Thread 0x7f0e5fdfb700 (LWP 273237)):
#0  futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7f0e5fdf8400, clockid=<optimized out>, expected=0, futex_word=0x559fd7d0e844) at ../sysdeps/nptl/futex-internal.h:320
#1  __pthread_cond_wait_common (abstime=0x7f0e5fdf8400, clockid=<optimized out>, mutex=0x559fd7d0e7e8, cond=0x559fd7d0e818) at pthread_cond_wait.c:520
#2  __pthread_cond_timedwait (cond=cond@entry=0x559fd7d0e818, mutex=0x559fd7d0e7e8, abstime=abstime@entry=0x7f0e5fdf8400) at pthread_cond_wait.c:656
#3  0x00007f0e62bc8605 in maxbase::ConditionVariable::wait_for (d=..., guard=..., this=0x559fd7d0e810) at /usr/include/c++/9/bits/std_mutex.h:123
#4  maxbase::WatchdogNotifier::Dependent::Ticker::run (this=0x559fd7d0e7d0) at /home/timofey_turenko_mariadb_com/MaxScale/maxutils/maxbase/src/watchdognotifier.cc:95
#5  0x00007f0e6245bde4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f0e62226609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#7  0x00007f0e62130163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
 
Thread 4 (Thread 0x7f0e605fc700 (LWP 273236)):
#0  futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7f0e605f9400, clockid=<optimized out>, expected=0, futex_word=0x559fd7d0bbd0) at ../sysdeps/nptl/futex-internal.h:320
#1  __pthread_cond_wait_common (abstime=0x7f0e605f9400, clockid=<optimized out>, mutex=0x559fd7d0bb78, cond=0x559fd7d0bba8) at pthread_cond_wait.c:520
#2  __pthread_cond_timedwait (cond=cond@entry=0x559fd7d0bba8, mutex=0x559fd7d0bb78, abstime=abstime@entry=0x7f0e605f9400) at pthread_cond_wait.c:656
#3  0x00007f0e62bc8605 in maxbase::ConditionVariable::wait_for (d=..., guard=..., this=0x559fd7d0bba0) at /usr/include/c++/9/bits/std_mutex.h:123
#4  maxbase::WatchdogNotifier::Dependent::Ticker::run (this=0x559fd7d0bb60) at /home/timofey_turenko_mariadb_com/MaxScale/maxutils/maxbase/src/watchdognotifier.cc:95
#5  0x00007f0e6245bde4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f0e62226609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#7  0x00007f0e62130163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
 
Thread 3 (Thread 0x7f0e60dfd700 (LWP 273235)):
#0  futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7f0e60dfa400, clockid=<optimized out>, expected=0, futex_word=0x7ffda212f0f8) at ../sysdeps/nptl/futex-internal.h:320
#1  __pthread_cond_wait_common (abstime=0x7f0e60dfa400, clockid=<optimized out>, mutex=0x7ffda212f0a0, cond=0x7ffda212f0d0) at pthread_cond_wait.c:520
#2  __pthread_cond_timedwait (cond=cond@entry=0x7ffda212f0d0, mutex=0x7ffda212f0a0, abstime=abstime@entry=0x7f0e60dfa400) at pthread_cond_wait.c:656
#3  0x00007f0e62bc7e1a in maxbase::ConditionVariable::wait_for (d=..., guard=..., this=0x7ffda212f0c8) at /usr/include/c++/9/bits/std_mutex.h:123
#4  maxbase::WatchdogNotifier::run (this=0x7ffda212f090) at /home/timofey_turenko_mariadb_com/MaxScale/maxutils/maxbase/src/watchdognotifier.cc:215
#5  0x00007f0e6245bde4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f0e62226609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#7  0x00007f0e62130163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
 
Thread 2 (Thread 0x7f0e616b5700 (LWP 273234)):
#0  futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7f0e616b2400, clockid=<optimized out>, expected=0, futex_word=0x559fd7cea490) at ../sysdeps/nptl/futex-internal.h:320
#1  __pthread_cond_wait_common (abstime=0x7f0e616b2400, clockid=<optimized out>, mutex=0x559fd7cea438, cond=0x559fd7cea468) at pthread_cond_wait.c:520
#2  __pthread_cond_timedwait (cond=cond@entry=0x559fd7cea468, mutex=0x559fd7cea438, abstime=abstime@entry=0x7f0e616b2400) at pthread_cond_wait.c:656
#3  0x00007f0e62bc8605 in maxbase::ConditionVariable::wait_for (d=..., guard=..., this=0x559fd7cea460) at /usr/include/c++/9/bits/std_mutex.h:123
#4  maxbase::WatchdogNotifier::Dependent::Ticker::run (this=0x559fd7cea420) at /home/timofey_turenko_mariadb_com/MaxScale/maxutils/maxbase/src/watchdognotifier.cc:95
#5  0x00007f0e6245bde4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f0e62226609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#7  0x00007f0e62130163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
 
Thread 1 (Thread 0x7f0e616d21c0 (LWP 273233)):
#0  0x00007f0e6213049e in epoll_wait (epfd=5, events=events@entry=0x7ffda212bf30, maxevents=1000, timeout=993) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007f0e62bc9e71 in maxbase::Worker::poll_waitevents (this=this@entry=0x7ffda212f440) at /home/timofey_turenko_mariadb_com/MaxScale/maxutils/maxbase/src/worker.cc:798
#2  0x00007f0e62bca123 in maxbase::Worker::run (this=0x7ffda212f440, pSem=0x0) at /home/timofey_turenko_mariadb_com/MaxScale/maxutils/maxbase/src/worker.cc:561
#3  0x0000559fd6d10c95 in maxbase::Worker::run (this=0x7ffda212f440) at /home/timofey_turenko_mariadb_com/MaxScale/maxutils/maxbase/include/maxbase/worker.hh:447
#4  main (argc=<optimized out>, argv=<optimized out>) at /home/timofey_turenko_mariadb_com/MaxScale/server/core/gateway.cc:2239
[Inferior 1 (process 273233) detached]
alert  : Writing core dump.

and the second one:

alert  : MaxScale 6.2.3 received fatal signal 11. Commit ID: 1ac7a0ddfdb5e7dc2df5d120b4aa9bf5961f0c3e System name: Linux Release string: Ubuntu 20.04.4 LTS
 
 
2022-05-12 17:28:49   alert  : (Read-Write-Service); MaxScale 6.2.3 received fatal signal 11. Commit ID: 1ac7a0ddfdb5e7dc2df5d120b4aa9bf5961f0c3e System name: Linux Release string: Ubuntu 20.04.4 LTS
2022-05-12 17:28:49   alert  : (Read-Write-Service); Statement currently being classified: none/unknown
2022-05-12 17:28:49   notice : (Read-Write-Service); For a more detailed stacktrace, install GDB and add 'debug=gdb-stacktrace' under the [maxscale] section.
  /lib/x86_64-linux-gnu/libc.so.6(cfree+0x20) [0x7fdcf909b720]: <binutils not installed>
  /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(_ZN8maxscale7Backend9ack_writeEv+0x5c) [0x7fdcf9a70b1c]: <binutils not installed>
  /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(_ZN8maxscale7Backend5closeENS0_10close_typeE+0x48) [0x7fdcf9a70b98]: <binutils not installed>
  /usr/lib/x86_64-linux-gnu/maxscale/libreadwritesplit.so(_ZN14RWSplitSessionD1Ev+0x89) [0x7fdcf45c3b19]: <binutils not installed>
  /usr/lib/x86_64-linux-gnu/maxscale/libreadwritesplit.so(_ZN14RWSplitSessionD0Ev+0x16) [0x7fdcf45c4156]: <binutils not installed>
  /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(_ZN15ServiceEndpoint5closeEv+0x8f) [0x7fdcf9b427af]: <binutils not installed>
  /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(_ZN9ClientDCB8shutdownEv+0x25) [0x7fdcf9abb095]: <binutils not installed>
  /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(_ZN3DCB7destroyEv+0x5d) [0x7fdcf9abb47d]: <binutils not installed>
  /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(_ZN8maxscale13RoutingWorker14delete_zombiesEv+0x61) [0x7fdcf9b295b1]: <binutils not installed>
  /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(_ZN8maxscale13RoutingWorker10epoll_tickEv+0x21) [0x7fdcf9b2a3d1]: <binutils not installed>
  /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker15poll_waiteventsEv+0x2f0) [0x7fdcf9bba010]: <binutils not installed>
  /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker3runEPNS_9SemaphoreE+0x53) [0x7fdcf9bba123]: <binutils not installed>
  /lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4) [0x7fdcf944bde4]: <binutils not installed>
  /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7fdcf9216609]: <binutils not installed>
  /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7fdcf9120163]: <binutils not installed>
alert  : Writing core dump.

Comment by markus makela [ 2022-05-12 ]

Can you add log_info=true under the [maxscale] section and try it again? This would give us information about what was happening when the crash happened.

Comment by Richard Lee [ 2022-05-12 ]

I have added info logging. Also interesting to note that it doesn't crash when i STOP/START the slave when the slave is set to "DRAINED".
I had drained it because it was significantly lagging behind master and i didn't want client connections connecting to it.

I have tried stopping and starting the slave a few times now, and it no longer seems to cause the fatal error.
Perhaps the issue is linked to the slave having seconds_behind_master > 0?

EDIT: very strange, first restart of slave showed 0 seconds behind master in both the server SHOW SLAVE STATUS, and MaxScale.
Second reset showed the expected ~90000s behind master, which resulted in the MaxScale fatal error.

Logs too large to attach now (20MB, after zipping). Any suggestions?

Comment by markus makela [ 2022-05-12 ]

You can take the last few hundred lines and attach it here. Once we see those we can determine if more are needed.

Comment by Richard Lee [ 2022-05-25 ]

maxscale_20220512_info_reduced.txt SearchResults_20220512.txt
Sorry for the delay. Was a bit of pest combing through results. See attached for a reduced output log (info logging). I've copy-pasted relevant sections from the main log file, which was 900MB...

The "Search results" file will point you to the useful parts to look at. I've separated the distinct sections with " ---- *BREAK* -----".
There were a few instances of slave going down and coming back. Some of these instances caused the fatal error, some didn't. It was and still is not clear to me the differences in the situations.

Comment by markus makela [ 2022-05-25 ]

First of all, I noticed that max_sescmd_history=1500 is way too much in your case. You'd probably be OK with the default 50 since you have prune_sescmd_history enabled. It seems there's about 1500 statements of SET autocommit=1 and SET autocommit=0 that end up being executed every time a reconnection happens. Even the default 50 is probably overkill in this particular case as it's only the autocommit state that's being changed but 50 is stil much better than 1500.

However, I did notice that the first SQL statement that it executes is this:

set autocommit=1, session_track_schema=1, session_track_system_variables='auto_increment_increment' , sql_mode = concat(@@sql_mode,',STRICT_TRANS_TABLES')

This means that if the history ends up being pruned, this statement would be lost. I think there's room for optimization in MaxScale where it would remove redundant statements from the history. Otherwise you run the risk of losing that original statement that prepares the session.

With this in mind, I'd recommend configuring your connection pool to have a reasonable maximum lifetime for the connections in the pool. This will prevent excessive history accumulation as well as make sure that the initial session state SQL (the one above) is always stored.

Comment by markus makela [ 2022-05-25 ]

I created an artificial setup where the backends would stop the response processing for history commands too early and i was able to reproduce an almost identical stacktrace. This means that if something caused the history processing to stop earlier than it should, it would result in this crash.

Comment by markus makela [ 2022-05-25 ]

if this was caused by a command in the history generating more than one response, the error would be different. It would end up failing with an Unexpected result error instead of crashing where it does.

Comment by markus makela [ 2022-05-25 ]

richNZ99 Just to make sure this is still a problem in 6.3.1, would it be possible for you to upgrade to it and verify that it still happens? In addition, have you found a way to reliably reproduce the crash outside of your current environment?

Comment by markus makela [ 2022-05-25 ]

Also, would you happen to know what sort of a connector your applications use? I'm assuming it's using some sort of a connection pool so my first guess would be a JDBC driver.

Comment by Richard Lee [ 2022-05-25 ]

Thanks for all the work on this Markus. I must admit i'm not 100% sure i understand the session command history functionality.
I don't believe there are many (if any) commands coming from the clients to affect session variables, if that is what its all about.
Most of the queries are pretty straight forward SELECTs.
Is there a way to easily track what sort of statements are being sent that would be affected by this session command history pruning?

Also note that i have already updated to MaxScale 6.3.1. Unfortunately i haven't kept a note on what date i did that. I thought these recent logs were on 6.3.1 but it seems they weren't..
I'll try replicate again.

Connectors are using mariadb.jdbc drivers.

Comment by Richard Lee [ 2022-05-25 ]

Also i noticed some annoying behavior the other day where a slave was lagging and a client read from a slave database and got the wrong result.
I've tried to utilize the "max_slave_replication_lag" setting, set to 60s to try prevent this. But i noticed that even when a slave exceeds this threshold, it takes a long time to wind down and drain connections, to the point where it isn't really very effective.

Is there a way to configure the read-write-service to force reads of specific databases/tables to go to the master instead of the slave?
I was hoping the "SchemaRouter" would be what i wanted for this, but i don't believe it is.

Comment by Richard Lee [ 2022-05-25 ]

Ah i see that my "Read-Write-Service" router diagnostics reported an "avg_sescmd_history_length" of 27 for the last 21 hours.
So changing this to 50 should be fine.

I have updated "max_sesscmd_history" = 50 (from 1500) and "connection_timeout" = 300000 (from 900000).

Comment by Richard Lee [ 2022-05-26 ]

I did manage to replicate the fatal error straight after posting my first message, but after changing the settings above, i can no longer replicate.
So perhaps it is indeed related to the "max_sescmd_history" setting.

Comment by markus makela [ 2022-07-06 ]

richNZ99 any updates? Have you been able to try this with the latest MaxScale release?

Comment by markus makela [ 2022-07-06 ]

I'll close this as Cannot Reproduce. If you can reproduce this again with the latest version 6 release, we'll reopen this.

I added some of the failsafe code back to readwritesplit that was removed from it in version 6. This should prevent a crash while still providing us with more information when the 6.4.1 release is made.

Comment by Richard Lee [ 2022-07-06 ]

Hi Markus,

I'm on 6.4.0. I don't seem to have issues anymore. Tried shutting down a slave, and powering it up again. Server state changed as expected from [Slave, Running] -> [Down] -> [Slave, Running] without issue.

Cheers

Richard

Generated at Thu Feb 08 04:26:07 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.