[MXS-4172] Hang in RWSplitSession::correct_packet_sequence Created: 2022-06-22  Updated: 2022-08-10  Resolved: 2022-07-11

Status: Closed
Project: MariaDB MaxScale
Component/s: Core
Affects Version/s: 6.3.1, 6.4.0
Fix Version/s: 6.4.1

Type: Bug Priority: Critical
Reporter: Pramod Mahto Assignee: markus makela
Resolution: Fixed Votes: 1
Labels: None

Sprint: MXS-SPRINT-161

 Description   

Maxscale crash occasionally with below crash details Version : 6.3.1 & 6.4.0 :-

 
2022-06-22 00:05:57.105 alert : (sigfatal_handler): MaxScale 6.4.0 received fatal signal 6. Commit ID: 6bbea45dd24f39ccf8effbfaec15adaf295d70a5 System name: Linux Release string: Red Hat Enterprise Linux release 8.1 (Ootpa)
2022-06-22 00:05:57.105 alert : (sigfatal_handler): Statement currently being classified: none/unknown
2022-06-22 00:05:57.105 notice : (sigfatal_handler): For a more detailed stacktrace, install GDB and add 'debug=gdb-stacktrace' under the [maxscale] section.
/lib64/libpthread.so.0(+0x10e56): sem_wait.c:?
/lib64/libpthread.so.0(+0x10f48): sem_wait.c:?
/usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN8maxscale13RoutingWorker20execute_concurrentlyERKSt8functionIFvvEE+0x52): maxutils/maxbase/include/maxbase/semaphore.hh:146
/usr/lib64/maxscale/libmaxscale-common.so.1.0.0(+0x1b69b3): /usr/include/c++/8/bits/std_function.h:256
/usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker14handle_messageERNS_12MessageQueueERKNS_19MessageQueueMessageE+0xa9): maxutils/maxbase/src/worker.cc:474
/usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase12MessageQueue18handle_poll_eventsEPNS_6WorkerEj+0x98): maxutils/maxbase/src/messagequeue.cc:307
/usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker15poll_waiteventsEv+0x219): maxutils/maxbase/src/worker.cc:857
/usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker3runEPNS_9SemaphoreE+0x4f): maxutils/maxbase/src/worker.cc:558
/usr/bin/maxscale(main+0x1f8a): server/core/gateway.cc:2240
/lib64/libc.so.6(__libc_start_main+0xf3): ??:?
/usr/bin/maxscale(_start+0x2e): ??:?
alert : Writing core dump.



 Comments   
Comment by Pramod Mahto [ 2022-06-22 ]

Maxscale log just before crash happen :-

 
2022-06-21 23:19:32.499   warning: [mariadbmon] (monitor_server): Error during monitor update of server 'HOSTNAME_XXX': Query 'SHOW ALL SLAVES STATUS;' failed: 'Lost connection to server during query'.
2022-06-21 23:19:32.622   warning: [mariadbmon] (operator()): Lost the lock 'maxscale_mariadbmonitor' on server 'HOSTNAME_XXX' without releasing it. The lock is now owned by connection 561932.
2022-06-21 23:19:57.448   warning: (642070) (Read-Write-Service); (set_rlag_state): Replication lag of 'HOSTNAME_XXX' is 30 seconds, which is above the configured limit 3s. 'HOSTNAME_XXX' is excluded from query routing.
2022-06-21 23:20:03.308   warning: (642076) (Read-Write-Service); (set_rlag_state): Replication lag of 'HOSTNAME_XXX' is 0 seconds, which is below the configured limit 3s. 'HOSTNAME_XXX' is returned to query routing.
2022-06-21 23:30:18.195   warning: [mariadbmon] (monitor_server): Error during monitor update of server 'HOSTNAME_XXX': Query 'SHOW ALL SLAVES STATUS;' failed: 'Lost connection to server during query'.
2022-06-21 23:30:18.314   warning: [mariadbmon] (operator()): Lost the lock 'maxscale_mariadbmonitor' on server 'HOSTNAME_XXX' without releasing it. The lock is now owned by connection 562137.
2022-06-21 23:30:44.774   warning: (643050) (Read-Write-Service); (set_rlag_state): Replication lag of 'HOSTNAME_XXX' is 28 seconds, which is above the configured limit 3s. 'HOSTNAME_XXX' is excluded from query routing.
2022-06-21 23:30:47.964   warning: (643055) (Read-Write-Service); (set_rlag_state): Replication lag of 'HOSTNAME_XXX' is 0 seconds, which is below the configured limit 3s. 'HOSTNAME_XXX' is returned to query routing.
2022-06-21 23:41:26.547   warning: [mariadbmon] (monitor_server): Error during monitor update of server 'HOSTNAME_XXX': Query 'SHOW ALL SLAVES STATUS;' failed: 'Lost connection to server during query'.
2022-06-21 23:41:26.665   warning: [mariadbmon] (operator()): Lost the lock 'maxscale_mariadbmonitor' on server 'HOSTNAME_XXX' without releasing it. The lock is now owned by connection 562280.
2022-06-21 23:41:52.542   warning: (644051) (Read-Write-Service); (set_rlag_state): Replication lag of 'HOSTNAME_XXX' is 28 seconds, which is above the configured limit 3s. 'HOSTNAME_XXX' is excluded from query routing.
2022-06-21 23:41:55.222   warning: (644055) (Read-Write-Service); (set_rlag_state): Replication lag of 'HOSTNAME_XXX' is 0 seconds, which is below the configured limit 3s. 'HOSTNAME_XXX' is returned to query routing.
2022-06-21 23:49:34.197   warning: (644740) (Read-Write-Service); (set_rlag_state): Replication lag of 'HOSTNAME_XXX' is 3 seconds, which is above the configured limit 3s. 'HOSTNAME_XXX' is excluded from query routing.
2022-06-21 23:50:04.872   warning: (644780) (Read-Write-Service); (set_rlag_state): Replication lag of 'HOSTNAME_XXX' is 0 seconds, which is below the configured limit 3s. 'HOSTNAME_XXX' is returned to query routing.
2022-06-22 00:02:45.011   warning: [mariadbmon] (monitor_server): Error during monitor update of server 'HOSTNAME_XXX': Query 'SHOW ALL SLAVES STATUS;' failed: 'Lost connection to server during query'.
2022-06-22 00:02:45.134   warning: [mariadbmon] (operator()): Lost the lock 'maxscale_mariadbmonitor' on server 'HOSTNAME_XXX' without releasing it. The lock is now owned by connection 562438.
2022-06-22 00:04:42.218   warning: (646095) (Read-Write-Service); (set_rlag_state): Replication lag of 'HOSTNAME_XXX' is 3 seconds, which is above the configured limit 3s. 'HOSTNAME_XXX' is excluded from query routing.
2022-06-22 00:05:11.073   warning: (646142) (Read-Write-Service); (set_rlag_state): Replication lag of 'HOSTNAME_XXX' is 0 seconds, which is below the configured limit 3s. 'HOSTNAME_XXX' is returned to query routing.
alert  : MaxScale 6.4.0 received fatal signal 6. Commit ID: 6bbea45dd24f39ccf8effbfaec15adaf295d70a5 System name: Linux Release string: Red Hat Enterprise Linux release 8.1 (Ootpa)

Comment by markus makela [ 2022-06-23 ]

We'll need the output from the crash with debug=gdb-stacktrace added to know why this happened.

Comment by Pramod Mahto [ 2022-06-26 ]

markus makela below is the crash detail after adding debug=gdb-stacktrace with in maxscale configuration file [MaxScale 6.4.0]

 
Thread 1 (Thread 0x7f5bca8c93c0 (LWP 3622917)):
#0  0x00007f5bc70115d4 in read () from /lib64/libc.so.6
#1  0x00007f5bc6fa1418 in __GI__IO_file_underflow () from /lib64/libc.so.6
#2  0x00007f5bc6fa2755 in _IO_default_xsgetn () from /lib64/libc.so.6
#3  0x00007f5bc6f955eb in fread () from /lib64/libc.so.6
#4  0x000000000040f0e3 in (anonymous namespace)::get_command_output_cb (cb=cb@entry=0x40b5d0 <<lambda(char const*)>::_FUN(const char *)>, format=0x412b88 "gdb --pid=%d -batch -nx -iex 'set auto-load off' -iex 'set print thread-events off' -ex 'thr a a bt'", format=0x412b88 "gdb --pid=%d -batch -nx -iex 'set auto-load off' -iex 'set print thread-events off' -ex 'thr a a bt'") at /home/timofey_turenko_mariadb_com/MaxScale/maxutils/maxbase/src/stacktrace.cc:78
#5  0x000000000040f7e7 in maxbase::dump_gdb_stacktrace (handler=0x40b5d0 <<lambda(char const*)>::_FUN(const char *)>) at /home/timofey_turenko_mariadb_com/MaxScale/maxutils/maxbase/src/stacktrace.cc:230
#6  0x000000000040b2d1 in sigfatal_handler (i=6) at /home/timofey_turenko_mariadb_com/MaxScale/server/core/gateway.cc:485
#7  <signal handler called>
#8  0x00007f5bc7c9fcd6 in do_futex_wait.constprop () from /lib64/libpthread.so.0
#9  0x00007f5bc7c9fdc8 in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#10 0x00007f5bca09edb2 in maxbase::Semaphore::wait (signal_approach=<optimized out>, this=<optimized out>) at /home/timofey_turenko_mariadb_com/MaxScale/maxutils/maxbase/include/maxbase/semaphore.hh:144
#11 maxbase::Semaphore::wait_n (signal_approach=<optimized out>, n_wait=<optimized out>, this=<optimized out>) at /home/timofey_turenko_mariadb_com/MaxScale/maxutils/maxbase/include/maxbase/semaphore.hh:176
#12 maxscale::RoutingWorker::execute_concurrently(std::function<void ()> const&) (func=...) at /home/timofey_turenko_mariadb_com/MaxScale/server/core/routingworker.cc:1190
#13 0x00007f5bca0ad9b3 in maxscale::WorkerGlobal<std::unordered_map<unsigned int, unsigned long, std::hash<unsigned int>, std::equal_to<unsigned int>, std::allocator<std::pair<unsigned int const, unsigned long> > > >::assign (t=..., this=0xec8348) at /usr/include/c++/8/new:169
#14 Server::<lambda()>::operator() (__closure=<optimized out>) at /home/timofey_turenko_mariadb_com/MaxScale/server/core/server.cc:667
#15 std::_Function_handler<void(), Server::set_gtid_list(const std::vector<std::pair<unsigned int, long unsigned int> >&)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/8/bits/std_function.h:297
#16 0x00007f5bca1282b9 in std::function<void ()>::operator()() const (this=0x7f5b7c025398) at /usr/include/c++/8/bits/std_function.h:682
#17 maxbase::Worker::CustomTask::execute (worker=..., this=0x7f5b7c025390) at /home/timofey_turenko_mariadb_com/MaxScale/maxutils/maxbase/src/worker.cc:488
#18 maxbase::Worker::handle_message (this=<optimized out>, queue=..., msg=...) at /home/timofey_turenko_mariadb_com/MaxScale/maxutils/maxbase/src/worker.cc:651
#19 0x00007f5bca12b5d8 in maxbase::MessageQueue::handle_poll_events (pWorker=<optimized out>, events=<optimized out>, this=0xe83200) at /home/timofey_turenko_mariadb_com/MaxScale/maxutils/maxbase/src/messagequeue.cc:307
#20 maxbase::MessageQueue::handle_poll_events (this=0xe83200, pWorker=<optimized out>, events=<optimized out>) at /home/timofey_turenko_mariadb_com/MaxScale/maxutils/maxbase/src/messagequeue.cc:286
#21 0x00007f5bca1294a9 in maxbase::Worker::poll_waitevents (this=this@entry=0x7fffa8ba1190) at /home/timofey_turenko_mariadb_com/MaxScale/maxutils/maxbase/src/worker.cc:848
#22 0x00007f5bca1296bf in maxbase::Worker::run (this=this@entry=0x7fffa8ba1190, pSem=pSem@entry=0x0) at /home/timofey_turenko_mariadb_com/MaxScale/maxutils/maxbase/src/worker.cc:554
#23 0x0000000000409c0a in maxbase::Worker::run (this=0x7fffa8ba1190) at /home/timofey_turenko_mariadb_com/MaxScale/maxutils/maxbase/include/maxbase/worker.hh:451
#24 main (argc=<optimized out>, argv=<optimized out>) at /home/timofey_turenko_mariadb_com/MaxScale/server/core/gateway.cc:2239
[Inferior 1 (process 3622917) detached]
alert  : Writing core dump.

 

Thread 21 (Thread 0x7f06c235a700 (LWP 4010870)):
#0 0x00007f06e6ea4cd6 in do_futex_wait.constprop () from /lib64/libpthread.so.0
#1 0x00007f06e6ea4dc8 in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2 0x00007f06e932db72 in maxbase::Semaphore::wait (signal_approach=<optimized out>, this=<optimized out>) at /home/timofey_turenko_mariadb_com/MaxScale/maxutils/maxbase/include/maxbase/semaphore.hh:144
#3 maxbase::Worker::call(std::function<void ()> const&, maxbase::Worker::execute_mode_t) (this=this@entry=0x7ffc8a4f9410, func=..., mode=mode@entry=maxbase::Worker::EXECUTE_AUTO) at /home/timofey_turenko_mariadb_com/MaxScale/maxutils/maxbase/src/worker.cc:519
#4 0x00007f06e92972d7 in resource_handle_request (request=...) at /usr/include/c++/8/bits/std_function.h:87
#5 0x00007f06e91f7c54 in Client::process (this=0x7f06a802ffc0, url=..., method=..., upload_data=<optimized out>, upload_size=<optimized out>) at /home/timofey_turenko_mariadb_com/MaxScale/server/core/admin.cc:729
#6 0x00007f06e91fa2af in Client::handle (this=0x7f06a802ffc0, url=..., method=..., upload_data=0x0, upload_data_size=0x7f06c2356518) at /usr/include/c++/8/bits/basic_string.h:940
#7 0x00007f06e91fa3cf in (anonymous namespace)::handle_client (cls=<optimized out>, connection=<optimized out>, url=0x7f06a80635c4 "/servers", method=<optimized out>, version=<optimized out>, upload_data=0x0, upload_data_size=0x7f06c2356518, con_cls=0x7f06a803b1a8) at /usr/include/c++/8/bits/basic_string.h:256
#8 0x00007f06e93e1d7f in call_connection_handler (connection=connection@entry=0x7f06a803b150) at connection.c:2221
#9 0x00007f06e93e3880 in MHD_connection_handle_idle (connection=connection@entry=0x7f06a803b150) at connection.c:3401
#10 0x00007f06e93e575e in call_handlers (con=con@entry=0x7f06a803b150, read_ready=<optimized out>, write_ready=<optimized out>, force_close=<optimized out>) at daemon.c:1207
#11 0x00007f06e93ea3d8 in MHD_epoll (daemon=daemon@entry=0x1ffe520, may_block=may_block@entry=1) at daemon.c:4581
#12 0x00007f06e93eb2d7 in MHD_polling_thread (cls=0x1ffe520) at daemon.c:4746
#13 0x00007f06e6e9c17a in start_thread () from /lib64/libpthread.so.0
#14 0x00007f06e6225dc3 in clone () from /lib64/libc.so.6

{ode}
Comment by markus makela [ 2022-06-27 ]

The stacktraces seem to point to the causal_reads resultset code in readwritesplit.

Thread 17 (Thread 0x7f06d899e700 (LWP 4010866)):
#0  gwbuf_link_length (b=0x7f06bf257730, b=0x7f06bf257730) at /home/timofey_turenko_mariadb_com/MaxScale/include/maxscale/buffer.hh:157
#1  gwbuf_copy_data (buffer=0x7f06bf257730, buffer@entry=0x7f06bc119550, offset=9163541, offset@entry=57627426, bytes=bytes@entry=3, dest=dest@entry=0x7f06d8997f6d "\037") at /home/timofey_turenko_mariadb_com/MaxScale/server/core/buffer.cc:736
#2  0x00007f06c2de94ef in RWSplitSession::correct_packet_sequence (this=this@entry=0x7f06bc0fe960, buffer=buffer@entry=0x7f06bc119550) at /home/timofey_turenko_mariadb_com/MaxScale/server/modules/routing/readwritesplit/rwsplit_causal_reads.cc:77
#3  0x00007f06c2dea339 in RWSplitSession::handle_causal_read_reply (this=this@entry=0x7f06bc0fe960, writebuf=writebuf@entry=0x7f06bc119550, reply=..., backend=backend@entry=0x7f06bc029e30) at /home/timofey_turenko_mariadb_com/MaxScale/server/modules/routing/readwritesplit/rwsplit_causal_reads.cc:113
#4  0x00007f06c2ddc95e in RWSplitSession::clientReply (this=0x7f06bc0fe960, writebuf=0x7f06bc119550, down=..., reply=...) at /home/timofey_turenko_mariadb_com/MaxScale/server/modules/routing/readwritesplit/rwsplitsession.cc:495
#5  0x00007f06e92bda57 in ServiceEndpoint::clientReply (this=0x7f06bc100460, buffer=<optimized out>, down=..., reply=...) at /home/timofey_turenko_mariadb_com/MaxScale/server/core/service.cc:1703
#6  0x00007f06e92b2ec6 in ServerEndpoint::clientReply (this=0x7f06bc0c1790, buffer=<optimized out>, down=..., reply=...) at /home/timofey_turenko_mariadb_com/MaxScale/server/core/server.cc:1202
#7  0x00007f06e92edb1e in MariaDBBackendConnection::normal_read (this=0x7f06bc09c050) at /home/timofey_turenko_mariadb_com/MaxScale/server/modules/protocol/MariaDB/mariadb_backend.cc:762
#8  0x00007f06e9242194 in DCB::process_events (events=5, this=0x7f06bc0e4f70) at /home/timofey_turenko_mariadb_com/MaxScale/server/core/dcb.cc:1311
#9  DCB::process_events (this=0x7f06bc0e4f70, events=5) at /home/timofey_turenko_mariadb_com/MaxScale/server/core/dcb.cc:1245
#10 0x00007f06e9242291 in DCB::event_handler (dcb=0x7f06bc0e4f70, events=<optimized out>) at /home/timofey_turenko_mariadb_com/MaxScale/server/core/dcb.cc:1363
#11 0x00007f06e932e4a9 in maxbase::Worker::poll_waitevents (this=this@entry=0x1fa7b20) at /home/timofey_turenko_mariadb_com/MaxScale/maxutils/maxbase/src/worker.cc:848
#12 0x00007f06e932e6bf in maxbase::Worker::run (this=0x1fa7b20, pSem=0x7ffc8a4f8b50) at /home/timofey_turenko_mariadb_com/MaxScale/maxutils/maxbase/src/worker.cc:554
#13 0x00007f06e81ebba3 in execute_native_thread_routine () from /lib64/libstdc++.so.6
#14 0x00007f06e6e9c17a in start_thread () from /lib64/libpthread.so.0
#15 0x00007f06e6225dc3 in clone () from /lib64/libc.so.6

This was true for both cases which might mean that there's some inefficiency in the result processing code that doesn't scale. However, this would be evident immediately when the query is executed.

Comment by markus makela [ 2022-07-11 ]

Added code that should prevent the problem from occurring. The true root cause wasn't found but with this we'll know more about the problem.

Generated at Thu Feb 08 04:26:43 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.