Uploaded image for project: 'MariaDB MaxScale'
  1. MariaDB MaxScale
  2. MXS-5063

Maxscale crash - "terminate called after throwing an instance of 'std::bad_alloc'"

Details

    Description

      Maxscale crashed during company's out of hours, issue handled by VRRP / standby. We don't have GDB or anything installed on this server so this is all that's available right now.

      terminate called after throwing an instance of 'std::bad_alloc'
        what():  std::bad_alloc
      2024-04-14 20:37:42   alert  : (35462108) (SS); MaxScale 23.08.5 received fatal signal 6. Commit ID: 4bb6b679d2be75ba84226baea656db7a0f330eb4, System name: Linux, Release string: CentOS Linux 7 (Core), Thread: Worker-09
      2024-04-14 20:37:42   alert  : (35462108) (SS); Statement currently being classified: none/unknown
      2024-04-14 20:37:42   alert  : (35462108) (SS); Session: 35462108 Service: SS
      2024-04-14 20:37:42   notice : (35462108) (SS); For a more detailed stacktrace, install GDB and add 'debug=gdb-stacktrace' under the [maxscale] section.
        /lib64/libc.so.6(gsignal+0x37) [0x7f95046b8387]: <binutils not installed>
        /lib64/libc.so.6(abort+0x148) [0x7f95046b9a78]: <binutils not installed>
        /lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x165) [0x7f9505913a95]: <binutils not installed>
        /lib64/libstdc++.so.6(+0x5ea06) [0x7f9505911a06]: <binutils not installed>
        /lib64/libstdc++.so.6(+0x5ea33) [0x7f9505911a33]: <binutils not installed>
        /lib64/libstdc++.so.6(+0x5ec53) [0x7f9505911c53]: <binutils not installed>
        /lib64/libstdc++.so.6(_Znwm+0x7d) [0x7f95059121ed]: <binutils not installed>
        /lib64/libstdc++.so.6(_Znam+0x9) [0x7f9505912289]: <binutils not installed>
        /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN10SHARED_BUFC1Em+0x15) [0x7f95078abc45]: <binutils not installed>
        /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN5GWBUF16prepare_to_writeEm+0x1b8) [0x7f95078abf48]: <binutils not installed>
        /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN5GWBUF6appendEPKhm+0x32) [0x7f95078abfd2]: <binutils not installed>
        /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN5GWBUF10merge_backEOS_+0x34) [0x7f95078ac324]: <binutils not installed>
        /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN3DCB13writeq_appendEO5GWBUF+0x18) [0x7f9507913968]: <binutils not installed>
        /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN23MariaDBClientConnection11clientReplyEO5GWBUFRKN8maxscale10ReplyRouteERKNS2_5ReplyE+0xa8) [0x7f9507a43fe8]: <binutils not installed>
        /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN15ServiceEndpoint13send_upstreamEO5GWBUFRKN8maxscale10ReplyRouteERKNS2_5ReplyE+0x31) [0x7f95079d3911]: <binutils not installed>
        /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN15ServiceEndpoint15ServiceUpstream11clientReplyEO5GWBUFRKN8maxscale10ReplyRouteERKNS3_5ReplyE+0xd) [0x7f95079e637d]: <binutils not installed>
        /usr/lib64/maxscale/libreadwritesplit.so(_ZN14RWSplitSession11clientReplyEO5GWBUFRKN8maxscale10ReplyRouteERKNS2_5ReplyE+0x5a5) [0x7f9507f51db5]: <binutils not installed>
        /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN15ServiceEndpoint11clientReplyEO5GWBUFRKN8maxscale10ReplyRouteERKNS2_5ReplyE+0x99) [0x7f95079d2be9]: <binutils
       not installed>
        /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN14ServerEndpoint11clientReplyEO5GWBUFRKN8maxscale10ReplyRouteERKNS2_5ReplyE+0xf4) [0x7f95079bddd4]: <binutils not installed>
          /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN24MariaDBBackendConnection11normal_readEv+0xdc9) [0x7f9507a3cd89]: <binutils not installed>
        /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN24MariaDBBackendConnection17ready_for_readingEP3DCB+0x140) [0x7f9507a405f0]: <binutils not installed>
        /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN3DCB14process_eventsEj+0x1f5) [0x7f95079150d5]: <binutils not installed>
        /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN3DCB13event_handlerEj+0x21) [0x7f9507915331]: <binutils not installed>
        /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN3DCB13FakeEventTask7executeERN7maxbase6WorkerE+0xb6) [0x7f9507915996]: <binutils not installed>
        /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker14handle_messageERNS_12MessageQueueERKNS_19MessageQueueMessageE+0xe1) [0x7f9507a8d651]: <binutils not installed>
        /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase17EventMessageQueue18handle_poll_eventsEPNS_6WorkerEjNS_8Pollable7ContextE+0x73) [0x7f9507a930e3]: <binutils not installed>
        /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker14deliver_eventsEPNS_8PollableEjNS1_7ContextE+0x23) [0x7f9507a90113]: <binutils not installed>
        /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker15poll_waiteventsEv+0x35f) [0x7f9507a9062f]: <binutils not installed>
        /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker3runEPNS_9SemaphoreE+0x53) [0x7f9507a90993]: <binutils not installed>
        /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(+0x4977c0) [0x7f9507b5f7c0]: <binutils not installed>
        /lib64/libpthread.so.0(+0x7ea5) [0x7f950667aea5]: <binutils not installed>
        /lib64/libc.so.6(clone+0x6d) [0x7f9504780b0d]: <binutils not installed>
      2024-04-14 20:37:42   notice : (35462108) (SS); GDB not found, attempting to dump stacktraces from all threads using internal profiler...
      

      Attachments

        1. Before crash.png
          Before crash.png
          46 kB
        2. image-2024-04-15-10-56-15-270.png
          image-2024-04-15-10-56-15-270.png
          62 kB
        3. image-2024-04-15-12-08-50-654.png
          image-2024-04-15-12-08-50-654.png
          44 kB
        4. maxscale_config.txt
          1 kB
        5. maxscale02.png
          maxscale02.png
          57 kB
        6. memleak.log
          137 kB
        7. memleak-20-mins.log
          1.64 MB
        8. memleak-during-big-query.log
          1.63 MB
        9. memory-spike-after-bypassing-maxscale.png
          memory-spike-after-bypassing-maxscale.png
          115 kB
        10. pmm-memory-free-graph-12hrs.png
          pmm-memory-free-graph-12hrs.png
          45 kB
        11. screenshot-1.png
          screenshot-1.png
          55 kB
        12. screenshot-2.png
          screenshot-2.png
          71 kB

        Activity

          psumner Phil Sumner added a comment - - edited

          This graph shows the memory spikes before and after I switched one of these table sync queries not to use maxscale late on the 14th. One of the two servers is still running these queries and I'll run memleak for the whole duration of it when it next runs at 0830 GMT. The spikes are still there but much reduced, so it definitely looks like these are the cause.

          This PMM graph shows the same pattern but also a gradual drop in free memory is clear, though at this point it's not at concerning levels and might level off over time.

          I've attached memleak-during-big-query.log which was running as this query started. I watched the maxscale server free memory graph drop as soon as it connected and for some reason (I think network issue) it failed, so the free memory after a few minutes rather than after the usual half hour.

          psumner Phil Sumner added a comment - - edited This graph shows the memory spikes before and after I switched one of these table sync queries not to use maxscale late on the 14th. One of the two servers is still running these queries and I'll run memleak for the whole duration of it when it next runs at 0830 GMT. The spikes are still there but much reduced, so it definitely looks like these are the cause. This PMM graph shows the same pattern but also a gradual drop in free memory is clear, though at this point it's not at concerning levels and might level off over time. I've attached memleak-during-big-query.log which was running as this query started. I watched the maxscale server free memory graph drop as soon as it connected and for some reason (I think network issue) it failed, so the free memory after a few minutes rather than after the usual half hour.
          markus makela markus makela added a comment -

          Yes, I think this pretty much confirms that it's a result that's being fully read from the network:

          	501612544 bytes in 7654 allocations from stack
          		operator new(unsigned long)+0x1d [libstdc++.so.6.0.19]
          		GWBUF::prepare_to_write(unsigned long)+0x78 [libmaxscale-common.so.1.0.0]
          		DCB::socket_read(unsigned long, DCB::ReadLimit)+0x42 [libmaxscale-common.so.1.0.0]
          		DCB::read_impl(unsigned long, unsigned long, DCB::ReadLimit)+0x41b [libmaxscale-common.so.1.0.0]
          		DCB::read(unsigned long, unsigned long)+0x15 [libmaxscale-common.so.1.0.0]
          		MariaDBBackendConnection::normal_read()+0xa0 [libmaxscale-common.so.1.0.0]
          		MariaDBBackendConnection::ready_for_reading(DCB*)+0x140 [libmaxscale-common.so.1.0.0]
          		DCB::process_events(unsigned int)+0x1f5 [libmaxscale-common.so.1.0.0]
          		DCB::event_handler(unsigned int)+0x21 [libmaxscale-common.so.1.0.0]
          		maxbase::Worker::deliver_events(maxbase::Pollable*, unsigned int, maxbase::Pollable::Context)+0x23 [libmaxscale-common.so.1.0.0]
          		maxbase::Worker::poll_waitevents()+0x35f [libmaxscale-common.so.1.0.0]
          		maxbase::Worker::run(maxbase::Semaphore*)+0x53 [libmaxscale-common.so.1.0.0]
          		execute_native_thread_routine+0x10 [libmaxscale-common.so.1.0.0]
          		[unknown]
          		std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)(maxbase::Worker*, maxbase::Semaphore*), maxbase::Worker*, maxbase::Semaphore*> > >::~_State_impl()+0x0 [libmaxscale-common.so.1.0.0]
          		[unknown]
          

          That's just shy of 500MiB of memory and the size of the spike is pretty close to that.

          markus makela markus makela added a comment - Yes, I think this pretty much confirms that it's a result that's being fully read from the network: 501612544 bytes in 7654 allocations from stack operator new(unsigned long)+0x1d [libstdc++.so.6.0.19] GWBUF::prepare_to_write(unsigned long)+0x78 [libmaxscale-common.so.1.0.0] DCB::socket_read(unsigned long, DCB::ReadLimit)+0x42 [libmaxscale-common.so.1.0.0] DCB::read_impl(unsigned long, unsigned long, DCB::ReadLimit)+0x41b [libmaxscale-common.so.1.0.0] DCB::read(unsigned long, unsigned long)+0x15 [libmaxscale-common.so.1.0.0] MariaDBBackendConnection::normal_read()+0xa0 [libmaxscale-common.so.1.0.0] MariaDBBackendConnection::ready_for_reading(DCB*)+0x140 [libmaxscale-common.so.1.0.0] DCB::process_events(unsigned int)+0x1f5 [libmaxscale-common.so.1.0.0] DCB::event_handler(unsigned int)+0x21 [libmaxscale-common.so.1.0.0] maxbase::Worker::deliver_events(maxbase::Pollable*, unsigned int, maxbase::Pollable::Context)+0x23 [libmaxscale-common.so.1.0.0] maxbase::Worker::poll_waitevents()+0x35f [libmaxscale-common.so.1.0.0] maxbase::Worker::run(maxbase::Semaphore*)+0x53 [libmaxscale-common.so.1.0.0] execute_native_thread_routine+0x10 [libmaxscale-common.so.1.0.0] [unknown] std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)(maxbase::Worker*, maxbase::Semaphore*), maxbase::Worker*, maxbase::Semaphore*> > >::~_State_impl()+0x0 [libmaxscale-common.so.1.0.0] [unknown] That's just shy of 500MiB of memory and the size of the spike is pretty close to that.
          markus makela markus makela added a comment -

          OK, I managed to reproduce it by just running a query that returns a very large resultset and reading it very slowly. This does not always trigger the problem so I suspect there might be an underlying bug that was simply revealed by the scheduling fairness change.

          markus makela markus makela added a comment - OK, I managed to reproduce it by just running a query that returns a very large resultset and reading it very slowly. This does not always trigger the problem so I suspect there might be an underlying bug that was simply revealed by the scheduling fairness change.
          markus makela markus makela added a comment -

          I believe that the culprit is indeed this commit:

          commit daa7c34f43523ca642ec2eb625f15d02f1f72592
          Author: Markus Mäkelä <markus.makela@mariadb.com>
          Date:   Wed Jan 10 10:51:54 2024 +0200
           
              MXS-4924: Throttle fake events
              
              Once two fake events have been generated by the same DCB, the third one
              will be sent via the event loop. This will prevent a single DCB from
              repeatedly triggering fake events.
          

          The code in 22.0 and 6.4 never limits the amount of data read from the network and always empties the socket. I do believe that it is less likely for this to happen in 6.4 than it is in 23.02 where the network reads are capped to writeq_high_water which by default is 64KiB.

          markus makela markus makela added a comment - I believe that the culprit is indeed this commit: commit daa7c34f43523ca642ec2eb625f15d02f1f72592 Author: Markus Mäkelä <markus.makela@mariadb.com> Date: Wed Jan 10 10:51:54 2024 +0200   MXS-4924: Throttle fake events Once two fake events have been generated by the same DCB, the third one will be sent via the event loop. This will prevent a single DCB from repeatedly triggering fake events. The code in 22.0 and 6.4 never limits the amount of data read from the network and always empties the socket. I do believe that it is less likely for this to happen in 6.4 than it is in 23.02 where the network reads are capped to writeq_high_water which by default is 64KiB.
          markus makela markus makela added a comment - - edited

          I think that increasing the value of writeq_high_water to be larger than the TCP network buffer size could help avoid this from happening as often. The default 64KiB limit in 22.08 and all newer versions combined with the limited reads in 23.02 will cause this bug to be hit with all larger selects. Increasing it to the default in 6.4 (writeq_high_water=16MiB) may help avoid it but it still won't prevent it from happening.

          markus makela markus makela added a comment - - edited I think that increasing the value of writeq_high_water to be larger than the TCP network buffer size could help avoid this from happening as often. The default 64KiB limit in 22.08 and all newer versions combined with the limited reads in 23.02 will cause this bug to be hit with all larger selects. Increasing it to the default in 6.4 ( writeq_high_water=16MiB ) may help avoid it but it still won't prevent it from happening.

          People

            markus makela markus makela
            psumner Phil Sumner
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.