[MXS-4465] Maxscale is killed by SystemD watchdog Created: 2022-12-29  Updated: 2023-03-17  Resolved: 2023-03-17

Status: Closed
Project: MariaDB MaxScale
Component/s: N/A
Affects Version/s: 6.4.4
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Anton Assignee: markus makela
Resolution: Cannot Reproduce Votes: 0
Labels: None

Attachments: Text File dump_output.log     Text File gdb_output.log     Text File mx01_output.log    
Sprint: MXS-SPRINT-174, MXS-SPRINT-175, MXS-SPRINT-176, MXS-SPRINT-177

 Description   

Hello,
Today I got fail of maxscale instance, this happened few times when analytic team tried to get data from database via maxscale

Looks like watchdog killed service
Service Failed with error

alert  : MaxScale 6.4.4 received fatal signal 6. Commit ID: 1bca350ddb85828ae771b96dc7381cba36e48493 System name: Linux Release string: NAME="CentOS Stream"
 
 
2022-12-29 13:18:52   alert  : MaxScale 6.4.4 received fatal signal 6. Commit ID: 1bca350ddb85828ae771b96dc7381cba36e48493 System name: Linux Release string: NAME="CentOS Stream"
2022-12-29 13:18:52   alert  : Statement currently being classified: none/unknown
2022-12-29 13:18:52   notice : For a more detailed stacktrace, install GDB and add 'debug=gdb-stacktrace' under the [maxscale] section.
  /lib64/libpthread.so.0(+0x10bd6): sem_wait.c:?
  /lib64/libpthread.so.0(+0x10cc8): sem_wait.c:?
  /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN8maxscale13RoutingWorker20execute_concurrentlyERKSt8functionIFvvEE+0x52): maxutils/maxbase/include/maxbase/semaphore.hh:146
  /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(+0x1c0bd3): /usr/include/c++/8/bits/std_function.h:256
  /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker14handle_messageERNS_12MessageQueueERKNS_19MessageQueueMessageE+0xa9): maxutils/maxbase/src/worker.cc:474
  /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase12MessageQueue18handle_poll_eventsEPNS_6WorkerEj+0x98): maxutils/maxbase/src/messagequeue.cc:307
  /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker15poll_waiteventsEv+0x219): maxutils/maxbase/src/worker.cc:857
  /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker3runEPNS_9SemaphoreE+0x4f): maxutils/maxbase/src/worker.cc:558
  /usr/bin/maxscale(main+0x1f8a): server/core/gateway.cc:2237
  /lib64/libc.so.6(__libc_start_main+0xf3): ??:?
  /usr/bin/maxscale(_start+0x2e): ??:?
alert  : Writing core dump.

I've upload to ftp MXS-4465.zip

  • coredump
  • config cnf
  • maxscale.log


 Comments   
Comment by Naresh Chandra [ 2022-12-29 ]

Same issue we got in the Maxscale 22.08.3 version as well. Its crashed with Signal 6 in the MAxscale 22.08.3 version and OS is Centos 7.

Comment by Johan Wikman [ 2022-12-30 ]

R A core file is great for solving this. But could you be a bit more specific where you have uploaded the zip?

Br,
Johan

Comment by Anton [ 2022-12-30 ]

johan.wikman hello

I uploaded archive to ftp.mariadb.org/private/

Comment by markus makela [ 2023-01-03 ]

R what OS are you using? It looks like some RHEL variant (RHEL 8?) but I'd need to know the exact version you use to be able to analyze that coredump.

Comment by markus makela [ 2023-01-03 ]

R would it also be possible for you to capture the output of the following command on the system where the coredump was generated?

gdb --core core.maxscale.989.c9b20302a271406abc2b1027e90b1c4f.3348563.1672319933000000 -batch -ex 'thr a a bt' /usr/bin/maxscale

This would help speed up us fixing this problem by providing a better stacktrace. Being able to reproduce your exact environment isn't always possible and thus results in less accurate results for us when we're looking at the coredump.

Comment by markus makela [ 2023-01-03 ]

Nevermind about that core dump, it seems the shell output contains the full stacktrace. However, the GDB output from the command above would still be useful to have.

Comment by markus makela [ 2023-01-03 ]

Is it possible for you to provide the stacktraces from the other coredumps that are still present? Having more stacktraces would help us figure out what might be causing this.

Comment by Anton [ 2023-01-09 ]

Hello markus makela

I've attached dump output that you required dump_output.log

Yes, OS is CentOS 8

NAME="CentOS Stream"
VERSION="8"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="8"
PLATFORM_ID="platform:el8"

Comment by Anton [ 2023-01-09 ]

And I provided one more output from other dump mx01_output.log

Comment by markus makela [ 2023-01-09 ]

R do you have any other coredumps available on this system? Being able to see the GDB output from them would be very helpful.

Comment by Anton [ 2023-01-09 ]

markus makela
I have 3 additional core dumps , output from one of them I've already provide mx01_output.log

You need the dumps themselves , or just GDB output (using command as you sent) ?

Comment by markus makela [ 2023-01-09 ]

GDB outputs, this'll tell us what the other threads that the main MaxScale thread is waiting for are doing. So far they seem to be at very odd places that should not trigger this behavior.

Is the system under very heavy load or is it inside some virtualized environment?

Comment by Anton [ 2023-01-09 ]

markus makela
Maxscale are deployed in VM environment (kvm)
All previous crashes happened on the heavy load , when analytics tried tried to get a lot of data from database via maxscale

The last dumps seems point to the different problem, GDB output shows problem with memory access
at the monitoring graphs I can see that host had enougth free memory at the crash momemt

[root@inf-maxscale-alpha-01 ~]# coredumpctl info 3735483
           PID: 3735483 (maxscale)
           UID: 989 (maxscale)
           GID: 986 (maxscale)
        Signal: 6 (ABRT)
     Timestamp: Sat 2023-01-07 10:24:37 UTC (2 days ago)
  Command Line: /usr/bin/maxscale
    Executable: /usr/bin/maxscale
 Control Group: /system.slice/maxscale.service
          Unit: maxscale.service
         Slice: system.slice
       Boot ID: c433f52783584d7eaa1cc40193c1c871
    Machine ID: 24f3e89667754facb05c533589541ed0
      Hostname: inf-maxscale-alpha-01
       Storage: /var/lib/systemd/coredump/core.maxscale.989.c433f52783584d7eaa1cc40193c1c871.3735483.1673087077000000.lz4 (truncated)
       Message: Process 3735483 (maxscale) of user 989 dumped core.
 
                Stack trace of thread 3735483:
                #0  0x00007fa770e88bd6 n/a (n/a)
 
[root@inf-maxscale-alpha-02 ~]# coredumpctl info 3893490
           PID: 3893490 (maxscale)
           UID: 989 (maxscale)
           GID: 986 (maxscale)
        Signal: 6 (ABRT)
     Timestamp: Mon 2023-01-09 00:56:43 UTC (13h ago)
  Command Line: /usr/bin/maxscale
    Executable: /usr/bin/maxscale
 Control Group: /system.slice/maxscale.service
          Unit: maxscale.service
         Slice: system.slice
       Boot ID: c9b20302a271406abc2b1027e90b1c4f
    Machine ID: d7d9b429b68642e0b2e7332c5094a716
      Hostname: inf-maxscale-alpha-02
       Storage: /var/lib/systemd/coredump/core.maxscale.989.c9b20302a271406abc2b1027e90b1c4f.3893490.1673225803000000.lz4 (truncated)
       Message: Process 3893490 (maxscale) of user 989 dumped core.
 
                Stack trace of thread 3893490:
                #0  0x00007f5de871cbd6 n/a (n/a)

failed were with the next errors

alert  : MaxScale 6.4.4 received fatal signal 6. Commit ID: 1bca350ddb85828ae771b96dc7381cba36e48493 System name: Linux Release string: NAME="CentOS Stream"
 
 
2023-01-07 10:24:35   alert  : MaxScale 6.4.4 received fatal signal 6. Commit ID: 1bca350ddb85828ae771b96dc7381cba36e48493 System name: Linux Release string: NAME="CentOS Stream"
2023-01-07 10:24:35   alert  : Statement currently being classified: none/unknown
2023-01-07 10:24:35   notice : For a more detailed stacktrace, install GDB and add 'debug=gdb-stacktrace' under the [maxscale] section.
  /lib64/libpthread.so.0(+0x10bd6): sem_wait.c:?
  /lib64/libpthread.so.0(+0x10cc8): sem_wait.c:?
  /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN8maxscale13RoutingWorker20execute_concurrentlyERKSt8functionIFvvEE+0x52): maxutils/maxbase/include/maxbase/semaphore.hh:146
  /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(+0x1c0bd3): /usr/include/c++/8/bits/std_function.h:256
  /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker14handle_messageERNS_12MessageQueueERKNS_19MessageQueueMessageE+0xa9): maxutils/maxbase/src/worker.cc:474
  /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase12MessageQueue18handle_poll_eventsEPNS_6WorkerEj+0x98): maxutils/maxbase/src/messagequeue.cc:307
  /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker15poll_waiteventsEv+0x219): maxutils/maxbase/src/worker.cc:857
  /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker3runEPNS_9SemaphoreE+0x4f): maxutils/maxbase/src/worker.cc:558
  /usr/bin/maxscale(main+0x1f8a): server/core/gateway.cc:2237
  /lib64/libc.so.6(__libc_start_main+0xf3): ??:?
  /usr/bin/maxscale(_start+0x2e): ??:?
alert  : Writing core dump.
 
2023-01-09 00:56:41   notice : For a more detailed stacktrace, install GDB and add 'debug=gdb-stacktrace' under the [maxscale] section.
  /lib64/libpthread.so.0(+0x10bd6): sem_wait.c:?
  /lib64/libpthread.so.0(+0x10cc8): sem_wait.c:?
  /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN8maxscale13RoutingWorker20execute_concurrentlyERKSt8functionIFvvEE+0x52): maxutils/maxbase/include/maxbase/semaphore.hh:146
  /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(+0x1c0bd3): /usr/include/c++/8/bits/std_function.h:256
  /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker14handle_messageERNS_12MessageQueueERKNS_19MessageQueueMessageE+0xa9): maxutils/maxbase/src/worker.cc:474
  /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase12MessageQueue18handle_poll_eventsEPNS_6WorkerEj+0x98): maxutils/maxbase/src/messagequeue.cc:307
  /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker15poll_waiteventsEv+0x219): maxutils/maxbase/src/worker.cc:857
  /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker3runEPNS_9SemaphoreE+0x4f): maxutils/maxbase/src/worker.cc:558
  /usr/bin/maxscale(main+0x1f8a): server/core/gateway.cc:2237
  /lib64/libc.so.6(__libc_start_main+0xf3): ??:?
  /usr/bin/maxscale(_start+0x2e): ??:?
alert  : Writing core dump.

I've attached GDB output of these dumps gdb_output.log

Comment by markus makela [ 2023-01-09 ]

Those coredumps appear to be truncated (at least based on the output) which explains why they seem corrupt.

Comment by markus makela [ 2023-01-09 ]

This appears to be an exception where it really does take over 60 seconds for MaxScale to process a batch of results. To verify that this is indeed the case, can you edit /lib/systemd/system/maxscale.service and change the WatchdogSec=60s to WatchdogSec=180s? You'll probably need to restart MaxScale for this to take effect.

Comment by Anton [ 2023-01-10 ]

I changed watchdog to 3 minutes .
I'll provide information if a new crash will appear

Comment by Anton [ 2023-01-20 ]

The problem didn't appear more
The 10 days passed from moment when I changed watchdog timeout

Comment by Johan Wikman [ 2023-01-23 ]

R Thank you, that's quite valuable information.

Comment by Johan Wikman [ 2023-01-23 ]

R The message queue used for sending messages between threads is built on top of pipe(2). Up until 6.4.4 the pipe was created using O_DIRECT, which was quite unnecessary, as all our messages are of the same size. We then found out that the use of O_DIRECT significantly increases the memory consumption and thus dramatically reduces the number of messages that simultaneously fits in the pipe so in 6.4.5 we dropped the use of O_DIRECT.

It would be quite informative if you could try out 6.4.5 with the original watchdog timeout and see whether the watchdog is still triggered. It's not out of the question that the use of O_DIRECT could have had other effects as well.

Comment by markus makela [ 2023-01-23 ]

R would it be possible for you to include any separate configuration files for filters that you are using (dbfwfilter, masking, etc.)?

Comment by markus makela [ 2023-01-25 ]

This is possibly caused by the same mechanism that causes the debug assertion in MXS-4461 to be hit. MXS-4461 only affects 22.08, it doesn't explain the behavior in 6.4.

Comment by markus makela [ 2023-02-08 ]

I'll reopen this until we get some feedback.

Comment by markus makela [ 2023-02-09 ]

R is it possible for you to test this without the dbfwfilter and masking filters? It could be that the processing caused by them ends up slowing the system down enough to make it halt. It's also possible if the analytics queries are very large that if any of the filters use regular expressions, they might slow down the system long enough for the watchdog to kill the process.

One of the watchdog timeouts happened on a service that has filters=filter_sc_column_masking. Is it possible for you to provide the rule file located at /etc/maxscale.d/filter_sc_column_masking? It might give clues to us as to why this is happening.

Comment by markus makela [ 2023-02-20 ]

R any updates?

Comment by Anton [ 2023-02-20 ]

HI markus makela

Sorry , I missed your messages , I suppose the ticket was closed

I'll try to update to 6.4.5 and return watchdog timeout back
And I'll try to use without filters

I'll provide results soon

Comment by markus makela [ 2023-02-20 ]

Thank you very much. It's important to find out the root cause for this as increasing the watchdog timeout is likely to just postpone a future timeout.

Comment by markus makela [ 2023-03-06 ]

I'll move this into the Needs Feeback state.

Comment by Anton [ 2023-03-17 ]

Hi markus makela

I updated to 6.4.5 and return watchdog timeout back to 60 seconds
I tried to reproduce the crash for few days , but it was not reproduced
Seems it fixed . Thanks

I need to notice that I didn't turned off filters (they are worked for all time)

Comment by markus makela [ 2023-03-17 ]

I'll close this as Cannot Reproduce as we weren't able to reproduce this.

Generated at Thu Feb 08 04:28:50 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.