[MXS-4777] Maxscale crash due to systemd timeout Created: 2023-09-25  Updated: 2023-09-29  Resolved: 2023-09-29

Status: Closed
Project: MariaDB MaxScale
Component/s: Core
Affects Version/s: 6.4.10
Fix Version/s: 6.4.11, 22.08.9, 23.02.5, 23.08.2

Type: Bug Priority: Major
Reporter: Bryan Bancroft (Inactive) Assignee: markus makela
Resolution: Fixed Votes: 2
Labels: triage
Environment:

3 maxscale nodes behind ALB, aws vms



 Description   

This happened during a usage spike, ooms occurred on the underlying galera nodes during this time as well.

maxscale 1 lost sight to all db nodes

2023-09-25 13:19:18.614   error  : (mon_report_query_error): Failed to execute query on server 'L1LPVERGENTDB03' ([10.225.27.183]:3306): Lost connection to server during query
2023-09-25 13:19:32.314   error  : (mon_report_query_error): Failed to execute query on server 'L1LPVERGENTDB03' ([10.225.27.183]:3306): Lost connection to server during query
2023-09-25 13:23:00.946   error  : (mon_report_query_error): Failed to execute query on server 'L1LPVERGENTDB03' ([10.225.27.183]:3306): Lost connection to server during query
2023-09-25 13:23:16.987   error  : (mon_report_query_error): Failed to execute query on server 'L1LPVERGENTDB03' ([10.225.27.183]:3306): Lost connection to server at 'handshake: reading initial communication packet', system error: 110
2023-09-25 13:23:45.717   error  : (mon_report_query_error): Failed to execute query on server 'L1LPVERGENTDB03' ([10.225.27.183]:3306): Lost connection to server during query
2023-09-25 13:23:56.736   error  : (mon_report_query_error): Failed to execute query on server 'L1LPVERGENTDB03' ([10.225.27.183]:3306): Lost connection to server during query
2023-09-25 13:24:11.889   error  : (mon_report_query_error): Failed to execute query on server 'L1LPVERGENTDB03' ([10.225.27.183]:3306): Lost connection to server during query
2023-09-25 13:24:20.371   error  : (mon_report_query_error): Failed to execute query on server 'L1LPVERGENTDB03' ([10.225.27.183]:3306): Lost connection to server during query
2023-09-25 13:24:30.386   error  : (mon_report_query_error): Failed to execute query on server 'L1LPVERGENTDB03' ([10.225.27.183]:3306): Lost connection to server during query
2023-09-25 13:26:20.262   error  : (mon_report_query_error): Failed to execute query on server 'L1LPVERGENTDB03' ([10.225.27.183]:3306): Lost connection to server during query
2023-09-25 13:26:31.049   error  : (mon_report_query_error): Failed to execute query on server 'L1LPVERGENTDB03' ([10.225.27.183]:3306): Lost connection to server during query
2023-09-25 13:27:03.119   error  : (mon_report_query_error): Failed to execute query on server 'L1LPVERGENTDB03' ([10.225.27.183]:3306): Lost connection to server during query
2023-09-25 13:27:22.378   error  : (mxs_mysql_real_connect): Failed to set latin1 character set: Lost connection to server during query
2023-09-25 13:27:41.054   error  : (mon_report_query_error): Failed to execute query on server 'L1LPVERGENTDB03' ([10.225.27.183]:3306): Lost connection to server during query
2023-09-25 13:28:14.823   error  : (mon_report_query_error): Failed to execute query on server 'L1LPVERGENTDB03' ([10.225.27.183]:3306): Lost connection to server during query
2023-09-25 13:28:47.975   error  : (mon_report_query_error): Failed to execute query on server 'L1LPVERGENTDB03' ([10.225.27.183]:3306): Lost connection to server during query
2023-09-25 13:28:58.986   error  : (mon_report_query_error): Failed to execute query on server 'L1LPVERGENTDB03' ([10.225.27.183]:3306): Lost connection to server during query
2023-09-25 13:29:09.953   error  : (mxs_mysql_real_connect): Failed to set latin1 character set: Lost connection to server during query
2023-09-25 13:32:34.385   error  : (mon_report_query_error): Failed to execute query on server 'L1LPVERGENTDB04' ([10.225.27.118]:3306): Lost connection to server during query
2023-09-25 13:41:44.306   error  : (mon_report_query_error): Failed to execute query on server 'L1LPVERGENTDB04' ([10.225.27.118]:3306): Lost connection to server during query
2023-09-25 13:42:05.403   error  : (mon_report_query_error): Failed to execute query on server 'L1LPVERGENTDB04' ([10.225.27.118]:3306): Lost connection to server during query
2023-09-25 13:42:19.746   error  : (mon_report_query_error): Failed to execute query on server 'L1LPVERGENTDB04' ([10.225.27.118]:3306): Lost connection to server during query
2023-09-25 13:42:30.649   error  : (mon_report_query_error): Failed to execute query on server 'L1LPVERGENTDB04' ([10.225.27.118]:3306): Lost connection to server during query
2023-09-25 13:42:44.061   error  : (mon_report_query_error): Failed to execute query on server 'L1LPVERGENTDB04' ([10.225.27.118]:3306): Lost connection to server during query
2023-09-25 13:43:09.442   error  : (mxs_mysql_real_connect): Failed to set latin1 character set: Lost connection to server during query
2023-09-25 13:43:18.010   error  : (mxs_mysql_real_connect): Failed to set latin1 character set: Lost connection to server during query
2023-09-25 13:43:18.027   notice : (log_state_change): Server changed state: L1LPVERGENTDB04[10.225.27.118:3306]: slave_down. [Slave, Synced, Running] -> [Down]
2023-09-25 13:43:18.027   error  : (983789) [readwritesplit] (rwsplit-service); (handleError): Server 'L1LPVERGENTDB04' was lost in the middle of a resultset, cannot continue the session: #HY000: Lost connection to backend server: connection closed by peer (L1LPVERGENTDB04: Generated event)
2023-09-25 13:43:22.377   notice : (log_state_change): Server changed state: L1LPVERGENTDB04[10.225.27.118:3306]: slave_up. [Down] -> [Slave, Synced, Running]
2023-09-25 14:07:29.827   error  : (mon_report_query_error): Failed to execute query on server 'L1LPVERGENTDB04' ([10.225.27.118]:3306): Lost connection to server during query
2023-09-25 14:08:10.003   notice : (log_state_change): Server changed state: L1LPVERGENTDB04[10.225.27.118:3306]: slave_down. [Slave, Synced, Running] -> [Down]
2023-09-25 14:09:06.851   error  : (985612) (L1LPVERGENTDB03); (socket_write): Write to Backend DCB 10.225.27.183 in state DCB::State::POLLING failed: 104, Connection reset by peer
2023-09-25 14:09:30.579   error  : [galeramon] (post_tick): There are no cluster members

Maxscales 2 and 3 became unresponsive and hit a signal 6

2023-09-25 13:24:34.499   notice : (log_state_change): Server changed state: L1LPVERGENTDB03[10.225.27.183:3306]: new_slave. [Running] -> [Slave, Synced, Running]
2023-09-25 13:25:29.198   warning: (notify_systemd_watchdog): Thread 'Worker-01' has not reported back in 30 seconds.
2023-09-25 13:25:59.198   warning: (notify_systemd_watchdog): Thread 'Worker-01' has not reported back in 30 seconds.
alert  : MaxScale 6.4.10 received fatal signal 6. Commit ID: e21972b152407ccc41a5c835c91774578c64c472 System name: Linux Release string: PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
 
 
2023-09-25 13:25:59.268   alert  : (sigfatal_handler): MaxScale 6.4.10 received fatal signal 6. Commit ID: e21972b152407ccc41a5c835c91774578c64c472 System name: Linux Release string: PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
2023-09-25 13:25:59.268   alert  : (sigfatal_handler): Statement currently being classified: none/unknown
2023-09-25 13:25:59.268   notice : (sigfatal_handler): For a more detailed stacktrace, install GDB and add 'debug=gdb-stacktrace' under the [maxscale] section.
  /lib/x86_64-linux-gnu/libc.so.6(epoll_wait+0x56) [0x7f0e0895dd56]: <binutils not installed>
  /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker15poll_waiteventsEv+0x104) [0x7f0e09412b34]: <binutils not installed>
  /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker3runEPNS_9SemaphoreE+0x4f) [0x7f0e09412e9f]: <binutils not installed>
  /usr/bin/maxscale(main+0x1c87) [0x55bb71fb9fe7]: <binutils not installed>
  /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xea) [0x7f0e08885d0a]: <binutils not installed>
  /usr/bin/maxscale(_start+0x2a) [0x55bb71fba5ea]: <binutils not installed>
alert  : Writing core dump.
 
 
MariaDB MaxScale  /var/log/maxscale/maxscale.log  Mon Sep 25 14:26:07 2023
----------------------------------------------------------------------------
2023-09-25 14:26:07   notice : (WatchdogNotifier): The systemd watchdog is Enabled. Internal timeout = 30s

We are trying to determine why maxscale would become unresponsive after issues on 1 maxscale node and OOMs on the monitored DBs



 Comments   
Comment by Rick Pizzi [ 2023-09-26 ]

see attached

Comment by markus makela [ 2023-09-27 ]

More often than not the reason for a systemd watchdog timeout is that the name servers are slow. Currently the default behavior is that the systemd watchdog is allowed to kill the MaxScale process if the reverse name lookups are taking too long. I think that this isn't ideal and while there is a way to prevent the slowness (skip_name_resolve=true in MaxScale), not using it should not kill the process and should instead just slow the system down.

Generated at Thu Feb 08 04:31:04 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.