Uploaded image for project: 'MariaDB MaxScale'
  1. MariaDB MaxScale
  2. MXS-5300

Maxscale aborts when a server is put into maintenance

    XMLWordPrintable

Details

    • Bug
    • Status: Needs Feedback (View Workflow)
    • Major
    • Resolution: Unresolved
    • 22.08.8
    • None
    • None
    •  - 2 Maxscale 22.08.8 servers, `dbproxy1` and `dbproxy2` sharing a VIP using `keepalived` with binlog router enabled for external replication
       - 3 MariaDB 10.6.12-7 servers, `db1`, `db2`, and `db3` with replication managed by Maxscale
    • MXS-SPRINT-219

    Description

      After placing a third server in a cluster into maintenance mode, the Maxscale process is terminated with signal 6 (Abort). This is not guaranteed to happen every time, and generally happens on systems with more connections.

      This is a critical bug that the customer's management is very aware of, and has been happening frequently during patching for some time. They would like a custom build of Maxscale with more debugging for abort signals.

      Configuration and log files from the time of the incidents for Maxscale, MariaDB, and keepalived for all systems are attached in a hidden comment. The Maxscale crashes happen around 2024-09-24 22:48:20

      Timeline of issue on 2024-09-24 (UTC):

      1. db2 is put into maintenance mode, wait for connections to drain - 21:42:36
      2. db2 has its OS patched and is rebooted - 22:02:03
      3. db2 taken out of maintenance mode - 22:19:54
      4. dbproxy1 has its OS patched and is rebooted - 22:23:39
      5. db3 is put into maintenance mode in preparation for patching - 22:30:20
      6. db1 and db2 start reporting errors reading communication packets - 22:36:40
      7. dbproxy1 aborts with these log entries, keepalived does not switchover - 22:48:20

      2024-09-24 22:47:50   warning: Thread 'Worker-08' has not reported back in 30 seconds.
      2024-09-24 22:48:20   warning: Thread 'Worker-10' has not reported back in 30 seconds.
      2024-09-24 22:48:20   warning: Thread 'Worker-08' has not reported back in 30 seconds.
      2024-09-24 22:48:20   warning: Thread 'Worker-26' has not reported back in 30 seconds.
      2024-09-24 22:48:20   warning: Thread 'Worker-27' has not reported back in 30 seconds.
      2024-09-24 22:48:20   warning: Thread 'Worker-28' has not reported back in 30 seconds.
      8165099:alert  : MaxScale 22.08.8 received fatal signal 6. Commit ID: 2f16a515391ac530a7280334dff5334f489d884e System name: Linux Release string: Ubuntu 20.04.6 LTS
       
       
      26	../sysdeps/unix/sysv/linux/read.c: No such file or directory.
       
      
      

      Running systemctl restart maxscale brings Maxscale back up without issue.

      Attachments

        Issue Links

          Activity

            People

              markus makela markus makela
              Paul.rothrock@mariadb.com Paul Rothrock
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.