Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-36204

Galera cluster gets stuck in the --wsrep_recover phase

Details

    • Bug
    • Status: Open (View Workflow)
    • Critical
    • Resolution: Unresolved
    • 11.4.5
    • None
    • Galera, Server
    • None
    • Ubuntu 24.04.2 LTS
      Hardware: Pro WS 665-ACE, AMD Ryzen 9 7950X3D, 192Gb RAM

    Description

      One of three Galera nodes doesn't start after a reboot. It gets stuck in the --wsrep_recover phase. If I run it manually and send the SIGABRT after some time, I get this log:

      # /usr/sbin/mariadbd --user=mysql --wsrep_recover --disable-log-error
      2025-03-02 18:03:24 0 [Note] slave_connections_needed_for_purge changed to 0 because of Galera. Change it to 1 or higher if this Galera node is also Master in a normal replication setup
      2025-03-02 18:03:24 0 [Note] Starting MariaDB 11.4.5-MariaDB-ubu2404-log source revision 0771110266ff5c04216af4bf1243c65f8c67ccf4 server_uid scRVhtEliqUIegje+r6p56/AWgY= as process 164899
      2025-03-02 18:03:24 0 [Note] InnoDB: Compressed tables use zlib 1.3
      2025-03-02 18:03:24 0 [Note] InnoDB: Number of transaction pools: 1
      2025-03-02 18:03:24 0 [Note] InnoDB: Using AVX512 instructions
      2025-03-02 18:03:24 0 [Note] InnoDB: Using liburing
      2025-03-02 18:03:24 0 [Note] InnoDB: Initializing buffer pool, total size = 128.000GiB, chunk size = 2.000GiB
      2025-03-02 18:03:24 0 [Note] InnoDB: Completed initialization of buffer pool
      2025-03-02 18:03:24 0 [Note] InnoDB: Buffered log writes (block size=4096 bytes)
      2025-03-02 18:03:24 0 [Note] InnoDB: End of log at LSN=4996688328739
      2025-03-02 18:03:24 0 [Note] InnoDB: Resizing redo log from 2.000GiB to 16.000GiB; LSN=4996688328739
      2025-03-02 18:03:24 0 [Note] InnoDB: Buffered log writes (block size=4096 bytes)
      2025-03-02 18:03:24 0 [Note] InnoDB: Opened 3 undo tablespaces
      2025-03-02 18:03:24 0 [Note] InnoDB: 128 rollback segments in 3 undo tablespaces are active.
      2025-03-02 18:03:24 0 [Note] InnoDB: Removed temporary tablespace data file: "./ibtmp1"
      2025-03-02 18:03:24 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ...
      2025-03-02 18:03:24 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.
      2025-03-02 18:03:24 0 [Note] InnoDB: log sequence number 4996688328739; transaction id 4889635920
      2025-03-02 18:03:24 0 [Warning] InnoDB: Skipping buffer pool dump/restore during wsrep recovery.
      2025-03-02 18:03:24 0 [Note] Plugin 'FEEDBACK' is disabled.
      250302 18:13:51 [ERROR] /usr/sbin/mariadbd got signal 6 ;
      Sorry, we probably made a mistake, and this is a bug.
       
      Your assistance in bug reporting will enable us to fix this for the next release.
      To report this bug, see https://mariadb.com/kb/en/reporting-bugs about how to report
      a bug on https://jira.mariadb.org/.
       
      Please include the information from the server start above, to the end of the
      information below.
       
      Server version: 11.4.5-MariaDB-ubu2404-log source revision: 0771110266ff5c04216af4bf1243c65f8c67ccf4
       
      The information page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mariadbd/
      contains instructions to obtain a better version of the backtrace below.
      Following these instructions will help MariaDB developers provide a fix quicker.
       
      Attempting backtrace. Include this in the bug report.
      (note: Retrieving this information may fail)
       
      Thread pointer: 0x0
      stack_bottom = 0x0 thread_stack 0x49000
      /usr/sbin/mariadbd(my_print_stacktrace+0x30)[0x5a36b15b81b0]
      /usr/sbin/mariadbd(handle_fatal_signal+0x2a1)[0x5a36b1151fa1]
      /lib/x86_64-linux-gnu/libc.so.6(+0x45330)[0x76eefa445330]
      /lib/x86_64-linux-gnu/libc.so.6(+0x98d71)[0x76eefa498d71]
      /lib/x86_64-linux-gnu/libc.so.6(pthread_cond_wait+0x20d)[0x76eefa49b7ed]
      /usr/sbin/mariadbd(psi_cond_wait+0x4b)[0x5a36b0ca48fd]
      /usr/sbin/mariadbd(+0x5f0c75)[0x5a36b0c5bc75]
      /usr/sbin/mariadbd(+0x6b66a8)[0x5a36b0d216a8]
      /usr/sbin/mariadbd(_Z11mysqld_mainiPPc+0xf79)[0x5a36b0d24659]
      /lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x76eefa42a1ca]
      /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x76eefa42a28b]
      /usr/sbin/mariadbd(_start+0x25)[0x5a36b0cfe035]
      Writing a core file...
      Working directory at /var/lib/mysql
      Resource Limits (excludes unlimited resources):
      Limit                     Soft Limit           Hard Limit           Units     
      Max stack size            8388608              unlimited            bytes     
      Max core file size        0                    unlimited            bytes     
      Max processes             768922               768922               processes 
      Max open files            33063                33063                files     
      Max locked memory         25222557696          25222557696          bytes     
      Max pending signals       768922               768922               signals   
      Max msgqueue size         819200               819200               bytes     
      Max nice priority         0                    0                    
      Max realtime priority     0                    0                    
      Core pattern: core
       
      Kernel version: Linux version 6.11.0-17-generic (buildd@lcy02-amd64-038) (x86_64-linux-gnu-gcc-13 (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0, GNU ld (GNU Binutils for Ubuntu) 2.42) #17~24.04.2-Ubuntu SMP PREEMPT_DYNAMIC Mon Jan 20 22:48:29 UTC 2
       
      Aborted
      

      Below is the gdb backtrace taken while it was stuck:

      (gdb) bt
      #0  0x0000705755298d71 in __futex_abstimed_wait_common64 (private=28759, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x5c9365566170 <mysql_bin_log+3344>) at ./nptl/futex-internal.c:57
      #1  __futex_abstimed_wait_common (cancel=true, private=28759, abstime=0x0, clockid=0, expected=0, futex_word=0x5c9365566170 <mysql_bin_log+3344>) at ./nptl/futex-internal.c:87
      #2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x5c9365566170 <mysql_bin_log+3344>, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0)
          at ./nptl/futex-internal.c:139
      #3  0x000070575529b7ed in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x5c93655660e0 <mysql_bin_log+3200>, cond=0x5c9365566148 <mysql_bin_log+3304>) at ./nptl/pthread_cond_wait.c:503
      #4  ___pthread_cond_wait (cond=0x5c9365566148 <mysql_bin_log+3304>, mutex=0x5c93655660e0 <mysql_bin_log+3200>) at ./nptl/pthread_cond_wait.c:627
      #5  0x00005c93641f78fd in psi_cond_wait ()
      #6  0x00005c93641aec75 in ?? ()
      #7  0x00005c93642746a8 in ?? ()
      #8  0x00005c9364277659 in mysqld_main(int, char**) ()
      #9  0x000070575522a1ca in __libc_start_call_main (main=main@entry=0x5c936420fdc0 <main>, argc=argc@entry=4, argv=argv@entry=0x7ffe0acb8938) at ../sysdeps/nptl/libc_start_call_main.h:58
      #10 0x000070575522a28b in __libc_start_main_impl (main=0x5c936420fdc0 <main>, argc=4, argv=0x7ffe0acb8938, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe0acb8928) at ../csu/libc-start.c:360
      #11 0x00005c9364251035 in _start ()
      

      If I turn off the binlog, the node joins the Galera cluster, but does not accept any client connections:

      # mariadb --connect-timeout=10
      ERROR 2013 (HY000): Lost connection to server at 'handshake: reading initial communication packet', system error: 110
      

      Attachments

        Activity

          crocodylus Viktor Soroka added a comment -

          Further investigation revealed that the issue arose because the donor's mysql.gtid_slave_pos table contains 926,146,635 rows. An SST from another node resolved the problem; however, on the problematic node the number of rows in mysql.gtid_slave_pos does not decrease, as the node is not participating in either inbound or outbound replication—only the binary log is enabled.

          MariaDB [mysql]> select count(*) from gtid_slave_pos;
          +-----------+
          | count(*)  |
          +-----------+
          | 926146635 |
          +-----------+
          1 row in set (45.236 sec)
          

          crocodylus Viktor Soroka added a comment - Further investigation revealed that the issue arose because the donor's mysql.gtid_slave_pos table contains 926,146,635 rows. An SST from another node resolved the problem; however, on the problematic node the number of rows in mysql.gtid_slave_pos does not decrease, as the node is not participating in either inbound or outbound replication—only the binary log is enabled. MariaDB [mysql]> select count(*) from gtid_slave_pos; +-----------+ | count(*) | +-----------+ | 926146635 | +-----------+ 1 row in set (45.236 sec)

          People

            Unassigned Unassigned
            crocodylus Viktor Soroka
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.