[MDEV-36204] Galera cluster gets stuck in the --wsrep_recover phase - Jira

Details

Type: Bug
Status: Open (View Workflow)
Priority: Critical
Resolution: Unresolved
Affects Version/s: 11.4.5
Fix Version/s: None
Component/s: Galera, Server
Labels:
None
Environment:
Ubuntu 24.04.2 LTS
Hardware: Pro WS 665-ACE, AMD Ryzen 9 7950X3D, 192Gb RAM

Description

One of three Galera nodes doesn't start after a reboot. It gets stuck in the --wsrep_recover phase. If I run it manually and send the SIGABRT after some time, I get this log:

# /usr/sbin/mariadbd --user=mysql --wsrep_recover --disable-log-error

2025-03-02 18:03:24 0 [Note] slave_connections_needed_for_purge changed to 0 because of Galera. Change it to 1 or higher if this Galera node is also Master in a normal replication setup

2025-03-02 18:03:24 0 [Note] Starting MariaDB 11.4.5-MariaDB-ubu2404-log source revision 0771110266ff5c04216af4bf1243c65f8c67ccf4 server_uid scRVhtEliqUIegje+r6p56/AWgY= as process 164899

2025-03-02 18:03:24 0 [Note] InnoDB: Compressed tables use zlib 1.3

2025-03-02 18:03:24 0 [Note] InnoDB: Number of transaction pools: 1

2025-03-02 18:03:24 0 [Note] InnoDB: Using AVX512 instructions

2025-03-02 18:03:24 0 [Note] InnoDB: Using liburing

2025-03-02 18:03:24 0 [Note] InnoDB: Initializing buffer pool, total size = 128.000GiB, chunk size = 2.000GiB

2025-03-02 18:03:24 0 [Note] InnoDB: Completed initialization of buffer pool

2025-03-02 18:03:24 0 [Note] InnoDB: Buffered log writes (block size=4096 bytes)

2025-03-02 18:03:24 0 [Note] InnoDB: End of log at LSN=4996688328739

2025-03-02 18:03:24 0 [Note] InnoDB: Resizing redo log from 2.000GiB to 16.000GiB; LSN=4996688328739

2025-03-02 18:03:24 0 [Note] InnoDB: Buffered log writes (block size=4096 bytes)

2025-03-02 18:03:24 0 [Note] InnoDB: Opened 3 undo tablespaces

2025-03-02 18:03:24 0 [Note] InnoDB: 128 rollback segments in 3 undo tablespaces are active.

2025-03-02 18:03:24 0 [Note] InnoDB: Removed temporary tablespace data file: "./ibtmp1"

2025-03-02 18:03:24 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ...

2025-03-02 18:03:24 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.

2025-03-02 18:03:24 0 [Note] InnoDB: log sequence number 4996688328739; transaction id 4889635920

2025-03-02 18:03:24 0 [Warning] InnoDB: Skipping buffer pool dump/restore during wsrep recovery.

2025-03-02 18:03:24 0 [Note] Plugin 'FEEDBACK' is disabled.

250302 18:13:51 [ERROR] /usr/sbin/mariadbd got signal 6 ;

Sorry, we probably made a mistake, and this is a bug.

Your assistance in bug reporting will enable us to fix this for the next release.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs about how to report

a bug on https://jira.mariadb.org/.

Please include the information from the server start above, to the end of the

information below.

Server version: 11.4.5-MariaDB-ubu2404-log source revision: 0771110266ff5c04216af4bf1243c65f8c67ccf4

The information page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mariadbd/

contains instructions to obtain a better version of the backtrace below.

Following these instructions will help MariaDB developers provide a fix quicker.

Attempting backtrace. Include this in the bug report.

(note: Retrieving this information may fail)

Thread pointer: 0x0

stack_bottom = 0x0 thread_stack 0x49000

/usr/sbin/mariadbd(my_print_stacktrace+0x30)[0x5a36b15b81b0]

/usr/sbin/mariadbd(handle_fatal_signal+0x2a1)[0x5a36b1151fa1]

/lib/x86_64-linux-gnu/libc.so.6(+0x45330)[0x76eefa445330]

/lib/x86_64-linux-gnu/libc.so.6(+0x98d71)[0x76eefa498d71]

/lib/x86_64-linux-gnu/libc.so.6(pthread_cond_wait+0x20d)[0x76eefa49b7ed]

/usr/sbin/mariadbd(psi_cond_wait+0x4b)[0x5a36b0ca48fd]

/usr/sbin/mariadbd(+0x5f0c75)[0x5a36b0c5bc75]

/usr/sbin/mariadbd(+0x6b66a8)[0x5a36b0d216a8]

/usr/sbin/mariadbd(_Z11mysqld_mainiPPc+0xf79)[0x5a36b0d24659]

/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x76eefa42a1ca]

/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x76eefa42a28b]

/usr/sbin/mariadbd(_start+0x25)[0x5a36b0cfe035]

Writing a core file...

Working directory at /var/lib/mysql

Resource Limits (excludes unlimited resources):

Limit                     Soft Limit           Hard Limit           Units

Max stack size            8388608              unlimited            bytes

Max core file size        0                    unlimited            bytes

Max processes             768922               768922               processes

Max open files            33063                33063                files

Max locked memory         25222557696          25222557696          bytes

Max pending signals       768922               768922               signals

Max msgqueue size         819200               819200               bytes

Max nice priority         0                    0

Max realtime priority     0                    0

Core pattern: core

Kernel version: Linux version 6.11.0-17-generic (buildd@lcy02-amd64-038) (x86_64-linux-gnu-gcc-13 (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0, GNU ld (GNU Binutils for Ubuntu) 2.42) #17~24.04.2-Ubuntu SMP PREEMPT_DYNAMIC Mon Jan 20 22:48:29 UTC 2

Aborted

Below is the gdb backtrace taken while it was stuck:

(gdb) bt

#0  0x0000705755298d71 in __futex_abstimed_wait_common64 (private=28759, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x5c9365566170 <mysql_bin_log+3344>) at ./nptl/futex-internal.c:57

#1  __futex_abstimed_wait_common (cancel=true, private=28759, abstime=0x0, clockid=0, expected=0, futex_word=0x5c9365566170 <mysql_bin_log+3344>) at ./nptl/futex-internal.c:87

#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x5c9365566170 <mysql_bin_log+3344>, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0)

    at ./nptl/futex-internal.c:139

#3  0x000070575529b7ed in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x5c93655660e0 <mysql_bin_log+3200>, cond=0x5c9365566148 <mysql_bin_log+3304>) at ./nptl/pthread_cond_wait.c:503

#4  ___pthread_cond_wait (cond=0x5c9365566148 <mysql_bin_log+3304>, mutex=0x5c93655660e0 <mysql_bin_log+3200>) at ./nptl/pthread_cond_wait.c:627

#5  0x00005c93641f78fd in psi_cond_wait ()

#6  0x00005c93641aec75 in ?? ()

#7  0x00005c93642746a8 in ?? ()

#8  0x00005c9364277659 in mysqld_main(int, char**) ()

#9  0x000070575522a1ca in __libc_start_call_main (main=main@entry=0x5c936420fdc0 <main>, argc=argc@entry=4, argv=argv@entry=0x7ffe0acb8938) at ../sysdeps/nptl/libc_start_call_main.h:58

#10 0x000070575522a28b in __libc_start_main_impl (main=0x5c936420fdc0 <main>, argc=4, argv=0x7ffe0acb8938, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe0acb8928) at ../csu/libc-start.c:360

#11 0x00005c9364251035 in _start ()

If I turn off the binlog, the node joins the Galera cluster, but does not accept any client connections:

# mariadb --connect-timeout=10

ERROR 2013 (HY000): Lost connection to server at 'handshake: reading initial communication packet', system error: 110

Attachments

Issue Links

is caused by

MDEV-34924 gtid_slave_pos table rows newer been deleted on non replica nodes (wsrep_gtid_mode = 1)

Closed

Galera cluster gets stuck in the --wsrep_recover phase

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration