[MDEV-33977] MariaDB hangs on --wsrep-recover phase - Jira

XML

Word

Printable

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Duplicate
Affects Version/s: 10.6.17, 10.11.7
Fix Version/s: N/A
Component/s: Galera
Labels:
None

Description

An otherwise healthy Galera cluster presents hangs on node restart when the node has been running for a few days.
The hang happens in the galera_recovery/--wsrep-recover phase.
If one waits for hours and the node eventually restarts, restarting it right away will be fast, the hang seems to happen only when the node has been running for days.

When recovery is run, it produces file: /tmp/wsrep_recovery.xxxxxxx

2024-04-10 11:07:20 0 [Note] Starting MariaDB 10.11.7-MariaDB-log source revision 87e13722a95af5d9378d990caf48cb6874439347 as process 539904

2024-04-10 11:07:20 0 [Note] InnoDB: Compressed tables use zlib 1.2.11

2024-04-10 11:07:20 0 [Note] InnoDB: Number of transaction pools: 1

2024-04-10 11:07:20 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions

2024-04-10 11:07:20 0 [Note] InnoDB: Using Linux native AIO

2024-04-10 11:07:20 0 [Note] InnoDB: Initializing buffer pool, total size = 12.000GiB, chunk size = 192.000MiB

2024-04-10 11:07:20 0 [Note] InnoDB: Completed initialization of buffer pool

2024-04-10 11:07:20 0 [Note] InnoDB: Buffered log writes (block size=512 bytes)

2024-04-10 11:07:20 0 [Note] InnoDB: End of log at LSN=2145632313933

2024-04-10 11:07:20 0 [Note] InnoDB: 128 rollback segments are active.

2024-04-10 11:07:20 0 [Note] InnoDB: Setting file '/mysql/data/ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ...

2024-04-10 11:07:20 0 [Note] InnoDB: File '/mysql/data/ibtmp1' size is now 12.000MiB.

2024-04-10 11:07:20 0 [Note] InnoDB: log sequence number 2145632313933; transaction id 2789662810

2024-04-10 11:07:20 0 [Warning] InnoDB: Skipping buffer pool dump/restore during wsrep recovery.

2024-04-10 11:07:20 0 [Note] Plugin 'FEEDBACK' is disabled.

At this point one CPU core is at 100% and no progress is seen.

The MariaDB log is not involved in this phase so probably not relevant, nothing logged there as well as no file updated in the datadir.

The aggregated stack trace is attached as well as the graphed version.

Relevant config:
innodb_buffer_pool_size = 120G
innodb_log_file_size = 5120M
innodb_log_buffer_size = 500M

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

CS0733762-106ES-recovery-trace.log.gz
1 kB
2024-06-05 09:19
flame-june04-mdb106-recovery-trace.svg
29 kB
2024-06-04 10:32
offcputime.out
14 kB
2024-06-20 14:22
recovery-hang-trace.log
13 kB
2024-04-23 14:33
recovery-hang-trace.png
32 kB
2024-04-23 14:33
strace.out.gz
34 kB
2024-06-20 14:22

Issue Links

is caused by

MDEV-31413 Node has been dropped from the cluster on Startup / Shutdown with async replica

Closed

MDEV-34924 gtid_slave_pos table rows newer been deleted on non replica nodes (wsrep_gtid_mode = 1)

Closed

is duplicated by

MDEV-34924 gtid_slave_pos table rows newer been deleted on non replica nodes (wsrep_gtid_mode = 1)

Closed

relates to

MDEV-33799 mysql_manager_submit Segfault at Startup Still Possible During Recovery

Closed

MDEV-34211 Seconds_Behind_Master gives impossible values

Closed

MDEV-35627 mysql.gtid_slave_pos gets really big

Closed

(1 relates to)

Activity

People

Assignee:: Jan Lindström

Reporter:: Claudio Nanni

Votes:: 14 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 2024-04-23 14:48

Updated:: 2025-05-02 13:01

Resolved:: 2025-01-08 17:30

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.