[MDEV-35829] galera node crash with race condition - Jira

Details

Type: Bug
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 10.11.10
Fix Version/s: 10.11
Component/s: Galera
Labels:
None
Environment:
3 nodes with loadbalancer Hetzner Cloud Server CCX43 CPU 16 RAM 64GB Disk 200GB Debian 12 mariadb 10.11.10 galera 4 26.4.20 happens with ws-rep rsync and mariabackup

Description

Race condition CPU 100% with always the same message (from crash to crash innodb table changes):

2025-01-04  5:00:19 0 [ERROR] InnoDB: ibuf cursor restoration fails! ibuf record inserted to page [page id: space=14819, page number=396] in file ./fm@002dcz@002dprod/wp_gdmaq_log_email.ibd

2025-01-04  5:00:19 0 [ERROR] InnoDB: Submit a detailed bug report to https://jira.mariadb.org/

PHYSICAL RECORD: n_fields 6; 1-byte offsets; info bits 0

0: len 4; hex 000039e3; asc   9 ;;

1: len 1; hex 00; asc  ;;

2: len 4; hex 0000018c; asc     ;;

3: len 16; hex 00010101860800088000860800088000; asc                 ;;

4: len 8; hex 00000000000413d1; asc         ;;

5: len 8; hex 00000000000a5853; asc       XS;;

PHYSICAL RECORD: n_fields 6; 1-byte offsets; info bits 0

0: len 4; hex 000039e3; asc   9 ;;

1: len 1; hex 00; asc  ;;

2: len 4; hex 0000018c; asc     ;;

3: len 16; hex 00010101860800088000860800088000; asc                 ;;

4: len 8; hex 00000000000413d1; asc         ;;

5: len 8; hex 00000000000a5853; asc       XS;;

DATA TUPLE: 3 fields;

0: len 4; hex 000039e3; asc   9 ;;

1: len 1; hex 00; asc  ;;

2: len 4; hex 0000018c; asc     ;;

----------------------------------------------------------
60-galera-cnf
---------------------------------------------------------------------------------------------------------

[mysqld]

wsrep_provider              = /usr/lib/galera/libgalera_smm.so

wsrep_on                    = ON

wsrep_cluster_name          = Forum Prod Galera Cluster

wsrep_cluster_address       = gcomm://10.1.1.2,10.1.1.3,10.1.1.4

wsrep_node_name             = DB03

wsrep_node_address          = 10.1.1.4

wsrep_provider_options      = "gmcast.listen_addr=tcp://10.1.1.4:4567; gcache.size=10G; gcache.page_size=10G"

binlog_format               = ROW

default_storage_engine      = InnoDB

wsrep_slave_threads         = 16

wsrep_load_data_splitting   = ON

innodb_autoinc_lock_mode    = 2

bind-address = 10.0.0.15,10.1.1.4

---------------------------------------------------
50-server.cnf
------------------------------------------------------------------------------------------

# These groups are read by MariaDB server.

# Use it for options that only the server (but not clients) should see

[server]

# this is only for the mysqld standalone daemon

[mysqld]

# * Basic Settings

pid-file                = /run/mysqld/mysqld.pid

basedir                 = /usr

skip-name-resolve

# * Fine Tuning

thread_cache_size = 100

table_cache = 2500

join_buffer_size = 64M

tmp_table_size = 64M

max_heap_table_size = 64M

table_definition_cache = 3500

# * Logging and Replication

# Both location gets rotated by the cronjob.

# Be aware that this log type is a performance killer.

# Recommend only changing this at runtime for short testing periods if needed!

# general_log_file       = /var/log/mysql/mysql.log

# general_log            = 1

# When running under systemd, error logging goes via stdout/stderr to journald

# and when running legacy init error logging goes to syslog due to

# /etc/mysql/conf.d/mariadb.conf.d/50-mysqld_safe.cnf

# Enable this if you want to have error logging into a separate file

log_error = /var/log/mysql/error.log

# Enable the slow query log to see queries with especially long duration

log_slow_query_file    = /var/log/mysql/mariadb-slow.log

log_slow_query_time    = 10

log_slow_verbosity     = query_plan,explain

# log-queries-not-using-indexes

# log_slow_min_examined_row_limit = 1000

# The following can be used as easy to replay backup logs or for replication.

# note: if you are setting up a replication slave, see README.Debian about

#       other settings you may need to change.

expire_logs_days        = 10

# * SSL/TLS

ssl-ca = /etc/mysql/certs/ca-cert.pem

ssl-key = /etc/mysql/certs/server-key.pem

ssl-cert = /etc/mysql/certs/server-cert.pem

proxy-protocol-networks=10.0.0.0/24,localhost

require-secure-transport = off

# * Character sets

character-set-server  = utf8mb4

collation-server      = utf8mb4_general_ci

# * InnoDB

innodb_flush_log_at_trx_commit=0

innodb_flush_method=O_DIRECT

innodb_doublewrite=1

innodb_autoextend_increment=64

innodb_file_per_table=1

innodb_thread_concurrency=0

innodb_buffer_pool_size=46G

innodb_buffer_pool_instances=10

innodb_log_file_size=8G

innodb_io_capacity = 2000

innodb_io_capacity_max = 5000

innodb_checksum_algorithm = crc32

innodb_log_compressed_pages=OFF

innodb_change_buffering=all

innodb_autoinc_lock_mode=2

[mariadb]

key_buffer_size        = 128M

net_buffer_length      = 1048576

max_allowed_packet     = 1024M

query_cache_type=0

query_cache_size=0

performance_schema=on

wait_timeout=300

proxy-protocol-networks=10.0.0.0/24,localhost

# Hetzner recommendations

max_connect_errors    = 4294967295

max_connections       = 100000

Attachments

Issue Links

is duplicated by

MDEV-35508 Race condition between purge and secondary index INSERT or UPDATE

Closed

Activity

Ascending order - Click to sort in descending order

Jan Lindström added a comment - 2025-01-15 12:19 - edited

hriesz Can you please provide full error log and node configuration. Do you have some instructions how to reproduce?

Jan Lindström added a comment - 2025-01-15 12:19 - edited hriesz Can you please provide full error log and node configuration. Do you have some instructions how to reproduce?

Hansjuergen Riess added a comment - 2025-01-16 10:19 - edited

Hello Jan,
what is full node configuration? The config files are in the message. The error is due to the race condition always the same as listed.
It seems to happen when a Index corrupts and is replicated to the other node. Then the other node crashes.
Or the other way round: The index gets corrupt when updated.... This happens every few days.

The crashed node connot be restarted. It shows the race condition again when started.

Only a recreation of the node works (delete /var/lib/mysql except /var/lib/mysql/mysql and systemctl start mariadb)

It looks a little like https://jira.mariadb.org/browse/MDEV-35508

Is there a configuration parameter to avoid this?

Hansjuergen Riess added a comment - 2025-01-16 10:19 - edited Hello Jan, what is full node configuration? The config files are in the message. The error is due to the race condition always the same as listed. It seems to happen when a Index corrupts and is replicated to the other node. Then the other node crashes. Or the other way round: The index gets corrupt when updated.... This happens every few days. The crashed node connot be restarted. It shows the race condition again when started. Only a recreation of the node works (delete /var/lib/mysql except /var/lib/mysql/mysql and systemctl start mariadb) It looks a little like https://jira.mariadb.org/browse/MDEV-35508 Is there a configuration parameter to avoid this?

Jan Lindström added a comment - 2025-01-16 11:39

hriesz Thanks for pointing that out. It seems that fix is not yet released so you might need to wait for 10.11.11 to be released.

Jan Lindström added a comment - 2025-01-16 11:39 hriesz Thanks for pointing that out. It seems that fix is not yet released so you might need to wait for 10.11.11 to be released.

Hansjuergen Riess added a comment - 2025-01-16 14:11 - edited

Thanks for your answer. What I can resume now from the last crash is that the race condition occurs on the synced nodeS.
The source node then shows :
[ERROR] Got error 180 when reading table './abc@002defg/wp_wpr_rucss_used_css'
When do you think the fix will be available?

Hansjuergen Riess added a comment - 2025-01-16 14:11 - edited Thanks for your answer. What I can resume now from the last crash is that the race condition occurs on the synced nodeS. The source node then shows : [ERROR] Got error 180 when reading table './abc@002defg/wp_wpr_rucss_used_css' When do you think the fix will be available?

Amit Ghosh added a comment - 2025-02-05 16:08

We also encountered this bug. MariaDB 10.11.11 has been released yesterday. According to the Release Notes and Changelog a fix for https://jira.mariadb.org/browse/MDEV-35508 doesn't seem to be included. ~~MDEV-35508~~ however also links to https://jira.mariadb.org/browse/MDEV-35619 which is also not mentioned in the Release Notes / Changelog of 10.11.11 but still states in the ticket to be fixed in version 10.11.11. Can someone clarify if we can expect this bug to be fixed in 10.11.11? If not, is it still planned to include the fix into future versions of the 10.11 branch or do you recommend upgrading to e.g. 11.4.5 to avoid this bug?

Amit Ghosh added a comment - 2025-02-05 16:08 We also encountered this bug. MariaDB 10.11.11 has been released yesterday. According to the Release Notes and Changelog a fix for https://jira.mariadb.org/browse/MDEV-35508 doesn't seem to be included. MDEV-35508 however also links to https://jira.mariadb.org/browse/MDEV-35619 which is also not mentioned in the Release Notes / Changelog of 10.11.11 but still states in the ticket to be fixed in version 10.11.11. Can someone clarify if we can expect this bug to be fixed in 10.11.11? If not, is it still planned to include the fix into future versions of the 10.11 branch or do you recommend upgrading to e.g. 11.4.5 to avoid this bug?

MariaDB Server

galera node crash with race condition

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration