Details
-
Bug
-
Status: Open (View Workflow)
-
Critical
-
Resolution: Unresolved
-
11.4.4, 11.4.7, 11.4.9
-
None
-
None
Description
Summary
MariaDB Galera cluster experiences IST (Incremental State Transfer) failure when a joiner node attempts to recover from a donor node. The failure is caused by a sequence number (seqno) mismatch that occurs due to timing issues in the recovery process.
Issue Title
*IST Receiver Failure: Sequence Number Mismatch Between GCache Recovery and InnoDB Recovery*
Severity
*Critical* - Causes node restart loop and cluster instability
Environment
- MariaDB Version**: 11.4.4, but same issue is seen in 11.4.7 and 11.4.9 as well.
- Galera Version**: 4.21(rd811a57)
- Cluster Setup**: 3-node Galera cluster with SST via mariabackup
- Storage Engine**: InnoDB
- 3 node kubernetes cluster docker image from Mariadb.org repo
Problem Description
The three node cluster are running in kubernetes environment with
Root Cause
The IST failure occurs due to a sequence number mismatch between:
1. *GCache Recovery Phase*: Recovers seqno from persistent galera.cache
2. *SST Phase*: Donor sends writesets for a specific seqno range
3. *InnoDB Recovery Phase*: InnoDB recovery completes with a different seqno than expected
4. *IST Application Phase*: IST tries to apply writesets but seqno doesn't match
Detailed Sequence of Events
Phase 1: GCache Recovery (Initial Position)
```
2026-01-16 12:53:55 0 [Note] WSREP: Found saved state: 6e350d19-f301-11f0-bb57-471d08f203bb:6966, safe_to_bootstrap: 0
2026-01-16 12:53:55 0 [Note] WSREP: Recovering GCache ring buffer: found gapless sequence 1-6966
```
- *Initial GCache seqno*: 6966
- *Status*: Node has persistent state up to seqno 6966
Phase 2: State Transfer Request
```
2026-01-16 12:53:56 2 [Note] WSREP: State transfer required:
Group state: 6e350d19-f301-11f0-bb57-471d08f203bb:6968
Local state: 6e350d19-f301-11f0-bb57-471d08f203bb:6966
```
- *Group state*: 6968 (cluster is at seqno 6968)
- *Local state*: 6966 (joiner is at seqno 6966)
- *Gap*: 2 transactions (6967-6968)
Phase 3: SST via mariabackup
```
WSREP_SST: [INFO] 'xtrabackup_ist' received from donor: Running IST
WSREP_SST: [INFO] Galera co-ords from donor: 6e350d19-f301-11f0-bb57-471d08f203bb:6966 0
```
- *Donor sends*: Backup at seqno 6966
- *Expected IST range*: 6967-6968
Phase 4: InnoDB Recovery (THE PROBLEM)
```
2026-01-16 12:53:57 0 [Note] InnoDB: log sequence number 31084789; transaction id 5168
2026-01-16 12:53:57 0 [Note] InnoDB: Loading buffer pool(s) from /bitnami/mariadb/data/ib_buffer_pool
2026-01-16 12:54:00 3 [Note] WSREP: Recovered position from storage: 6e350d19-f301-11f0-bb57-471d08f203bb:6666
```
- *InnoDB recovered position*: 6666 (LOWER than expected 6966!)
- *This is the critical issue*: InnoDB recovery completes with seqno 6666, not 6966
Phase 5: IST Application Failure
```
2026-01-16 12:54:00 2 [Note] WSREP: Receiving IST: 302 writesets, seqnos 6667-6968
2026-01-16 12:54:00 0 [Note] WSREP: ####### IST applying starts with 6667
2026-01-16 12:54:00 6 [ERROR] WSREP: Receiving IST failed, node restart required:
IST receiver reported failure: 'IST started with wrong seqno: 6929, expected <= 6667'
```
- *Expected IST start*: 6667 (after InnoDB recovery at 6666)
- *Actual IST start*: 6929 (MISMATCH!)
- *Result*: IST fails, node requires restart
Root Cause Analysis
The Timing Issue
1. *GCache reports*: Seqno 6966 is safe
2. *SST backup taken at*: Seqno 6966
3. *InnoDB recovery completes at*: Seqno 6666 (300+ transactions behind!)
4. *IST tries to apply*: Writesets starting from 6667, but InnoDB is at 6666
5. *Seqno adjustment happens too late*: After IST has already started with wrong expectations
Why InnoDB Recovery Lags
- InnoDB crash recovery may not replay all transactions from the backup
- Buffer pool loading may cause seqno to be lower than expected
- Galera seqno tracking and InnoDB LSN may be out of sync
- The SST backup seqno (6966) doesn't match the actual InnoDB recovered seqno (6666)
Error Messages
```
[ERROR] WSREP: Receiving IST failed, node restart required:
IST receiver reported failure: 'IST started with wrong seqno: 6929, expected <= 6667'
```
Impact
- *Node Restart Loop*: Node continuously restarts trying to recover
- *Cluster Instability*: Joiner node cannot join the cluster
- *Data Consistency Risk*: Incomplete state transfer
- *Service Disruption*: Affected node is unavailable
Expected Behavior
1. GCache recovery should match InnoDB recovery seqno
2. SST backup seqno should match actual InnoDB recovered seqno
3. IST should start with correct seqno range
4. No seqno mismatch between phases
Short-term Workaround
Manual restart of the node is required
Steps to Reproduce
1. Set up 3-node Galera cluster
2. Perform transactions to advance seqno
3. Stop one node (joiner)
4. Continue transactions on other nodes
5. Restart joiner node
6. Observe IST failure with seqno mismatch
Configuration Details
Galera Configuration
```
base_dir = /bitnami/mariadb/data/
base_host = 10.59.47.32
base_port = 4567
cert.log_conflicts = no
cert.optimistic_pa = yes
debug = no
evs.auto_evict = 0
evs.delay_margin = PT1S
evs.delayed_keep_period = PT30S
evs.inactive_check_period = PT0.5S
evs.inactive_timeout = PT15S
evs.join_retrans_period = PT1S
evs.max_install_timeouts = 3
evs.send_window = 4
evs.stats_report_period = PT1M
evs.suspect_timeout = PT5S
evs.user_send_window = 2
evs.view_forget_timeout = PT24H
gcache.dir = /bitnami/mariadb/data/
gcache.keep_pages_size = 0
gcache.mem_size = 0
gcache.name = galera.cache
gcache.page_size = 128M
gcache.recover = yes
gcache.size = 20G
gcomm.thread_prio = (default)
gcs.fc_debug = 0
gcs.fc_factor = 1.0
gcs.fc_limit = 16
gcs.fc_master_slave = no
gcs.fc_single_primary = no
gcs.max_packet_size = 64500
gcs.max_throttle = 0.25
gcs.recv_q_hard_limit = 9223372036854775807
gcs.recv_q_soft_limit = 0.25
gcs.sync_donor = no
gmcast.segment = 0
gmcast.version = 0
```
InnoDB Configuration
```
innodb_buffer_pool_size = 2.000GiB
innodb_buffer_pool_chunk_size = 32.000MiB
innodb_log_sequence_number = 31084789
innodb_transaction_id = 5168
innodb_undo_tablespaces = 3 (active)
innodb_rollback_segments = 128
innodb_temp_file_size = 12.000MiB
innodb_use_native_aio = yes
innodb_use_avx512 = yes
innodb_compression_algorithm = zlib 1.2.13
```
SST (State Transfer) Configuration
```
sst_method = mariabackup
sst_role = joiner
sst_address = 10.59.47.32
sst_datadir = /bitnami/mariadb/data/
sst_defaults_file = /opt/bitnami/mariadb/conf/my.cnf
sst_parent_pid = 1
sst_progress = 0
sst_binlog = mysql-bin
sst_ssl_mode = DISABLED
sst_ssl_ca = (empty)
sst_ssl_capath = (empty)
sst_ssl_cert = (empty)
sst_ssl_key = (empty)
sst_ssl_encrypt = 0
sst_streamer = socat
sst_stream_port = 4444
sst_socket_info_utility = ss
sst_timeout = 310 seconds (with -k 310 300)
```
Cluster Topology
```
Cluster UUID: 6e350d19-f301-11f0-bb57-471d08f203bb
Cluster Name: DBGalera
Cluster Size: 3 nodes
Cluster State: PRIMARY
Node 0 (Index 0):
UUID: 008c0b68-a846
Address: tcp://10.59.47.39:4567
Status: SYNCED
Hostname: ttd-db-mariadb-0
Node 1 (Index 1):
UUID: 61658c97-8ecd
Address: tcp://10.59.47.64:4567
Status: SYNCED
Hostname: ttd-db-mariadb-0
Node 2 (Index 2) - AFFECTED NODE:
UUID: b53c8051-951c
Address: tcp://10.59.47.32:4567
Status: JOINER (failed to join)
Hostname: ttd-db-mariadb-0
Role: Joiner
Donor: Node 1 (Index 1)
```
GCache State
```
GCache Version: 2
GCache UUID: 6e350d19-f301-11f0-bb57-471d08f203bb
GCache Seqno Range: 1 - 6966
GCache Offset: 1280
GCache Synced: yes
GCache Total Size: 21474836504 bytes (~20GB)
GCache Unused Buffers: 55699528 bytes
GCache Free Space: 21419137320 bytes
GCache Locked Buffers: 2/6968
GCache Recovery Status: Gapless sequence found (1-6966)
```
Protocol Versions
```
GCS Protocol: 5
Replication Protocol: 11
Application Protocol: 4
Galera Capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY,
ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED,
PREORDERED, STREAMING, NBO
```
Additional Information
- *Galera Cache Size*: 20G
- *GCache Page Size*: 128M
- *Cluster UUID*: 6e350d19-f301-11f0-bb57-471d08f203bb
- *Affected Node*: ttd-db-mariadb-0 (index 2)
- *Donor Node*: ttd-db-mariadb-0 (index 1)
- *Cluster State*: PRIMARY (2/3 nodes synced)
- *InnoDB Buffer Pool*: 2GB
- *SST Method*: mariabackup with IST optimization
- *Network*: 10.59.47.0/24 subnet
Related Issues
- Galera seqno tracking inconsistency
- InnoDB recovery vs Galera seqno synchronization
- SST backup seqno validation
- GCache-InnoDB state mismatch
Attachments
- Full MariaDB error log (provided above)
- Galera configuration (detailed above)
- Cluster topology information (detailed above)
- InnoDB configuration (detailed above)