[MDEV-39044] MyRocks corruption after restart during/after ALTER workload: Corruption: truncated record body, .frm mismatch, no crash log, no OOM killer - Jira

Details

Type: Bug
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 10.11.16
Fix Version/s: 10.11
Component/s: Server, Storage Engine - RocksDB
Labels:
- crash
- recovery

Bug Category:
Can result in data loss

Description

Summary

We hit a MyRocks corruption scenario where MariaDB restarts, MyRocks fails to initialize with:

Corruption: truncated record body

The server itself then continues to start with InnoDB crash recovery and binlog recovery, but MyRocks remains unavailable and a large number of table metadata errors follow:

Incorrect information in file: './pmacontrol/...frm'

This leaves the server partially up, but RocksDB tables are unusable because the .frm files appear out of sync with the MyRocks dictionary/state.

A notable part of this case is that there is no explicit crash record in the MariaDB error log and no OOM-killer event recorded by the kernel.

Observed timeframe

Between 2026-03-05 23:00 and 2026-03-06 01:00:

I found no matching journalctl entries for mariadb.service
no matching entries for mariadbd
no OOM-killer event in kernel logs
journalctl -u mariadb is empty for that exact window

So there is:

no systemd-recorded stop/start in that period
no kernel-recorded OOM event
no OOM-killer evidence
no explicit mysqld crash entry in the MariaDB error log for that incident window

Important evidence from MariaDB error log

The MariaDB error log shows the relevant sequence in error.log.old:

repeated InnoDB: Memory pressure event disregarded messages from 23:18 onward
a burst of aborted pmacontrol connections around 23:57:49-23:57:53
MariaDB startup at 2026-03-05 23:58:35

during that startup, MyRocks fails with:

RocksDB: Error opening instance, Status Code: 2, Status: Corruption: truncated record body

the server then performs InnoDB crash recovery
the server then performs binlog recovery

immediately after that, many errors appear like:

Incorrect information in file: './pmacontrol/...frm'

Representative references from the error log:

repeated memory pressure warnings starting around error.log.old#L481872
aborted connections around error.log.old#L481903
startup at error.log.old#L481930
MyRocks corruption at error.log.old#L481939
InnoDB crash recovery at error.log.old#L481954
binlog recovery at error.log.old#L481980
.frm metadata errors starting around error.log.old#L482038
continuing later, e.g. error.log.old#L483593

Important negative evidence

There is no corresponding:

mysqld got signal
assertion failure
stack trace
OOM killer log
systemd stop/start record

So the incident looks like a restart followed by MyRocks corruption recovery failure, but without a normal crash signature in either MariaDB logs or kernel logs.

Representative server log excerpt

2026-03-06  9:16:18 0 [Note] Starting MariaDB 10.11.16-MariaDB-deb12-log ...

2026-03-06  9:16:18 0 [Note] RocksDB: 2 column families found

2026-03-06  9:16:20 0 [ERROR] RocksDB: Error opening instance, Status Code: 2, Status: Corruption: truncated record body

2026-03-06  9:16:20 0 [ERROR] Plugin 'ROCKSDB' registration as a STORAGE ENGINE failed.

...

2026-03-06  9:16:20 0 [Note] /usr/sbin/mariadbd: ready for connections.

...

2026-03-06  9:16:20 9 [ERROR] mariadbd: Incorrect information in file: './pmacontrol/ts_value_digest_text.frm'

2026-03-06  9:16:20 9 [ERROR] mariadbd: Incorrect information in file: './pmacontrol/ts_value_calculated_double.frm'

2026-03-06  9:16:20 9 [ERROR] mariadbd: Incorrect information in file: './pmacontrol/ts_value_slave_json.frm'

2026-03-06  9:16:20 9 [ERROR] mariadbd: Incorrect information in file: './pmacontrol/ts_value_general_int.frm'

2026-03-06  9:16:20 9 [ERROR] mariadbd: Incorrect information in file: './pmacontrol/ts_value_calculated_int.frm'

2026-03-06  9:16:20 9 [ERROR] mariadbd: Incorrect information in file: './pmacontrol/ts_value_slave_text.frm'

2026-03-06  9:16:20 9 [ERROR] mariadbd: Incorrect information in file: './pmacontrol/ts_value_general_text.frm'

2026-03-06  9:16:20 9 [ERROR] mariadbd: Incorrect information in file: './pmacontrol/ts_value_digest_double.frm'

2026-03-06  9:16:20 9 [ERROR] mariadbd: Incorrect information in file: './pmacontrol/ts_value_calculated_text.frm'

Allocator / memory environment

The server is running with jemalloc.

This is relevant because the incident happened in a context where MariaDB was logging repeated:

InnoDB: Memory pressure event disregarded

but there was still:

no kernel OOM-killer event
no explicit MariaDB crash signature in the error log

So this does not look like a straightforward kernel OOM kill. The behavior happened under memory pressure, with jemalloc in use, and ended in MyRocks startup corruption plus .frm metadata inconsistency.

Possible trigger

This happened after ALTER activity on RocksDB tables, with concurrent memory pressure visible on the InnoDB side (Memory pressure event disregarded messages), while running with jemalloc. I cannot prove kernel OOM, and there is no OOM-killer entry, but there was clearly memory pressure before the restart/corruption event.

Workaround / recovery

To recover MyRocks enough to start and proceed with repair, the following setting helped:

rocksdb_wal_recovery_mode=2

This appears necessary to get past the corrupted WAL state.

Expected behavior

One of the following should happen instead:

ALTER on RocksDB should fail atomically without leaving MyRocks data dictionary / .frm files inconsistent.
Startup should detect and report the exact metadata mismatch more explicitly.
Recovery should not leave the server in a state where MyRocks is unavailable but the server otherwise appears started and usable.

Actual behavior

MariaDB starts
MyRocks plugin fails
InnoDB crash recovery runs
binlog recovery runs
.frm errors continue for RocksDB tables
affected RocksDB tables in schema pmacontrol are unusable

Environment

MariaDB version: 10.11.16-MariaDB-deb12-log
OS: Debian 12
MyRocks enabled
InnoDB also enabled
jemalloc in use

Relevant my.cnf excerpt

[mysqld]

user = mysql

port = 3306

socket = /var/run/mysqld/mysqld.sock

datadir = /srv/mysql/data

tmpdir = /srv/mysql/tmp

log_error = /srv/mysql/log/error.log

skip-name-resolve

max_connections = 100

connect_timeout = 10

wait_timeout = 600

max_allowed_packet = 256M

thread_cache_size = 128

sort_buffer_size = 32M

tmp_table_size = 768M

max_heap_table_size = 768M

key_buffer_size = 128M

default_storage_engine = InnoDB

innodb_buffer_pool_size = 2G

innodb_buffer_pool_size_auto_min = 1G

innodb_buffer_pool_size_max = 3G

innodb_log_file_size = 948M

innodb_log_buffer_size = 8M

innodb_file_per_table = 1

innodb_open_files = 2000

innodb_io_capacity = 2000

innodb_flush_method = O_DIRECT

innodb_strict_mode = 1

innodb_rollback_on_timeout = 1

slow_query_log = 1

slow_query_log_file = /srv/mysql/log/mariadb-slow.log

long_query_time = 1

log_bin = /srv/mysql/binlog/mariadb-bin

log_bin_index = /srv/mysql/binlog/mariadb-bin.index

binlog_expire_logs_seconds = 3600

max_binlog_size = 100M

sync_binlog = 10000

performance_schema = ON

userstat = ON

query_response_time_stats = ON

event_scheduler = ON

rocksdb_wal_recovery_mode = 2

rocksdb_flush_log_at_trx-commit = 2

server-id = 394663081

report_host = ist-pmacontrol

wsrep_on = OFF

wsrep_cluster_name = 68Koncept

Potentially related existing tickets

This issue looks related in theme, but not identical, to:

MDEV-20406: Rocksdb gets corrupted on OOM during ALTER
~~MDEV-18204~~: RocksDB failed to start due to problems validating data dictionary against .frm files
MDEV-29749: RocksDB does not refuse nopad collation in time, leaves corrupt schema

The difference here is:

no kernel OOM evidence
no OOM-killer event
no explicit crash signature in the MariaDB error log
truncated record body instead of the exact messages in those reports
server continues startup without MyRocks, then emits many .frm metadata errors

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

screenshot-1.png
554 kB
2026-03-12 19:08

Issue Links

relates to

MDEV-25180 Atomic ALTER TABLE

Closed

MyRocks corruption after restart during/after ALTER workload: Corruption: truncated record body, .frm mismatch, no crash log, no OOM killer

Details

Description

Summary

Observed timeframe

Important evidence from MariaDB error log

Important negative evidence

Representative server log excerpt

Allocator / memory environment

Possible trigger

Workaround / recovery

Expected behavior

Actual behavior

Environment

Relevant my.cnf excerpt

Potentially related existing tickets

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration