Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-39044

MyRocks corruption after restart during/after ALTER workload: Corruption: truncated record body, .frm mismatch, no crash log, no OOM killer

    XMLWordPrintable

Details

    • Can result in data loss

    Description

      Summary

      We hit a MyRocks corruption scenario where MariaDB restarts, MyRocks fails to initialize with:

      Corruption: truncated record body

      The server itself then continues to start with InnoDB crash recovery and binlog recovery, but MyRocks remains unavailable and a large number of table metadata errors follow:

      Incorrect information in file: './pmacontrol/...frm'

      This leaves the server partially up, but RocksDB tables are unusable because the .frm files appear out of sync with the MyRocks dictionary/state.

      A notable part of this case is that there is no explicit crash record in the MariaDB error log and no OOM-killer event recorded by the kernel.

      Observed timeframe

      Between 2026-03-05 23:00 and 2026-03-06 01:00:

      • I found no matching journalctl entries for mariadb.service
      • no matching entries for mariadbd
      • no OOM-killer event in kernel logs
      • journalctl -u mariadb is empty for that exact window

      So there is:

      • no systemd-recorded stop/start in that period
      • no kernel-recorded OOM event
      • no OOM-killer evidence
      • no explicit mysqld crash entry in the MariaDB error log for that incident window

      Important evidence from MariaDB error log

      The MariaDB error log shows the relevant sequence in error.log.old:

      • repeated InnoDB: Memory pressure event disregarded messages from 23:18 onward
      • a burst of aborted pmacontrol connections around 23:57:49-23:57:53
      • MariaDB startup at 2026-03-05 23:58:35
      • during that startup, MyRocks fails with:

        RocksDB: Error opening instance, Status Code: 2, Status: Corruption: truncated record body

      • the server then performs InnoDB crash recovery
      • the server then performs binlog recovery
      • immediately after that, many errors appear like:

        Incorrect information in file: './pmacontrol/...frm'

      Representative references from the error log:

      • repeated memory pressure warnings starting around error.log.old#L481872
      • aborted connections around error.log.old#L481903
      • startup at error.log.old#L481930
      • MyRocks corruption at error.log.old#L481939
      • InnoDB crash recovery at error.log.old#L481954
      • binlog recovery at error.log.old#L481980
      • .frm metadata errors starting around error.log.old#L482038
      • continuing later, e.g. error.log.old#L483593

      Important negative evidence

      There is no corresponding:

      • mysqld got signal
      • assertion failure
      • stack trace
      • OOM killer log
      • systemd stop/start record

      So the incident looks like a restart followed by MyRocks corruption recovery failure, but without a normal crash signature in either MariaDB logs or kernel logs.

      Representative server log excerpt

      2026-03-06  9:16:18 0 [Note] Starting MariaDB 10.11.16-MariaDB-deb12-log ...
      2026-03-06  9:16:18 0 [Note] RocksDB: 2 column families found
      2026-03-06  9:16:20 0 [ERROR] RocksDB: Error opening instance, Status Code: 2, Status: Corruption: truncated record body
      2026-03-06  9:16:20 0 [ERROR] Plugin 'ROCKSDB' registration as a STORAGE ENGINE failed.
      ...
      2026-03-06  9:16:20 0 [Note] /usr/sbin/mariadbd: ready for connections.
      ...
      2026-03-06  9:16:20 9 [ERROR] mariadbd: Incorrect information in file: './pmacontrol/ts_value_digest_text.frm'
      2026-03-06  9:16:20 9 [ERROR] mariadbd: Incorrect information in file: './pmacontrol/ts_value_calculated_double.frm'
      2026-03-06  9:16:20 9 [ERROR] mariadbd: Incorrect information in file: './pmacontrol/ts_value_slave_json.frm'
      2026-03-06  9:16:20 9 [ERROR] mariadbd: Incorrect information in file: './pmacontrol/ts_value_general_int.frm'
      2026-03-06  9:16:20 9 [ERROR] mariadbd: Incorrect information in file: './pmacontrol/ts_value_calculated_int.frm'
      2026-03-06  9:16:20 9 [ERROR] mariadbd: Incorrect information in file: './pmacontrol/ts_value_slave_text.frm'
      2026-03-06  9:16:20 9 [ERROR] mariadbd: Incorrect information in file: './pmacontrol/ts_value_general_text.frm'
      2026-03-06  9:16:20 9 [ERROR] mariadbd: Incorrect information in file: './pmacontrol/ts_value_digest_double.frm'
      2026-03-06  9:16:20 9 [ERROR] mariadbd: Incorrect information in file: './pmacontrol/ts_value_calculated_text.frm'

      Allocator / memory environment

      The server is running with jemalloc.

      This is relevant because the incident happened in a context where MariaDB was logging repeated:

      InnoDB: Memory pressure event disregarded

      but there was still:

      • no kernel OOM-killer event
      • no explicit MariaDB crash signature in the error log

      So this does not look like a straightforward kernel OOM kill. The behavior happened under memory pressure, with jemalloc in use, and ended in MyRocks startup corruption plus .frm metadata inconsistency.

      Possible trigger

      This happened after ALTER activity on RocksDB tables, with concurrent memory pressure visible on the InnoDB side (Memory pressure event disregarded messages), while running with jemalloc. I cannot prove kernel OOM, and there is no OOM-killer entry, but there was clearly memory pressure before the restart/corruption event.

      Workaround / recovery

      To recover MyRocks enough to start and proceed with repair, the following setting helped:

      rocksdb_wal_recovery_mode=2

      This appears necessary to get past the corrupted WAL state.

      Expected behavior

      One of the following should happen instead:

      1. ALTER on RocksDB should fail atomically without leaving MyRocks data dictionary / .frm files inconsistent.
      2. Startup should detect and report the exact metadata mismatch more explicitly.
      3. Recovery should not leave the server in a state where MyRocks is unavailable but the server otherwise appears started and usable.

      Actual behavior

      • MariaDB starts
      • MyRocks plugin fails
      • InnoDB crash recovery runs
      • binlog recovery runs
      • .frm errors continue for RocksDB tables
      • affected RocksDB tables in schema pmacontrol are unusable

      Environment

      • MariaDB version: 10.11.16-MariaDB-deb12-log
      • OS: Debian 12
      • MyRocks enabled
      • InnoDB also enabled
      • jemalloc in use

      Relevant my.cnf excerpt

      [mysqld]
      user = mysql
      port = 3306
      socket = /var/run/mysqld/mysqld.sock
      datadir = /srv/mysql/data
      tmpdir = /srv/mysql/tmp
      log_error = /srv/mysql/log/error.log
      skip-name-resolve
       
      max_connections = 100
      connect_timeout = 10
      wait_timeout = 600
      max_allowed_packet = 256M
      thread_cache_size = 128
       
      sort_buffer_size = 32M
      tmp_table_size = 768M
      max_heap_table_size = 768M
      key_buffer_size = 128M
       
      default_storage_engine = InnoDB
      innodb_buffer_pool_size = 2G
      innodb_buffer_pool_size_auto_min = 1G
      innodb_buffer_pool_size_max = 3G
      innodb_log_file_size = 948M
      innodb_log_buffer_size = 8M
      innodb_file_per_table = 1
      innodb_open_files = 2000
      innodb_io_capacity = 2000
      innodb_flush_method = O_DIRECT
      innodb_strict_mode = 1
      innodb_rollback_on_timeout = 1
       
      slow_query_log = 1
      slow_query_log_file = /srv/mysql/log/mariadb-slow.log
      long_query_time = 1
       
      log_bin = /srv/mysql/binlog/mariadb-bin
      log_bin_index = /srv/mysql/binlog/mariadb-bin.index
      binlog_expire_logs_seconds = 3600
      max_binlog_size = 100M
      sync_binlog = 10000
       
      performance_schema = ON
      userstat = ON
      query_response_time_stats = ON
      event_scheduler = ON
       
      rocksdb_wal_recovery_mode = 2
      rocksdb_flush_log_at_trx-commit = 2
       
      server-id = 394663081
      report_host = ist-pmacontrol
       
      wsrep_on = OFF
      wsrep_cluster_name = 68Koncept

      Potentially related existing tickets

      This issue looks related in theme, but not identical, to:

      • MDEV-20406: Rocksdb gets corrupted on OOM during ALTER
      • MDEV-18204: RocksDB failed to start due to problems validating data dictionary against .frm files
      • MDEV-29749: RocksDB does not refuse nopad collation in time, leaves corrupt schema

      The difference here is:

      • no kernel OOM evidence
      • no OOM-killer event
      • no explicit crash signature in the MariaDB error log
      • truncated record body instead of the exact messages in those reports
      • server continues startup without MyRocks, then emits many .frm metadata errors

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              Aurelien_LEQUOY Aurélien LEQUOY
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.