[MDEV-21160] Rocksdb gets corrupted and stops the server form running Created: 2019-11-27  Updated: 2019-12-17

Status: Open
Project: MariaDB Server
Component/s: Storage Engine - RocksDB
Affects Version/s: 10.4.10
Fix Version/s: 10.4

Type: Bug Priority: Major
Reporter: Philip orleans Assignee: Sergei Petrunia
Resolution: Unresolved Votes: 0
Labels: crash
Environment:

Linux



 Description   

I get this error message very often, in several machines.
The server stops running and waits until a human being erases that file manually.
This is a disaster. Mission-critical machines needs to be converted back to INNODB, which is very inefficient in terms of storage.
The right design is that Rocksdb fixes itself, erases any corrupt file and continues.

[ERROR] RocksDB: The server will exit normally and stop restart attempts. Remove ./#rocksdb/ROCKSDB_CORRUPTED file from data directory and start mysqld manually.



 Comments   
Comment by Sergei Petrunia [ 2019-12-03 ]

> The right design is that Rocksdb fixes itself, erases any corrupt file and continues.

I am looking at the code and this logic with ROCKSDB_CORRUPTED file was put there intentionally. Maybe there's some kind of error that requires user intervention?

I see that above that line, you should have got this text:

        "RocksDB: There was a corruption detected in RockDB files. "
        "Check error log emitted earlier for more details.");

Do you have it? Is there anything above that text that would give a clue about why RocksDB stopped?
Another place to check is the $datadir/#rocksdb/LOG file

Comment by Sergei Petrunia [ 2019-12-03 ]

Looking at the code - I see that ROCKSDB_CORRUPTED file is created when an operation over RocksDB returns a data corruption error. This should not normally happen (e.g. server process crash machine power off are not expected to cause this).

We need to figure out what is causing the data corruption error.

  • The first step is to check the error log and RocksDB's LOG file
  • then, one could use sst_dump utility to check the data directory for errors.
Comment by Philip orleans [ 2019-12-17 ]

I just saw this. The issue is that in no event the server may stop and wait
for a user to manually delete the file. If the fix is merely deleting a
file, please have the software or the whatchdog process do it. We use
mariadb for business.

On Tue, Dec 3, 2019 at 8:39 AM Sergei Petrunia (Jira) <jira@mariadb.org>

Generated at Thu Feb 08 09:05:02 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.