[MDEV-20406] Rocksdb gets corrupted on OOM during ALTER Created: 2019-08-22 Updated: 2023-06-22 |
|
| Status: | Open |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - RocksDB |
| Affects Version/s: | 10.4.7 |
| Fix Version/s: | 10.4 |
| Type: | Bug | Priority: | Major |
| Reporter: | Philip orleans | Assignee: | Sergei Petrunia |
| Resolution: | Unresolved | Votes: | 3 |
| Labels: | upstream | ||
| Environment: |
Linux |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Description |
|
I was adding a field on a very large table
Note: The box
|
| Comments |
| Comment by Roel Van de Paar [ 2021-09-14 ] | |||||||||||||||||||||||||||||||||||||
|
philip_38 Hi! Would you have table definition for npadata please? Also, a full error log would be of assistance, if you still have it from 2019. Thanks. | |||||||||||||||||||||||||||||||||||||
| Comment by Roel Van de Paar [ 2021-09-14 ] | |||||||||||||||||||||||||||||||||||||
|
Similar outcome (Status Code: 2, Status: Corruption: truncated header) in MDEV-17777 with OOS. | |||||||||||||||||||||||||||||||||||||
| Comment by Philip orleans [ 2021-09-14 ] | |||||||||||||||||||||||||||||||||||||
|
I don't have the error log but I can give you access to the box, and you can try to replicate it. | |||||||||||||||||||||||||||||||||||||
| Comment by Roel Van de Paar [ 2021-09-15 ] | |||||||||||||||||||||||||||||||||||||
|
philip_38 Thank you. No need for box access. Can you please clarify data size (number of table entries/records)? Can you also clarify machine specs at the time of the corruption especially memory and disk size. | |||||||||||||||||||||||||||||||||||||
| Comment by Roel Van de Paar [ 2021-09-15 ] | |||||||||||||||||||||||||||||||||||||
|
In terms of avoiding the OOM/OOS and subsequent crash recovery to start with, this is not a bug in MariaDB, and a pure RocksDB configuration issue as far as I can tell. Preventing this issue would involve things like:
For this, I would propose that any person running into this takes a sample of their dataset, for example 10% of the data, and tests the ALTER on a test environment. This way the disk and memory required can be monitored. Whilst multiplying those numbers by 10 may be a bit too simplistic a calculation, it will at least give an idea/an indication of the approximate required memory and disk space. | |||||||||||||||||||||||||||||||||||||
| Comment by Roel Van de Paar [ 2021-09-15 ] | |||||||||||||||||||||||||||||||||||||
|
In terms of the failing crash recovery/failed RocksDB open (Status Code: 2, Status: Corruption: truncated header), this is a known Facebook RocksDB issue https://github.com/facebook/mysql-5.6/issues/814 - as such this bug was marked upstream. | |||||||||||||||||||||||||||||||||||||
| Comment by Philip orleans [ 2021-09-15 ] | |||||||||||||||||||||||||||||||||||||
|
The table has 2BN records. | |||||||||||||||||||||||||||||||||||||
| Comment by Mark Callaghan [ 2021-09-16 ] | |||||||||||||||||||||||||||||||||||||
|
1) Thank you for a nice bug report and sorry you encounter this | |||||||||||||||||||||||||||||||||||||
| Comment by Yoshinori Matsunobu [ 2021-09-16 ] | |||||||||||||||||||||||||||||||||||||
|
> [ERROR] RocksDB: Error opening instance, Status Code: 2, Status: Corruption: truncated header This may happen when operating with rocksdb_wal_recovery_mode=1 (kAbsoluteConsistency) and when RocksDB crashes. We recommend using rocksdb_wal_recovery_mode=2 (kPointInTimeRecovery). kPointInTimeRecovery is actually a default in RocksDB and MyRocks (FB MyRocks) also switched default from 1 to 2 before. Could you try setting to 2 if it is 1? 2 (kPointInTimeRecovery) means recovery stops and opens RocksDB when hitting a corrupted wal entry. It starts the instance with losing some data (up to the last valid wal entry), but replication state is consistent, so the instance can be recovered from a primary instance. | |||||||||||||||||||||||||||||||||||||
| Comment by Philip orleans [ 2021-09-16 ] | |||||||||||||||||||||||||||||||||||||
|
root@scrubber58:/usr/src# free -g the box has 64 cores df -H I am uploading the log as a file, it is 97K long I am trying to stop using zlib, can you please upload the new configuration line that uses zstd instead of zlib? | |||||||||||||||||||||||||||||||||||||
| Comment by Mark Callaghan [ 2021-09-16 ] | |||||||||||||||||||||||||||||||||||||
|
A blog post from me with advice on my.cnf for MyRocks is here. You need to determine whether the mysqld (mariad?) binary includes support for ZStd. The linked blog post shows how to do that (grep for "Compression algorithms supported"), and if supported you replace kZlibCompression with the name for ZStd listed in that section. | |||||||||||||||||||||||||||||||||||||
| Comment by Mark Callaghan [ 2021-09-16 ] | |||||||||||||||||||||||||||||||||||||
|
Do you know how big the mysqld (mariad) process is when it dies? From the "df -g" output this host has ~500G of RAM and the my.cnf has a 62G block cache and ~4G (16 x 256M) write buffer. From the LOG you uploaded I see this, so kZSTD can replace kZlibCompression
The only odd things I see in the RocksDB LOG file are:
And a huge size difference between L0 and the next level
| |||||||||||||||||||||||||||||||||||||
| Comment by Philip orleans [ 2021-09-16 ] | |||||||||||||||||||||||||||||||||||||
|
This case is from 2019. I have no records of any kind. | |||||||||||||||||||||||||||||||||||||
| Comment by Mark Callaghan [ 2021-09-16 ] | |||||||||||||||||||||||||||||||||||||
|
OK, thanks for the report. | |||||||||||||||||||||||||||||||||||||
| Comment by Philip orleans [ 2021-09-17 ] | |||||||||||||||||||||||||||||||||||||
|
Question: | |||||||||||||||||||||||||||||||||||||
| Comment by Mark Callaghan [ 2021-09-17 ] | |||||||||||||||||||||||||||||||||||||
|
Something like: sed s/kZlibCompression/kZSTD/g | |||||||||||||||||||||||||||||||||||||
| Comment by Sergei Golubchik [ 2021-09-27 ] | |||||||||||||||||||||||||||||||||||||
|
psergei, shall we switch rocksdb_wal_recovery_mode to be 2 (kPointInTimeRecovery) by default, like the upstream did? | |||||||||||||||||||||||||||||||||||||
| Comment by Sergei Golubchik [ 2021-09-27 ] | |||||||||||||||||||||||||||||||||||||
|
also we can force jemalloc or recomment jemalloc, and force transparent_huge_pages=never or recommend it. if it'll make any difference. | |||||||||||||||||||||||||||||||||||||
| Comment by Julien Muchembled [ 2023-06-22 ] | |||||||||||||||||||||||||||||||||||||
|
We just got the same issue after a OOM. If I understand correctly, the risk of rocksdb_wal_recovery_mode=2 is only to lose the Durable property of ACID, in case of unreliable hardware: we'd need https://github.com/facebook/rocksdb/issues/6288 to force the user to first solve IO issues before actually dropping committed transactions. |