[MDEV-20406] Rocksdb gets corrupted on OOM during ALTER Created: 2019-08-22  Updated: 2023-06-22

Status: Open
Project: MariaDB Server
Component/s: Storage Engine - RocksDB
Affects Version/s: 10.4.7
Fix Version/s: 10.4

Type: Bug Priority: Major
Reporter: Philip orleans Assignee: Sergei Petrunia
Resolution: Unresolved Votes: 3
Labels: upstream
Environment:

Linux


Attachments: Zip Archive rocksdb-log.zip    
Issue Links:
Blocks
is blocked by MDEV-30610 Update RocksDB to the latest upstream... Open

 Description   

I was adding a field on a very large table
ALTER TABLE npadata ADD COLUMN name1 VARCHAR(64) NOT NULL DEFAULT '';
ERROR 2013 (HY000): Lost connection to MySQL server during query
Note: there is plenty of space available in the box

tail /var/lib/mysql/*.err

Version: '10.4.7-MariaDB-1:10.4.7+maria~bionic'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  mariadb.org binary distribution
2019-08-19 15:32:00 419 [Note] Zerofilling moved table:  './mysql/proc'
2019-08-20 14:19:50 901 [Note] mysqld: O_TMPFILE is not supported on /temp (disabling future attempts)
2019-08-22  5:53:01 0 [Note] RocksDB: 2 column families found
2019-08-22  5:53:01 0 [Note] RocksDB: Column Families at start:
2019-08-22  5:53:01 0 [Note]   cf=default
2019-08-22  5:53:01 0 [Note]     write_buffer_size=268435456
2019-08-22  5:53:01 0 [Note]     target_file_size_base=67108864
2019-08-22  5:53:01 0 [Note]   cf=__system__
2019-08-22  5:53:01 0 [Note]     write_buffer_size=268435456
2019-08-22  5:53:01 0 [Note]     target_file_size_base=67108864
2019-08-22  5:53:44 0 [ERROR] RocksDB: Error opening instance, Status Code: 2, Status: Corruption: truncated header
2019-08-22  5:53:44 0 [ERROR] Plugin 'ROCKSDB' init function returned error.
2019-08-22  5:53:44 0 [ERROR] Plugin 'ROCKSDB' registration as a STORAGE ENGINE failed.
2019-08-22  5:53:44 0 [ERROR] Unknown/unsupported storage engine: rocksdb
2019-08-22  5:53:44 0 [ERROR] Aborting

Note:
If the table is very large, adding a fields takes literally days. But also, which is worse, if you need to add many fields at once, there is no way to do it in one single event. It has to copy the entire table many times. In my case, the table has 1.92BN records. One additional field took 3 days to add. That was las week. Now I need to add 5 fields. This is not acceptable even in case that this corruption can be resolved.

The box

df -H

Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb1       1.1T  649G  425G  61% /
none            504k     0  504k   0% /dev
tmpfs           127G     0  127G   0% /dev/shm
tmpfs           127G  1.0M  127G   1% /run
tmpfs           5.3M     0  5.3M   0% /run/lock
tmpfs           127G     0  127G   0% /sys/fs/cgroup
tmpfs            26G     0   26G   0% /run/user/0

free -g

              total        used        free      shared  buff/cache   available
Mem:            236          30         196           0           8         203
Swap:             7           4           3

my.cnf

[mysqld]
open_files_limit=65535
log_warnings= 0
datadir=/var/lib/mysql
tmpdir=/temp
slave_load_tmpdir=/data/temp
plugin-load-add=ha_rocksdb.so
port =3306 
table_open_cache=3000
table_open_cache_instances=24
default-storage-engine = rocksdb
skip_innodb=1
max_allowed-packet=1G
large_pages=1
log_error=/var/lib/mysql/mariadb.err
general_log=0
general_log_file=mariadb.log
log_output='FILE'
lower_case_table_names=1
max_heap_table_size=20G
memlock=1
query_cache_size=0
query_cache_type=0
skip_name_resolve=1
sort_buffer_size=8M
symbolic_links=0
thread_cache_size=256
thread_handling=pool-of-threads
thread-pool-size=64
thread_stack=192K
tmp_table_size=64M
general-log=0
general-log-file=queries.log
log-output=file
join_buffer_size=8M
event_scheduler = 1
max_connections=1250
extra_max_connections=10
extra_port=9433
connect_timeout=200
interactive_timeout=3000000
net_read_timeout=2000
net_write_timeout=2000
slave_net_timeout=3600
wait_timeout=6000
rocksdb_commit_in_the_middle=1
rocksdb_allow_concurrent_memtable_write=1
rocksdb_blind_delete_primary_key=1
rocksdb_tmpdir=/temp
rocksdb_bulk_load_size=10000
rocksdb_max_total_wal_size=14G
rocksdb_table_cache_numshardbits=19
rocksdb_default_cf_options=write_buffer_size=256m;target_file_size_base=64m;max_bytes_for_level_base=512m;level0_file_num_compaction_trigger=4;level0_slowdown_writes_trigger=256;level0_stop_writes_trigger=256;max_write_buffer_number=16;compression_per_level=kNoCompression:kNoCompression:kNoCompression:kZlibCompression:kZlibCompression:kZlibCompression;bottommost_compression=kZlibCompression;compression_opts=-14:1:0;block_based_table_factory={cache_index_and_filter_blocks=1;filter_policy=bloomfilter:10:false;whole_key_filtering=1};level_compaction_dynamic_level_bytes=true;optimize_filters_for_hits=true
rocksdb_override_cf_options=system={memtable=skip_list:16}
rocksdb_write_disable_wal=0
rocksdb_flush_log_at_trx_commit=1
rocksdb_strict_collation_check=off
sync_binlog=0
join_buffer_size=256K
sort_buffer_size=256K
 ##innodb_spin_wait_delay=96
rocksdb_max_background_jobs=24
rocksdb_compaction_sequential_deletes=199999
rocksdb_compaction_sequential_deletes_window=200000
rocksdb_max_subcompactions=16
rocksdb_compaction_readahead_size=16m
rocksdb_use_direct_reads=ON
rocksdb_use_direct_io_for_flush_and_compaction=ON
rocksdb_max_row_locks=1073741824
rocksdb-table-stats-sampling-pct=15
 
rocksdb_max_open_files=-1
rocksdb_block_size=16384
rocksdb_block_cache_size=62G
 
rocksdb_bytes_per_sync=4194304
rocksdb_wal_bytes_per_sync=4194304
rocksdb_rate_limiter_bytes_per_sec=204857600 
 
rocksdb_compaction_sequential_deletes_count_sd=1
rocksdb_compaction_sequential_deletes=199999



 Comments   
Comment by Roel Van de Paar [ 2021-09-14 ]

philip_38 Hi! Would you have table definition for npadata please? Also, a full error log would be of assistance, if you still have it from 2019. Thanks.

Comment by Roel Van de Paar [ 2021-09-14 ]

Similar outcome (Status Code: 2, Status: Corruption: truncated header) in MDEV-17777 with OOS.
Similar outcome (Status Code: 2, Status: Corruption: truncated header) in https://github.com/facebook/mysql-5.6/issues/814 with OOM (and some extra info!).

Comment by Philip orleans [ 2021-09-14 ]

 CREATE TABLE `npadata` (
  `state` varchar(2) NOT NULL DEFAULT '??',
  `company` varchar(30) DEFAULT NULL,
  `ocn` varchar(4) NOT NULL DEFAULT 'N/A',
  `prefix_type` varchar(10) DEFAULT NULL,
  `ratecenter` varchar(40) DEFAULT NULL,
  `clli` varchar(20) DEFAULT NULL,
  `lata` varchar(5) DEFAULT NULL,
  `country` varchar(2) DEFAULT NULL,
  `wireless` tinyint(1) unsigned NOT NULL DEFAULT 0,
  `did` bigint(20) unsigned NOT NULL,
  `lrn` bigint(20) unsigned DEFAULT NULL,
  `port_type` tinyint(4) unsigned DEFAULT NULL,
  `dnc` tinyint(4) unsigned DEFAULT NULL,
  `reachable` tinyint(4) unsigned DEFAULT NULL,
  `npa` varchar(3) DEFAULT NULL,
  `nxx` varchar(3) DEFAULT NULL,
  `xxxx` varchar(4) DEFAULT NULL,
  `tested` tinyint(1) DEFAULT 0,
  `lastchange` date NOT NULL DEFAULT curdate(),
  `residential` tinyint(4) unsigned NOT NULL DEFAULT 0,
  PRIMARY KEY (`did`),
  KEY `IDX_npadata` (`wireless`,`ocn`,`company`),
  KEY `IDX_npadata_state` (`state`,`wireless`,`country`),
  KEY `IDX_npadata_lrn` (`lrn`),
  KEY `IDX_npadata3` (`reachable`,`tested`),
  KEY `IDX_npadata_COMPANY` (`company`,`wireless`),
  KEY `IDX_npadata_dnc` (`dnc`),
  KEY `IDX_npadata_lastchange` (`lastchange`),
  KEY `IDX_npadata_tested` (`tested`),
  KEY `IDX_npadata2` (`npa`,`lrn`),
  KEY `NUK_npadata` (`npa`,`dnc`,`did`,`lrn`,`ocn`,`port_type`),
  KEY `IDX_npadata_residential` (`residential`),
  KEY `IDX_npadata4` (`country`,`npa`),
  KEY `IDX_npadata_country` (`country`,`npa`,`wireless`),
  KEY `IDX_npadata5` (`npa`,`nxx`)
) ENGINE=ROCKSDB DEFAULT CHARSET=latin1 

I don't have the error log but I can give you access to the box, and you can try to replicate it.

Comment by Roel Van de Paar [ 2021-09-15 ]

philip_38 Thank you. No need for box access. Can you please clarify data size (number of table entries/records)? Can you also clarify machine specs at the time of the corruption especially memory and disk size.

Comment by Roel Van de Paar [ 2021-09-15 ]

In terms of avoiding the OOM/OOS and subsequent crash recovery to start with, this is not a bug in MariaDB, and a pure RocksDB configuration issue as far as I can tell.

Preventing this issue would involve things like:

  • Ensuring sufficient memory and disk space is available for the requested operation.
  • Ensuring rocksdb_max_row_locks is set correctly.
  • Ensuring RocksDB is tuned correctly.

For this, I would propose that any person running into this takes a sample of their dataset, for example 10% of the data, and tests the ALTER on a test environment. This way the disk and memory required can be monitored. Whilst multiplying those numbers by 10 may be a bit too simplistic a calculation, it will at least give an idea/an indication of the approximate required memory and disk space.

Comment by Roel Van de Paar [ 2021-09-15 ]

In terms of the failing crash recovery/failed RocksDB open (Status Code: 2, Status: Corruption: truncated header), this is a known Facebook RocksDB issue https://github.com/facebook/mysql-5.6/issues/814 - as such this bug was marked upstream.

Comment by Philip orleans [ 2021-09-15 ]

The table has 2BN records.
I can give you access via SQL
the data is not important.

Comment by Mark Callaghan [ 2021-09-16 ]

1) Thank you for a nice bug report and sorry you encounter this
2) Can you share the RocksDB LOG file? I am rusty but think it is at $mysql-data-dir/.rocksdb/LOG. If the file is too big then the prefix and suffix of it will be useful.
3) How much RAM does your server have? With a 62G block cache and 16 x 256M write buffers, that can be ~66G of RAM. Regardless, corruption after OOM isn't a reasonable outcome.
4) I don't see any problems from the my.cnf except for rocksdb_max_open_files=-1. Although it isn't clear to me that an open file would use too much memory given my.cnf has: cache_index_and_filter_blocks=1
5) If supported, performance should improve by using zstd rather than zlib – less CPU overhead from compression means there will be fewer write stalls

Comment by Yoshinori Matsunobu [ 2021-09-16 ]

> [ERROR] RocksDB: Error opening instance, Status Code: 2, Status: Corruption: truncated header

This may happen when operating with rocksdb_wal_recovery_mode=1 (kAbsoluteConsistency) and when RocksDB crashes. We recommend using rocksdb_wal_recovery_mode=2 (kPointInTimeRecovery). kPointInTimeRecovery is actually a default in RocksDB and MyRocks (FB MyRocks) also switched default from 1 to 2 before. Could you try setting to 2 if it is 1? 2 (kPointInTimeRecovery) means recovery stops and opens RocksDB when hitting a corrupted wal entry. It starts the instance with losing some data (up to the last valid wal entry), but replication state is consistent, so the instance can be recovered from a primary instance.

Comment by Philip orleans [ 2021-09-16 ]

root@scrubber58:/usr/src# free -g
total used free shared buff/cache available
Mem: 503 100 336 0 67 385
Swap: 7 0 7

the box has 64 cores

df -H
Filesystem Size Used Avail Use% Mounted on
/dev/sdb1 1.1T 626G 448G 59% /
none 504k 4.1k 500k 1% /dev
/dev/sda2 53G 21G 29G 42% /usr/lib/modules
none 271G 0 271G 0% /dev/shm
tmpfs 55G 1.1M 55G 1% /run
tmpfs 5.3M 0 5.3M 0% /run/lock
tmpfs 271G 0 271G 0% /sys/fs/cgroup
tmpfs 55G 0 55G 0% /run/user/0

I am uploading the log as a file, it is 97K long

I am trying to stop using zlib, can you please upload the new configuration line that uses zstd instead of zlib?

Comment by Mark Callaghan [ 2021-09-16 ]

A blog post from me with advice on my.cnf for MyRocks is here. You need to determine whether the mysqld (mariad?) binary includes support for ZStd. The linked blog post shows how to do that (grep for "Compression algorithms supported"), and if supported you replace kZlibCompression with the name for ZStd listed in that section.

Comment by Mark Callaghan [ 2021-09-16 ]

Do you know how big the mysqld (mariad) process is when it dies? From the "df -g" output this host has ~500G of RAM and the my.cnf has a 62G block cache and ~4G (16 x 256M) write buffer.

From the LOG you uploaded I see this, so kZSTD can replace kZlibCompression

2021/09/15-22:14:46.852318 7f87d8ed7800 Compression algorithms supported:
2021/09/15-22:14:46.852320 7f87d8ed7800 	kZSTDNotFinalCompression supported: 1
2021/09/15-22:14:46.852322 7f87d8ed7800 	kZSTD supported: 1
2021/09/15-22:14:46.852323 7f87d8ed7800 	kXpressCompression supported: 0
2021/09/15-22:14:46.852325 7f87d8ed7800 	kLZ4HCCompression supported: 1
2021/09/15-22:14:46.852326 7f87d8ed7800 	kLZ4Compression supported: 1
2021/09/15-22:14:46.852327 7f87d8ed7800 	kBZip2Compression supported: 0
2021/09/15-22:14:46.852328 7f87d8ed7800 	kZlibCompression supported: 1
2021/09/15-22:14:46.852335 7f87d8ed7800 	kSnappyCompression supported: 1

The only odd things I see in the RocksDB LOG file are:

2021/09/15-22:14:47.647397 7f87d8ed7800 [version_set.cc:3289] More existing levels in DB than needed. max_bytes_for_level_multiplier may not be guaranteed.

And a huge size difference between L0 and the next level

** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      4/0   68.13 MB   1.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0
  L6   1689/0   102.60 GB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0
 Sum   1693/0   102.67 GB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0
 Int      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0

Comment by Philip orleans [ 2021-09-16 ]

This case is from 2019. I have no records of any kind.

Comment by Mark Callaghan [ 2021-09-16 ]

OK, thanks for the report.

Comment by Philip orleans [ 2021-09-17 ]

Question:
to replace zlib for zstd, can do a global "sed" on my.cnf and change
kzlib for kzstd ?
I looked at your blog and couldn't figure it out.

Comment by Mark Callaghan [ 2021-09-17 ]

Something like: sed s/kZlibCompression/kZSTD/g
Although I didn't test that sed command.

Comment by Sergei Golubchik [ 2021-09-27 ]

psergei, shall we switch rocksdb_wal_recovery_mode to be 2 (kPointInTimeRecovery) by default, like the upstream did?

Comment by Sergei Golubchik [ 2021-09-27 ]

also we can force jemalloc or recomment jemalloc, and force transparent_huge_pages=never or recommend it. if it'll make any difference.

Comment by Julien Muchembled [ 2023-06-22 ]

We just got the same issue after a OOM. If I understand correctly, the risk of rocksdb_wal_recovery_mode=2 is only to lose the Durable property of ACID, in case of unreliable hardware: we'd need https://github.com/facebook/rocksdb/issues/6288 to force the user to first solve IO issues before actually dropping committed transactions.

Generated at Thu Feb 08 08:59:14 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.