[MDEV-19395] Server fails to crash recover after being kill -9'd with RocksDB and binlog Created: 2019-05-05  Updated: 2019-06-21  Resolved: 2019-06-21

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - RocksDB
Affects Version/s: 10.2.21, 10.3.13
Fix Version/s: 10.2.26, 10.3.17, 10.3.18

Type: Bug Priority: Major
Reporter: Anton Polyakov Assignee: Sergei Petrunia
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Blocks
is blocked by MDEV-19397 rocksdb is not crash safe with binlog on Closed
is blocked by MDEV-19795 Merge upstream MyRocks: June 2019 Closed
Duplicate
is duplicated by MDEV-18246 MyRocks ColumnFamilySet(): Assertion ... Closed

 Description   

Steps to reproduce:

  1. Setup vanilla 10.2.21 and run it with bin-log and RocksDB enabled, i.e add to my.cnf

    [mysqld]
    plugin-load-add = 'ha_rocksdb'
    log-bin = '/rdsdbdata/log/binlog/mysql-bin-changelog'
    

  2. create database benchdb
  3. start sysbench
    1. first prepare

       sysbench --test=/usr/share/sysbench/tests/include/oltp_legacy/parallel_prepare.lua --mysql-port=3306 --db-driver=mysql --mysql-table-engine=rocksdb --oltp-tables-count=10 --oltp-table-size=250000 --num-threads=250 --mysql-user=root --mysql-db=benchdb  --mysql-host=127.0.0.1 run
      

    2. then load

      sysbench --db-driver=mysql --mysql-user=root --mysql-db=benchdb --mysql-table-engine=rocksdb --test=/usr/share/sysbench/tests/include/oltp_legacy/oltp.lua --mysql-host=127.0.0.1 --mysql-port=3306 --oltp-tables-count=10 --oltp-table-size=25000 --max-requests=0 --max-time=2000 --forced-shutdown --report-interval=1 --oltp_simple_ranges=0 --oltp-distinct-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-point-selects=0 --rand-type=uniform --num-threads=20  --oltp-skip-trx=0  --report-interval=10 run
      

  4. wait 10-20 sec
  5. kill -9 the server:

    pkill -9 mysqld
    

  6. try starting the server again

Actual result:
Server won't come up failing at RocksDB crash recovery. I observed at least 3 different error patterns in mysql error logs:

pthread lock: Invalid argument
190430 11:10:35 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.3.13-MariaDB-log
key_buffer_size=16777216
read_buffer_size=262144
max_used_connections=0
max_threads=1259
thread_count=0
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 2945179 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x40000
/rdsdbbin/mysql/bin/mysqld(my_print_stacktrace+0x29)[0x55e144b2c8c9]
/rdsdbbin/mysql/bin/mysqld(handle_fatal_signal+0x57d)[0x55e1445f407d]
sigaction.c:0(__restore_rt)[0x2b500fc59100]
:0(__GI_raise)[0x2b500fe9b5f7]
:0(__GI_abort)[0x2b500fe9cce8]
/rdsdbbin/mariadb-10.3.13.R1/lib/plugin/ha_rocksdb.so(+0x186bf9)[0x2b5012986bf9]
/rdsdbbin/mariadb-10.3.13.R1/lib/plugin/ha_rocksdb.so(+0x4d4d85)[0x2b5012cd4d85]
/rdsdbbin/mariadb-10.3.13.R1/lib/plugin/ha_rocksdb.so(_ZN7rocksdb18GenericRateLimiterD1Ev+0x2b)[0x2b5012c7df8b]
/rdsdbbin/mariadb-10.3.13.R1/lib/plugin/ha_rocksdb.so(_ZN7rocksdb18GenericRateLimiterD0Ev+0x9)[0x2b5012c7e189]
/rdsdbbin/mariadb-10.3.13.R1/lib/plugin/ha_rocksdb.so(_ZN7rocksdb6DBImplD1Ev+0x460)[0x2b5012b03e20]
/rdsdbbin/mariadb-10.3.13.R1/lib/plugin/ha_rocksdb.so(_ZN7rocksdb6DBImplD0Ev+0x9)[0x2b5012b041e9]
/rdsdbbin/mariadb-10.3.13.R1/lib/plugin/ha_rocksdb.so(_ZN7rocksdb13TransactionDB4OpenERKNS_9DBOptionsERKNS_20TransactionDBOptionsERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt6vectorINS_22ColumnFamilyDescriptorESaISG_EEPSF_IPNS_18ColumnFamilyHandleESaISM_EEPPS0_+0xfd8)[0x2b5012ca93b8]
/rdsdbbin/mariadb-10.3.13.R1/lib/plugin/ha_rocksdb.so(+0x29824d)[0x2b5012a9824d]
/rdsdbbin/mysql/bin/mysqld(_Z24ha_initialize_handlertonP13st_plugin_int+0x67)[0x55e1445f64f7]
sql/handler.cc:523(ha_initialize_handlerton(st_plugin_int*))[0x55e144437beb]
sql/sql_plugin.cc:1432(plugin_initialize)[0x55e144438d3a]
sql/sql_plugin.cc:1714(plugin_init(int*, char**, int))[0x55e144374081]
sql/mysqld.cc:5385(init_server_components)[0x55e14437a0e6]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b500fe87b15]
/rdsdbbin/mysql/bin/mysqld(+0x5b5545)[0x55e14436d545]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.

or

2019-05-04 21:01:06 140088851875648 [Note] RocksDB: Column Families at start:
2019-05-04 21:01:06 140088851875648 [Note]   cf=default
2019-05-04 21:01:06 140088851875648 [Note]     write_buffer_size=67108864
2019-05-04 21:01:06 140088851875648 [Note]     target_file_size_base=67108864
2019-05-04 21:01:06 140088851875648 [Note]   cf=__system__
2019-05-04 21:01:06 140088851875648 [Note]     write_buffer_size=67108864
2019-05-04 21:01:06 140088851875648 [Note]     target_file_size_base=67108864
190504 21:01:09 [ERROR] mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
 
Server version: 10.2.21-MariaDB-log
key_buffer_size=16777216
read_buffer_size=262144
max_used_connections=0
max_threads=626
thread_count=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1471373 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x40000
addr2line: 'rdsdbbin/mariadb-10.2.21.R1/bin/mysqld': No such file
rdsdbbin/mariadb-10.2.21.R1/bin/mysqld(my_print_stacktrace+0x29)[0x55e5e2c822c9]
rdsdbbin/mariadb-10.2.21.R1/bin/mysqld(handle_fatal_signal+0x57b)[0x55e5e272c28b]
sigaction.c:0(__restore_rt)[0x7f68f927b5d0]
[0x55e5e4ec7a9e]

2019-05-01  6:05:26 0 [Note]     write_buffer_size=67108864
2019-05-01  6:05:26 0 [Note]     target_file_size_base=67108864
2019-05-01  6:05:26 0 [Note]   cf=__system__
2019-05-01  6:05:26 0 [Note]     write_buffer_size=67108864
2019-05-01  6:05:26 0 [Note]     target_file_size_base=67108864
pthread lock: Invalid argument
190501  6:05:29 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
 
Server version: 10.3.13-MariaDB-log
key_buffer_size=16777216
read_buffer_size=262144
max_used_connections=0
max_threads=1259
thread_count=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 2945179 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x40000
/rdsdbbin/mysql/bin/mysqld(my_print_stacktrace+0x29)[0x55b6f333f8c9]
mysys/stacktrace.c:270(my_print_stacktrace)[0x55b6f2e0707d]
sigaction.c:0(__restore_rt)[0x2b1df6a5c100]
:0(__GI_raise)[0x2b1df6c9e5f7]
:0(__GI_abort)[0x2b1df6c9fce8]
port_posix.cc:0(rocksdb::port::PthreadCall(char const*, int) [clone .part.1])[0x2b1df9986bf9]
port/port_posix.cc:65(rocksdb::port::Mutex::Lock())[0x2b1df9cd4d85]
bits/stl_deque.h:356(std::_Deque_iterator<rocksdb::GenericRateLimiter::Req*, rocksdb::GenericRateLimiter::Req*&, rocksdb::GenericRateLimiter::Req**>::difference_type std::operator-<rocksdb::GenericRateLimiter::Req*, rocksdb::GenericRateL
imiter::Req*&, rocksdb::GenericRateLimiter::Req**>(std::_Deque_iterator<rocksdb::GenericRateLimiter::Req*, rocksdb::GenericRateLimiter::Req*&, rocksdb::GenericRateLimiter::Req**> const&, std::_Deque_iterator<rocksdb::GenericRateLimiter::
Req*, rocksdb::GenericRateLimiter::Reutil/rate_limiter.cc:90(rocksdb::GenericRateLimiter::~GenericRateLimiter())[0x2b1df9c7e189]
bits/unique_ptr.h:267(_ZNSt10unique_ptrIN7rocksdb11RateLimiterESt14default_deleteIS1_EED4Ev)[0x2b1df9b03e20]
db/db_impl.cc:460(rocksdb::DBImpl::~DBImpl())[0x2b1df9b041e9]
transactions/pessimistic_transaction_db.cc:219(rocksdb::TransactionDB::Open(rocksdb::DBOptions const&, rocksdb::TransactionDBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vec
tor<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::TransactionDB**))[0x2b1df9ca93b8]
rocksdb/ha_rocksdb.cc:5130(rocksdb_init_func)[0x2b1df9a9824d]
/rdsdbbin/mysql/bin/mysqld(_Z24ha_initialize_handlertonP13st_plugin_int+0x67)[0x55b6f2e094f7]
sql/handler.cc:523(ha_initialize_handlerton(st_plugin_int*))[0x55b6f2c4abeb]
sql/sql_plugin.cc:1432(plugin_initialize)[0x55b6f2c4bd3a]
sql/sql_plugin.cc:1714(plugin_init(int*, char**, int))[0x55b6f2b87081]
sql/mysqld.cc:5385(init_server_components)[0x55b6f2b8d0e6]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b1df6c8ab15]
/rdsdbbin/mysql/bin/mysqld(+0x5b5545)[0x55b6f2b80545]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
2019-05-01  6:06:42 0 [Note] RocksDB: 2 column families found
2019-05-01  6:06:42 0 [Note] RocksDB: Column Families at start:
2019-05-01  6:06:42 0 [Note]   cf=default
2019-05-01  6:06:42 0 [Note]     write_buffer_size=67108864
2019-05-01  6:06:42 0 [Note]     target_file_size_base=67108864
2019-05-01  6:06:42 0 [Note]   cf=__system__
2019-05-01  6:06:42 0 [Note]     write_buffer_size=67108864
2019-05-01  6:06:42 0 [Note]     target_file_size_base=67108864
pthread lock: Invalid argument

Removing "log-bin" from my.cnf makes issue go away. Verified on both 10.2.21 and 10.3.13



 Comments   
Comment by Sergei Petrunia [ 2019-06-03 ]

Results so far:

It's easily reproducible on my machine

The crash (assert) occurs when RocksDB tries to de-initialize itself after a
failed recovery. I think this is a bug in RocksDB, it should not hit an assert
in this scenario. But the fix that we need is that recovery should not fail.

The recovery failure happens in PessimisticTransactionDB::Initialize, in the

for (auto it = rtrxs.begin(); it != rtrxs.end()

loop.
We get Status::kTimedOut when we try to re-build one of the transactions we are
recovering.

The mentioned loop basically does this:

for each transaction we are recovering {
  create a transaction real_trx;
  for each change made by the original transaction {
    real_trx->do_the_same_change(); // Put, Delete, Merge, etc
  }
}

We fail in "real_trx->do_the_same_change" step, when we try to apply a Merge
operation on table with index_num=9 (See Rdb_key_def::AUTO_INC=9).

Comment by Sergei Petrunia [ 2019-06-03 ]

Fairly easy to reproduce with just two transactions:

  • Start two transactions
  • run an INSERT INTO table_with_auto_increment
  • Stop right after m_rocksdb_tx->Prepare() call
  • Call rdb->FlushWAL(true)
  • Crash the server

On recovery, both transactions will try to apply the merge operation to the AUTO_INC key, and will hit an error. Stack trace:

(gdb) wher
  #0  rocksdb::Status::operator= (this=0x7fffffffbf10, s=...) at /home/psergey/dev-git/10.2/storage/rocksdb/rocksdb/include/rocksdb/status.h:333
  #1  0x00007fffe3789015 in rocksdb::WriteBatch::Iterate (this=0x555557d8a140, handler=0x7fffffffc090) at /home/psergey/dev-git/10.2/storage/rocksdb/rocksdb/db/write_batch.cc:485
  #2  0x00007fffe3a48e4b in rocksdb::TransactionBaseImpl::RebuildFromWriteBatch (this=0x555557e258a0, src_batch=0x555557d8a140) at /home/psergey/dev-git/10.2/storage/rocksdb/rocksdb/utilities/transactions/transaction_base.cc:728
  #3  0x00007fffe391fecc in rocksdb::PessimisticTransactionDB::Initialize (this=0x555557df84e0, compaction_enabled_cf_indices=std::vector of length 0, capacity 0, handles=std::vector of length 2, capacity 2 = {...}) at /home/psergey/dev-git/10.2/storage/rocksdb/rocksdb/utilities/transactions/pessimistic_transaction_db.cc:148
  #4  0x00007fffe39209c7 in rocksdb::TransactionDB::WrapDB (db=0x555557cf5600, txn_db_options=..., compaction_enabled_cf_indices=std::vector of length 0, capacity 0, handles=std::vector of length 2, capacity 2 = {...}, dbptr=0x7fffe3fd6060 <myrocks::rdb>) at /home/psergey/dev-git/10.2/storage/rocksdb/rocksdb/utilities/transactions/pessimistic_transaction_db.cc:288
  #5  0x00007fffe39205ea in rocksdb::TransactionDB::Open (db_options=..., txn_db_options=..., dbname="./#rocksdb", column_families=std::vector of length 2, capacity 2 = {...}, handles=0x7fffffffc9b0, dbptr=0x7fffe3fd6060 <myrocks::rdb>) at /home/psergey/dev-git/10.2/storage/rocksdb/rocksdb/utilities/transactions/pessimistic_transaction_db.cc:231
  #6  0x00007fffe353d48a in myrocks::rocksdb_init_func (p=0x555557b24d60) at /home/psergey/dev-git/10.2/storage/rocksdb/ha_rocksdb.cc:5130

Comment by Sergei Petrunia [ 2019-06-03 ]

The upstream doesn't hit this, because the have this patch:

https://github.com/facebook/rocksdb/commit/3f5282268fc81200c305eff974f4bf912cd0de2a

it allows a transaction to disable concurrency control . rocksdb::PessimisticTransaction::TryLock has this at the start :

  Status s;
  if (UNLIKELY(skip_concurrency_control_)) {
    return s;
  }

and this is how it avoids a lock conflict

Comment by Sergei Petrunia [ 2019-06-03 ]

... and that patch was fixing exactly the problem we are having:

    // This would help avoiding deadlock for keys that although exist in the WAL
    // did not go through concurrency control. This includes the merge that
    // MyRocks uses for auto-inc columns. It is safe to do so, since (i) if
    // there is a conflict between the keys of two transactions that must be
    // avoided, it is already avoided by the application, MyRocks, before the
    // restart (ii) application, MyRocks, guarntees to rollback/commit the
    // recovered transactions before new transactions start.
    t_options.skip_concurrency_control = true;

Comment by Sergei Petrunia [ 2019-06-03 ]

Ok, so one way to fix this is to do a merge MyRocks upstream. This will include updating the RocksDB to a newer version which includes the fix. Such merges usually require some manual effort, though.

A possible easier solution is to just update the RocksDB to a newer version. This might not work (RocksDB API is not 100% stable), but typically it works.

Comment by Sergei Petrunia [ 2019-06-21 ]

I've merged upstream MyRocks and RocksDB into MariaDB. The merge commit is:
https://github.com/MariaDB/server/commit/821b866b5502511fe6fcba9503c5eb5790c393a9
at the moment it is only in MariaDB-10.2, but it will be merged to 10.3 and 10.4, before their next releases.

Generated at Thu Feb 08 08:51:21 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.