[MDEV-16428] Simple concurrent DML on RocksDB tables makes optimistic parallel replication abort Created: 2018-06-07  Updated: 2023-04-27

Status: Confirmed
Project: MariaDB Server
Component/s: Replication, Storage Engine - RocksDB
Affects Version/s: 10.2, 10.3, 10.4, 10.5
Fix Version/s: 10.4, 10.5

Type: Bug Priority: Major
Reporter: Elena Stepanova Assignee: Sergei Petrunia
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Relates
relates to MDEV-16242 MyRocks: parallel slave on a table wi... Confirmed
relates to MDEV-24401 Deadlocks on Rocksdb when there shoul... Open

 Description   

Note: run the test case below with --mysqld=--slave_parallel_mode=optimistic --mysqld=--slave-parallel-threads=2 --mysqld=--plugin-load-add=ha_rocksdb. It usually fails on the first attempt for me, but it's still a race condition, so sometimes it misses the mark. Run with --repeat if it doesn't fail right away.

--source include/have_binlog_format_row.inc
--source include/master-slave.inc
 
CREATE TABLE t1 (pk INT AUTO_INCREMENT PRIMARY KEY, a INT, KEY(a)) ENGINE=RocksDB;
CREATE TABLE t2 (pk INT AUTO_INCREMENT PRIMARY KEY, b INT) ENGINE=RocksDB;
 
INSERT INTO t1 (pk) VALUES (NULL);
 
--connect (con1,localhost,root,,test)
INSERT INTO t1 (pk) VALUES (NULL);
 
--connection master1
BEGIN;
 
--connection con1
INSERT INTO t1 (pk) VALUES (NULL);
INSERT INTO t1 (pk) VALUES (NULL);
 
--connection master1
INSERT INTO t2 (pk) VALUES (NULL),(NULL);
--send
  UPDATE t1 SET a = 1;
 
--connection master
--send
  DELETE FROM t1;
 
--connection master1
--reap
COMMIT;
 
--connection master
--reap
 
--sync_slave_with_master
 
# Cleanup
--disconnect con1
--connection master
DROP TABLE t1, t2;
--source include/rpl_end.inc

10.2 a31e99a89c

2018-06-08  0:12:38 140524761007872 [Note] Slave SQL thread initialized, starting replication in log 
'master-bin.000001' at position 4, relay log './slave-relay-bin.000001' position: 4
2018-06-08  0:12:49 140524760401664 [ERROR] Slave worker thread retried transaction 10 time(s) in
 vain, giving up. Consider raising the value of the slave_transaction_retries variable.
2018-06-08  0:12:49 140524760401664 [ERROR] Slave SQL: Lock wait timeout exceeded; try 
restarting transaction, Gtid 0-1-7, Internal MariaDB error code: 1205
2018-06-08  0:12:49 140524760401664 [Warning] Slave: Lock wait timeout exceeded; try restarting 
transaction Error_code: 1205
2018-06-08  0:12:49 140524760401664 [ERROR] Error running query, slave SQL thread aborted. 
Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at
 log 'master-bin.000001' position 1553
2018-06-08  0:12:49 140524761007872 [Note] Error reading relay log event: slave SQL thread was killed
2018-06-08  0:12:49 140524760704768 [ERROR] Slave (additional info): Commit failed due to failure
 of an earlier commit on which this one depends Error_code: 1964
2018-06-08  0:12:49 140524760704768 [Warning] Slave: Commit failed due to failure of an earlier 
commit on which this one depends Error_code: 1964
2018-06-08  0:12:49 140524760704768 [ERROR] Error running query, slave SQL thread aborted. 
Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log
 'master-bin.000001' position 1553
2018-06-08  0:12:49 140524761007872 [Note] Slave SQL thread exiting, replication stopped 
in log 'master-bin.000001' at position 1553

Increasing rocksdb_lock_wait_timeout or slave_transaction_retries doesn't help, just makes the test run longer.
Doesn't fail with InnoDB.



 Comments   
Comment by Sergei Petrunia [ 2018-06-18 ]

Full text of MTR failure: https://gist.github.com/spetrunia/a1f2a41144d0667a55d3afd7ad554cad

Comment by Sergei Petrunia [ 2018-06-18 ]

Re-starting the analysis.

Debugging log:

# Transaction 0-1-7 starts. According to annotate events in the relay log,
# this is INSERT INTO t2 (pk) VALUES (NULL),(NULL) 
AAA 140473353197312 DBImpl::WriteImpl wrote a batch with sequence=20
AAA 140473353197312 Transaction got snapshot 22 
AAA 140473353197312: ha_rockdb::write_row gtid is 0-1-7
AAA 140473353197312 Transaction got snapshot 22 
AAA 140473353197312: ha_rockdb::write_row gtid is 0-1-7

# According to relay log, 0-1-8 is DELETE FROM t1
AAA 140473352894208 Transaction got snapshot 22 
AAA 140473352894208: ha_rockdb::delete_row gtid is 0-1-8
AAA 140473352894208: ha_rockdb::delete_row gtid is 0-1-8
AAA 140473352894208: ha_rockdb::delete_row gtid is 0-1-8
AAA 140473352894208: ha_rockdb::delete_row gtid is 0-1-8

Then, 0-1-7 is back?

AAA 140473353197312: ha_rockdb::write_row gtid is 0-1-7

It re-tries 10 times:

AAA 140473353197312 Transaction got snapshot 22
AAA 140473353197312: ha_rockdb::write_row gtid is 0-1-7
AAA 140473353197312: ha_rockdb::write_row gtid is 0-1-7
 
AAA 140473353197312 Transaction got snapshot 22 
AAA 140473353197312: ha_rockdb::write_row gtid is 0-1-7
AAA 140473353197312: ha_rockdb::write_row gtid is 0-1-7
 
AAA 140473353197312 Transaction got snapshot 22 
AAA 140473353197312: ha_rockdb::write_row gtid is 0-1-7
AAA 140473353197312: ha_rockdb::write_row gtid is 0-1-7
 
AAA 140473353197312 Transaction got snapshot 22 
AAA 140473353197312: ha_rockdb::write_row gtid is 0-1-7
AAA 140473353197312: ha_rockdb::write_row gtid is 0-1-7
 
AAA 140473353197312 Transaction got snapshot 22 
AAA 140473353197312: ha_rockdb::write_row gtid is 0-1-7
AAA 140473353197312: ha_rockdb::write_row gtid is 0-1-7
 
AAA 140473353197312 Transaction got snapshot 22 
AAA 140473353197312: ha_rockdb::write_row gtid is 0-1-7
AAA 140473353197312: ha_rockdb::write_row gtid is 0-1-7
 
AAA 140473353197312 Transaction got snapshot 22 
AAA 140473353197312: ha_rockdb::write_row gtid is 0-1-7
AAA 140473353197312: ha_rockdb::write_row gtid is 0-1-7
 
AAA 140473353197312 Transaction got snapshot 22 
AAA 140473353197312: ha_rockdb::write_row gtid is 0-1-7
AAA 140473353197312: ha_rockdb::write_row gtid is 0-1-7
 
AAA 140473353197312 Transaction got snapshot 22 
AAA 140473353197312: ha_rockdb::write_row gtid is 0-1-7
AAA 140473353197312: ha_rockdb::write_row gtid is 0-1-7

and then finally fails:

AAA 140473353197312 Transaction got snapshot 22 
AAA 140473353197312: ha_rockdb::write_row gtid is 0-1-7
2018-06-18 22:00:16 140473353197312 [ERROR] Slave worker thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable.
2018-06-18 22:00:16 140473353197312 [ERROR] Slave SQL: Lock wait timeout exceeded; try restarting transaction, Gtid 0-1-7, Internal MariaDB error code: 1205
2018-06-18 22:00:16 140473353197312 [Warning] Slave: Lock wait timeout exceeded; try restarting transaction Error_code: 1205
2018-06-18 22:00:16 140473353197312 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'master-bin.000001' position 1553

Comment by Sergei Petrunia [ 2018-06-19 ]

Yet another attempt to debug.

Binlog events are:

slave-relay-bin.000002  1640    Gtid    1       1382    BEGIN GTID 0-1-6
slave-relay-bin.000002  1682    Annotate_rows   1       1438    INSERT INTO t1 (pk) VALUES (NULL)
slave-relay-bin.000002  1738    Table_map       1       1484    table_id: 30 (test.t1)
slave-relay-bin.000002  1784    Write_rows_v1   1       1522    table_id: 30 flags: STMT_END_F
slave-relay-bin.000002  1822    Xid     1       1553    COMMIT /* xid=140 */
 
slave-relay-bin.000002  1853    Gtid    1       1595    BEGIN GTID 0-1-7
slave-relay-bin.000002  1895    Annotate_rows   1       1658    INSERT INTO t2 (pk) VALUES (NULL),(NULL)
slave-relay-bin.000002  1958    Table_map       1       1704    table_id: 31 (test.t2)
slave-relay-bin.000002  2004    Write_rows_v1   1       1747    table_id: 31 flags: STMT_END_F
slave-relay-bin.000002  2047    Annotate_rows   1       1789    UPDATE t1 SET a = 1
slave-relay-bin.000002  2089    Table_map       1       1835    table_id: 30 (test.t1)
slave-relay-bin.000002  2135    Update_rows_v1  1       1925    table_id: 30 flags: STMT_END_F
slave-relay-bin.000002  2225    Xid     1       1956    COMMIT /* xid=141 */
 
slave-relay-bin.000002  2256    Gtid    1       1998    BEGIN GTID 0-1-8
slave-relay-bin.000002  2298    Annotate_rows   1       2035    DELETE FROM t1
slave-relay-bin.000002  2335    Table_map       1       2081    table_id: 30 (test.t1)
slave-relay-bin.000002  2381    Delete_rows_v1  1       2150    table_id: 30 flags: STMT_END_F
slave-relay-bin.000002  2450    Xid     1       2181    COMMIT /* xid=143 */

Now, reading the debug log.

Thread *04 executes the event with GTID 0-1-6:

AAA 140154515343104: ha_rockdb::write_row gtid is 0-1-6
AAA 140154515343104 Transaction got snapshot 19 
AAA 140154515343104: ha_rockdb::write_row retval2=0
AAA 140154515343104: rocksdb_prepare() OK
AAA 140154515343104 DBImpl::WriteImpl wrote a batch with sequence=20

Thread *04 starts to execute 0-1-7. It writes the first row into t2:

AAA 140154515343104 Transaction got snapshot 22 
AAA 140154515343104: ha_rockdb::write_row gtid is 0-1-7
AAA 140154515343104 Transaction got snapshot 22 
AAA 140154515343104: ha_rockdb::write_row retval2=0

Thread *00 started to execute something (before this, it ran 0-1-3). We'll see a bit later that it's 0-1-8:

AAA 140154515040000 Transaction got snapshot 22

*04 continues, starts to write second row into t2:

AAA 140154515343104: ha_rockdb::write_row gtid is 0-1-7

Now we can see that thread 00 is executing 0-1-8. It deletes the first row. Note that this happens while 0-1-7 is still executing. This is how optimistic parallel replication works.

AAA 140154515040000: ha_rockdb::delete_row gtid is 0-1-8
AAA 140154515040000: ha_rockdb::delete_row OK

Thread 04 finishes writing the second row into t2:

AAA 140154515343104: ha_rockdb::write_row retval2=0

Thread *00 continues to delete rows from t2, and reaches prepare stage:

AAA 140154515040000: ha_rockdb::delete_row gtid is 0-1-8
AAA 140154515040000: ha_rockdb::delete_row OK
AAA 140154515040000: ha_rockdb::delete_row gtid is 0-1-8
AAA 140154515040000: ha_rockdb::delete_row OK
AAA 140154515040000: ha_rockdb::delete_row gtid is 0-1-8
AAA 140154515040000: ha_rockdb::delete_row OK
AAA 140154515040000: rocksdb_prepare() OK

Now, thred *04 is unable to get a lock on the row of t1 that it needs to modify:
(error 146 is HA_ERR_LOCK_WAIT_TIMEOUT):

AAA                  Update_rows_log_event::do_exec_row: find_row error 146
AAA                  apply_event_and_update_pos_apply: failed!
AAA 140154515343104: retry_event_group() starting retry
AAA 140154515343104: ha_rockdb::write_row gtid is 0-1-7
AAA 140154515343104 Transaction got snapshot 22

It starts to retry but that will not succeed, as *00 is holding the locks on all rows in t1.

Re-reading the text here https://mariadb.com/kb/en/library/parallel-replication/ :

Any transactional DML (INSERT/UPDATE/DELETE) is allowed to run in parallel, up to the limit of @@slave_domain_parallel_threads. This may cause conflicts on the slave, eg. if two transactions try to modify the same row. Any such conflict is detected, and the latter of the two transactions is rolled back, allowing the former to proceed. The latter transaction is then re-tried once the former has completed.

Comment by Sergei Petrunia [ 2018-06-20 ]

There is this function, thd_rpl_deadlock_check

../storage/innobase/lock/lock0lock.cc|302| /** Set if thd_rpl_deadlock_check() should be called for waits. */
../storage/innobase/lock/lock0lock.cc|7517| thd_rpl_deadlock_check(m_start->mysql_thd,

which storage engine needs to call to inform the SQL layer that transaction X is about to wait on transaction Y.
(See also thd_need_wait_reports although that one is not mandatory)

Comment by Sergei Petrunia [ 2018-06-25 ]

DRAFT Patch: https://github.com/MariaDB/server/commits/bb-10.2-mdev16428
It includes RocksDB part: https://github.com/spetrunia/rocksdb/tree/mdev-16428

Comment by Sergei Petrunia [ 2018-06-25 ]

thd_rpl_deadlock_check() function has a comment describing why the callback is needed and how it functions https://github.com/MariaDB/server/blob/10.3/sql/sql_class.cc#L4925

Comment by Sergei Petrunia [ 2018-07-02 ]

The test tree is at: https://github.com/MariaDB/server/commits/bb-10.2-mdev16428 . Note that it uses a new submodule: rocksdb now refers to https://github.com/spetrunia/rocksdb/tree/mdev-16428 .

Comment by Andrei Elkin [ 2020-10-01 ]

My last comment offer of checking the table type and then degrade to conservative remains. If psergey does not have any better idea, let's go with that.

Generated at Thu Feb 08 08:28:51 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.