[MDEV-16787] optimistic parallel replication fails on spider Created: 2018-07-20  Updated: 2020-08-25  Resolved: 2019-02-05

Status: Closed
Project: MariaDB Server
Component/s: Replication, Storage Engine - Spider
Affects Version/s: 10.3
Fix Version/s: 10.4.3, 10.3.13

Type: Bug Priority: Critical
Reporter: Mattias Jonsson Assignee: Kentoku Shiba (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

linux



 Description   

Parallel optimitic replication to a spider engine can fail on timeout.

(XA is enabled on the spider head node, spider_internal_xa)

When two transactions from the binary log (1 and 2) are running in parallel on tables using spider engine, the second transaction may succeed before the first, resulting in being in a XA prepared state holding locks that the first transaction needs. But since the second transaction cannot commit until after the first is committed and the first cannot get locks on the row (innodb row lock on the datanode) to complete its changes the replication fails with error 1205 'Lock wait timeout exceeded; try restarting transaction'.

Since there is an issue the first transaction (lock wait timeout), why is not the second transaction rolled back before retrying?

Binary log consists of:

# at 1590965
BEGIN
/*!*/;
# at 1591178
### UPDATE `db`.`t1`
### WHERE
###   @1=313
###   ...
### SET
###   @1=313
###   ...
# Number of rows: 1
# at 1592269
### UPDATE `db`.`t2`
### WHERE
###   @1=313
###   ...
### SET
###   @1=313
###   ...
# Number of rows: 1
# at 1592980
### UPDATE `db`.`t3`
### WHERE
###   @1=313
###   @2=7502
###   ...
### SET
###   @1=313
###   @2=7502
###   ...
# Number of rows: 1
# at 1594723
### UPDATE `db`.`t3`
### WHERE
###   @1=313
###   @2=7502
###   ...
### SET
###   @1=313
###   @2=7502
###   ...
# Number of rows: 1
# at 1596345
COMMIT/*!*/;
# at 1596376
BEGIN
/*!*/;
# at 1596537
### UPDATE `db`.`t2`
### WHERE
###   @1=313
### ...
### SET
###   @1=313
### ...
# Number of rows: 1
# at 1597073
COMMIT/*!*/;

General log including spider entries (478743 matches the first transaction in the binary log, 478744 matches the second/later transaction in the binary log):

180717 17:50:12	478743 Query	BEGIN
		478744 Query	BEGIN
		478743 Query	mysql datanode4.example.com set session transaction isolation level repeatable read;set session autocommit = 1;xa start 0x3734653137,0x62353663393630,1
		478743 Query	mysql datanode4.example.com select `id`, ... from `db`.`t1` where `id` = 313 for update
		478744 Query	mysql datanode4.example.com set session transaction isolation level repeatable read;set session autocommit = 1;xa start 0x3734653138,0x62353663393630,1
		478744 Query	mysql datanode4.example.com select `id`,...,`t1_id`,... from `db`.`t2` where `t1_id` = 313 for update
		478743 Query	mysql datanode4.example.com update `db`.`t1` set `id` = 313 ... where `id` = 313 and ... limit 1
		478744 Query	mysql datanode4.example.com update `db`.`t2` set `id` = 1017,...,`t1_id` = 313,... where `id` = 1017 and `t1_id` = 313 and ... limit 1
		478743 Query	mysql datanode4.example.com select `id`,...,`t1_id`,... from `db`.`t2` where `t1_id` = 313 for update
		478744 Query	COMMIT /* implicit, from Xid_log_event */
		478744 Query	mysql datanode4.example.com xa end 0x3734653138,0x62353663393630,1
		478744 Query	mysql datanode4.example.com xa prepare 0x3734653138,0x62353663393630,1
180717 17:51:03	478743 Query	mysql datanode4.example.com xa end 0x3734653137,0x62353663393630,1
		478743 Query	mysql datanode4.example.com xa rollback 0x3734653137,0x62353663393630,1
		478743 Query	BEGIN
		478743 Query	mysql datanode4.example.com set session transaction isolation level repeatable read;set session autocommit = 1;xa start 0x3734653137,0x62353663393630,1
		478743 Query	mysql datanode4.example.com select `id`, ... from `db`.`t1` where `id` = 313 for update
		478743 Query	mysql datanode4.example.com update `db`.`t1` set `id` = 313 ... where `id` = 313 and ... limit 1
		478743 Query	mysql datanode4.example.com select `id`,...,`t1_id`,... from `db`.`t2` where `t1_id` = 313 for update
180717 17:51:54	478743 Query	mysql datanode4.example.com xa end 0x3734653137,0x62353663393630,1
		478743 Query	mysql datanode4.example.com xa rollback 0x3734653137,0x62353663393630,1
		478743 Query	BEGIN
		478743 Query	mysql datanode4.example.com set session transaction isolation level repeatable read;set session autocommit = 1;xa start 0x3734653137,0x62353663393630,1
		478743 Query	mysql datanode4.example.com select `id`, ... from `db`.`t1` where `id` = 313 for update
		478743 Query	mysql datanode4.example.com update `db`.`t1` set `id` = 313 ... where `id` = 313 and ... limit 1
		478743 Query	mysql datanode4.example.com select `id`,...,`t1_id`,... from `db`.`t2` where `t1_id` = 313 for update
180717 17:52:45	478743 Query	mysql datanode4.example.com xa end 0x3734653137,0x62353663393630,1
		478743 Query	mysql datanode4.example.com xa rollback 0x3734653137,0x62353663393630,1
		478743 Query	BEGIN
		478743 Query	mysql datanode4.example.com set session transaction isolation level repeatable read;set session autocommit = 1;xa start 0x3734653137,0x62353663393630,1
		478743 Query	mysql datanode4.example.com select `id`, ... from `db`.`t1` where `id` = 313 for update
		478743 Query	mysql datanode4.example.com update `db`.`t1` set `id` = 313 ... where `id` = 313 and ... limit 1
		478743 Query	mysql datanode4.example.com select `id`,...,`t1_id`,... from `db`.`t2` where `t1_id` = 313 for update
180717 17:53:36	478743 Query	mysql datanode4.example.com xa end 0x3734653137,0x62353663393630,1
		478743 Query	mysql datanode4.example.com xa rollback 0x3734653137,0x62353663393630,1
		478743 Query	BEGIN
		478743 Query	mysql datanode4.example.com set session transaction isolation level repeatable read;set session autocommit = 1;xa start 0x3734653137,0x62353663393630,1
		478743 Query	mysql datanode4.example.com select `id`, ... from `db`.`t1` where `id` = 313 for update
		478743 Query	mysql datanode4.example.com update `db`.`t1` set `id` = 313 ... where `id` = 313 and ... limit 1
		478743 Query	mysql datanode4.example.com select `id`,...,`t1_id`,... from `db`.`t2` where `t1_id` = 313 for update
180717 17:54:27	478743 Query	mysql datanode4.example.com xa end 0x3734653137,0x62353663393630,1
		478743 Query	mysql datanode4.example.com xa rollback 0x3734653137,0x62353663393630,1
		478743 Query	BEGIN
		478743 Query	mysql datanode4.example.com set session transaction isolation level repeatable read;set session autocommit = 1;xa start 0x3734653137,0x62353663393630,1
		478743 Query	mysql datanode4.example.com select `id`, ... from `db`.`t1` where `id` = 313 for update
		478743 Query	mysql datanode4.example.com update `db`.`t1` set `id` = 313 ... where `id` = 313 and ... limit 1
		478743 Query	mysql datanode4.example.com select `id`,...,`t1_id`,... from `db`.`t2` where `t1_id` = 313 for update
180717 17:55:18	478743 Query	mysql datanode4.example.com xa end 0x3734653137,0x62353663393630,1
		478743 Query	mysql datanode4.example.com xa rollback 0x3734653137,0x62353663393630,1
		478743 Query	BEGIN
		478743 Query	mysql datanode4.example.com set session transaction isolation level repeatable read;set session autocommit = 1;xa start 0x3734653137,0x62353663393630,1
		478743 Query	mysql datanode4.example.com select `id`, ... from `db`.`t1` where `id` = 313 for update
		478743 Query	mysql datanode4.example.com update `db`.`t1` set `id` = 313 ... where `id` = 313 and ... limit 1
		478743 Query	mysql datanode4.example.com select `id`,...,`t1_id`,... from `db`.`t2` where `t1_id` = 313 for update
180717 17:56:09	478743 Query	mysql datanode4.example.com xa end 0x3734653137,0x62353663393630,1
		478743 Query	mysql datanode4.example.com xa rollback 0x3734653137,0x62353663393630,1
		478743 Query	BEGIN
		478743 Query	mysql datanode4.example.com set session transaction isolation level repeatable read;set session autocommit = 1;xa start 0x3734653137,0x62353663393630,1
		478743 Query	mysql datanode4.example.com select `id`, ... from `db`.`t1` where `id` = 313 for update
		478743 Query	mysql datanode4.example.com update `db`.`t1` set `id` = 313 ... where `id` = 313 and ... limit 1
		478743 Query	mysql datanode4.example.com select `id`,...,`t1_id`,... from `db`.`t2` where `t1_id` = 313 for update
180717 17:57:00	478743 Query	mysql datanode4.example.com xa end 0x3734653137,0x62353663393630,1
		478743 Query	mysql datanode4.example.com xa rollback 0x3734653137,0x62353663393630,1
		478743 Query	BEGIN
		478743 Query	mysql datanode4.example.com set session transaction isolation level repeatable read;set session autocommit = 1;xa start 0x3734653137,0x62353663393630,1
		478743 Query	mysql datanode4.example.com select `id`, ... from `db`.`t1` where `id` = 313 for update
		478743 Query	mysql datanode4.example.com update `db`.`t1` set `id` = 313 ... where `id` = 313 and ... limit 1
		478743 Query	mysql datanode4.example.com select `id`,...,`t1_id`,... from `db`.`t2` where `t1_id` = 313 for update
180717 17:57:51	478743 Query	mysql datanode4.example.com xa end 0x3734653137,0x62353663393630,1
		478743 Query	mysql datanode4.example.com xa rollback 0x3734653137,0x62353663393630,1
		478743 Query	BEGIN
		478743 Query	mysql datanode4.example.com set session transaction isolation level repeatable read;set session autocommit = 1;xa start 0x3734653137,0x62353663393630,1
		478743 Query	mysql datanode4.example.com select `id`, ... from `db`.`t1` where `id` = 313 for update
		478743 Query	mysql datanode4.example.com update `db`.`t1` set `id` = 313 ... where `id` = 313 and ... limit 1
		478743 Query	mysql datanode4.example.com select `id`,...,`t1_id`,... from `db`.`t2` where `t1_id` = 313 for update
180717 17:58:42	478743 Query	mysql datanode4.example.com xa end 0x3734653137,0x62353663393630,1
		478743 Query	mysql datanode4.example.com xa rollback 0x3734653137,0x62353663393630,1
		478743 Query	BEGIN
		478743 Query	mysql datanode4.example.com set session transaction isolation level repeatable read;set session autocommit = 1;xa start 0x3734653137,0x62353663393630,1
		478743 Query	mysql datanode4.example.com select `id`, ... from `db`.`t1` where `id` = 313 for update
		478743 Query	mysql datanode4.example.com update `db`.`t1` set `id` = 313 ... where `id` = 313 and ... limit 1
		478743 Query	mysql datanode4.example.com select `id`,...,`t1_id`,... from `db`.`t2` where `t1_id` = 313 for update
180717 17:59:33	478743 Query	mysql datanode4.example.com xa end 0x3734653137,0x62353663393630,1
		478743 Query	mysql datanode4.example.com xa rollback 0x3734653137,0x62353663393630,1
		478744 Query	mysql datanode4.example.com xa rollback 0x3734653138,0x62353663393630,1

error log on spider head node (slave):

20180717 17:51:03 [SEND SPIDER SQL] from 478743 to [datanode4.example.com] 206940404:  sql: select `id``t1_id`,... from `db`.`t2` where `t1_id` = 313 for update
20180717 17:51:03 [ERROR SPIDER RESULT] to 478743: 1205 Lock wait timeout exceeded; try restarting transaction
...
20180717 17:51:54 [SEND SPIDER SQL] from 478743 to [datanode4.example.com] 206940581:  sql: select `id``t1_id`,... from `db`.`t2` where `t1_id` = 313 for update
20180717 17:51:54 [ERROR SPIDER RESULT] to 478743: 1205 Lock wait timeout exceeded; try restarting transaction
...
20180717 17:59:33 [SEND SPIDER SQL] from 478743 to [datanode4.example.com] 206942165:  sql: select `id``t1_id`,... from `db`.`t2` where `t1_id` = 313 for update
20180717 17:59:33 [ERROR SPIDER RESULT] to 478743: 1205 Lock wait timeout exceeded; try restarting transaction
2018-07-17 17:59:33 478743 [ERROR] Slave worker thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable.
2018-07-17 17:59:33 478743 [ERROR] Slave SQL: Lock wait timeout exceeded; try restarting transaction, Gtid 0-190237024-207448702, Internal MariaDB error code: 1205
2018-07-17 17:59:33 478743 [Warning] Slave: Lock wait timeout exceeded; try restarting transaction Error_code: 1205
2018-07-17 17:59:33 478743 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'binlog.003265' position 1590965
2018-07-17 17:59:33 478744 [ERROR] Slave (additional info): Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2018-07-17 17:59:33 478744 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2018-07-17 17:59:33 478744 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'binlog.003265' position 1590965
2018-07-17 17:59:33 478743 [ERROR] Slave (additional info): Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2018-07-17 17:59:33 478743 [Warning] Slave: Lock wait timeout exceeded; try restarting transaction Error_code: 1205
2018-07-17 17:59:33 478743 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2018-07-17 17:59:33 478743 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'binlog.003265' position 1590965
2018-07-17 17:59:33 478744 [ERROR] Slave (additional info): Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2018-07-17 17:59:33 478744 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2018-07-17 17:59:33 478744 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2018-07-17 17:59:33 478744 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2018-07-17 17:59:33 478744 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'binlog.003265' position 1590965
2018-07-17 17:59:33 478743 [ERROR] Slave (additional info): Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2018-07-17 17:59:33 478743 [Warning] Slave: Lock wait timeout exceeded; try restarting transaction Error_code: 1205
2018-07-17 17:59:33 478743 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2018-07-17 17:59:33 478743 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2018-07-17 17:59:33 478743 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2018-07-17 17:59:33 478743 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2018-07-17 17:59:33 478743 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2018-07-17 17:59:33 478743 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2018-07-17 17:59:33 478743 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'binlog.003265' position 1590965
2018-07-17 17:59:33 478744 [ERROR] Slave (additional info): Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2018-07-17 17:59:33 478744 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2018-07-17 17:59:33 478744 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2018-07-17 17:59:33 478744 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2018-07-17 17:59:33 478744 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2018-07-17 17:59:33 478744 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2018-07-17 17:59:33 478744 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2018-07-17 17:59:33 478744 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2018-07-17 17:59:33 478744 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2018-07-17 17:59:33 478744 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'binlog.003265' position 1590965
2018-07-17 17:59:33 478743 [ERROR] Slave (additional info): Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2018-07-17 17:59:33 478743 [Warning] Slave: Lock wait timeout exceeded; try restarting transaction Error_code: 1205
2018-07-17 17:59:33 478743 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2018-07-17 17:59:33 478743 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2018-07-17 17:59:33 478743 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2018-07-17 17:59:33 478743 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
...
2018-07-17 17:59:33 478744 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2018-07-17 17:59:33 478744 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'binlog.003265' position 1590
965
2018-07-17 17:59:33 478743 [ERROR] Slave (additional info): Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2018-07-17 17:59:33 478743 [Warning] Slave: Lock wait timeout exceeded; try restarting transaction Error_code: 1205
2018-07-17 17:59:33 478743 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
...
2018-07-17 17:59:33 478743 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2018-07-17 17:59:33 478743 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'binlog.003265' position 1590
965
2018-07-17 17:59:33 478742 [Note] Slave SQL thread exiting, replication stopped in log 'binlog.003265' at position 1590965

Also why so many duplicated lines?



 Comments   
Comment by Jacob Mathew (Inactive) [ 2018-10-01 ]

Reassigning to Sachin to evaluate as a replication bug.

Comment by Mattias Jonsson [ 2018-11-14 ]

Maybe port the facebook mysql-5.6 feature mts_dependency_replication ? Which may help increasing the parallelism on the slave needed to compensate for the added response time of spider engine due to network etc.

Comment by Kristian Nielsen [ 2018-11-25 ]

So I'm not very familiar with Spider (is it a storage engine?)
But maybe the problem here is that spider needs to implement thd_rpl_deadlock_check() similar as done in InnoDB and TokuDB?
This callback is needed from the storage engine so that parallel replication can know about the conflict and roll back the second transaction.
Not sure if distributed nature of Spider will make this hard to implement... alternatively, if thd_rpl_deadlock_check() cannot be supported, then spider transactions will need to be marked as not safe for optimistic parallel replication (there is a bit for this in the GTID event).

Comment by Kentoku Shiba (Inactive) [ 2018-12-12 ]

Normally, it is ok to use read committed for slave SQL threads even if the master is repeatable read. Because a lot of slaves are read only and tables are InnoDB.
So I added a system variable for Spider. Please try to use "spider_slave_trx_isolation=1" on the slave Spider nodes.

  • spider_slave_trx_isolation
    The transaction isolation level when Spider table is used by slave SQL thread.
    -1 : OFF
    0 : READ UNCOMMITTED
    1 : READ COMMITTED
    2 : REPEATABLE READ
    3 : SERIALIZABLE
    The default value is -1

Change for 10.3 83bc6de
Change for 10.4 5e5aecf

Comment by Kentoku Shiba (Inactive) [ 2018-12-24 ]

Sachin, please review this.

Comment by Sachin Setiya (Inactive) [ 2019-02-03 ]

Okay to push

Comment by Kentoku Shiba (Inactive) [ 2019-02-05 ]

Merged to 10.3 and 10.4 tree.

Generated at Thu Feb 08 08:31:33 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.