[MDEV-10523] Deadlock between INSERT and DROP on parallel replication Created: 2016-08-09  Updated: 2020-08-25  Resolved: 2018-05-31

Status: Closed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.0.19
Fix Version/s: 10.0.21

Type: Bug Priority: Critical
Reporter: Hartmut Holzgraefe Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: None
Environment:

Linux



 Description   

An INSERT INTO SELECT, followed by a DROP of the table being selected from, sometimes leads to a deadlock as both queries are executed in parallel on a secondary slave, and both end up waiting forever. The INSERT thread is waiting on metadata lock, while the DROP thread is waiting on prior transaction commit. Replication is using binlog_format=STATEMENT

PROCESSLIST output:

346 system user NULL Connect 847613 Waiting for master to send event NULL 0.000
387 system user some_host Connect 34328 Waiting for table metadata lock INSERT IGNORE INTO some_table select * from other_temp_table 0.000
392 system user NULL Connect 34328 Waiting for prior transaction to commit NULL 0.000
395 system user NULL Connect 34328 Waiting for prior transaction to commit NULL 0.000
402 system user NULL Connect 34328 Waiting for prior transaction to commit NULL 0.000
415 system user some_host Connect 34328 Waiting for prior transaction to commit DROP TABLE IF EXISTS `other_table`  0.000
427 system user NULL Connect 34359 Waiting for room in worker thread event queue NULL 0.000

Related Binlog events (filtered to only show event headers with GTID and the two locking queries):

[...]
#160808 23:59:33 server id 187214013 end_log_pos 11324685 GTID 0-187214013-3384686406 cid=533036010
#160808 23:59:33 server id 187214013 end_log_pos 11325461 GTID 0-187214013-3384686407 cid=533036010
INSERT IGNORE INTO some_table select * from other_table
#160808 23:59:33 server id 187214013 end_log_pos 11325720 GTID 0-187214013-3384686408 cid=533036010
#160808 23:59:33 server id 187214013 end_log_pos 11326182 GTID 0-187214013-3384686409 cid=533036010
#160808 23:59:33 server id 187214013 end_log_pos 11326949 GTID 0-187214013-3384686410 cid=533036010
#160808 23:59:33 server id 187214013 end_log_pos 11327246 GTID 0-187214013-3384686411
DROP TABLE IF EXISTS `other_table`
[...]

@@gtid_slave_pos shows 0-187214013-3384686406, the event right before the INSERT



 Comments   
Comment by Hartmut Holzgraefe [ 2016-08-09 ]

Backtrace of the thread running the INSERT...SELECT:

Thread 5 (Thread 0x7f1cb44f0700 (LWP 23527)):
#0  0x00007f6a5f63ba5e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000006a25d2 in MDL_wait::timed_wait(MDL_context_owner*, timespec*, bool, PSI_stage_info_v1 const*) ()
#2  0x00000000006a3640 in MDL_context::acquire_lock(MDL_request*, unsigned long) ()
#3  0x000000000058ebd6 in open_table_get_mdl_lock(THD*, Open_table_context*, MDL_request*, unsigned int, MDL_ticket**) ()
#4  0x00000000005961ec in open_table(THD*, TABLE_LIST*, st_mem_root*, Open_table_context*) ()
#5  0x0000000000597c87 in open_tables(THD*, TABLE_LIST**, unsigned int*, unsigned int, Prelocking_strategy*) ()
#6  0x0000000000597ee4 in open_and_lock_tables(THD*, TABLE_LIST*, bool, unsigned int, Prelocking_strategy*) ()
#7  0x00000000005d81a7 in mysql_execute_command(THD*) ()
#8  0x00000000005da5e7 in mysql_parse ()
#9  0x00000000007e7a4a in Query_log_event::do_apply_event(rpl_group_info*, char const*, unsigned int) ()
#10 0x0000000000564dcd in apply_event_and_update_pos(Log_event*, THD*, rpl_group_info*, rpl_parallel_thread*) ()
#11 0x00000000006d51b7 in rpt_handle_event(rpl_parallel_thread::queued_event*, rpl_parallel_thread*) ()
#12 0x00000000006d6998 in retry_event_group(rpl_group_info*, rpl_parallel_thread*, rpl_parallel_thread::queued_event*) ()
#13 0x00000000006d7f88 in handle_rpl_parallel_thread ()
#14 0x0000000000a715fd in pfs_spawn_thread ()
#15 0x00007f6a5f637aa1 in start_thread () from /lib64/libpthread.so.0
#16 0x00007f6a5dd5093d in clone () from /lib64/libc.so.6

Comment by Hartmut Holzgraefe [ 2016-08-09 ]

Backtrace of the thread running the DROP TABLE:

Thread 76 (Thread 0x7f1cb4060700 (LWP 23456)):
#0  0x00007f6a5f63b68c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000005ae663 in wait_for_commit::wait_for_prior_commit2(THD*) ()
#2  0x00000000007d7821 in MYSQL_BIN_LOG::write(Log_event*, char*) ()
#3  0x00000000005adbed in THD::binlog_query(THD::enum_binlog_query_type, char const*, unsigned long, bool, bool, bool, int) ()
#4  0x000000000064c6b0 in mysql_rm_table_no_locks ()
#5  0x000000000064f617 in mysql_rm_table(THD*, TABLE_LIST*, char, char) ()
#6  0x00000000005d8284 in mysql_execute_command(THD*) ()
#7  0x00000000005da5e7 in mysql_parse ()
#8  0x00000000007e7a4a in Query_log_event::do_apply_event(rpl_group_info*, char const*, unsigned int) ()
#9  0x0000000000564dcd in apply_event_and_update_pos(Log_event*, THD*, rpl_group_info*, rpl_parallel_thread*) ()
#10 0x00000000006d7f19 in handle_rpl_parallel_thread ()
#11 0x0000000000a715fd in pfs_spawn_thread ()
#12 0x00007f6a5f637aa1 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f6a5dd5093d in clone () from /lib64/libc.so.6

Comment by Hartmut Holzgraefe [ 2016-08-09 ]

InnoDB transaction entries of both queries:

---TRANSACTION 35769736996, not started
MySQL thread id 8205021, OS thread handle 0x7f1cb4060700, query id 5499216913 Waiting for prior transaction to commit
DROP TABLE IF EXISTS `other table`

---TRANSACTION 35769737366, not started
MySQL thread id 8205092, OS thread handle 0x7f1cb44f0700, query id 5499216915 Waiting for table metadata lock
INSERT IGNORE INTO some_table select * from other_table

Comment by Hartmut Holzgraefe [ 2016-08-10 ]

My current theory about the deadlock, based on seeing retry_event_group() in the INSERT threads stack trace, is that this is caused by the INSERT transaction failing to commit, and so being retried by the related slave applier thread:

  • INSERT INTO ... SELECT gets started, holds MDL (metadata lock)
    on the table selected from
  • DROP event comes up in parallel, but has to wait for the MDL
    (so far, so good)
  • now the replicated INSERT transaction somehow fails ...
  • ... which releases the MDL ...
  • INSERT transaction is retried via retry_event_group() ...
  • ... but now the MDL is already held by the thread trying to
    execute the DROP
  • so the INSERT is blocked by the DROP holding the MDL on the table
  • but at the same time the DROP later on in the mysql_rm_table()
    code flow tries to execute wait_for_prior_commit() ...
  • and this prior commit is that of the INSERT transaction which can't
    proceed as it can't get the MDL

=> classic deadlock

Comment by Hartmut Holzgraefe [ 2016-08-12 ]

Most likely already fixed in 10.0.21 by the fix for MDEV-8302, according to Monty and Kristian:

"I talked with Kristian about this issue and he told me that there has been some recent fixes in parallel replication that may fix this issue. Especially MDEV-8302 for MariaDB 10.0.21.
This is not only about duplicate key, but it's about re-scheduling the parallel threads when it comes to rollback, which is directly related to this issue."

Comment by Elena Stepanova [ 2018-05-31 ]

Closing based on hholzgra's comment above.

Generated at Thu Feb 08 07:42:52 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.