[MDEV-24622] Replication does not support bulk insert into empty table - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.6
Fix Version/s: 10.6.18, 10.11.8, 11.0.6, 11.1.5
Component/s: Replication, Storage Engine - InnoDB
Labels:
- regression-10.6

Description

~~MDEV-515~~ will replace row-level locking with table-level locking in the special case that an INSERT is performed into an empty table or partition.

This causes a few replication tests to fail. We have adjusted some tests by inserting an extra record into the table, so that the normal row-level locking and undo logging will be used. But, some tests will require deeper studying to assess whether such a work-around is acceptable or whether some improvements to the replication logic will be needed. Some failures are nondeterministic:

multi_source.gtid_ignore_duplicates 'innodb' [ pass ]   5774

multi_source.gtid_ignore_duplicates 'innodb' [ 2 pass ]   4660

multi_source.gtid_ignore_duplicates 'innodb' [ 3 pass ]   5678

multi_source.gtid_ignore_duplicates 'innodb' [ 4 fail ]  Found warnings/errors in server log file!

        Test ended at 2021-01-19 09:33:32

line

2021-01-19  9:33:28 92 [Warning] Slave: Connection was killed Error_code: 1927

2021-01-19  9:33:28 92 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213

2021-01-19  9:33:28 92 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964

2021-01-19  9:33:28 89 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213

2021-01-19  9:33:28 89 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213

2021-01-19  9:33:28 89 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964

Attachments

Issue Links

is caused by

MDEV-515 innodb bulk insert

Closed

MDEV-33475 --gtid-ignore-duplicate can double-apply event in case of parallel replication retry

Closed

relates to

MDEV-27751 InnoDB: Failing assertion: !cursor->index->is_committed() in row0ins.cc line 221 (10.7+)

Closed

MDEV-33350 Bulk InnoDB LOAD DATA breaks replication upon hitting duplicate key

Confirmed

Activity

Ascending order - Click to sort in descending order

View 5 older comments

Marko Mäkelä added a comment - 2021-01-22 10:54

Based on seppo’s suggestion in ~~MDEV-24623~~, I disabled the ~~MDEV-515~~ logic on replicas (by introducing an accessor thd_is_slave(), which should be removed as part of fixing this).

Marko Mäkelä added a comment - 2021-01-22 10:54 Based on seppo ’s suggestion in MDEV-24623 , I disabled the MDEV-515 logic on replicas (by introducing an accessor thd_is_slave() , which should be removed as part of fixing this).

Kristian Nielsen added a comment - 2024-02-13 20:34

Please remove the check for thd_is_slave(), it is not a correct fix. It is
only hiding the problem, not solving it.

The real issue here is a bug in replication for --gtid-ignore-duplicates
with parallel replication, unrelated to any InnoDB code. If a GTID gets a
temporary error and is retried, another multi-source connection can
errorneously try to apply the same GTID in parallel, resulting in duplicate
apply of the transaction.

The test case multi_source.gtid_ignore_duplicates originally did not get any
deadlock or conflicts, so this error got unnoticed. With ~~MDEV-515~~, the
INSERT into an empty table takes a table lock, which can cause a deadlock
and retry, triggering the replication bug. That is why ~~MDEV-515~~ makes the
test start failing, even though it is unrelated to the root cause of the
bug.

There may be a few other test cases that expect a specific number of retries
to be needed, those might need adjusting for ~~MDEV-515~~, eg. by inserting an
extra row at the start or something. But after ~~MDEV-24818~~, the ~~MDEV-515~~
optimisation is only enabled when unique_checks=0 and foreign_key_checks=0,
so will no longer affect replication tests anyway.

The problem is not longer repeatable using mtr due to ~~MDEV-24818~~. But I
verified that multi_source.gtid_ignore_duplicates fails as described on the
original commit 3cef4f8f0fc88ae5bfae4603d8d600ec84cc70a9 of ~~MDEV-515~~, and
that this patch fixes the test failure:

diff --git a/sql/rpl_parallel.cc b/sql/rpl_parallel.cc

index 72347d93ad1..f4c2e8b061b 100644

--- a/sql/rpl_parallel.cc

+++ b/sql/rpl_parallel.cc

@@ -781,7 +781,7 @@ retry_event_group(rpl_group_info *rgi, rpl_parallel_thread *rpt,

       DEBUG_SYNC(thd, "rpl_parallel_simulate_wait_at_retry");

});

-  rgi->cleanup_context(thd, 1);

+  rgi->cleanup_context(thd, 1, 1);

   wait_for_pending_deadlock_kill(thd, rgi);

   thd->reset_killed();

   thd->clear_error();

diff --git a/sql/rpl_rli.cc b/sql/rpl_rli.cc

index 8b61a3a708b..0f50bf9004f 100644

--- a/sql/rpl_rli.cc

+++ b/sql/rpl_rli.cc

@@ -2251,7 +2251,7 @@ delete_or_keep_event_post_apply(rpl_group_info *rgi,

-void rpl_group_info::cleanup_context(THD *thd, bool error)

+void rpl_group_info::cleanup_context(THD *thd, bool error, bool retry)

   DBUG_ENTER("rpl_group_info::cleanup_context");

   DBUG_PRINT("enter", ("error: %d", (int) error));

@@ -2308,7 +2308,7 @@ void rpl_group_info::cleanup_context(THD *thd, bool error)

       Ensure we always release the domain for others to process, when using

       --gtid-ignore-duplicates.

*/

-    if (gtid_ignore_duplicate_state != GTID_DUPLICATE_NULL)

+    if (gtid_ignore_duplicate_state != GTID_DUPLICATE_NULL && !retry)

       rpl_global_gtid_slave_state->release_domain_owner(this);

diff --git a/sql/rpl_rli.h b/sql/rpl_rli.h

index e3b59aa615c..9e9d98433fe 100644

--- a/sql/rpl_rli.h

+++ b/sql/rpl_rli.h

@@ -917,7 +917,7 @@ struct rpl_group_info

   void clear_tables_to_lock();

-  void cleanup_context(THD *, bool);

+  void cleanup_context(THD *, bool, bool retry=false);

   void slave_close_thread_tables(THD *);

   void mark_start_commit_no_lock();

   void mark_start_commit();

(But I think maybe a different patch would be a better fix, calling
check_duplicate_gtid() inside retry_event_group(); also a test case is
needed.)

Kristian Nielsen added a comment - 2024-02-13 20:34 Please remove the check for thd_is_slave(), it is not a correct fix. It is only hiding the problem, not solving it. The real issue here is a bug in replication for --gtid-ignore-duplicates with parallel replication, unrelated to any InnoDB code. If a GTID gets a temporary error and is retried, another multi-source connection can errorneously try to apply the same GTID in parallel, resulting in duplicate apply of the transaction. The test case multi_source.gtid_ignore_duplicates originally did not get any deadlock or conflicts, so this error got unnoticed. With MDEV-515 , the INSERT into an empty table takes a table lock, which can cause a deadlock and retry, triggering the replication bug. That is why MDEV-515 makes the test start failing, even though it is unrelated to the root cause of the bug. There may be a few other test cases that expect a specific number of retries to be needed, those might need adjusting for MDEV-515 , eg. by inserting an extra row at the start or something. But after MDEV-24818 , the MDEV-515 optimisation is only enabled when unique_checks=0 and foreign_key_checks=0, so will no longer affect replication tests anyway. The problem is not longer repeatable using mtr due to MDEV-24818 . But I verified that multi_source.gtid_ignore_duplicates fails as described on the original commit 3cef4f8f0fc88ae5bfae4603d8d600ec84cc70a9 of MDEV-515 , and that this patch fixes the test failure: diff --git a/sql/rpl_parallel.cc b/sql/rpl_parallel.cc index 72347d93ad1..f4c2e8b061b 100644 --- a/sql/rpl_parallel.cc +++ b/sql/rpl_parallel.cc @@ -781,7 +781,7 @@ retry_event_group(rpl_group_info *rgi, rpl_parallel_thread *rpt, DEBUG_SYNC(thd, "rpl_parallel_simulate_wait_at_retry"); }); - rgi->cleanup_context(thd, 1); + rgi->cleanup_context(thd, 1, 1); wait_for_pending_deadlock_kill(thd, rgi); thd->reset_killed(); thd->clear_error(); diff --git a/sql/rpl_rli.cc b/sql/rpl_rli.cc index 8b61a3a708b..0f50bf9004f 100644 --- a/sql/rpl_rli.cc +++ b/sql/rpl_rli.cc @@ -2251,7 +2251,7 @@ delete_or_keep_event_post_apply(rpl_group_info *rgi, } -void rpl_group_info::cleanup_context(THD *thd, bool error) +void rpl_group_info::cleanup_context(THD *thd, bool error, bool retry) { DBUG_ENTER("rpl_group_info::cleanup_context"); DBUG_PRINT("enter", ("error: %d", (int) error)); @@ -2308,7 +2308,7 @@ void rpl_group_info::cleanup_context(THD *thd, bool error) Ensure we always release the domain for others to process, when using --gtid-ignore-duplicates. */ - if (gtid_ignore_duplicate_state != GTID_DUPLICATE_NULL) + if (gtid_ignore_duplicate_state != GTID_DUPLICATE_NULL && !retry) rpl_global_gtid_slave_state->release_domain_owner(this); } diff --git a/sql/rpl_rli.h b/sql/rpl_rli.h index e3b59aa615c..9e9d98433fe 100644 --- a/sql/rpl_rli.h +++ b/sql/rpl_rli.h @@ -917,7 +917,7 @@ struct rpl_group_info } void clear_tables_to_lock(); - void cleanup_context(THD *, bool); + void cleanup_context(THD *, bool, bool retry=false); void slave_close_thread_tables(THD *); void mark_start_commit_no_lock(); void mark_start_commit(); (But I think maybe a different patch would be a better fix, calling check_duplicate_gtid() inside retry_event_group(); also a test case is needed.)

Kristian Nielsen added a comment - 2024-02-16 11:12

I have filed a different bug ~~MDEV-33475~~ for the real issue, which needs to be fixed in 10.4.
This bug then is only about removing the thd_is_slave() check in 10.6.

Kristian Nielsen added a comment - 2024-02-16 11:12 I have filed a different bug MDEV-33475 for the real issue, which needs to be fixed in 10.4. This bug then is only about removing the thd_is_slave() check in 10.6.

Kristian Nielsen added a comment - 2024-03-08 21:24

I have made a fix for the underlying root cause, ~~MDEV-33475~~.

I can take this and remove the work-around in InnoDB.

Kristian Nielsen added a comment - 2024-03-08 21:24 I have made a fix for the underlying root cause, MDEV-33475 . I can take this and remove the work-around in InnoDB.

Kristian Nielsen added a comment - 2024-03-15 16:00

Removed the workaround code in InnoDB and pushed to 10.6.

Kristian Nielsen added a comment - 2024-03-15 16:00 Removed the workaround code in InnoDB and pushed to 10.6.

MariaDB Server

Replication does not support bulk insert into empty table

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration