Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-24622

Replication does not support bulk insert into empty table

Details

    Description

      MDEV-515 will replace row-level locking with table-level locking in the special case that an INSERT is performed into an empty table or partition.

      This causes a few replication tests to fail. We have adjusted some tests by inserting an extra record into the table, so that the normal row-level locking and undo logging will be used. But, some tests will require deeper studying to assess whether such a work-around is acceptable or whether some improvements to the replication logic will be needed. Some failures are nondeterministic:

      multi_source.gtid_ignore_duplicates 'innodb' [ pass ]   5774
      multi_source.gtid_ignore_duplicates 'innodb' [ 2 pass ]   4660
      multi_source.gtid_ignore_duplicates 'innodb' [ 3 pass ]   5678
      multi_source.gtid_ignore_duplicates 'innodb' [ 4 fail ]  Found warnings/errors in server log file!
              Test ended at 2021-01-19 09:33:32
      line
      2021-01-19  9:33:28 92 [Warning] Slave: Connection was killed Error_code: 1927
      2021-01-19  9:33:28 92 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213
      2021-01-19  9:33:28 92 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
      2021-01-19  9:33:28 89 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213
      2021-01-19  9:33:28 89 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213
      2021-01-19  9:33:28 89 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
      

      Attachments

        Issue Links

          Activity

            Based on seppo’s suggestion in MDEV-24623, I disabled the MDEV-515 logic on replicas (by introducing an accessor thd_is_slave(), which should be removed as part of fixing this).

            marko Marko Mäkelä added a comment - Based on seppo ’s suggestion in MDEV-24623 , I disabled the MDEV-515 logic on replicas (by introducing an accessor thd_is_slave() , which should be removed as part of fixing this).

            Please remove the check for thd_is_slave(), it is not a correct fix. It is
            only hiding the problem, not solving it.

            The real issue here is a bug in replication for --gtid-ignore-duplicates
            with parallel replication, unrelated to any InnoDB code. If a GTID gets a
            temporary error and is retried, another multi-source connection can
            errorneously try to apply the same GTID in parallel, resulting in duplicate
            apply of the transaction.

            The test case multi_source.gtid_ignore_duplicates originally did not get any
            deadlock or conflicts, so this error got unnoticed. With MDEV-515, the
            INSERT into an empty table takes a table lock, which can cause a deadlock
            and retry, triggering the replication bug. That is why MDEV-515 makes the
            test start failing, even though it is unrelated to the root cause of the
            bug.

            There may be a few other test cases that expect a specific number of retries
            to be needed, those might need adjusting for MDEV-515, eg. by inserting an
            extra row at the start or something. But after MDEV-24818, the MDEV-515
            optimisation is only enabled when unique_checks=0 and foreign_key_checks=0,
            so will no longer affect replication tests anyway.

            The problem is not longer repeatable using mtr due to MDEV-24818. But I
            verified that multi_source.gtid_ignore_duplicates fails as described on the
            original commit 3cef4f8f0fc88ae5bfae4603d8d600ec84cc70a9 of MDEV-515, and
            that this patch fixes the test failure:

            diff --git a/sql/rpl_parallel.cc b/sql/rpl_parallel.cc
            index 72347d93ad1..f4c2e8b061b 100644
            --- a/sql/rpl_parallel.cc
            +++ b/sql/rpl_parallel.cc
            @@ -781,7 +781,7 @@ retry_event_group(rpl_group_info *rgi, rpl_parallel_thread *rpt,
                   DEBUG_SYNC(thd, "rpl_parallel_simulate_wait_at_retry");
                 });
             
            -  rgi->cleanup_context(thd, 1);
            +  rgi->cleanup_context(thd, 1, 1);
               wait_for_pending_deadlock_kill(thd, rgi);
               thd->reset_killed();
               thd->clear_error();
            diff --git a/sql/rpl_rli.cc b/sql/rpl_rli.cc
            index 8b61a3a708b..0f50bf9004f 100644
            --- a/sql/rpl_rli.cc
            +++ b/sql/rpl_rli.cc
            @@ -2251,7 +2251,7 @@ delete_or_keep_event_post_apply(rpl_group_info *rgi,
             }
             
             
            -void rpl_group_info::cleanup_context(THD *thd, bool error)
            +void rpl_group_info::cleanup_context(THD *thd, bool error, bool retry)
             {
               DBUG_ENTER("rpl_group_info::cleanup_context");
               DBUG_PRINT("enter", ("error: %d", (int) error));
            @@ -2308,7 +2308,7 @@ void rpl_group_info::cleanup_context(THD *thd, bool error)
                   Ensure we always release the domain for others to process, when using
                   --gtid-ignore-duplicates.
                 */
            -    if (gtid_ignore_duplicate_state != GTID_DUPLICATE_NULL)
            +    if (gtid_ignore_duplicate_state != GTID_DUPLICATE_NULL && !retry)
                   rpl_global_gtid_slave_state->release_domain_owner(this);
               }
             
            diff --git a/sql/rpl_rli.h b/sql/rpl_rli.h
            index e3b59aa615c..9e9d98433fe 100644
            --- a/sql/rpl_rli.h
            +++ b/sql/rpl_rli.h
            @@ -917,7 +917,7 @@ struct rpl_group_info
               }
             
               void clear_tables_to_lock();
            -  void cleanup_context(THD *, bool);
            +  void cleanup_context(THD *, bool, bool retry=false);
               void slave_close_thread_tables(THD *);
               void mark_start_commit_no_lock();
               void mark_start_commit();
            

            (But I think maybe a different patch would be a better fix, calling
            check_duplicate_gtid() inside retry_event_group(); also a test case is
            needed.)

            knielsen Kristian Nielsen added a comment - Please remove the check for thd_is_slave(), it is not a correct fix. It is only hiding the problem, not solving it. The real issue here is a bug in replication for --gtid-ignore-duplicates with parallel replication, unrelated to any InnoDB code. If a GTID gets a temporary error and is retried, another multi-source connection can errorneously try to apply the same GTID in parallel, resulting in duplicate apply of the transaction. The test case multi_source.gtid_ignore_duplicates originally did not get any deadlock or conflicts, so this error got unnoticed. With MDEV-515 , the INSERT into an empty table takes a table lock, which can cause a deadlock and retry, triggering the replication bug. That is why MDEV-515 makes the test start failing, even though it is unrelated to the root cause of the bug. There may be a few other test cases that expect a specific number of retries to be needed, those might need adjusting for MDEV-515 , eg. by inserting an extra row at the start or something. But after MDEV-24818 , the MDEV-515 optimisation is only enabled when unique_checks=0 and foreign_key_checks=0, so will no longer affect replication tests anyway. The problem is not longer repeatable using mtr due to MDEV-24818 . But I verified that multi_source.gtid_ignore_duplicates fails as described on the original commit 3cef4f8f0fc88ae5bfae4603d8d600ec84cc70a9 of MDEV-515 , and that this patch fixes the test failure: diff --git a/sql/rpl_parallel.cc b/sql/rpl_parallel.cc index 72347d93ad1..f4c2e8b061b 100644 --- a/sql/rpl_parallel.cc +++ b/sql/rpl_parallel.cc @@ -781,7 +781,7 @@ retry_event_group(rpl_group_info *rgi, rpl_parallel_thread *rpt, DEBUG_SYNC(thd, "rpl_parallel_simulate_wait_at_retry"); }); - rgi->cleanup_context(thd, 1); + rgi->cleanup_context(thd, 1, 1); wait_for_pending_deadlock_kill(thd, rgi); thd->reset_killed(); thd->clear_error(); diff --git a/sql/rpl_rli.cc b/sql/rpl_rli.cc index 8b61a3a708b..0f50bf9004f 100644 --- a/sql/rpl_rli.cc +++ b/sql/rpl_rli.cc @@ -2251,7 +2251,7 @@ delete_or_keep_event_post_apply(rpl_group_info *rgi, } -void rpl_group_info::cleanup_context(THD *thd, bool error) +void rpl_group_info::cleanup_context(THD *thd, bool error, bool retry) { DBUG_ENTER("rpl_group_info::cleanup_context"); DBUG_PRINT("enter", ("error: %d", (int) error)); @@ -2308,7 +2308,7 @@ void rpl_group_info::cleanup_context(THD *thd, bool error) Ensure we always release the domain for others to process, when using --gtid-ignore-duplicates. */ - if (gtid_ignore_duplicate_state != GTID_DUPLICATE_NULL) + if (gtid_ignore_duplicate_state != GTID_DUPLICATE_NULL && !retry) rpl_global_gtid_slave_state->release_domain_owner(this); } diff --git a/sql/rpl_rli.h b/sql/rpl_rli.h index e3b59aa615c..9e9d98433fe 100644 --- a/sql/rpl_rli.h +++ b/sql/rpl_rli.h @@ -917,7 +917,7 @@ struct rpl_group_info } void clear_tables_to_lock(); - void cleanup_context(THD *, bool); + void cleanup_context(THD *, bool, bool retry=false); void slave_close_thread_tables(THD *); void mark_start_commit_no_lock(); void mark_start_commit(); (But I think maybe a different patch would be a better fix, calling check_duplicate_gtid() inside retry_event_group(); also a test case is needed.)

            I have filed a different bug MDEV-33475 for the real issue, which needs to be fixed in 10.4.
            This bug then is only about removing the thd_is_slave() check in 10.6.

            knielsen Kristian Nielsen added a comment - I have filed a different bug MDEV-33475 for the real issue, which needs to be fixed in 10.4. This bug then is only about removing the thd_is_slave() check in 10.6.

            I have made a fix for the underlying root cause, MDEV-33475.

            I can take this and remove the work-around in InnoDB.

            knielsen Kristian Nielsen added a comment - I have made a fix for the underlying root cause, MDEV-33475 . I can take this and remove the work-around in InnoDB.

            Removed the workaround code in InnoDB and pushed to 10.6.

            knielsen Kristian Nielsen added a comment - Removed the workaround code in InnoDB and pushed to 10.6.

            People

              knielsen Kristian Nielsen
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.