Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-17073

INSERT…ON DUPLICATE KEY UPDATE became more deadlock-prone

Details

    Description

      MySQL 5.7.4 changed the behaviour of INSERT…ON DUPLICATE KEY UPDATE in the InnoDB storage engine. Upon encountering a duplicate key, it would no longer directly fall back to INSERT, but instead it would proceed to acquire an exclusive lock on every index record for the row on which the UPDATE failed.

      The extra locking was motivated by a public bug report: MySQL Bug#50413 insert on duplicate key update sometimes writes binlog position incorrectly (Oracle internal BUG#11758237). The fix was followed up by a couple of regression fixes. For one user, reverting these changes significantly reduces the deadlock rate of INSERT…ON DUPLICATE KEY UPDATE.

      There also is a related MySQL Bug #52020 InnoDB can still deadlock on just INSERT...ON DUPLICATE KEY. One of the factors was that when the pluggable storage engine interface was created in MySQL 5.1, the function innobase_query_is_update() was removed without replacement, and MySQL Bug #7975 (which lacked a test case) was reintroduced.

      In a comment in MySQL Bug #52020 I anticipated that the deadlocks would be caused in a scenario where the INSERT phase fails, then some other transaction locks some of the index records, causing the ON DUPLICATE KEY UPDATE phase to wait for those locks or to deadlock. Acquiring the locks for all index records already in the INSERT phase would make the UPDATE phase wait-free, but it could cause more conflicts with other accesses, as the hold time of the locks is extended.

      valerii posted some insightful comments on Bug #52020. I hope he can construct a test case that demonstrates the increased deadlock rate, so that we can see what can be improved here.

      Sven Sandberg suggested in MySQL Bug #50413 that the INSERT phase should have acquired a gap lock, so that conflicting INSERT with that key would be prevented. I assume that he meant the PRIMARY key, because his example involves two unique keys: PRIMARY KEY(a), UNIQUE KEY(b). He also filed MySQL Bug #58637 Mark INSERT...ON DUPLICATE KEY UPDATE unsafe when there is more than one key.

      Apparently the nondeterminism that the extra locking is trying to prevent is caused by the ambiguity of the ON DUPLICATE KEY syntax. It does not specify the key! An unambiguous syntax would be something like:

      INSERT INTO t1 VALUES(1,2,3) ON DUPLICATE KEY(PRIMARY) UPDATE …;
      INSERT INTO t1 VALUES(1,2,3) ON DUPLICATE KEY(u) UPDATE …;
      

      Statement-based replication is obviously affected by this ambiguity.
      I hope that Elkin and seppo can comment on whether row-based replication and parallel forms of replication (including Galera Cluster and MySQL 5.7 group replication) are affected, and how exactly the operations would be logged by the master and applied on the slave.

      Note: Comments in MySQL Bug #50413 suggest that innodb_autoinc_lock_mode settings 0 and 1 are equivalent in this respect. I’d also like to know whether this parameter is at all relevant outside statement-based replication (that is, when innodb_autoinc_lock_mode=2 could be safe to use). With the setting 2, InnoDB does not acquire any auto-increment lock within the transaction. With the settings 0 or 1, InnoDB will hold a lock until the end of the current statement. This would suggest that the setting only matters in statement-based replication.

      Attachments

        Issue Links

          Activity

            seppo Seppo Jaakola added a comment -

            Galera replication exercises optimistic concurrency control, whatever happens during the transaction processing, in master node, does not really matter. At commit time, Galera populates a replication write set which will contain binlog events for the transaction and key information for modified rows (primary keys, unique keys and foreign keys). If insert succeeds, there should be write rows events, and if insert execution deviated for updating, there should be update rows events in the binlog events set.

            In the slave node side, the write set is applied directly by using primary keys in the respective binlog events, so this is rather straightforward operation, and does not involve excessive locking.

            However, there will be some harm for replication performance, in multi-master topologies, if transaction locks more rows than what will be needed during the write set applying phase in slave node. In such "asymmetric locking situation", the INSERT...ON DUPLICATE... execution, which has advanced in replication phase, may still end up as victim for earlier replicated transaction, and it has to rollback. If the conflict happened over such locks, which are not used in applying phase, Galera will still abort the transaction and immediately replay. This rollback-replay cycle, in master node, will slow down the master node somewhat, especially if such non related conflicts are frequent.

            seppo Seppo Jaakola added a comment - Galera replication exercises optimistic concurrency control, whatever happens during the transaction processing, in master node, does not really matter. At commit time, Galera populates a replication write set which will contain binlog events for the transaction and key information for modified rows (primary keys, unique keys and foreign keys). If insert succeeds, there should be write rows events, and if insert execution deviated for updating, there should be update rows events in the binlog events set. In the slave node side, the write set is applied directly by using primary keys in the respective binlog events, so this is rather straightforward operation, and does not involve excessive locking. However, there will be some harm for replication performance, in multi-master topologies, if transaction locks more rows than what will be needed during the write set applying phase in slave node. In such "asymmetric locking situation", the INSERT...ON DUPLICATE... execution, which has advanced in replication phase, may still end up as victim for earlier replicated transaction, and it has to rollback. If the conflict happened over such locks, which are not used in applying phase, Galera will still abort the transaction and immediately replay. This rollback-replay cycle, in master node, will slow down the master node somewhat, especially if such non related conflicts are frequent.
            Elkin Andrei Elkin added a comment -

            marko, thanks for a well-conducted analysis and references. Indeed the unsafety of IODKU is caused
            by shying out to specify an actual index. There's nothing to worry about Row-format incl on the parallel slave as execution is deterministic in such case. As to the Statement format, whatever execution branch of IODKU is selected
            on master the query won't be binlogged in one group with one that it found a conflict with (which decided the exec branch). So it must be safe to run in parallel with its binlog-group neighbours.

            Elkin Andrei Elkin added a comment - marko , thanks for a well-conducted analysis and references. Indeed the unsafety of IODKU is caused by shying out to specify an actual index. There's nothing to worry about Row-format incl on the parallel slave as execution is deterministic in such case. As to the Statement format, whatever execution branch of IODKU is selected on master the query won't be binlogged in one group with one that it found a conflict with (which decided the exec branch). So it must be safe to run in parallel with its binlog-group neighbours.

            Andrei, I hope you can run some tests and suggest what should be done. I see two possible outcomes, but I may be wrong:

            1. We can avoid the extra locking inside InnoDB, except on replication masters and slaves when using statement-based replication.
            2. We can avoid the extra locking altogether.

            If the outcome is the first one, then I would appreciate your help in writing a predicate function.

            marko Marko Mäkelä added a comment - Andrei, I hope you can run some tests and suggest what should be done. I see two possible outcomes, but I may be wrong: We can avoid the extra locking inside InnoDB, except on replication masters and slaves when using statement-based replication. We can avoid the extra locking altogether. If the outcome is the first one, then I would appreciate your help in writing a predicate function.
            Elkin Andrei Elkin added a comment -

            marko There always has existed a method to defeat non-determinism without involving engine. As which key to chose to detect a conflict
            is of the server layer, we could just record the decision within the replication event, Query_log_event that is 'cos only SBR is concerned.
            On the slave the chosen key solely will be consulted. By my quick check this must be feasible.
            Let's check out with optimizer people, their approval of the idea would be a warrant to withdraw Bug #50413 fixes.
            As next step we would implement the refined duplicate key check on the slave.

            Elkin Andrei Elkin added a comment - marko There always has existed a method to defeat non-determinism without involving engine. As which key to chose to detect a conflict is of the server layer, we could just record the decision within the replication event, Query_log_event that is 'cos only SBR is concerned. On the slave the chosen key solely will be consulted. By my quick check this must be feasible. Let's check out with optimizer people, their approval of the idea would be a warrant to withdraw Bug #50413 fixes. As next step we would implement the refined duplicate key check on the slave.

            I see locking as a way to make things serialized or more deterministic in a distributed system.

            We discussed three main sources of nondeterminism between a master and a slave:

            1. Statement-based replication: Which records will be locked? To ensure serializable execution, we will probably have to lock the same set of records on both master and slaves.
            2. On which unique index is the duplicate detected?
            3. In which order will InnoDB processes secondary indexes and lock the records?

            In InnoDB, the PRIMARY KEY or the clustered index is always modified first. On CREATE TABLE or after rebuilding a table with ALGORITHM=COPY, InnoDB would create the dict_table_t::indexes in the order of TABLE::key_info[], that is, UNIQUE KEY before non-unique ones. It makes sense to process unique keys first, to avoid modifying non-unique indexes in case a duplicate key error occurs.

            If ALTER TABLE…ADD UNIQUE KEY or CREATE UNIQUE INDEX was executed with ALGORITHM=INPLACE, then InnoDB would add the index last (while it would be among the first ones in the .frm file.

            If the master and slaves use different settings for old_alter_table, or if some system was initialized from mysqldump while others are running on files where an unique index was created with ALGORITHM=INPLACE, then InnoDB would be locking and processing the index records in different order.

            It seems to me that when statement-based replication is used, we must keep the current level of locking.

            For row-based replication, Elkin suggested that the replication event should indicate the set of indexes where locks were acquired in the INSERT phase before the ON DUPLICATE KEY kicked in. The slave would then ensure that the equivalent locks will be acquired. In this way, the INSERT phase could avoid locking the records in all indexes, thus reducing the potential for lock conflicts.

            Reducing the number of locks acquired will not only reduce the potential for deadlocks, but also the potential for lock waits and lock wait timeouts.

            marko Marko Mäkelä added a comment - I see locking as a way to make things serialized or more deterministic in a distributed system. We discussed three main sources of nondeterminism between a master and a slave: Statement-based replication: Which records will be locked? To ensure serializable execution, we will probably have to lock the same set of records on both master and slaves. On which unique index is the duplicate detected? In which order will InnoDB processes secondary indexes and lock the records? In InnoDB, the PRIMARY KEY or the clustered index is always modified first. On CREATE TABLE or after rebuilding a table with ALGORITHM=COPY , InnoDB would create the dict_table_t::indexes in the order of TABLE::key_info[] , that is, UNIQUE KEY before non-unique ones. It makes sense to process unique keys first, to avoid modifying non-unique indexes in case a duplicate key error occurs. If ALTER TABLE…ADD UNIQUE KEY or CREATE UNIQUE INDEX was executed with ALGORITHM=INPLACE , then InnoDB would add the index last (while it would be among the first ones in the .frm file. If the master and slaves use different settings for old_alter_table , or if some system was initialized from mysqldump while others are running on files where an unique index was created with ALGORITHM=INPLACE , then InnoDB would be locking and processing the index records in different order. It seems to me that when statement-based replication is used, we must keep the current level of locking. For row-based replication, Elkin suggested that the replication event should indicate the set of indexes where locks were acquired in the INSERT phase before the ON DUPLICATE KEY kicked in. The slave would then ensure that the equivalent locks will be acquired. In this way, the INSERT phase could avoid locking the records in all indexes, thus reducing the potential for lock conflicts. Reducing the number of locks acquired will not only reduce the potential for deadlocks, but also the potential for lock waits and lock wait timeouts.

            Valerii Kravchuk posted some insightful comments on Bug #52020. I hope he can construct a test case that demonstrates the increased deadlock rate, so that we can see what can be improved here.

            I created MDEV-17522 that has a test case showing deadlocks with INSERT ON DUPLICATE UPDATE and unique keys.

            GeoffMontee Geoff Montee (Inactive) added a comment - Valerii Kravchuk posted some insightful comments on Bug #52020. I hope he can construct a test case that demonstrates the increased deadlock rate, so that we can see what can be improved here. I created MDEV-17522 that has a test case showing deadlocks with INSERT ON DUPLICATE UPDATE and unique keys.

            As a simple fix, I would keep the extra locking only when binlog is enabled and statement-based replication is in use. In this way, we will get deterministic locking on master and slave, while not paying penalty for standalone operation or row-based replication:

            diff --git a/sql/sql_class.cc b/sql/sql_class.cc
            index e581ed0af25..71d5b80eaa3 100644
            --- a/sql/sql_class.cc
            +++ b/sql/sql_class.cc
            @@ -4523,6 +4523,11 @@ extern "C" int thd_rpl_is_parallel(const MYSQL_THD thd)
               return thd->rgi_slave && thd->rgi_slave->is_parallel_exec;
             }
             
            +extern "C" int thd_rpl_stmt_based(const MYSQL_THD thd)
            +{
            +  return !thd->is_current_stmt_binlog_format_row() &&
            +    !thd->is_current_stmt_binlog_disabled();
            +}
             
             /* Returns high resolution timestamp for the start
               of the current query. */
            

            marko Marko Mäkelä added a comment - As a simple fix, I would keep the extra locking only when binlog is enabled and statement-based replication is in use. In this way, we will get deterministic locking on master and slave, while not paying penalty for standalone operation or row-based replication: diff --git a/sql/sql_class.cc b/sql/sql_class.cc index e581ed0af25..71d5b80eaa3 100644 --- a/sql/sql_class.cc +++ b/sql/sql_class.cc @@ -4523,6 +4523,11 @@ extern "C" int thd_rpl_is_parallel(const MYSQL_THD thd) return thd->rgi_slave && thd->rgi_slave->is_parallel_exec; } +extern "C" int thd_rpl_stmt_based(const MYSQL_THD thd) +{ + return !thd->is_current_stmt_binlog_format_row() && + !thd->is_current_stmt_binlog_disabled(); +} /* Returns high resolution timestamp for the start of the current query. */

            I pushed a fix to 10.2. With the following patch, it trips a check at the upper level:

            diff --git a/mysql-test/suite/innodb/include/innodb_binlog.combinations b/mysql-test/suite/innodb/include/innodb_binlog.combinations
            index 46d31e733b1..c4286c0a171 100644
            --- a/mysql-test/suite/innodb/include/innodb_binlog.combinations
            +++ b/mysql-test/suite/innodb/include/innodb_binlog.combinations
            @@ -1,3 +1,4 @@
             [log-bin]
             log-bin
            +binlog-format=statement
             [skip-log-bin]
            

            innodb.auto_increment_dup 'innodb,log-bin' [ fail ]
                    Test ended at 2018-11-02 15:05:53
             
            CURRENT_TEST: innodb.auto_increment_dup
            mysqltest: At line 146: query 'INSERT INTO t1(k) VALUES (2), (4), (5) ON DUPLICATE KEY UPDATE c='2'' failed: 1665: Cannot execute statement: impossible to write to binary log since BINLOG_FORMAT = STATEMENT and at least one table uses a storage engine limited to row-based logging. InnoDB is limited to row-logging when transaction isolation level is READ COMMITTED or READ UNCOMMITTED.
            

            Maybe a better fix would be to revert the upstream change entirely and improve the check that was originally added in MySQL 5.1.20 in ha_innobase::table_flags() and in 5.1.21 moved to ha_innobase::external_lock().

            marko Marko Mäkelä added a comment - I pushed a fix to 10.2 . With the following patch, it trips a check at the upper level: diff --git a/mysql-test/suite/innodb/include/innodb_binlog.combinations b/mysql-test/suite/innodb/include/innodb_binlog.combinations index 46d31e733b1..c4286c0a171 100644 --- a/mysql-test/suite/innodb/include/innodb_binlog.combinations +++ b/mysql-test/suite/innodb/include/innodb_binlog.combinations @@ -1,3 +1,4 @@ [log-bin] log-bin +binlog-format=statement [skip-log-bin] innodb.auto_increment_dup 'innodb,log-bin' [ fail ] Test ended at 2018-11-02 15:05:53   CURRENT_TEST: innodb.auto_increment_dup mysqltest: At line 146: query 'INSERT INTO t1(k) VALUES (2), (4), (5) ON DUPLICATE KEY UPDATE c='2'' failed: 1665: Cannot execute statement: impossible to write to binary log since BINLOG_FORMAT = STATEMENT and at least one table uses a storage engine limited to row-based logging. InnoDB is limited to row-logging when transaction isolation level is READ COMMITTED or READ UNCOMMITTED. Maybe a better fix would be to revert the upstream change entirely and improve the check that was originally added in MySQL 5.1.20 in ha_innobase::table_flags() and in 5.1.21 moved to ha_innobase::external_lock() .
            marko Marko Mäkelä added a comment - It turns out that upstream reverted the problematic fix and implemented a cleaner fix in MySQL 5.7.26.

            In MariaDB Server 10.2 and later, thanks to MDEV-17614, replication will work correctly (or warnings will be issued to the user) in this scenario. The MDEV-17614 fix thus allows (and its test case requires) us to revise this MDEV-17073 fix. Instead of selectively disabling the problematic change that we inherited from MySQL 5.7, we must remove that problematic change altogether. I did that as part of merging MDEV-17614 from 10.1 to 10.2.

            marko Marko Mäkelä added a comment - In MariaDB Server 10.2 and later, thanks to MDEV-17614 , replication will work correctly (or warnings will be issued to the user) in this scenario. The MDEV-17614 fix thus allows (and its test case requires) us to revise this MDEV-17073 fix. Instead of selectively disabling the problematic change that we inherited from MySQL 5.7, we must remove that problematic change altogether . I did that as part of merging MDEV-17614 from 10.1 to 10.2.

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              3 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.