Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-20605

Awaken transaction can miss inserted by other transaction records due to wrong persistent cursor restoration

Details

    Description

      Some users are seeing duplicate key errors like the following when using the optimistic and aggressive modes of parallel replication:

      MariaDB [(none)]> SHOW SLAVE STATUS\G
      *************************** 1. row ***************************
                     Slave_IO_State: Waiting for master to send event
                        Master_Host: 172.30.0.154
                        Master_User: repl
                        Master_Port: 3306
                      Connect_Retry: 60
                    Master_Log_File: mariadb-bin.000055
                Read_Master_Log_Pos: 593262418
                     Relay_Log_File: ip-172-30-0-228-relay-bin.000002
                      Relay_Log_Pos: 33725828
              Relay_Master_Log_File: mariadb-bin.000039
                   Slave_IO_Running: Yes
                  Slave_SQL_Running: No
                    Replicate_Do_DB:
                Replicate_Ignore_DB:
                 Replicate_Do_Table:
             Replicate_Ignore_Table:
            Replicate_Wild_Do_Table:
        Replicate_Wild_Ignore_Table:
                         Last_Errno: 1062
                         Last_Error: Error 'Duplicate entry '...' for key 'name'' on query. Default database: 'sample3'. Query: 'INSERT INTO ...'
                       Skip_Counter: 0
                Exec_Master_Log_Pos: 705697132
                    Relay_Log_Space: 17101184558
                    Until_Condition: None
                     Until_Log_File:
                      Until_Log_Pos: 0
                 Master_SSL_Allowed: No
                 Master_SSL_CA_File:
                 Master_SSL_CA_Path:
                    Master_SSL_Cert:
                  Master_SSL_Cipher:
                     Master_SSL_Key:
              Seconds_Behind_Master: NULL
      Master_SSL_Verify_Server_Cert: No
                      Last_IO_Errno: 0
                      Last_IO_Error:
                     Last_SQL_Errno: 1062
                     Last_SQL_Error: Error 'Duplicate entry '...' for key 'name'' on query. Default database: 'sample3'. Query: 'INSERT INTO ...'
        Replicate_Ignore_Server_Ids:
                   Master_Server_Id: 1
                     Master_SSL_Crl:
                 Master_SSL_Crlpath:
                         Using_Gtid: Slave_Pos
                        Gtid_IO_Pos: 0-1-42163518
            Replicate_Do_Domain_Ids:
        Replicate_Ignore_Domain_Ids:
                      Parallel_Mode: aggressive
                          SQL_Delay: 0
                SQL_Remaining_Delay: NULL
            Slave_SQL_Running_State:
      1 row in set (0.00 sec)
      

      However, when the problem is analyzed, it is clear that the duplicate key should not occur.

      The configuration:

      [mariadb]
      server_id=2
      log_bin=mariadb-bin
      log_slave_updates
      binlog_format=MIXED
      slave_parallel_threads=16
      slave_parallel_max_queued=1048576
      slave_parallel_mode=aggressive
      

      The configuration with slave_parallel_mode=optimistic was identical.

      Some observations:

      • When the table is queried on the slave, the row does indeed exists, so it is clear that the duplicate key error is real.
      • The SQL script contains an UPDATE and a DELETE before the INSERT, so the row should have been deleted, which should have prevented the duplicate key error.
      • The row still exists, so it seems that the DELETE statement had no effect.
      • The values in the existing row in the table are the old non-updated values, so it seems that the UPDATE statement also had no effect.
      • The slave's binary log contains both the UPDATE and DELETE statements, which should indicate that the statements were successfully executed on the slave.
      • I haven't yet been able to reproduce the problem when the master has binlog_format=ROW set, so row-based replication may be immune to the problem.

      Attachments

        1. MDEV-20605.patch
          8 kB
          Marko Mäkelä
        2. MDEV-20605-deadlock.test
          2 kB
          Vladislav Lesin
        3. MDEV-20605-locking.test
          3 kB
          Vladislav Lesin
        4. MDEV-20605-mysql.test
          3 kB
          Vladislav Lesin
        5. MDEV-20605-restore-cursor.test
          3 kB
          Vladislav Lesin
        6. MDEV-20605-restore-cursor-v2.test
          3 kB
          Vladislav Lesin
        7. thiru_20605.yy
          0.9 kB
          Matthias Leich

        Issue Links

          Activity

            vlad.lesin Vladislav Lesin added a comment - - edited

            Some update and analyses.

            The initial issue.

            During the last RQG testing "CHECK TABLE" showed error in the following code:

            int ha_innobase::check(...)                                                     
            {                                                                               
            ...                                                                             
                            if (index == dict_table_get_first_index(m_prebuilt->table)) {   
                                    n_rows_in_table = n_rows;                               
                            } else if (!(index->type & DICT_FTS)                            
                                       && (n_rows != n_rows_in_table)) {                    
                                    push_warning_printf(                                    
                                            thd, Sql_condition::WARN_LEVEL_WARN,            
                                            ER_NOT_KEYFILE,                                 
                                            "InnoDB: Index '%-.200s' contains " ULINTPF     
                                            " entries, should be " ULINTPF ".",             
                                            index->name(), n_rows, n_rows_in_table);        
                                    is_ok = false;                                          
                                    dict_set_corrupted(index, "CHECK TABLE; Wrong count",   
                                                       false);                              
                            }                                                               
            ...                                                                             
            }
            

            The number of rows does not match for the first and the next indexes.

            The initial analyses

            The scenario is the following. Suppose we have a pair of records in secondary index with the same unique fields, the first record is non-delete-marked, the second record is delete-marked-committed. The transaction of "CHECK TABLE" does non-locking read of secondary index. After the record from secondary index is read in row_search_mvcc(), the correspondent clustered index record is also requested(see mtr_has_extra_clust_latch local variable in row_search_mvcc()). After that before requesting the next secondary index record in row_search_mvcc() cursor position is stored, mtr is committed, mtr is started and cursor position is restored(to preserve clustered index pages latching order). In the room when the page was unlatched(after mtr commit) the cursor's record was purged, and then cursor position was restored to the previous secondary index record. As the previous record had the same unique fields as the unique fields in the stored in cursor record, the cursor was not moved forward, and the same record was pushed twice in results set.

            The above case can happen only if cursor position is stored, the page is unlatched, purge thread removed stored in cursor record, and then the cursor position is restored. There can be 3 cases when row_search_mvcc() does the above(see btr_pcur_store_position() and sel_restore_position_for_mysql() calls in row_search_mvcc()):

            1) When row_search_mvcc() is finishing. It can be invoked with the same cursor again to continue index reading, see ha_innobase::index_next() as an example;
            2) before suspending when it got waiting lock;
            3) when there is search in secondary index, and we need to request clustered index too(see the above example for "CHECK TABLE").

            Consider the cases for locking and non-locking reads.

            Non-locking read

            Case (1) should not cause the issue because if the locking read view can see the record and the view is still opened, the record can't be purged.

            Case (2) can't happen for non-locking read by definition.

            For non-locking read for case(3) the following code requests clustered index record even if the secondary index record is delete-marked:

            dberr_t                                                                         
            row_search_mvcc(...)                                                            
            {                                                                               
            ...                                                                             
                    if (prebuilt->select_lock_type != LOCK_NONE) {                          
                    ...                                                                     
                    } else {                                                                
                    ...                                                                     
                            if (trx->isolation_level == TRX_ISO_READ_UNCOMMITTED            
                                || prebuilt->table->is_temporary()                          
                                || prebuilt->table->no_rollback()) {                        
                                                                                            
                                    /* Do nothing: we let a non-locking SELECT read the     
                                    latest version of the record */                         
                                                                                            
                            } else if (index == clust_index) {                              
                            ...                                                             
                            } else {                                                        
                                    /* We are looking into a non-clustered index,           
                                    and to get the right version of the record we           
                                    have to look also into the clustered index: this        
                                    is necessary, because we can only get the undo          
                                    information via the clustered index record. */          
                                    ...                                                     
                                    if (!srv_read_only_mode) {                              
                                            ...                                             
                                            if (trx->read_view.sees(trx_id)) {              
                                                    goto locks_ok;                          
                                            }                                               
                                            /* We should look at the clustered index.       
                                            However, as this is a non-locking read,         
                                            we can skip the clustered index lookup if       
                                            the condition does not match the secondary      
                                            index entry. */                                 
                                            switch (row_search_idx_cond_check(              
                                                            buf, prebuilt, rec, offsets)) { 
                                            ...                                             
                                            case CHECK_POS:
                                                    goto requires_clust_rec;                
                                            }                                               
                                                                                            
                                            ut_error;                                       
                                    }                                                       
                            }                                                               
                    }                                                                       
            ...                                                                             
            }            
            

            That is why for locking read row_search_mvcc() can unlatch the page after clustered index record is read:

            dberr_t                                                                         
            row_search_mvcc(...)                                                            
            {                                                                               
            ...                                                                             
                    } else if (mtr_has_extra_clust_latch) {                                 
                            /* If we have extra cluster latch, we must commit               
                            mtr if we are moving to the next non-clustered                  
                            index record, because we could break the latching               
                            order if we would access a different clustered                  
                            index page right away without releasing the previous. */        
                                                                                            
                            btr_pcur_store_position(pcur, &mtr);                            
                            mtr.commit();                                                   
                            mtr_has_extra_clust_latch = FALSE;                              
                                                                                            
                            mtr.start();                                                    
                                                                                            
                            if (sel_restore_position_for_mysql(&same_user_rec,              
                                                               BTR_SEARCH_LEAF,             
                                                               pcur, moves_up, &mtr)) {     
                                    goto rec_loop;                                          
                            }                                                               
                    }                                                                       
                                                                                            
            ...                                                                             
            }
            

            And when the page is unlatched, purge has ability to remove stored in the cursor record, what causes rows duplication in result set.

            marko proposed to use mtr_t::get_savepoint() and mtr_t::release_s_latch_at_savepoint() to solve the issue. I.e. we can create mtr savepoint before requesting clustered index record, then release only clustered index page latches instead of releasing the whole mtr latches. See MDEV-27557.

            One more way to fix it is not to execute that "don't move forward cursor if the record with the same unique fields is found" logic for non-locking reads at all. Because it does not make sense for non-locking read. For locking read that logic was implemented to found just inserted record after reading transaction is awaken, and cursor position is restored. This is necessary for correct locking order. For non-locking read locking order does not make sense. And even if something was inserted after read view is created, it will not be visible for that read view(depending on isolation level).

            I would prefer to create separate task to implement marko's idea for the future optimisation and implement my approach now.

            Locking read

            For locking read row_search_mvcc() contains the following code:

            dberr_t row_search_mvcc(...)                                                    
            {                                                                               
            ...                                                                             
            locks_ok:                                                                       
                    /* NOTE that at this point rec can be an old version of a clustered     
                    index record built for a consistent read. We cannot assume after this   
                    point that rec is on a buffer pool page. Functions like                 
                    page_rec_is_comp() cannot be used! */                                   
                                                                                            
                    if (rec_get_deleted_flag(rec, comp)) {                                  
            locks_ok_del_marked:                                                            
                            /* In delete-marked records, DB_TRX_ID must                     
                            always refer to an existing undo log record. */                 
                            ut_ad(index != clust_index                                      
                                  || row_get_rec_trx_id(rec, index, offsets));              
                                                                                            
                            /* The record is delete-marked: we can skip it */               
                                                                                            
                            /* This is an optimization to skip setting the next key lock    
                            on the record that follows this delete-marked record. This      
                            optimization works because of the unique search criteria        
                            which precludes the presence of a range lock between this       
                            delete marked record and the record following it.               
                                                                                            
                            For now this is applicable only to clustered indexes while      
                            doing a unique search except for HANDLER queries because        
                            HANDLER allows NEXT and PREV even in unique search on           
                            clustered index. There is scope for further optimization        
                            applicable to unique secondary indexes. Current behaviour is    
                            to widen the scope of a lock on an already delete marked record 
                            if the same record is deleted twice by the same transaction */  
                            if (index == clust_index && unique_search                       
                                && !prebuilt->used_in_HANDLER) {                            
                                                                                            
                                    err = DB_RECORD_NOT_FOUND;                              
                                                                                            
                                    goto normal_return;                                     
                            }
                                                                                            
                            goto next_rec;                                                  
                    }                                                                       
            ...                                                                             
            }
            

            I.e. delete-marked record is ignored, and either the next record is read, or row_search_mvcc() returns DB_RECORD_NOT_FOUND. So case (1) is impossible. Case (3) is also impossible for the same reason, the above code shows that clustered index record is not requested for delete-marked secondary index record.

            Consider case (2). To repeat the issue there must be two records with the same unique fields, the first record (in index order) must be non-delete-marked, the second record must be delete-marked. Some transaction must hold lock on delete-marked record, and some other transaction must hold lock on non-delete-marked record, and create waiting lock and suspend on delete-marked record. In any case the first record will always be locked before the second record. So any transaction with conflicting lock will be blocked on the first record. The locks will be released on transaction commit, and during this process lock hash table cell will be latched, what means any other transaction will wait until both locks are released, and there can't be a situation when some other transaction acquired first record lock, and blocked on the second record lock(to let "purge" to remove stored in persistent cursor record). Is the case (2) is impossible too.

            Summary

            The issue caused by my changes does not exist for locking read. It does exist for non-locking read, but for non-locking read my changes don't make sense, so we can just switch them off for this case. We can also prevent mtr commit in the case when we have to read clustered index record during secondary index search, and unlatch only clustered index pages, as marko suggested, it's safe because clustered index pages are S-latched in this case(see MDEV-27557).

            vlad.lesin Vladislav Lesin added a comment - - edited Some update and analyses. The initial issue. During the last RQG testing "CHECK TABLE" showed error in the following code: int ha_innobase::check(...) { ... if (index == dict_table_get_first_index(m_prebuilt->table)) { n_rows_in_table = n_rows; } else if (!(index->type & DICT_FTS) && (n_rows != n_rows_in_table)) { push_warning_printf( thd, Sql_condition::WARN_LEVEL_WARN, ER_NOT_KEYFILE, "InnoDB: Index '%-.200s' contains " ULINTPF " entries, should be " ULINTPF "." , index->name(), n_rows, n_rows_in_table); is_ok = false ; dict_set_corrupted(index, "CHECK TABLE; Wrong count" , false ); } ... } The number of rows does not match for the first and the next indexes. The initial analyses The scenario is the following. Suppose we have a pair of records in secondary index with the same unique fields, the first record is non-delete-marked, the second record is delete-marked-committed. The transaction of "CHECK TABLE" does non-locking read of secondary index. After the record from secondary index is read in row_search_mvcc(), the correspondent clustered index record is also requested(see mtr_has_extra_clust_latch local variable in row_search_mvcc()). After that before requesting the next secondary index record in row_search_mvcc() cursor position is stored, mtr is committed, mtr is started and cursor position is restored(to preserve clustered index pages latching order). In the room when the page was unlatched(after mtr commit) the cursor's record was purged, and then cursor position was restored to the previous secondary index record. As the previous record had the same unique fields as the unique fields in the stored in cursor record, the cursor was not moved forward, and the same record was pushed twice in results set. The above case can happen only if cursor position is stored, the page is unlatched, purge thread removed stored in cursor record, and then the cursor position is restored. There can be 3 cases when row_search_mvcc() does the above(see btr_pcur_store_position() and sel_restore_position_for_mysql() calls in row_search_mvcc()): 1) When row_search_mvcc() is finishing. It can be invoked with the same cursor again to continue index reading, see ha_innobase::index_next() as an example; 2) before suspending when it got waiting lock; 3) when there is search in secondary index, and we need to request clustered index too(see the above example for "CHECK TABLE"). Consider the cases for locking and non-locking reads. Non-locking read Case (1) should not cause the issue because if the locking read view can see the record and the view is still opened, the record can't be purged. Case (2) can't happen for non-locking read by definition. For non-locking read for case(3) the following code requests clustered index record even if the secondary index record is delete-marked: dberr_t row_search_mvcc(...) { ... if (prebuilt->select_lock_type != LOCK_NONE) { ... } else { ... if (trx->isolation_level == TRX_ISO_READ_UNCOMMITTED || prebuilt->table->is_temporary() || prebuilt->table->no_rollback()) { /* Do nothing: we let a non-locking SELECT read the latest version of the record */ } else if (index == clust_index) { ... } else { /* We are looking into a non-clustered index, and to get the right version of the record we have to look also into the clustered index: this is necessary, because we can only get the undo information via the clustered index record. */ ... if (!srv_read_only_mode) { ... if (trx->read_view.sees(trx_id)) { goto locks_ok; } /* We should look at the clustered index. However, as this is a non-locking read, we can skip the clustered index lookup if the condition does not match the secondary index entry. */ switch (row_search_idx_cond_check( buf, prebuilt, rec, offsets)) { ... case CHECK_POS: goto requires_clust_rec; } ut_error; } } } ... } That is why for locking read row_search_mvcc() can unlatch the page after clustered index record is read: dberr_t row_search_mvcc(...) { ... } else if (mtr_has_extra_clust_latch) { /* If we have extra cluster latch, we must commit mtr if we are moving to the next non-clustered index record, because we could break the latching order if we would access a different clustered index page right away without releasing the previous. */ btr_pcur_store_position(pcur, &mtr); mtr.commit(); mtr_has_extra_clust_latch = FALSE; mtr.start(); if (sel_restore_position_for_mysql(&same_user_rec, BTR_SEARCH_LEAF, pcur, moves_up, &mtr)) { goto rec_loop; } } ... } And when the page is unlatched, purge has ability to remove stored in the cursor record, what causes rows duplication in result set. marko proposed to use mtr_t::get_savepoint() and mtr_t::release_s_latch_at_savepoint() to solve the issue. I.e. we can create mtr savepoint before requesting clustered index record, then release only clustered index page latches instead of releasing the whole mtr latches. See MDEV-27557 . One more way to fix it is not to execute that "don't move forward cursor if the record with the same unique fields is found" logic for non-locking reads at all. Because it does not make sense for non-locking read. For locking read that logic was implemented to found just inserted record after reading transaction is awaken, and cursor position is restored. This is necessary for correct locking order. For non-locking read locking order does not make sense. And even if something was inserted after read view is created, it will not be visible for that read view(depending on isolation level). I would prefer to create separate task to implement marko 's idea for the future optimisation and implement my approach now. Locking read For locking read row_search_mvcc() contains the following code: dberr_t row_search_mvcc(...) { ... locks_ok: /* NOTE that at this point rec can be an old version of a clustered index record built for a consistent read. We cannot assume after this point that rec is on a buffer pool page. Functions like page_rec_is_comp() cannot be used! */ if (rec_get_deleted_flag(rec, comp)) { locks_ok_del_marked: /* In delete-marked records, DB_TRX_ID must always refer to an existing undo log record. */ ut_ad(index != clust_index || row_get_rec_trx_id(rec, index, offsets)); /* The record is delete-marked: we can skip it */ /* This is an optimization to skip setting the next key lock on the record that follows this delete-marked record. This optimization works because of the unique search criteria which precludes the presence of a range lock between this delete marked record and the record following it. For now this is applicable only to clustered indexes while doing a unique search except for HANDLER queries because HANDLER allows NEXT and PREV even in unique search on clustered index. There is scope for further optimization applicable to unique secondary indexes. Current behaviour is to widen the scope of a lock on an already delete marked record if the same record is deleted twice by the same transaction */ if (index == clust_index && unique_search && !prebuilt->used_in_HANDLER) { err = DB_RECORD_NOT_FOUND; goto normal_return; } goto next_rec; } ... } I.e. delete-marked record is ignored, and either the next record is read, or row_search_mvcc() returns DB_RECORD_NOT_FOUND. So case (1) is impossible. Case (3) is also impossible for the same reason, the above code shows that clustered index record is not requested for delete-marked secondary index record. Consider case (2). To repeat the issue there must be two records with the same unique fields, the first record (in index order) must be non-delete-marked, the second record must be delete-marked. Some transaction must hold lock on delete-marked record, and some other transaction must hold lock on non-delete-marked record, and create waiting lock and suspend on delete-marked record. In any case the first record will always be locked before the second record. So any transaction with conflicting lock will be blocked on the first record. The locks will be released on transaction commit, and during this process lock hash table cell will be latched, what means any other transaction will wait until both locks are released, and there can't be a situation when some other transaction acquired first record lock, and blocked on the second record lock(to let "purge" to remove stored in persistent cursor record). Is the case (2) is impossible too. Summary The issue caused by my changes does not exist for locking read. It does exist for non-locking read, but for non-locking read my changes don't make sense, so we can just switch them off for this case. We can also prevent mtr commit in the case when we have to read clustered index record during secondary index search, and unlatch only clustered index pages, as marko suggested, it's safe because clustered index pages are S-latched in this case(see MDEV-27557 ).

            origin/bb-10.5-MDEV-20605-cur-pos-fix 9f032f03fa8fd674c9b63401a12aed144ad3aa34 2022-02-10T21:52:50+03:00
            performed well in RQG testing.
            The issues observed occur on main source trees too.
            

            mleich Matthias Leich added a comment - origin/bb-10.5-MDEV-20605-cur-pos-fix 9f032f03fa8fd674c9b63401a12aed144ad3aa34 2022-02-10T21:52:50+03:00 performed well in RQG testing. The issues observed occur on main source trees too.

            OK to push. The MDEV-27557 optimization can be implemented later.

            marko Marko Mäkelä added a comment - OK to push. The MDEV-27557 optimization can be implemented later.

            The branch bb-10.2-MDEV-27025-MDEV-20605 is for 10.2-based custom builds.

            vlad.lesin Vladislav Lesin added a comment - The branch bb-10.2- MDEV-27025 - MDEV-20605 is for 10.2-based custom builds.

            I've just attached modified test for mysql(it's still not fixed in MySQL-9.0.1): MDEV-20605-mysql.test.

            vlad.lesin Vladislav Lesin added a comment - I've just attached modified test for mysql(it's still not fixed in MySQL-9.0.1): MDEV-20605-mysql.test .

            People

              vlad.lesin Vladislav Lesin
              GeoffMontee Geoff Montee (Inactive)
              Votes:
              5 Vote for this issue
              Watchers:
              21 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.