[MDEV-30010] Slave (additional info): Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964 Created: 2022-11-14  Updated: 2023-10-27  Resolved: 2023-01-24

Status: Closed
Project: MariaDB Server
Component/s: Replication, Storage Engine - InnoDB
Affects Version/s: 10.4.27, 10.6.11
Fix Version/s: 10.4.28, 10.5.19, 10.6.12, 10.7.8, 10.8.7, 10.9.5, 10.10.3

Type: Bug Priority: Minor
Reporter: Rob Schwyzer Assignee: Andrei Elkin
Resolution: Done Votes: 1
Labels: None
Environment:

https://mariadbcorp.atlassian.net/wiki/spaces/CustomerEngineers/pages/1518829569/snsvr01

CentOS 7 running on 2x Intel Xeon Gold 6134 with 592GB RAM on a PCI-E 3.0 x8 NVMe SSD


Attachments: File MDEV-30010-pk.test     File MDEV-30010.test     File primary_global_variables.csv     File replica_global_variables.csv    
Issue Links:
Relates
relates to MDEV-20605 Awaken transaction can miss inserted ... Closed
relates to MDEV-20628 If temporary error occurs with optimi... Closed
relates to MDEV-30225 RR isolation violation with locking u... Closed

 Description   

Please review comments for reproduction information.

More to the point, this appears to be a direct reproduction of MDEV-20628 on a MariaDB version patched against MDEV-20605.

Primary using binlog_format=mixed. Replica using the below, key configurations-

slave_parallel_mode=optimistic
slave_parallel_threads=16
slave_domain_parallel_threads=16
slave_parallel_max_queued=2147483647

Table structure uses the following, critical columns-

CREATE TABLE test (
  id char(32) NOT NULL PRIMARY KEY,
  name VARCHAR(255) DEFAULT NULL UNIQUE KEY,
  touched datetime DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_general_ci;

Main workload consists of patterns such as-

INSERT INTO test (id, name, touched) VALUES('000000000000000000000000000000001','SYSTEM_LOCK:discovery_device_history:00000000000000000000000000000001', '2019-09-12 19:36:46');
UPDATE test SET touched = '2019-09-12 19:36:46' WHERE test.name = 'SYSTEM_LOCK:discovery_device_history:00000000000000000000000000000001';
DELETE FROM test WHERE test.name = 'SYSTEM_LOCK:discovery_device_history:00000000000000000000000000000001'

As expected to optimistic parallel replication, this kind of workload results in tons of rollbacks on the replica (but to be clear, the primary is fine throughout), such as-

2022-11-11 22:29:39 16 [Note] InnoDB: *** (2) TRANSACTION:
 
TRANSACTION 1757073418, ACTIVE 0 sec inserting
mysql tables in use 1, locked 1
3 lock struct(s), heap size 1128, 3 row lock(s), undo log entries 1
MySQL thread id 16, OS thread handle 139911793866496, query id 182677598 Update
INSERT INTO sys_mutex (`sys_id`,`sys_updated_by`,`system_id`,`sys_created_on`,`name`,`sys_mod_count`,`sys_updated_on`,`sys_created_by`) VALUES('cdd75b0d22e849008aa321f01222ec46','replay.admin.29','available','2019-09-12 19:36:46','SYSTEM_LOCK:discovery_device_history:ba1e3b9adb73b30834fa5e77dc96199b',0,'2019-09-12 19:36:46','replay.admin.29')
2022-11-11 22:29:39 16 [Note] InnoDB: *** (2) HOLDS THE LOCK(S):
 
RECORD LOCKS space id 26 page no 4 n bits 72 index name of table `master_4`.`sys_mutex` trx id 1757073418 lock mode S
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
 0: len 8; hex 73757072656d756d; asc supremum;;
 
Record lock, heap no 2 PHYSICAL RECORD: n_fields 2; compact format; info bits 32
 0: len 30; hex 53595354454d5f4c4f434b3a646973636f766572795f6465766963655f68; asc SYSTEM_LOCK:discovery_device_h; (total 69 bytes);
 1: len 30; hex 643039653233346535356137343363326266616264656532393166313561; asc d09e234e55a743c2bfabdee291f15a; (total 32 bytes);
 
2022-11-11 22:29:39 16 [Note] InnoDB: *** (2) WAITING FOR THIS LOCK TO BE GRANTED:
 
RECORD LOCKS space id 26 page no 4 n bits 72 index name of table `master_4`.`sys_mutex` trx id 1757073418 lock_mode X locks gap before rec insert intention waiting
Record lock, heap no 2 PHYSICAL RECORD: n_fields 2; compact format; info bits 32
 0: len 30; hex 53595354454d5f4c4f434b3a646973636f766572795f6465766963655f68; asc SYSTEM_LOCK:discovery_device_h; (total 69 bytes);
 1: len 30; hex 643039653233346535356137343363326266616264656532393166313561; asc d09e234e55a743c2bfabdee291f15a; (total 32 bytes);
 
2022-11-11 22:29:39 16 [Note] InnoDB: *** WE ROLL BACK TRANSACTION (1)

Despite tons of these deadlocks happening constantly, in my attempts the replica held out for over 24 hours before finally throwing the below error-

2022-11-11 22:29:39 17 [ERROR] Slave SQL: Error 'Duplicate entry 'SYSTEM_LOCK:discovery_device_history:ba1e3b9adb73b30834fa5e77...' for key 'name'' on query. Default database: 'master_6'. Query: 'INSERT INTO sys_mutex (`sys_id`,`sys_updated_by`,`system_id`,`sys_created_on`,`name`,`sys_mod_count`,`sys_updated_on`,`sys_created_by`) VALUES('a21dfb83830740b38da2db5e67d6c2c6','replay.admin.29','available','2019-09-12 19:36:46','SYSTEM_LOCK:discovery_device_history:ba1e3b9adb73b30834fa5e77dc96199b',0,'2019-09-12 19:36:46','replay.admin.29')', Gtid 0-1-445975955, Internal MariaDB error code: 1062
2022-11-11 22:29:39 17 [Warning] Slave: Duplicate entry 'SYSTEM_LOCK:discovery_device_history:ba1e3b9adb73b30834fa5e77...' for key 'name' Error_code: 1062
2022-11-11 22:29:39 17 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'CS_10.000174' position 791116152; GTID position '0-1-445975954'
2022-11-11 22:29:39 16 [ERROR] Slave (additional info): Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2022-11-11 22:29:39 16 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2022-11-11 22:29:39 16 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'CS_10.000174' position 791116152; GTID position '0-1-445975954'
2022-11-11 22:29:39 15 [ERROR] Slave (additional info): Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
2022-11-11 22:29:39 15 [Warning] Slave: Connection was killed Error_code: 1927
2022-11-11 22:29:39 15 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213
2022-11-11 22:29:39 15 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964

Resolved binary log from this repro is available via link in comments.

Data leftover on replica after this crash is-

SELECT * FROM master_6.sys_mutex\G
*************************** 1. row ***************************
          name: SYSTEM_LOCK:discovery_device_history:ba1e3b9adb73b30834fa5e77dc96199b
     system_id: app128033.iad105.service-now.com:ciscotesttemp2021
      acquired: 2019-09-12 19:36:46
       touched: 2019-09-12 19:36:46
        sys_id: 05259991e9ac49cb9b8860533c352800
sys_updated_by: replay.admin.29
sys_updated_on: 2019-09-12 19:36:46
sys_created_by: replay.admin.29
sys_created_on: 2019-09-12 19:36:46
 sys_mod_count: 0

The same query on the primary returns-

SELECT * FROM master_6.sys_mutex;
Empty set (0.000 sec)

SHOW GLOBAL VARIABLES from primary and replica also attached while the full error log from the replica is in the comments.



 Comments   
Comment by Vladislav Lesin [ 2022-11-21 ]

Some update. I managed to reproduce and record it with rr on 10.6. What I see now in my debug session is that one transaction rolls back DELETE, and another transaction gets "duplicate key error" on INSERT. My suggestion based on what I saw in debug session is that there are 3 groups, which can be executed in parallel, but must be committed sequentially. The 2nd group conflicts with the 1st group and does rollback to let the 1st group to finish it's commit. The 3rd group must also be rolled back, but before rolling back it tries to execute INSERT after the 2nd group rolled back DELETE, what causes "duplicate key error". I am not 100% sure about this scenario, as I am not so familiar with replication code, and, despite Andrei helps me to understand some details, it will take some more time to check the assumption.

Comment by Vladislav Lesin [ 2022-11-29 ]

There are two records in secondary index with the same secondary index keys, but different primary key values in the scenario: (sk, pk1) - r1, (sk, pk2) - r2, sk is unique secondary index key, pk1 > pk2, it's important. r1 is delete-marked.

Let's denote queries as [UDI]/d+. where U - UPDATE, D - DELETE, I - INSERT, the number at the end is sequence number in GTID.

Master's binary log contains the following sequence of statements:
1. I355 INSERT INTO ... (pk2, sk) - insets r2
...
2. D366 DELETE FROM ... WHERE secondary_index_field = sk - deletes r2
...
3. I371 INSERT INTO ... (pk_n, sk)

I.e. D366 deletes record inserted by I355, that's why I371 can insert the record with the same secondary key without "duplicate key error".

The sequence of the statements execution is the following on slave:

1. U395 locks r1.
2. D366 looks for the record with sk, the record is r1, it creates waiting lock as r1 is locked by U395, then it stores the record in persistent cursor and suspends.
3. I355 inserts r2. Note, on master I355 is committed before D366. As pk1 > pk2, r2 is inserted prior to r1 in logical order.
4. D366 is resumed. It restores persistent cursor position on r1. It locks it successfully, but r1 is already delete-marked by some other transaction, that's why it get's the next record. The next record is supremum as pk1 > pk2, and r2 was inserted prior to r1 in logical order. And D366 skips the record, inserted by I355, so D366 does nothing with r2 and commits.
5. Then D419 tries to do something with r2, but it does not matter, as D366 is supposed to delete it, but not deleted it. This statement is here because of we set watchpoint on r2 just after I371 got duplicate key error, we will see D419 changing r2 in reverse execution.
6. I371 tries to insert a record with sk, but the record exists, as D366 did not delete it, and I371 gets "duplicate key error".

When D366 restores cursor position at step 4, it restores the position to the same record as it was before suspending, as the record still exists. But, as it delete-marked, row_search_mvcc() goes to the next record, see locks_ok_del_marked label for details:

locks_ok:                                                                       
        /* NOTE that at this point rec can be an old version of a clustered     
        index record built for a consistent read. We cannot assume after this   
        point that rec is on a buffer pool page. Functions like                 
        page_rec_is_comp() cannot be used! */                                   
                                                                                
        if (rec_get_deleted_flag(rec, comp)) {                                  
locks_ok_del_marked:                                                            
                /* In delete-marked records, DB_TRX_ID must                     
                always refer to an existing undo log record. */                 
                ut_ad(index != clust_index                                      
                      || row_get_rec_trx_id(rec, index, offsets));              
                                                                                
                /* The record is delete-marked: we can skip it */               
                                                                                
                /* This is an optimization to skip setting the next key lock    
                on the record that follows this delete-marked record. This      
                optimization works because of the unique search criteria        
                which precludes the presence of a range lock between this       
                delete marked record and the record following it.               
                                                                                
                For now this is applicable only to clustered indexes while      
                doing a unique search except for HANDLER queries because        
                HANDLER allows NEXT and PREV even in unique search on           
                clustered index. There is scope for further optimization        
                applicable to unique secondary indexes. Current behaviour is    
                to widen the scope of a lock on an already delete marked record 
                if the same record is deleted twice by the same transaction */  
                if (index == clust_index && unique_search                       
                    && !prebuilt->used_in_HANDLER) {                            
                                                                                
                        err = DB_RECORD_NOT_FOUND;                              
                                                                                
                        goto normal_return;                                     
                }                                                               
                                                                                
                goto next_rec;                                                  
        } 

There are several questions which are still opened for me:
1. Why D366 is committed successfully if it did delete some row on master and did not delete anything on slave?
2. Should we do "goto next_rec" in the above code in case of READ_COMMITTED, point query and resuming from sleep instead of doing the search again?

Comment by Vladislav Lesin [ 2022-12-02 ]

I have just attached MDEV-30010.test. It used the same names as in the comment above.

There can be several ways to fix it.

1. Restore cursor position to the first record with the same key instead the exact record it was set before suspending for secondary indexes.

2. Check the result of DELETE on slave, and if it returned not the same amount of deleted rows, just restart it.

Comment by Vladislav Lesin [ 2022-12-02 ]

MDEV-30010.test contains also the analyses in the comments.

I am afraid, the general reason of the issue is different transactions execution order on master and slave. Even of we fix this certain issue, I am quite sure there will be other issues caused by the same reason. The order of records locking is not defined in general.

I would suggest just not to mix statement-based replication and parallel execution on slave to avoid such issues.

Comment by Vladislav Lesin [ 2022-12-02 ]

Just some note for possible future fix. Here is the logic of cursor position restoring used in our case:

btr_pcur_t::restore_status                                                      
btr_pcur_t::restore_position(ulint restore_latch_mode, mtr_t *mtr)              
{                                                                               
...                                                                             
        switch (restore_latch_mode) {                                           
        case BTR_SEARCH_LEAF:                                                   
        case BTR_MODIFY_LEAF:                                                   
        case BTR_SEARCH_PREV:                                                   
        case BTR_MODIFY_PREV:                                                   
                /* Try optimistic restoration. */                               
                                                                                
                if (block_when_stored.run_with_hint(                            
                        optimistic_latch_leaves(this, &restore_latch_mode,      
                                                mtr))) {                        
                        pos_state = BTR_PCUR_IS_POSITIONED;                     
                        latch_mode = restore_latch_mode;                        
                                                                                
                        if (rel_pos == BTR_PCUR_ON) {                           
                                ...                                             
                                return restore_status::SAME_ALL;                
                        }                                                       
                        ...                                                     
                }                                                               
        }                                                                       
...                                                                             
} 

The cursor restores the same position, because block_when_stored.run_with_hint() returns true. Otherwise it would do search on tree and found inserted by I355 record. I am not sure in the previous claim, need to check it.

Comment by Vladislav Lesin [ 2022-12-07 ]

The general question raised during the analyses of this case was why cursor position is restored to same (sk, pk) pair for secondary index in DELETE_366 after awakening and after the INSERT_355 inserted and committed new record before the pair. And one more question was about behaviour for primary keys, is it possible to create the same test for primary index, and if no, then why.

Let's try to answer the questions with code analyses.

Secondary index.

Let's remember the test case MDEV-30010.test.

We have (10, 100) row, where 10 is primary key, 100 is secondary unique key. The row, i.e. both primary and secondary index records, is locked by trx 0 executed SELECT FOR UPDATE. Then some other transaction trx 1 executes INSERT ... VALUES (5,100), which is blocked, then another transaction trx 2 executes DELETE ... WHERE sk = 100, which is also blocked, then trx 0 executes DELETE ... WHERE sk = 100 and commits. Then the INSERT of trx 1 inserts (5, 100), commits, then the DELETE of trx 2 skips the inserted value and deletes nothing.

How does it work?

INSERT of trx 1 inserts record in primary index. Nothing prevents the INSERT from inserting the record in primary index, as it does lookup for the key less or equal to inserting key(see row_ins_clust_index_entry_low() for details), and it finds infimum, and nobody holds gap locks on the record next to infimum. Then the INSERT does secondary index duplicates check. It uses S-lock for this purpose. As secondary key, the INSERT looks for, is locked with X-lock by trx 0,, the INSERT pushes its lock to waiting queue and suspends.

DELETE of trx 1 tries to find the record to delete with secondary key condition. It uses X-lock as it's going to change the record, it tries to lock secondary key, which is already locked, pushes X-lock in waiting queue and suspends.

When trx 0 is committed, it wakes up the INSERT of trx 1. Trx 0 clears "waiting" flag in the lock the INSERT of trx 1 requested. The INSERT does secondary index duplicates check again and finds delete-marked record. It requests S-lock again, and, as trx 1 already holds S-lock, it does not create new lock and checks the record "deleted" flag. As the flag is set, it just continues scanning, the next record is "supremum" so, it's ok, secondary index duplicate was not found. The INSERT then inserts new record into secondary index. Nothing bothers the INSERT as nobody holds gap locks in
secondary index.

Note the INSERT does search in secondary index to set the cursor to the position after which the new record is going to be inserted. It looks up the (sk, pk) pair, which is going to be inserted. This pair does not exist yet in secondary index, but the delete-marked record with the same sk exists, but, as (sk, pk) pair less then (sk, pk_deleted), the cursor is positioned to infimum. See page_cur_search_with_match() and cmp_dtuple_rec_with_match() for details. That's why the inserted into secondary index record is on the position prior to the delete-marked record.

The question is why both secondary and primary key values are compared during such search in secondary index. The entry to compare with records is created in row_build_index_entry_low() with
dtuple_set_n_fields_cmp() call. The number of fields to compare is gotten in dict_index_get_n_unique_in_tree(), which returns the number of unique fields for clustered index and the number of all fields for secondary index. Note that during secondary index duplicates check
only secondary index key is compared, because the number of fields to compare is set to 1 before the search, and set back after the search, see dtuple_set_n_fields_cmp() calls in row_ins_scan_sec_index_for_duplicate().

So the INSERT of trx 1 commits and wakes up DELETE of trx 2, which restores cursor position on delete-marked record and continues scanning missing the record inserted before the delete-marked one. See btr_pcur_store_position() and btr_pcur_t::restore_position() for
details.

Note btr_pcur_store_position() sets cursor->old_n_fields, i.e. the number of fields used in on-leave search during cursor restoration, with the same dict_index_get_n_unique_in_tree() call, which, as we already know, returns the number of unique fields for clustered index and the number of all fields for secondary index.

The above means the cursor will be restored to the position with the same unique field values for clustered index, and with the same all fields for secondary index.

Primary index.

The test case MDEV-30010-pk.test is the following

Suppose we have the following table.
CREATE TABLE t (pk int PRIMARY KEY, c varchar(10)) ENGINE=InnoDB;
INSERT INTO t VALUES (10, "ccc");

Trx 0 locks (10, "ccc") record with SELECT ... FOR UPDATE. Trx 1 INSERT tries to insert (10, "aaaaa"), and blocked by trx 0. Then trx 2 DELETE tries to delete the record with primary key equal to 10, and it's also blocked by trx 0. Then trx 0 marks (10, "ccc") as deleted and commits. Then the INSERT of trx 1 inserts (10, "aaaaa"), commits. Then the DELETE of trx 2 is finished with DEADLOCK error.

How does it work?

Trx 1 INSERT does primary index duplicates check. It still requests S-lock. Then it finds conflicting X-lock and goes to sleep, doing the same things before sleep as described above for secondary index. The same story is with trx 2 DELETE, it requests X-lock and goes to bed. When trx 1 INSERT is woken up, it does the same things, i.e. it finds delete-marked record in primary index during primary index duplicates check, it does not create new lock as it already holds S-lock, it skips delete-marked record and finds out there is no duplicate key in primary index.

Then the INSERT of tr 1 tries to insert the record by modifying already found delete-marked record. See row_ins_clust_index_entry_by_modify() for details. For this purpose it requests X-lock on that delete-marked record. But the DELETE of trx 2 already holds waiting X-lock on that record. In InnoDB requesting locks can conflict with waiting locks. I don't want to explain it here, there is separate explanation in the comment.

So as trx 1 INSERT's X-lock conflicts with trx 2 DELETE's X-lock, deadlock resolution procedure choose trx 2 as a victim and kills it.

That's why I can't create test case for primary keys.

But, it does not matter, as the principle of fields comparison for on-leave search is already described, and we already know how cursor is stored and restored for clustered and secondary indexes.

Summary

The key value plays dict_index_get_n_unique_in_tree() function. It intentionally returns all fields number for secondary indexes, and only unique fields number for clustered indexes. I think it was done intentionally because secondary indexes can be non-unique, and we need to restore persistent cursor position to the same (sk, pk) pair as it was stored before going to sleep for non-unique indexes. For non-unique secondary indexes there can be several non-delete-marked
records with the same secondary key, but with different primary keys.

Comment by Vladislav Lesin [ 2022-12-08 ]

Here is some illustration to the MDEV-30010.test.

Comment by Marko Mäkelä [ 2022-12-09 ]

vlad.lesin, thank you! I think that an important observation is that due to the way how MVCC and locking is implemented in InnoDB, there may be multiple secondary index records that point to the same primary key. In the clustered index (primary key index), there can be at most one record for each potential key. Old versions are in a singly linked list of undo log records (pointed to by DB_ROLL_PTR; the validity of those pointers is based on the value of DB_TRX_ID in each record).

We should also keep in mind that InnoDB does not have any concept of "table row lock". InnoDB transactions only know "table locks" and "index record locks". When there exist secondary indexes on a table, the type of locking conflicts depends on the access path. A DELETE or UPDATE or similar statement will start with a locking read. That read may use a secondary index, that is, lock a secondary index record before potentially locking the clustered index record.

If the INSERT had added the secondary index record before the DELETE looked up the secondary key, also then the DELETE would have waited for a lock that was created by the INSERT, but it would have been on the INSERT’ed record, not the record that INSERT read-locked for the duplicate key check.

As far as I understand, this problem affects UPDATE and possibly REPLACE as well, and it is not really limited to unique indexes. I see some possible solutions to this.

  1. If the affected mode of replication is active, and if DELETE or UPDATE or REPLACE could make use of a secondary index, require row-based replication events to be written.
  2. In InnoDB, after a lock wait on a secondary index record resumes, restart the search from the first matching key, instead of continuing from the current index position.
  3. In InnoDB, after a lock wait on a delete-marked secondary index record resumes, look up and lock the clustered index record and check if it matches the WHERE condition. Sorry, I realized that this would not work even in the scenario of MDEV-30010.test. We would not find the primary key of the INSERT.

I would consider any InnoDB change as a work-around. I am concerned that changing the logic in InnoDB could cause a serious performance regression for some types of workloads. The "restart the index scan" fix could potentially introduce a bug where we get into an infinite loop if the circumstances are suitable (say, there are lots of concurrent page splits or merges on the index tree). It would unnecessarily slow down execution on all servers even when the affected mode of replication is not being used at all.

Comment by Vladislav Lesin [ 2022-12-09 ]

marko, it seems, if we remove not-gap locks from secondary indexes, "table row lock" concept will be implemented.

And MDEV-30010.test should end up with deadlock error for the DELETE, like MDEV-30010-pk.test.

One more solution would be in requesting next-key locks for secondary indexes. This would also cause deadlock error for the DELETE. But performance impact should be measured.

Comment by Vladislav Lesin [ 2022-12-15 ]

After long discussions it was decided that skipping record is a bug, which was filed in MDEV-30225, and we can fix it quite trivially for RR. But for RC we would need to rescan range every time after transaction resuming to fix it, and such rescanning is quite expensive from performance perspective. That's why it was decided not to fix it for RC, and instead to flag DELETE/UPDATE/REPLACE unsafe for statement-based replication when a secondary index may be used, and the isolation level is READ (UN)COMMITTED.

I am fixing MDEV-30225 and returning this issue to replication team.

Comment by Andrei Elkin [ 2023-01-24 ]

Closed with 'Done' to indicate that a commit pushed merely extends the test base
and clarifies on the documented binlog-related behavior.

Generated at Thu Feb 08 10:12:57 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.