Details
-
Bug
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
-
10.2.27, 5.5(EOL), 10.0(EOL), 10.1(EOL), 10.3(EOL), 10.4(EOL)
-
None
Description
Some users are seeing duplicate key errors like the following when using the optimistic and aggressive modes of parallel replication:
MariaDB [(none)]> SHOW SLAVE STATUS\G
|
*************************** 1. row ***************************
|
Slave_IO_State: Waiting for master to send event
|
Master_Host: 172.30.0.154
|
Master_User: repl
|
Master_Port: 3306
|
Connect_Retry: 60
|
Master_Log_File: mariadb-bin.000055
|
Read_Master_Log_Pos: 593262418
|
Relay_Log_File: ip-172-30-0-228-relay-bin.000002
|
Relay_Log_Pos: 33725828
|
Relay_Master_Log_File: mariadb-bin.000039
|
Slave_IO_Running: Yes
|
Slave_SQL_Running: No
|
Replicate_Do_DB:
|
Replicate_Ignore_DB:
|
Replicate_Do_Table:
|
Replicate_Ignore_Table:
|
Replicate_Wild_Do_Table:
|
Replicate_Wild_Ignore_Table:
|
Last_Errno: 1062
|
Last_Error: Error 'Duplicate entry '...' for key 'name'' on query. Default database: 'sample3'. Query: 'INSERT INTO ...'
|
Skip_Counter: 0
|
Exec_Master_Log_Pos: 705697132
|
Relay_Log_Space: 17101184558
|
Until_Condition: None
|
Until_Log_File:
|
Until_Log_Pos: 0
|
Master_SSL_Allowed: No
|
Master_SSL_CA_File:
|
Master_SSL_CA_Path:
|
Master_SSL_Cert:
|
Master_SSL_Cipher:
|
Master_SSL_Key:
|
Seconds_Behind_Master: NULL
|
Master_SSL_Verify_Server_Cert: No
|
Last_IO_Errno: 0
|
Last_IO_Error:
|
Last_SQL_Errno: 1062
|
Last_SQL_Error: Error 'Duplicate entry '...' for key 'name'' on query. Default database: 'sample3'. Query: 'INSERT INTO ...'
|
Replicate_Ignore_Server_Ids:
|
Master_Server_Id: 1
|
Master_SSL_Crl:
|
Master_SSL_Crlpath:
|
Using_Gtid: Slave_Pos
|
Gtid_IO_Pos: 0-1-42163518
|
Replicate_Do_Domain_Ids:
|
Replicate_Ignore_Domain_Ids:
|
Parallel_Mode: aggressive
|
SQL_Delay: 0
|
SQL_Remaining_Delay: NULL
|
Slave_SQL_Running_State:
|
1 row in set (0.00 sec)
|
However, when the problem is analyzed, it is clear that the duplicate key should not occur.
The configuration:
[mariadb]
|
server_id=2
|
log_bin=mariadb-bin
|
log_slave_updates
|
binlog_format=MIXED
|
slave_parallel_threads=16
|
slave_parallel_max_queued=1048576
|
slave_parallel_mode=aggressive
|
The configuration with slave_parallel_mode=optimistic was identical.
Some observations:
- When the table is queried on the slave, the row does indeed exists, so it is clear that the duplicate key error is real.
- The SQL script contains an UPDATE and a DELETE before the INSERT, so the row should have been deleted, which should have prevented the duplicate key error.
- The row still exists, so it seems that the DELETE statement had no effect.
- The values in the existing row in the table are the old non-updated values, so it seems that the UPDATE statement also had no effect.
- The slave's binary log contains both the UPDATE and DELETE statements, which should indicate that the statements were successfully executed on the slave.
- I haven't yet been able to reproduce the problem when the master has binlog_format=ROW set, so row-based replication may be immune to the problem.
Attachments
Issue Links
- blocks
-
MDEV-17814 Server crashes in is_current_stmt_binlog_format_row
-
- Open
-
-
MDEV-30225 RR isolation violation with locking unique search
-
- Closed
-
- causes
-
MDEV-6860 Parallel async replication hangs on a Galera node
-
- Closed
-
-
MDEV-33802 Weird read view after ROLLBACK of other transactions.
-
- Closed
-
- is blocked by
-
MDEV-26206 gap lock is not set if implicit lock exists
-
- Closed
-
- is duplicated by
-
MDEV-28200 Rows just inserted sometimes not returned by select afterwards
-
- Closed
-
- relates to
-
MDEV-10962 Deadlock with 3 concurrent DELETEs by unique key
-
- Closed
-
-
MDEV-14589 InnoDB should not lock a delete-marked record
-
- Closed
-
-
MDEV-18648 slave_parallel_mode= optimistic default in 10.5
-
- Closed
-
-
MDEV-19544 Remove innodb_locks_unsafe_for_binlog
-
- Closed
-
-
MDEV-20628 If temporary error occurs with optimistic mode of parallel replication, error message has false information
-
- Closed
-
-
MDEV-24224 Gap lock on delete in 10.5 using READ COMMITTED
-
- Closed
-
-
MDEV-25757 Assertion `!lock->is_waiting()' in lock_discard_for_index
-
- Closed
-
-
MDEV-26475 InnoDB must not lock delete-marked records
-
- Open
-
-
MDEV-27557 InnoDB unnecessarily commits mtr during secondary index search to preserve clustered index latching order
-
- Closed
-
-
MDEV-28422 Page split breaks a gap lock
-
- Closed
-
-
MDEV-30010 Slave (additional info): Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
-
- Closed
-
-
MDEV-16208 [Draft] Assertion `lock_get_wait(other_lock)' failed in lock_rec_queue_validate
-
- Closed
-
-
MDEV-16406 Refactor the InnoDB record locks
-
- Open
-
-
MDEV-17843 Assertion `page_rec_is_leaf(rec)' failed in lock_rec_queue_validate upon SHOW ENGINE INNODB STATUS
-
- Closed
-
-
MDEV-18706 ER_LOCK_DEADLOCK on concurrent read and insert into already locked gap
-
- In Review
-
-
MDEV-20604 Duplicate key value is silently truncated to 64 characters in print_keydup_error
-
- Closed
-
-
MDEV-22634 Assertion `dict_index_is_clust(index) || !dict_index_is_online_ddl(index)' failed in lock_rec_create_low
-
- Closed
-
-
MDEV-23197 Assertion `(index)->is_clust() || !dict_index_is_online_ddl(index)' failed. in lock_rec_create_low
-
- Closed
-
-
MDEV-24224 Gap lock on delete in 10.5 using READ COMMITTED
-
- Closed
-
-
MDEV-27025 insert-intention lock conflicts with waiting ORDINARY lock
-
- Closed
-
-
MDEV-30777 InnoDB: Failing assertion: pcur->restore_position(BTR_MODIFY_TREE, mtr) == btr_pcur_t::SAME_ALL
-
- Closed
-
- mentioned in
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
Some update and analyses.
The initial issue.
During the last RQG testing "CHECK TABLE" showed error in the following code:
{
...
n_rows_in_table = n_rows;
&& (n_rows != n_rows_in_table)) {
push_warning_printf(
thd, Sql_condition::WARN_LEVEL_WARN,
ER_NOT_KEYFILE,
index->name(), n_rows, n_rows_in_table);
}
...
}
The number of rows does not match for the first and the next indexes.
The initial analyses
The scenario is the following. Suppose we have a pair of records in secondary index with the same unique fields, the first record is non-delete-marked, the second record is delete-marked-committed. The transaction of "CHECK TABLE" does non-locking read of secondary index. After the record from secondary index is read in row_search_mvcc(), the correspondent clustered index record is also requested(see mtr_has_extra_clust_latch local variable in row_search_mvcc()). After that before requesting the next secondary index record in row_search_mvcc() cursor position is stored, mtr is committed, mtr is started and cursor position is restored(to preserve clustered index pages latching order). In the room when the page was unlatched(after mtr commit) the cursor's record was purged, and then cursor position was restored to the previous secondary index record. As the previous record had the same unique fields as the unique fields in the stored in cursor record, the cursor was not moved forward, and the same record was pushed twice in results set.
The above case can happen only if cursor position is stored, the page is unlatched, purge thread removed stored in cursor record, and then the cursor position is restored. There can be 3 cases when row_search_mvcc() does the above(see btr_pcur_store_position() and sel_restore_position_for_mysql() calls in row_search_mvcc()):
1) When row_search_mvcc() is finishing. It can be invoked with the same cursor again to continue index reading, see ha_innobase::index_next() as an example;
2) before suspending when it got waiting lock;
3) when there is search in secondary index, and we need to request clustered index too(see the above example for "CHECK TABLE").
Consider the cases for locking and non-locking reads.
Non-locking read
Case (1) should not cause the issue because if the locking read view can see the record and the view is still opened, the record can't be purged.
Case (2) can't happen for non-locking read by definition.
For non-locking read for case(3) the following code requests clustered index record even if the secondary index record is delete-marked:
dberr_t
row_search_mvcc(...)
{
...
...
...
|| prebuilt->table->is_temporary()
|| prebuilt->table->no_rollback()) {
...
and to get the right version of the record we
have to look also into the clustered index: this
is necessary, because we can only get the undo
...
...
}
However, as this is a non-locking read,
we can skip the clustered index lookup if
the condition does not match the secondary
buf, prebuilt, rec, offsets)) {
...
}
ut_error;
}
}
}
...
}
That is why for locking read row_search_mvcc() can unlatch the page after clustered index record is read:
dberr_t
row_search_mvcc(...)
{
...
mtr if we are moving to the next non-clustered
index record, because we could break the latching
order if we would access a different clustered
btr_pcur_store_position(pcur, &mtr);
mtr.commit();
mtr_has_extra_clust_latch = FALSE;
mtr.start();
BTR_SEARCH_LEAF,
pcur, moves_up, &mtr)) {
}
}
...
}
And when the page is unlatched, purge has ability to remove stored in the cursor record, what causes rows duplication in result set.
marko proposed to use mtr_t::get_savepoint() and mtr_t::release_s_latch_at_savepoint() to solve the issue. I.e. we can create mtr savepoint before requesting clustered index record, then release only clustered index page latches instead of releasing the whole mtr latches. See
MDEV-27557.One more way to fix it is not to execute that "don't move forward cursor if the record with the same unique fields is found" logic for non-locking reads at all. Because it does not make sense for non-locking read. For locking read that logic was implemented to found just inserted record after reading transaction is awaken, and cursor position is restored. This is necessary for correct locking order. For non-locking read locking order does not make sense. And even if something was inserted after read view is created, it will not be visible for that read view(depending on isolation level).
I would prefer to create separate task to implement marko's idea for the future optimisation and implement my approach now.
Locking read
For locking read row_search_mvcc() contains the following code:
dberr_t row_search_mvcc(...)
{
...
locks_ok:
index record built for a consistent read. We cannot assume after this
point that rec is on a buffer pool page. Functions like
locks_ok_del_marked:
ut_ad(index != clust_index
|| row_get_rec_trx_id(rec, index, offsets));
on the record that follows this delete-marked record. This
optimization works because of the unique search criteria
which precludes the presence of a range lock between this
delete marked record and the record following it.
For now this is applicable only to clustered indexes while
doing a unique search except for HANDLER queries because
HANDLER allows NEXT and PREV even in unique search on
clustered index. There is scope for further optimization
applicable to unique secondary indexes. Current behaviour is
to widen the scope of a lock on an already delete marked record
&& !prebuilt->used_in_HANDLER) {
err = DB_RECORD_NOT_FOUND;
}
}
...
}
I.e. delete-marked record is ignored, and either the next record is read, or row_search_mvcc() returns DB_RECORD_NOT_FOUND. So case (1) is impossible. Case (3) is also impossible for the same reason, the above code shows that clustered index record is not requested for delete-marked secondary index record.
Consider case (2). To repeat the issue there must be two records with the same unique fields, the first record (in index order) must be non-delete-marked, the second record must be delete-marked. Some transaction must hold lock on delete-marked record, and some other transaction must hold lock on non-delete-marked record, and create waiting lock and suspend on delete-marked record. In any case the first record will always be locked before the second record. So any transaction with conflicting lock will be blocked on the first record. The locks will be released on transaction commit, and during this process lock hash table cell will be latched, what means any other transaction will wait until both locks are released, and there can't be a situation when some other transaction acquired first record lock, and blocked on the second record lock(to let "purge" to remove stored in persistent cursor record). Is the case (2) is impossible too.
Summary
The issue caused by my changes does not exist for locking read. It does exist for non-locking read, but for non-locking read my changes don't make sense, so we can just switch them off for this case. We can also prevent mtr commit in the case when we have to read clustered index record during secondary index search, and unlatch only clustered index pages, as marko suggested, it's safe because clustered index pages are S-latched in this case(see
MDEV-27557).