[MDEV-20605] Awaken transaction can miss inserted by other transaction records due to wrong persistent cursor restoration Created: 2019-09-16 Updated: 2023-06-07 Resolved: 2022-02-14 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 5.5, 10.0, 10.1, 10.2.27, 10.3, 10.4 |
| Fix Version/s: | 10.3.35, 10.4.25, 10.5.16, 10.6.8, 10.7.4, 10.8.3 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Geoff Montee (Inactive) | Assignee: | Vladislav Lesin |
| Resolution: | Fixed | Votes: | 5 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
Some users are seeing duplicate key errors like the following when using the optimistic and aggressive modes of parallel replication:
However, when the problem is analyzed, it is clear that the duplicate key should not occur. The configuration:
The configuration with slave_parallel_mode=optimistic was identical. Some observations:
|
| Comments |
| Comment by Marko Mäkelä [ 2019-10-09 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I debugged this in Elkin’s environment today and concluded that InnoDB is to blame. The InnoDB purge thread seems to be wrongly converting a waiting gap lock request that is attached to a purgeable record, to a granted gap lock request on the page supremum. I have suspected that the gap-lock-widening of the purge thread could cause problems, but mostly I had the non-waiting scenarios in mind, and did not come up with a specific example. The gap locking rules are summarized in MDEV-16406. MDEV-20605.patch The test case that I debugged was running on a replication slave, and it would not fail every time. It is much more convenient to test and fix this without involving replication. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Matthias Leich [ 2019-10-11 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-10-12 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Let me give some background for understanding the problem and a possible solution. InnoDB implements multi-versioning by guaranteeing to keep the full history of committed transactions until there is no old read view that would see the database state as it was before those transactions were committed. In particular, records are not deleted immediately; they are merely marked for deletion. Only the ROLLBACK of an INSERT can immediately delete a record. So, there may be many delete-marked records in indexes. Some of them could be purgeable. There may be a significant purge lag, for example, in the instant when a very old read view was destroyed when its transaction was committed, millions of records could become purgeable. InnoDB implements next-key locking by gap locks that are anchored to the record that is succeeding the preceding record of the gap. (There are two mutually exclusive types of gap locks, in MDEV-16406 called read-gap and write-gap locks. Traditionally, read-gap locks are called ‘gap locks’ and write-gap locks are ‘insert intention locks.) I think that at the root of the problem is the fact that the anchor records of the gap locks may be committed (and possibly purgeable) delete-marked records. This problem has been described earlier in The problemAs proven with MDEV-20605.patch Possible workaround: Suspend purgeA simple ‘fix’ could be to stop the purge when it detects this situation. There are numerous problems with this, such as:
Possible fix: Acquire gap locks on all successive delete-marked committed recordsThere seems to be another way of guaranteeing that purge will be invisible to transactions. Let us assume that the index contents is as follows, and we want a gap-lock that covers the non-existing key (2): (0),delete-marked(5),delete-marked(9),13 Currently, we would only acquire a gap lock on the delete-marked key (5), and this gap lock would cover the keys in the open range (0,5). If purge removes the key (5), it would ‘widen’ the gap by attaching the gap lock on the delete-marked key (9). So, the gap would cover (0,9). What if we changed the gap lock acquisition algorithm? Instead of acquiring a gap lock only on the successor record, let us keep acquiring record+gap locks on its successors until we reach the end of the index, or a record that is not delete-marked. With this approach, we would lock the semi-open ranges (0,5],(5,9] before locking the final open range (9,13). This combined range (0,5],(5,9],(9,13) is equivalent to (0,13). Thus, we are immune to the scenario where purge would remove the delete-marked anchor records in between! Note: I am proposing that we acquire also record locks on the delete-marked records. I think that we are already doing that, except for READ COMMITTED isolation level the fix of With this extended gap locking, whenever we remove a record, we can simply remove the attached gap lock without having to create a gap lock for the successor record. If there are waiting gap lock requests on the record that we are removing, we would have to wake up the waiting transactions. Note on determinismI believe that the extended gap locking could make InnoDB gap locks fully deterministic and thus possibly help make some cases of replication deterministic. Currently, without my proposed change, depending on how much purge is lagging behind, gap locks could be ‘narrow’ or ‘wide’. Let us consider my above example again:
With my proposed fix, both the SELECT and INSERT would lock the whole range of 0<k<13, and they would conflict with each other, leading into a lock wait timeout for one of the transactions if neither transaction is committed. Currently, depending on when purge removes the delete-marked committed keys k=5, k=9, both transactions could actually succeed, and never observe a lock wait timeout, no matter how long they wait. And, I believe that both would eventually be holding conflicting locks (read-gap a.k.a. ‘gap’ lock and write-gap a.k.a. ‘insert intention’ lock) on the full range 0<k<13. This feels very wrong! | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Thirunarayanan Balathandayuthapani [ 2019-10-16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The following test case can get crash for instrumented patch which basically checks whether purge grants the waiting lock.
I wrote the comments in test case for explanation. I am not convinced that instrumental patch repeats the correct issue.
I may need to get the perspective from marko as well. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-10-17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
My debug instrumentation may be too eager. I still think that my idea of acquiring multiple gap locks (and simply discarding them on purge) is correct. I must to come up with an executable test case that demonstrates my line of thought. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-10-18 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I debugged thiru’s test cases (with some fixes) on 10.3, with MDEV-20605.patch
Curiously, if I do not add the records f1<0 to the table, the SELECT * FROM t1 WHERE f1>2 FOR UPDATE will actually lock the delete-marked record f1=1, because the WHERE condition would not be passed down to prebuilt->search_tuple. Apparently, it will be a full table scan, via handler::ha_index_first(). This is causing some unnecessary locking. The SELECT should not really acquire a lock on the delete-marked record itself, but instead only acquire a special gap lock that will prevent the delete-marked key from being inserted by other transactions. That is, there should be no conflict between the two locking SELECT operations to begin with, because neither one matched any records. Both SELECT should be allowed to continue in the first place. If the extra records f1<0 are present in the table, then that will be the case: only a gap lock on the supremum record will be acquired, only blocking any future INSERT. I believe that implementing my idea of acquiring multiple gap locks when encountering delete-marked records should fix also this scenario. If the record was committed as delete-marked, we must acquire a read-gap-only lock on that record, and keep advancing to next records until an uncommitted record or committed non-delete-marked record or the end of the index is reached. Furthermore, we must ensure that row_ins_sec_index_entry_by_modify() and row_ins_clust_index_entry_by_modify() and similar functions will conflict with any read-gap-only locks that might exist on the successor of the committed delete-marked record. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-10-18 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I tried and failed to create a test case that involves INSERT (which was part of the original replication scenario). While debugging my test, I see that we are unnecessarily locking a delete-marked committed record f1=1 (but not the preceding gap) here:
The current statement is
The delete-marked committed record is f1=1, which is at the very start of the search range. The search is also locking ‘one row too much’, f1=4 (and its preceding gap), which is after the search range. If I add two ‘sentinel records’, hoping to not see a conflict between the INSERT and SELECT, also f1=5 will be locked by the SELECT:
For that particular bug (locking too many records after a range), MySQL 8.0.18 included a fix that we should evaluate and try to port to MariaDB. Side note: When purge is removing the record f1=1, it is creating a LOCK_GAP|LOCK_X|LOCK_REC on the successor record f1=3, which is delete-marked and committed. It fails to notice that there already was a covering LOCK_X|LOCK_REC on that record, and unnecessarily allocates another record lock. My proposed fix for this bug would avoid any lock creation when purge or rollback are deleting records. Such unnecessary allocation would also be avoided if we introduce a single record lock bitmap with 4 bits per record, as proposed in MDEV-16406. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-10-25 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
An alternative approach was suggested, to make purge more visible to concurrent transactions. We already have an unfortunate case of that, for SPATIAL INDEX (see MDEV-15284 and MDEV-16269). What would be the benefit of making purge more visible? To me, doing so would seem to cause more locking inconsistency between server instances in a replication or Galera setup, which is something that should be avoided. I would prefer things to be so regular that they can be described in simple abstract terms. A maintenance task that runs in the background and lacks theoretical foundation should be as invisible as possible. Side note: This was a main motivation for remove the InnoDB background change buffer merge in It is a bug if purge or rollback are creating or acquiring any locks.
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Lesin [ 2019-12-19 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Let's consider the following test (the same test which was previously considered by marko and thiru under 10.2:
The first "select" sets LOCK_ORDINARY locks on delete marked records (10), (20) and on record (30). The locks are set with the following stack trace:
Let's take a look at the following code in row_search_mvcc():
Please pay attention on "goto next_rec;". I.e. when delete marked record is locked, the lock on the next record is acquired. That is why the first "select" locks two delete marked records (10), (20) and one record (30). If we apply the patch proposed by thiru and modify it a little bit:
I.e. purge thread will not extend lock from delete marked record to it's successor, we will see that the second "select" will fail with "lock wait timeout" error even after all delete marked records are purged, because the first "select" set conflicting lock on record (30). In other words the requested by marko functionality is already implemented. If so then why the lock is granted for the second "select" in the original test after delete marked (10) is purged? Let's now consider the original test. I.e. replace:
In this case the first "select" locks delete marked (10) record with LOCK_ORDINARY type and supremum. All locks on supremum are considered as gap locks. When purge thread unlocks delete marked (10), the second "select" is unlocked(see lock_rec_reset_and_release_wait() call in lock_update_delete(), which is invoked from purge thread) and acquires lock on supremum. And the lock is granted with the following stack trace:
Let's look at the following code in lock_rec_has_to_wait():
As locks on supremum are considered as gap locks, and gap locks without LOCK_INSERT_INTENTION flag are not conflicting, then the above code and behaviour are correct. Note if we try to insert some record in the table after delete marked (10) is purged, the "insert" will be locked as expected even if we don't invoke lock_rec_inherit_to_gap() for delete marked (10) from purge thread, because the first "select" set lock on supremum. As well I would add, that even if we invoke lock_rec_inherit_to_gap() from purge thread, it will not add new lock for supremum and for (30) in the previous test, because it does not create new lock if it already exists, so there is no need to change the code here. I have not yet checked the behaviour on "rollback", but the above allows to claim, that the test for the root case is wrong and the requested functionality is already implemented. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-12-20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
A few quick observations:
My observation on 2019-10-09 suggests that it is necessary to remove some lock_rec_inherit_to_gap() calls to fix the scenario that I analyzed:
It may be the case that in order to fix the problems, we do not only have to remove lock_rec_inherit_to_gap(), but also actually fix | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-12-20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Further observations: A full fix of
I was actually expecting that the SELECT would block due to the preceding INSERT, but that is not the case, even if the DELETE statement is removed or the condition k=6 is replaced with anything else than k=3. Only if I change the condition to use >=, such as k>=10, there will be a locking conflict on record heap number 6, which is the record k=3. This is apparently because of an optimizer misfeature that appears to prefer a full table scan over range scans for small tables. vlad.lesin, for a complete understanding of this, I hope that you can determine when exactly INSERT is acquiring write-gap locks. I think that in the basic case, INSERT relies solely on implicit locking (no record locks). If the INSERT were acquiring a write-gap lock, then it obviously should block a later locking read operation with k=4, because that key would fall within the same gap. Anyway, the last INSERT after the locking SELECT is showing the problem that I wanted to illustrate. As long as the delete-marked record k=5 exists, the last INSERT can proceed. If the record k=5 is purged while both transactions are active, then we will have both a write-gap and read-gap lock on the remaining gap, which is a violation of the gap locking rules! I think that replication (as well as some users) can reasonably expect locking to be completely agnostic to the purge of history. Fixing this (that is, ensuring that the SELECT and INSERT will conflict no matter if delete-marked records exist between them) might introduce more locking conflicts, so we might want to introduce a configuration option (or abuse innodb_locks_unsafe_for_binlog) to enable the old non-conflicting gap locking. Note: that deprecated option was removed from MariaDB Server 10.5.0 in | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Lesin [ 2020-01-09 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The above test shows the problem. The following is the explanation why the last INSERT is locked after purge and is not locked before purge. 1. Connection "default", repeatable read, "insert into t values(3)":
where dmc - delete-marked committed 2. Connection "con1", serializable, "select * from t where k=6":
The lock on (9) is set with the following backtrace:
3. Connection "default", repeatable read, "insert into t values(10)": a) before purge
(10) is inserted as there is no gap lock on (13) b) after purge
As we can see, the problem is that gap locks on delete-marked records are not propagated till their non-delete-marked successors for point queries:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-08-12 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I will take over this. But I am afraid that I must address | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Lesin [ 2020-09-14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
marko I have just pushed my draft to https://github.com/MariaDB/server/tree/10.2-MDEV-20605-gap-locks-debug. I have implemented and tested changes in row_search_mvcc(). I have implemented and not tested changes in row_sel(). I started implementation of system variable to test row_sel(), you can see it in one of the commits. I have implemented changes for foreign key constraints check. But I don't like this implementation, because it's complexity is n*m. See comment in commit message. I have not implemented duplicates check for foreign check. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-03-15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Sorry for the delay; I was occupied by 10.6 development work. The changes in the 10.2 based branch look mostly OK to me. For testing, I think that we will need a 10.6-based branch. My understanding is that the function row_sel() is never invoked on spatial indexes. It might also be that all locking use of it is for system tables or InnoDB persistent statistics tables. If functions like fts_doc_fetch_by_doc_id() never execute a locking read, then we might not have to modify row_sel() at all. A more thorough description of how gap locks work in the FOREIGN KEY locks will be needed. Also, whenever we check that a record is delete-marked, is it certain that the record has been committed? I am fairly sure that a gap lock can be safely attached to a delete-marked non-committed record. Another related doubt: Is it safe to attach a gap lock to a record that has been inserted by a not-yet-committed transaction? Maybe, in such cases we should always also acquire a lock on the ‘anchor’ record itself (so that we will wait for the transaction commit), and not only on the preceding gap. As part of testing this, I think that we must repeat and analyze the failure of MDEV-22889. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Larry Adams [ 2021-03-17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I'm on 10.5.8 and struggling with this problem. The server, in 'optimistic' mode encounters a 1964 error, and the Slave_SQL_Running goes to No, and the whole thing breaks. So, optimistic does not work in 10.5.8. I'm testing with 'conservative', but the slaves are unable to keep up with the master presently. Just watching as I type, I have not seen the Seconds_Behind_Master decrease since I restarted in 'conservative' mode. I'm using the 'mixed' binlog format. Switching to 'row' likely won't help either. We are already generating 5TB of binary log's per day in 'mixed' mode. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-03-19 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I think that the following must be part of the fix:
In a FOREIGN KEY check, if we encounter a delete-marked record, we must ensure that the record is committed, by acquiring a lock on both the record and the preceding gap. And if it was the case, we must continue acquiring such locks on succeeding records until the end of the index, or until a record that is not delete-marked. On that non-delete-marked record, we must acquire a gap-only lock. And we must not recurse into that non-delete-marked record, because we already found that the first record was delete-marked. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-03-19 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Why should it suffice to acquire a gap-only lock on that last non-delete-marked record in the traversal in row_ins_check_foreign_constraint()? What if that last record is later delete-marked and committed? Can it be purged, so that the last segment of our gap lock would be widened in purge? The answer is that normally, the transaction that is executing the traversal in row_ins_check_foreign_constraint() has a read view that will be older than a possible future commit of delete-marking the record that was not delete-marked during our traversal. Because purge cannot advance beyond the oldest existing read view, the record could not be purged. However, there are exceptions at two isolation levels: In either case, there is nothing that would prevent purge from advancing to that last record. In these cases, it could be simplest to acquire LOCK_ORDINARY on the last segment of the gap (when the anchor record is not delete-marked). That would prevent such a future delete-mark operation from occurring before the gap-owning transaction is committed or rolled back. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Larry Adams [ 2021-03-19 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Per the recommendation made by others, I switched to ROW. So far so good. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-03-22 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
One more thought: In When innodb_locks_unsafe_for_binlog was set, InnoDB would skip gap locking in many cases. The non-descriptive name of the option hints that it was known to break statement-based replication. Later, the option was made redundant by incorporating such logic in the READ UNCOMMITTED and READ COMMITTED isolation levels. That those isolation levels acquire fewer gap locks is expected, as noted in All this suggests that the limitations that applied to innodb_locks_unsafe_for_binlog=ON will also apply to the READ UNCOMMITTED and READ COMMITTED isolation levels. Having said all this, we definitely do have a problem with the gap locks on delete-marked records that affects the default REPEATABLE READ isolation level. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-03-22 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The option innodb_locks_unsafe_for_binlog was deprecated in MySQL 5.6.3. Already back then, the logic had been incorporated to the READ UNCOMMITTED and READ COMMITTED isolation levels. So, those isolation levels are as unsafe to use with replication as the deprecated setting innodb_locks_unsafe_for_binlog=1. We will not change the reduced acquisition of gap locks in this ticket Note: I do not know what exactly was or is unsafe about innodb_locks_unsafe_for_binlog. I hope that Elkin can shed light on that. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergei Golubchik [ 2021-11-04 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
could you do a small test case to demonstrate that behavior? you can use, for example, sleep() to make sure statements are executed in the desired order on the slave. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Lesin [ 2021-11-10 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The test case is attached. It repeats the customer's root case with secondary keys locking. But more likely the same wrong cursor position restoring behaviour can be repeated for primary keys too. The test was developed for 10.6, I did not check it for the earlier versions, it might be some syncpoints missed for the earlier versions. MDEV-20605-restore-cursor.test | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Lesin [ 2021-11-10 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
There is one more test case with deadlock. MDEV-20605-deadlock.test It looks strange to me. The situation is the following. We have two transactions and one record. The first transaction holds ORDINARY S-lock on the record, the second transaction created waiting ORDINARY X-lock and waits for the first transaction. Then the first transaction requests insert-intention lock on the record. And this lock conflicts with the waiting ORDINARY X-lock of the second transaction. What causes deadlock. Why it should conflict? The first transaction already holds lock on the record. And the second's transaction lock contains "waiting" flag. This should be reported in the separate MDEV. See | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Lesin [ 2021-11-11 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
marko, I think the causer is not https://github.com/MariaDB/server/commit/f4fd68857b2e7767b559dbc09a8d4abbe7ccec29 commit. If we take a look the initial 5.0 import to git, we will see the following code:
The comment of btr_pcur_restore_position() says:
So "success" in sel_restore_position_for_mysql() contains two conditions joined with "and":
I understand why the cursor is moved forward if the first condition is false, i.e. the restored position is infinum or supremum. But I don't understand why it should be moved if the second condition is false. For our case the cursor position was restored to the new record with the same key as in the stored record, but with different value. I think this is error and when the code was developed, the developer took into account the first condition and did not take into account the second one. So the fix could be in dividing those two conditions. btr_pcur_restore_position() could, for example return true if cursor is not supremum or infinum, and have one more out parameter to notify if restored position points to the record with the same fields as the stored one. What do you think about it? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Lesin [ 2021-11-19 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Some update. The first I changed my test a little bit. There is no need to execute the last INSERT in parallel with the previous DELETE. We can finish the previous DELETE, and then execute the INSERT. The updated version is here: MDEV-20605-restore-cursor-v2.test I fixed cursor restore position function and the test passes with the fix. But there are test failures because of the fix: rpl.rpl_parallel_optimistic_nobinlog rpl.rpl_parallel_optimistic. I have not yet fully analysed them, but I suppose the issue is with locking and commit order. About locking order. Elkin found the the test fails even if we remove purging. I analysed it too and came to conclusion that this is two independent issues. Here is the test: MDEV-20605-locking.test Theoretically slave must resolve such issues allowing commits on slave to be in the same order as on master. But I have some doubts about it, need to analyse rpl.rpl_parallel_optimistic_nobinlog failures to be sure. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Lesin [ 2021-12-01 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Update: local testing of my fix looks good, pushed https://github.com/MariaDB/server/tree/bb-10.6-MDEV-20605-cur-pos-fix for buildbot testing. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Lesin [ 2022-01-20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Some update and analyses. The initial issue.During the last RQG testing "CHECK TABLE" showed error in the following code:
The number of rows does not match for the first and the next indexes. The initial analysesThe scenario is the following. Suppose we have a pair of records in secondary index with the same unique fields, the first record is non-delete-marked, the second record is delete-marked-committed. The transaction of "CHECK TABLE" does non-locking read of secondary index. After the record from secondary index is read in row_search_mvcc(), the correspondent clustered index record is also requested(see mtr_has_extra_clust_latch local variable in row_search_mvcc()). After that before requesting the next secondary index record in row_search_mvcc() cursor position is stored, mtr is committed, mtr is started and cursor position is restored(to preserve clustered index pages latching order). In the room when the page was unlatched(after mtr commit) the cursor's record was purged, and then cursor position was restored to the previous secondary index record. As the previous record had the same unique fields as the unique fields in the stored in cursor record, the cursor was not moved forward, and the same record was pushed twice in results set. The above case can happen only if cursor position is stored, the page is unlatched, purge thread removed stored in cursor record, and then the cursor position is restored. There can be 3 cases when row_search_mvcc() does the above(see btr_pcur_store_position() and sel_restore_position_for_mysql() calls in row_search_mvcc()): 1) When row_search_mvcc() is finishing. It can be invoked with the same cursor again to continue index reading, see ha_innobase::index_next() as an example; Consider the cases for locking and non-locking reads. Non-locking readCase (1) should not cause the issue because if the locking read view can see the record and the view is still opened, the record can't be purged. Case (2) can't happen for non-locking read by definition. For non-locking read for case(3) the following code requests clustered index record even if the secondary index record is delete-marked:
That is why for locking read row_search_mvcc() can unlatch the page after clustered index record is read:
And when the page is unlatched, purge has ability to remove stored in the cursor record, what causes rows duplication in result set. marko proposed to use mtr_t::get_savepoint() and mtr_t::release_s_latch_at_savepoint() to solve the issue. I.e. we can create mtr savepoint before requesting clustered index record, then release only clustered index page latches instead of releasing the whole mtr latches. See One more way to fix it is not to execute that "don't move forward cursor if the record with the same unique fields is found" logic for non-locking reads at all. Because it does not make sense for non-locking read. For locking read that logic was implemented to found just inserted record after reading transaction is awaken, and cursor position is restored. This is necessary for correct locking order. For non-locking read locking order does not make sense. And even if something was inserted after read view is created, it will not be visible for that read view(depending on isolation level). I would prefer to create separate task to implement marko's idea for the future optimisation and implement my approach now. Locking readFor locking read row_search_mvcc() contains the following code:
I.e. delete-marked record is ignored, and either the next record is read, or row_search_mvcc() returns DB_RECORD_NOT_FOUND. So case (1) is impossible. Case (3) is also impossible for the same reason, the above code shows that clustered index record is not requested for delete-marked secondary index record. Consider case (2). To repeat the issue there must be two records with the same unique fields, the first record (in index order) must be non-delete-marked, the second record must be delete-marked. Some transaction must hold lock on delete-marked record, and some other transaction must hold lock on non-delete-marked record, and create waiting lock and suspend on delete-marked record. In any case the first record will always be locked before the second record. So any transaction with conflicting lock will be blocked on the first record. The locks will be released on transaction commit, and during this process lock hash table cell will be latched, what means any other transaction will wait until both locks are released, and there can't be a situation when some other transaction acquired first record lock, and blocked on the second record lock(to let "purge" to remove stored in persistent cursor record). Is the case (2) is impossible too. SummaryThe issue caused by my changes does not exist for locking read. It does exist for non-locking read, but for non-locking read my changes don't make sense, so we can just switch them off for this case. We can also prevent mtr commit in the case when we have to read clustered index record during secondary index search, and unlatch only clustered index pages, as marko suggested, it's safe because clustered index pages are S-latched in this case(see | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Matthias Leich [ 2022-02-11 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-02-14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
OK to push. The | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Lesin [ 2022-02-22 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The branch bb-10.2- |