[MDEV-16664] InnoDB: Failing assertion: !other_lock || wsrep_thd_is_BF ... if innodb_lock_schedule_algorithm=VATS Created: 2018-07-02 Updated: 2023-02-28 Resolved: 2022-06-30 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB, Storage Engine - XtraDB |
| Affects Version/s: | 10.1.19, 10.3.4, 10.2.13, 10.4.0, 10.5.0 |
| Fix Version/s: | 10.6.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Matthias Leich | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | regression-10.2, rr-profile | ||
| Environment: |
Ubuntu 17.04 (Zesty Zapus) but I assume that the OS is not important. |
||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||
| Description |
|
Sorry, but the frontend did not allow me to fill in the MariaDB versions used.
|
| Comments |
| Comment by Matthias Leich [ 2018-07-02 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Bug reports with similar content simplified RQG YY grammar:
The bug seems to be a concurrency problem because attempts with on session only did not replay. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Matthias Leich [ 2018-07-04 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Marko gave the hint to try the test with "innodb_lock_schedule_algorithm=fcfs". | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2018-07-04 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
If this problem is not repeatable with innodb_lock_schedule_algorithm=fcfs, it seems to be a regression caused by | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2018-07-05 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
I posted feedback about this to the
For the record, MySQL 8.0 refers to the contribution as WL#10793, MySQL Bug #84266, pull request #115, and mysql/mysql-server@fb056f4. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Matthias Leich [ 2018-07-05 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Similar fast replay with 10.1.35-MariaDB-debug (InnoDB: Percona XtraDB (http://www.percona.com) 5.6.39-83.1) and innodb_lock_schedule_algorithm=vats. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2018-07-07 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
In MariaDB Server 10.2.17 and 10.3.9, we will revert to the old algorithm (innodb_lock_schedule_algorithm=fcfs) until this problem has been isolated and fixed. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Matthias Leich [ 2018-07-31 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2018-09-30 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
This looks very similar to
| ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Matthias Leich [ 2019-08-30 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-04-21 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
I am setting this to Blocker priority for the 10.6 release, with the intention that if this cannot be fixed before the 10.6.0 release, we will remove the feature. If the bug can be fixed, the fix should be applied to 10.1 and later releases. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-04-30 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
In MySQL 8.0, there have been some fixes to CATS (as the feature is called there). The latest one as of this writing appears to be a complete rewrite of CATS in MySQL 8.0.20. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2020-07-20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Analysis: There is basically two cases
In both cases we have condition like
This means that this lock request does not have WAIT-flag or GAP-flag. For VATS and Galera WAIT-flag is interesting here. Traditionally, InnoDB used First-Come-First-Served scheduling. This means that if new lock requests does not have WAIT-flag there may not be any other conflicting lock requests for same record. If there would have been, it would not be granted this lock request and WAIT-flag would have been added and lock request would have been added to end of the queue. However, Galera Brute-Force (BF) and VATS might have different scheduling.
After second analysis it seems that we might order new lock request before already granted lock request on a queue, this is naturally a bug. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2020-07-20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
marko I leave it to you decide do we want to take whatever MySQL did for 8.0 or do we just fix the debug-assertion. In my opinion for GA-releases the new implementation looks quite a big change. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-07-27 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Because The assertion has been ‘watered down’ for Galera. The original assertion expression is simply !other_lock. This bug was repeated without enabling any Galera replication. What matters is that other_lock is set, that is, a conflicting lock exists when it is not expected to. This problem is specific to innodb_lock_schedule_algorithm=VATS only. If there are no resources to fix the problematic VATS implementation, I think that the most meaningful course of action is to remove it from the MariaDB 10.6 release. Note: MariaDB 10.2 also inherited a similar bug from MySQL 5.7, which we fixed in | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-08-17 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
I believe that the ALTER TABLE statement in the grammar that mleich posted could be replaced with COMMIT. I tried to create a MTR test case based on the RQG grammar, but my test failed to reproduce the assertion failure on a recent 10.5 branch. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-09-18 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
I analyzed a recent trace of this. We have two DELETE statements executing the locking read part, and both are being granted an exclusive lock on the same record (in a clustered index root page). The lock breakage appears to occur in the following section of the function lock_rec_insert_by_trx_age():
Right before the call, lock_rec_enqueue_waiting() removed the lock from the hash table:
In lock_rec_has_to_wait_in_queue(), we will find no conflicting lock, even though the lock that was acquired by the first thread still exists:
In that loop, we fail to invoke lock_has_to_wait(wait_lock, lock) on the other_lock object will be flagged by the assertion failure in lock_rec_queue_validate(), invoked by lock_clust_rec_read_check_and_lock(), even though that other_lock remains granted in LOCK_X mode for the same record. At the time the conflicting LOCK_X other_lock was granted to the other DELETE, our DELETE was waiting for block->lock on the clustered index root page. You definitely should not need a big table to repeat this, because the clustered index consists of the root page only. The test case does not involve any secondary indexes. The PAGE_N_RECS is 6 at the time of the failure, but there are some PAGE_GARBAGE records in the page. I suspect that the bug is at the end of lock_rec_insert_by_trx_age(). | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-10-02 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
MariaDB Server 10.2, 10.3, 10.4, 10.5 will issue a warning message if the parameter is set to innodb_lock_schedule_algorithm=VATS, but the buggy behaviour will not be removed. The parameter innodb_lock_schedule_algorithm will be removed in MariaDB Server 10.6. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-10-05 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Starting with MariaDB Server 10.2.35, 10.3.26, 10.4.16, 10.5.7, a deprecation and corruption warning will be issued if the server is being started up with innodb_lock_schedule_algorithm=VATS. Starting with MariaDB Server 10.6, the parameter innodb_lock_schedule_algorithm will be removed. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2020-10-06 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
https://github.com/MariaDB/server/commit/f35b29674ec22f1ee7d944dc3765707b070e0ea0 Fix candidate tested with :
| ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-10-07 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
jplindst, I think that it is confusing to reopen a bug that we already used for removing (10.6) and deprecating a feature that did not work. It might be better to file a separate ticket for fixing the crash. Please rebase your work to a branch that includes a fix of | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-10-22 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
This ticket will be used for tracking the originally reported corruption that is caused by the setting innodb_lock_schedule_algorithm=VATS. Once the feature is fixed to work correctly, and if it has been demonstrated to improve performance, it can be permanently enabled in a later version. The code does not exist in the current MariaDB 10.6 development branch. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2020-10-27 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Test case attached produces lock wait queue exactly as in crash case but still actual crash does not reproduce so there is something more needed. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-01-28 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
While working on | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2021-02-11 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Unfortunately, I have not found a test case that would reproduce this and my guess fixes do not work. Therefore, I must say based on my current knowledge that I do not know how to fix this issue. If we want maybe we should use same as MySQL 8.0 (this change might be out of 10.6 scope). | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-06-30 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
This was fixed in MariaDB Server 10.6.0 by removing the parameter. |