[MDEV-16664] InnoDB: Failing assertion: !other_lock || wsrep_thd_is_BF ... if innodb_lock_schedule_algorithm=VATS - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.1.19, 10.3.4, 10.2.13, 10.4.0, 10.5.0
Fix Version/s: 10.6.0
Component/s: Storage Engine - InnoDB, Storage Engine - XtraDB
Labels:
- regression-10.2
- rr-profile
Environment:
Ubuntu 17.04 (Zesty Zapus) but I assume that the OS is not important.

Description

Sorry, but the frontend did not allow me to fill in the MariaDB versions used.
No replay on 10.1.35 commit 1d10c9afe0f2f4fba73892e6c12ea6efe90d5931.
Replay on

10.2.17 commit 400cf017152c732387c89deaa082b43c8fb42d71
10.3.8 commit 71144afa966a85d08053eb616a1021fd339102d1
all around 2018-07-02 + compiled with debug.

Server error log entries + Backtrace of 10.3:

InnoDB: Assertion failure in file storage/innobase/lock/lock0lock.cc line 5019

InnoDB: Failing assertion: !other_lock || wsrep_thd_is_BF(lock->trx->mysql_thd, FALSE) || wsrep_thd_is_BF(other_lock->trx->mysql_thd, FALSE)

...

Query: DELETE FROM t1 WHERE col1 = 1 OR col1 IS NULL  /* E_R Thread1 QNO 6083 CON_ID 15 */

...

Status: NOT_KILLED

#3  <signal handler called>

#4  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:58

#5  0x00007f8acda5237a in __GI_abort () at abort.c:89

#6  0x000055e6267f1eaf in ut_dbg_assertion_failed (expr=0x55e626e559d8 "!other_lock || wsrep_thd_is_BF(lock->trx->mysql_thd, FALSE) || wsrep_thd_is_BF(other_lock->trx->mysql_thd, FALSE)", file=0x55e626e53a40 "storage/innobase/lock/lock0lock.cc", line=5019) at storage/innobase/ut/ut0dbg.cc:61

#7  0x000055e62664b06c in lock_rec_queue_validate (locked_lock_trx_sys=false, block=0x7f8aaba3e100, rec=0x7f8aac52c149 "", index=0x7f8a2c2cdb18, offsets=0x7f8ac41923b0) at storage/innobase/lock/lock0lock.cc:5017

#8  0x000055e62664dd24 in lock_clust_rec_read_check_and_lock (flags=0, block=0x7f8aaba3e100, rec=0x7f8aac52c149 "", index=0x7f8a2c2cdb18, offsets=0x7f8ac41923b0, mode=LOCK_X, gap_mode=1024, thr=0x7f8a2c2de240) at storage/innobase/lock/lock0lock.cc:5862

#9  0x000055e62674f5e1 in sel_set_rec_lock (pcur=0x7f8a2c2ddd38, rec=0x7f8aac52c149 "", index=0x7f8a2c2cdb18, offsets=0x7f8ac41923b0, mode=3, type=1024, thr=0x7f8a2c2de240, mtr=0x7f8ac4192cd0) at storage/innobase/row/row0sel.cc:1247

#10 0x000055e6267588d4 in row_search_mvcc (buf=0x7f8a2c2dd720 "\377", mode=PAGE_CUR_G, prebuilt=0x7f8a2c2ddb78, match_mode=0, direction=0) at storage/innobase/row/row0sel.cc:5020

#11 0x000055e6265c17bd in ha_innobase::index_read (this=0x7f8a2c2dcf90, buf=0x7f8a2c2dd720 "\377", key_ptr=0x0, key_len=0, find_flag=HA_READ_AFTER_KEY) at storage/innobase/handler/ha_innodb.cc:9315

#12 0x000055e6265c2ab6 in ha_innobase::index_first (this=0x7f8a2c2dcf90, buf=0x7f8a2c2dd720 "\377") at storage/innobase/handler/ha_innodb.cc:9732

#13 0x000055e6265c2cc5 in ha_innobase::rnd_next (this=0x7f8a2c2dcf90, buf=0x7f8a2c2dd720 "\377") at storage/innobase/handler/ha_innodb.cc:9825

#14 0x000055e6263b16f1 in handler::ha_rnd_next (this=0x7f8a2c2dcf90, buf=0x7f8a2c2dd720 "\377") at sql/handler.cc:2764

#15 0x000055e62652db86 in rr_sequential (info=0x7f8ac41935c0) at sql/records.cc:481

#16 0x000055e62602aa5d in READ_RECORD::read_record (this=0x7f8ac41935c0) at sql/records.h:73

#17 0x000055e62654ae08 in mysql_delete (thd=0x7f8a2c000a98, table_list=0x7f8a2c014c78, conds=0x7f8a2c0157b8, order_list=0x7f8a2c0052f0, limit=18446744073709551615, options=549755813888, result=0x0) at sql/sql_delete.cc:727

#18 0x000055e6260c5c03 in mysql_execute_command (thd=0x7f8a2c000a98) at sql/sql_parse.cc:4916

#19 0x000055e6260cfe9e in mysql_parse (thd=0x7f8a2c000a98, rawbuf=0x7f8a2c014b20 "DELETE FROM t1 WHERE col1 = 1 OR col1 IS NULL  /* E_R Thread1 QNO 6083 CON_ID 15 */", length=83, parser_state=0x7f8ac4194640, is_com_multi=false, is_next_command=false) at sql/sql_parse.cc:8076

...

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

innodb_vats.opt
0.1 kB
2020-10-27 09:25
innodb_vats.test
2 kB
2020-10-27 09:25
other_lock_assert.tgz
2 kB
2019-08-30 17:56

Issue Links

is caused by

MDEV-11039 Add new scheduling algorithm for reducing tail latencies

Closed

relates to

MDEV-12837 WSREP: BF lock wait long

Closed

MDEV-14552 Random crashes: Assertion failure in lock0lock.cc line 8150

Closed

MDEV-15326 InnoDB: Failing assertion: !other_lock || wsrep_thd_is_BF(lock->trx->mysql_thd, FALSE) || wsrep_thd_is_BF(other_lock->trx->mysql_thd, FALSE)

Closed

MDEV-23877 Document deprecation/removal of innodb_lock_schedule_algorithm

Closed

MDEV-23897 ASAN SEGV in lock_grant_and_move_on_page

Closed

MDEV-30748 LOCKing mechanism Improvement Request that milestone of rdbms history

Open

links to

see the discussion at the end of #248

(2 relates to, 1 links to)

Activity

Ascending order - Click to sort in descending order

Matthias Leich created issue - 2018-07-02 20:26

Matthias Leich added a comment - 2018-07-02 20:35 - edited

Bug reports with similar content ~~MDEV-15326~~ But slightly different backtrace.

simplified RQG YY grammar:

query:

    \# The ALTER TABLE t1 is required for replay, but what is that doing at all?

    ALTER TABLE t1  |

    DELETE FROM t1 WHERE col1 = 1 OR col1 IS NULL ; INSERT INTO t1 (col1) VALUES (1) ;

query_init:

    start_delay ; CREATE TABLE IF NOT EXISTS t1 (col1 INT) ENGINE = InnoDB ; SET AUTOCOMMIT = 0;

start_delay:

    { sleep 5; return undef };

The bug seems to be a concurrency problem because attempts with on session only did not replay.
With 8 concurrent sessions I need usually one attempt lasting ~ 60s.

Matthias Leich added a comment - 2018-07-02 20:35 - edited Bug reports with similar content MDEV-15326 But slightly different backtrace. simplified RQG YY grammar: query: \# The ALTER TABLE t1 is required for replay, but what is that doing at all? ALTER TABLE t1 | DELETE FROM t1 WHERE col1 = 1 OR col1 IS NULL ; INSERT INTO t1 (col1) VALUES (1) ; query_init: start_delay ; CREATE TABLE IF NOT EXISTS t1 (col1 INT) ENGINE = InnoDB ; SET AUTOCOMMIT = 0; start_delay: { sleep 5; return undef }; The bug seems to be a concurrency problem because attempts with on session only did not replay. With 8 concurrent sessions I need usually one attempt lasting ~ 60s.

Matthias Leich added a comment - 2018-07-04 11:47 - edited

Marko gave the hint to try the test with "innodb_lock_schedule_algorithm=fcfs".
None of the ~170 RQG runs with the grammar above replayed the problem. All runs passed.

Matthias Leich added a comment - 2018-07-04 11:47 - edited Marko gave the hint to try the test with "innodb_lock_schedule_algorithm=fcfs". None of the ~170 RQG runs with the grammar above replayed the problem. All runs passed.

Marko Mäkelä added a comment - 2018-07-04 12:16

~~MDEV-11039~~ merged a contributed change that changed the InnoDB lock conflict resolution and introduced a parameter innodb_lock_schedule_algorithm=vats. The old behaviour is available by setting innodb_lock_schedule_algorithm=fcfs.

If this problem is not repeatable with innodb_lock_schedule_algorithm=fcfs, it seems to be a regression caused by ~~MDEV-11039~~.

Marko Mäkelä added a comment - 2018-07-04 12:16 MDEV-11039 merged a contributed change that changed the InnoDB lock conflict resolution and introduced a parameter innodb_lock_schedule_algorithm=vats . The old behaviour is available by setting innodb_lock_schedule_algorithm=fcfs . If this problem is not repeatable with innodb_lock_schedule_algorithm=fcfs , it seems to be a regression caused by MDEV-11039 .

Marko Mäkelä made changes - 2018-07-04 12:16

Field	Original Value	New Value
Link		This issue is caused by ~~MDEV-11039~~ [ ~~MDEV-11039~~ ]

Marko Mäkelä made changes - 2018-07-04 12:19

Labels

regression-10.2

Marko Mäkelä added a comment - 2018-07-05 08:08

I posted feedback about this to the ~~MDEV-11039~~ pull request #248, including the following:

In MySQL 8.0, I found the following bug fixes related to this:
mysql/mysql-server@f395242 (a test case refers to MySQL Bug #89829, which I cannot access).
mysql/mysql-server@2248a3e, mysql/mysql-server@c917825 (these could be addressing a similar issue as ~~MDEV-16664~~)
mysql/mysql-server@c1990af (memory management issue, apparently specific to the MySQL 8.0 implementation; in MariaDB lock_grant_and_move_on_page() which corresponds to the MySQL 8.0 lock_grant_vats() does not allocate memory)

It looks like MySQL 8.0 employs a threshold based on lock_sys->n_waiting < LOCK_VATS_THRESHOLD (defined as 32), so it would be dynamically choosing between the algorithms FCFS and VATS (also called CATS by them) based on load. This could make it harder to repeat the ~~MDEV-16664~~ failure scenario in MySQL 8.0.

For the record, MySQL 8.0 refers to the contribution as WL#10793, MySQL Bug #84266, pull request #115, and mysql/mysql-server@fb056f4.

Marko Mäkelä added a comment - 2018-07-05 08:08 I posted feedback about this to the MDEV-11039 pull request #248 , including the following: In MySQL 8.0, I found the following bug fixes related to this: mysql/mysql-server@f395242 (a test case refers to MySQL Bug #89829, which I cannot access). mysql/mysql-server@2248a3e , mysql/mysql-server@c917825 (these could be addressing a similar issue as MDEV-16664 ) mysql/mysql-server@c1990af (memory management issue, apparently specific to the MySQL 8.0 implementation; in MariaDB lock_grant_and_move_on_page() which corresponds to the MySQL 8.0 lock_grant_vats() does not allocate memory) It looks like MySQL 8.0 employs a threshold based on lock_sys->n_waiting < LOCK_VATS_THRESHOLD (defined as 32), so it would be dynamically choosing between the algorithms FCFS and VATS (also called CATS by them) based on load. This could make it harder to repeat the MDEV-16664 failure scenario in MySQL 8.0. For the record, MySQL 8.0 refers to the contribution as WL#10793 , MySQL Bug #84266 , pull request #115 , and mysql/mysql-server@fb056f4 .

Matthias Leich made changes - 2018-07-05 11:11

Affects Version/s

10.1.19 [ 22111 ]

Matthias Leich added a comment - 2018-07-05 11:12

Similar fast replay with 10.1.35-MariaDB-debug (InnoDB: Percona XtraDB (http://www.percona.com) 5.6.39-83.1) and innodb_lock_schedule_algorithm=vats.
https://jira.mariadb.org/browse/MDEV-11039 affected 10.1.19, 10.2.3, 10.3.0.

Matthias Leich added a comment - 2018-07-05 11:12 Similar fast replay with 10.1.35-MariaDB-debug (InnoDB: Percona XtraDB ( http://www.percona.com ) 5.6.39-83.1) and innodb_lock_schedule_algorithm=vats. https://jira.mariadb.org/browse/MDEV-11039 affected 10.1.19, 10.2.3, 10.3.0.

Marko Mäkelä made changes - 2018-07-07 08:49

Component/s

Storage Engine - XtraDB [ 10135 ]

Marko Mäkelä added a comment - 2018-07-07 08:52

In MariaDB Server 10.2.17 and 10.3.9, we will revert to the old algorithm (innodb_lock_schedule_algorithm=fcfs) until this problem has been isolated and fixed.

Marko Mäkelä added a comment - 2018-07-07 08:52 In MariaDB Server 10.2.17 and 10.3.9, we will revert to the old algorithm ( innodb_lock_schedule_algorithm=fcfs ) until this problem has been isolated and fixed.

Marko Mäkelä made changes - 2018-07-07 08:52

Assignee

Marko Mäkelä [ marko ]

Matthias Leich added a comment - 2018-07-31 11:00 - edited

2019-08   This seems to be https://jira.mariadb.org/browse/MDEV-15326.

I seem to have hit a case during RQG testing with innodb_lock_schedule_algorithm=fcfs which shows the

same backtrace. This issue is currently under investigation.

/work/RQG_new/rqg.pl \

--grammar=conf/engines/many_indexes.yy \

--gendata=conf/engines/many_indexes.zz \

--mysqld=--innodb_lock_schedule_algorithm=fcfs \

--mysqld=--innodb_stats_persistent=off \

--mysqld=--loose-idle_write_transaction_timeout=0 \

--mysqld=--loose-idle_transaction_timeout=0 \

--mysqld=--loose-idle_readonly_transaction_timeout=0 \

--mysqld=--connect_timeout=60 \

--mysqld=--interactive_timeout=28800 \

--mysqld=--slave_net_timeout=60 \

--mysqld=--net_read_timeout=30 \

 --mysqld=--net_write_timeout=60 \

--mysqld=--loose-table_lock_wait_timeout=50 \

 --mysqld=--wait_timeout=28800 \

--mysqld=--lock-wait-timeout=86400 \

--mysqld=--innodb-lock-wait-timeout=50 \

--no-mask \

--queries=10M \

--duration=350 \

--seed=6 \

--views \

--validators=none \

--redefine=conf/mariadb/general-workarounds.yy \

--redefine=conf/mariadb/alter_table.yy \

--redefine=conf/mariadb/bulk_insert.yy \

--redefine=conf/mariadb/xa.yy \

--redefine=conf/mariadb/versioning.yy \

--redefine=conf/mariadb/sequences.yy \

--mysqld=--log_output=FILE \

--mysqld=--log-bin \

--mysqld=--log_bin_trust_function_creators=1 \

--mysqld=--loose-max-statement-time=30 \

--mysqld=--loose-debug_assert_on_not_freed_memory=0 \

--reporters=Backtrace,ErrorLog,Deadlock \

--engine=InnoDB \

--mysqld=--innodb-buffer-pool-size=256M \

--mysqld=--innodb_use_native_aio=0 \

--seed=random \

MariaDB 10.2 commit fd378fc613851a12be346329d32e1666f10610d7 compiled with debug

Matthias Leich added a comment - 2018-07-31 11:00 - edited 2019-08 This seems to be https://jira.mariadb.org/browse/MDEV-15326. I seem to have hit a case during RQG testing with innodb_lock_schedule_algorithm=fcfs which shows the same backtrace. This issue is currently under investigation. /work/RQG_new/rqg.pl \ --grammar=conf/engines/many_indexes.yy \ --gendata=conf/engines/many_indexes.zz \ --mysqld=--innodb_lock_schedule_algorithm=fcfs \ --mysqld=--innodb_stats_persistent=off \ --mysqld=--loose-idle_write_transaction_timeout=0 \ --mysqld=--loose-idle_transaction_timeout=0 \ --mysqld=--loose-idle_readonly_transaction_timeout=0 \ --mysqld=--connect_timeout=60 \ --mysqld=--interactive_timeout=28800 \ --mysqld=--slave_net_timeout=60 \ --mysqld=--net_read_timeout=30 \ --mysqld=--net_write_timeout=60 \ --mysqld=--loose-table_lock_wait_timeout=50 \ --mysqld=--wait_timeout=28800 \ --mysqld=--lock-wait-timeout=86400 \ --mysqld=--innodb-lock-wait-timeout=50 \ --no-mask \ --queries=10M \ --duration=350 \ --seed=6 \ --views \ --validators=none \ --redefine=conf/mariadb/general-workarounds.yy \ --redefine=conf/mariadb/alter_table.yy \ --redefine=conf/mariadb/bulk_insert.yy \ --redefine=conf/mariadb/xa.yy \ --redefine=conf/mariadb/versioning.yy \ --redefine=conf/mariadb/sequences.yy \ --mysqld=--log_output=FILE \ --mysqld=--log-bin \ --mysqld=--log_bin_trust_function_creators=1 \ --mysqld=--loose-max-statement-time=30 \ --mysqld=--loose-debug_assert_on_not_freed_memory=0 \ --reporters=Backtrace,ErrorLog,Deadlock \ --engine=InnoDB \ --mysqld=--innodb-buffer-pool-size=256M \ --mysqld=--innodb_use_native_aio=0 \ --seed=random \ MariaDB 10.2 commit fd378fc613851a12be346329d32e1666f10610d7 compiled with debug

Elena Stepanova made changes - 2018-08-29 13:26

Fix Version/s		10.1 [ 16100 ]
Fix Version/s		10.2 [ 14601 ]
Fix Version/s		10.3 [ 22126 ]

Marko Mäkelä made changes - 2018-09-30 19:33

Link

This issue relates to ~~MDEV-14552~~ [ ~~MDEV-14552~~ ]

Marko Mäkelä added a comment - 2018-09-30 19:33

This looks very similar to ~~MDEV-14552~~.
Also, sachin.setiya.007 says that he can reproduce this assertion failure by repeating a single test enough many times:

rpl.rpl_parallel_optimistic_nobinlog 'innodb,stmt' w1 [ fail ]

        Test ended at 2018-09-30 14:09:20

CURRENT_TEST: rpl.rpl_parallel_optimistic_nobinlog

mysqltest: In included file "./include/sync_with_master_gtid.inc":

included from /home/sachin/10.2/mysql-test/suite/rpl/t/rpl_parallel_optimistic_nobinlog.test at line 59:

At line 43: query 'let $_result= `SELECT master_gtid_wait('$master_pos', $_slave_timeout)`' failed: 2013: Lost connection to MySQL server during query

Marko Mäkelä added a comment - 2018-09-30 19:33 This looks very similar to MDEV-14552 . Also, sachin.setiya.007 says that he can reproduce this assertion failure by repeating a single test enough many times: rpl.rpl_parallel_optimistic_nobinlog 'innodb,stmt' w1 [ fail ] Test ended at 2018-09-30 14:09:20 CURRENT_TEST: rpl.rpl_parallel_optimistic_nobinlog mysqltest: In included file "./include/sync_with_master_gtid.inc": included from /home/sachin/10.2/mysql-test/suite/rpl/t/rpl_parallel_optimistic_nobinlog.test at line 59: At line 43: query 'let $_result= `SELECT master_gtid_wait('$master_pos', $_slave_timeout)`' failed: 2013: Lost connection to MySQL server during query

Sergei Golubchik made changes - 2019-03-29 12:04

Fix Version/s

10.4 [ 22408 ]

Matthias Leich made changes - 2019-08-30 10:10

Summary

InnoDB: Failing assertion: !other_lock || wsrep_thd_is_BF(lock->trx->mysql_thd, FALSE) || wsrep_thd_is_BF(other_lock->trx->mysql_thd, FALSE) for DELETE

InnoDB: Failing assertion: !other_lock || wsrep_thd_is_BF ... if innodb_lock_schedule_algorithm=VATS

Matthias Leich made changes - 2019-08-30 17:56

Attachment

other_lock_assert.tgz [ 48862 ]

Matthias Leich added a comment - 2019-08-30 18:05

The uploaded other_lock.tgz contains

- a rqg_batch.pl config file     other_lock_assert.cc

- the RQG grammar                    other_lock_assert.yy  because that grammar is and will be not pushed into the official

Via running some RQG test campaign according to other_lock_assert.cc the following two bugs could be revealed on 10.3 2019-08

- the current bug https://jira.mariadb.org/browse/MDEV-16664

- https://jira.mariadb.org/browse/MDEV-15326

Both bugs show:   InnoDB: Failing assertion: !other_lock || wsrep_thd_is_BF ...

Matthias Leich added a comment - 2019-08-30 18:05 The uploaded other_lock.tgz contains - a rqg_batch.pl config file other_lock_assert.cc - the RQG grammar other_lock_assert.yy because that grammar is and will be not pushed into the official Via running some RQG test campaign according to other_lock_assert.cc the following two bugs could be revealed on 10.3 2019-08 - the current bug https://jira.mariadb.org/browse/MDEV-16664 - https://jira.mariadb.org/browse/MDEV-15326 Both bugs show: InnoDB: Failing assertion: !other_lock || wsrep_thd_is_BF ...

Jan Lindström (Inactive) made changes - 2020-04-16 05:48

Assignee

Marko Mäkelä [ marko ]

Jan Lindström [ jplindst ]

Marko Mäkelä added a comment - 2020-04-21 08:06

I am setting this to Blocker priority for the 10.6 release, with the intention that if this cannot be fixed before the 10.6.0 release, we will remove the feature.

If the bug can be fixed, the fix should be applied to 10.1 and later releases.

Marko Mäkelä added a comment - 2020-04-21 08:06 I am setting this to Blocker priority for the 10.6 release, with the intention that if this cannot be fixed before the 10.6.0 release, we will remove the feature. If the bug can be fixed, the fix should be applied to 10.1 and later releases.

Marko Mäkelä made changes - 2020-04-21 08:06

Fix Version/s		10.6 [ 24028 ]
Fix Version/s	10.2 [ 14601 ]
Fix Version/s	10.1 [ 16100 ]
Fix Version/s	10.3 [ 22126 ]
Fix Version/s	10.4 [ 22408 ]
Affects Version/s		10.5.0 [ 23709 ]
Affects Version/s		10.4.0 [ 23115 ]
Priority	Major [ 3 ]	Blocker [ 1 ]

Marko Mäkelä added a comment - 2020-04-30 08:23

In MySQL 8.0, there have been some fixes to CATS (as the feature is called there). The latest one as of this writing appears to be a complete rewrite of CATS in MySQL 8.0.20.

Marko Mäkelä added a comment - 2020-04-30 08:23 In MySQL 8.0, there have been some fixes to CATS (as the feature is called there). The latest one as of this writing appears to be a complete rewrite of CATS in MySQL 8.0.20 .

Jan Lindström (Inactive) added a comment - 2020-07-20 06:41 - edited

Analysis: There is basically two cases

lock_rec_add_to_queue : here we are adding a new record lock request to queue
lock_rec_queue_validate : here we validate single record lock requests queue

In both cases we have condition like

	if (!(type_mode & (LOCK_WAIT | LOCK_GAP))) {

This means that this lock request does not have WAIT-flag or GAP-flag. For VATS and Galera WAIT-flag is interesting here. Traditionally, InnoDB used First-Come-First-Served scheduling. This means that if new lock requests does not have WAIT-flag there may not be any other conflicting lock requests for same record. If there would have been, it would not be granted this lock request and WAIT-flag would have been added and lock request would have been added to end of the queue.

However, Galera Brute-Force (BF) and VATS might have different scheduling.

BF: Brute force transactions should not wait any locks. There is two ways trying to avoid it.
1. If record that we want to lock is free we add lock request to front of possible queue even when there already is conflicting lock request arrived before this lock request. This, leads situation where this new lock request has not WAIT-flag and queue might have other conflicting lock waiting for its turn.
2. If conflicting lock for same record is already granted we will kill that transaction to let BF transaction to continue.
VATS: Here transaction record lock requests might be re-ordered based on how long transaction has waited locks. This means that whenever long running transaction makes a new lock request it might get turn to lock before other waiting conflicting lock requests. Thus, this lock request has not WAIT-flag even when queue might have other conflicting lock requests are waiting their turn.

After second analysis it seems that we might order new lock request before already granted lock request on a queue, this is naturally a bug.

Jan Lindström (Inactive) added a comment - 2020-07-20 06:41 - edited Analysis: There is basically two cases lock_rec_add_to_queue : here we are adding a new record lock request to queue lock_rec_queue_validate : here we validate single record lock requests queue In both cases we have condition like if (!(type_mode & (LOCK_WAIT | LOCK_GAP))) { This means that this lock request does not have WAIT-flag or GAP-flag. For VATS and Galera WAIT-flag is interesting here. Traditionally, InnoDB used First-Come-First-Served scheduling. This means that if new lock requests does not have WAIT-flag there may not be any other conflicting lock requests for same record. If there would have been, it would not be granted this lock request and WAIT-flag would have been added and lock request would have been added to end of the queue. However, Galera Brute-Force (BF) and VATS might have different scheduling. BF: Brute force transactions should not wait any locks. There is two ways trying to avoid it. If record that we want to lock is free we add lock request to front of possible queue even when there already is conflicting lock request arrived before this lock request. This, leads situation where this new lock request has not WAIT-flag and queue might have other conflicting lock waiting for its turn. If conflicting lock for same record is already granted we will kill that transaction to let BF transaction to continue. VATS: Here transaction record lock requests might be re-ordered based on how long transaction has waited locks. This means that whenever long running transaction makes a new lock request it might get turn to lock before other waiting conflicting lock requests. Thus, this lock request has not WAIT-flag even when queue might have other conflicting lock requests are waiting their turn. After second analysis it seems that we might order new lock request before already granted lock request on a queue, this is naturally a bug.

Jan Lindström (Inactive) added a comment - 2020-07-20 06:44

marko I leave it to you decide do we want to take whatever MySQL did for 8.0 or do we just fix the debug-assertion. In my opinion for GA-releases the new implementation looks quite a big change.

Jan Lindström (Inactive) added a comment - 2020-07-20 06:44 marko I leave it to you decide do we want to take whatever MySQL did for 8.0 or do we just fix the debug-assertion. In my opinion for GA-releases the new implementation looks quite a big change.

Jan Lindström (Inactive) made changes - 2020-07-20 06:44

Assignee

Jan Lindström [ jplindst ]

Marko Mäkelä [ marko ]

Marko Mäkelä made changes - 2020-07-27 11:44

Link

This issue relates to ~~MDEV-15326~~ [ ~~MDEV-15326~~ ]

Marko Mäkelä added a comment - 2020-07-27 11:44

Because ~~MDEV-12837~~ already hard-wired Galera to use innodb_lock_schedule_algorithm=FCFS, the observations on Galera transaction certification process (which may abort other transactions) should not be very relevant here.

The assertion has been ‘watered down’ for Galera. The original assertion expression is simply !other_lock. This bug was repeated without enabling any Galera replication. What matters is that other_lock is set, that is, a conflicting lock exists when it is not expected to. This problem is specific to innodb_lock_schedule_algorithm=VATS only.

If there are no resources to fix the problematic VATS implementation, I think that the most meaningful course of action is to remove it from the MariaDB 10.6 release.

Note: MariaDB 10.2 also inherited a similar bug from MySQL 5.7, which we fixed in ~~MDEV-15326~~. That was due to a race condition in transaction commit, which basically failed to release all implicit locks atomically when the transaction state changed to committed.

Marko Mäkelä added a comment - 2020-07-27 11:44 Because MDEV-12837 already hard-wired Galera to use innodb_lock_schedule_algorithm=FCFS , the observations on Galera transaction certification process (which may abort other transactions) should not be very relevant here. The assertion has been ‘watered down’ for Galera. The original assertion expression is simply !other_lock . This bug was repeated without enabling any Galera replication. What matters is that other_lock is set, that is, a conflicting lock exists when it is not expected to. This problem is specific to innodb_lock_schedule_algorithm=VATS only. If there are no resources to fix the problematic VATS implementation, I think that the most meaningful course of action is to remove it from the MariaDB 10.6 release. Note: MariaDB 10.2 also inherited a similar bug from MySQL 5.7, which we fixed in MDEV-15326 . That was due to a race condition in transaction commit, which basically failed to release all implicit locks atomically when the transaction state changed to committed.

Marko Mäkelä made changes - 2020-07-27 11:44

Link

This issue relates to ~~MDEV-12837~~ [ ~~MDEV-12837~~ ]

Sergei Golubchik made changes - 2020-08-13 19:26

Description

Sorry, but the frontend did not allow me to fill in the MariaDB versions used.
No replay on 10.1.35 commit 1d10c9afe0f2f4fba73892e6c12ea6efe90d5931.
Replay on
- 10.2.17 commit 400cf017152c732387c89deaa082b43c8fb42d71
- 10.3.8 commit 71144afa966a85d08053eb616a1021fd339102d1
all around 2018-07-02 + compiled with debug.

Server error log entries + Backtrace of 10.3:
InnoDB: Assertion failure in file storage/innobase/lock/lock0lock.cc line 5019
InnoDB: Failing assertion: !other_lock || wsrep_thd_is_BF(lock->trx->mysql_thd, FALSE) || wsrep_thd_is_BF(other_lock->trx->mysql_thd, FALSE)
...
Query: DELETE FROM t1 WHERE col1 = 1 OR col1 IS NULL /* E_R Thread1 QNO 6083 CON_ID 15 */
...
Status: NOT_KILLED
#3 <signal handler called>
#4 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:58
#5 0x00007f8acda5237a in __GI_abort () at abort.c:89
#6 0x000055e6267f1eaf in ut_dbg_assertion_failed (expr=0x55e626e559d8 "!other_lock || wsrep_thd_is_BF(lock->trx->mysql_thd, FALSE) || wsrep_thd_is_BF(other_lock->trx->mysql_thd, FALSE)", file=0x55e626e53a40 "storage/innobase/lock/lock0lock.cc", line=5019) at storage/innobase/ut/ut0dbg.cc:61
#7 0x000055e62664b06c in lock_rec_queue_validate (locked_lock_trx_sys=false, block=0x7f8aaba3e100, rec=0x7f8aac52c149 "", index=0x7f8a2c2cdb18, offsets=0x7f8ac41923b0) at storage/innobase/lock/lock0lock.cc:5017
#8 0x000055e62664dd24 in lock_clust_rec_read_check_and_lock (flags=0, block=0x7f8aaba3e100, rec=0x7f8aac52c149 "", index=0x7f8a2c2cdb18, offsets=0x7f8ac41923b0, mode=LOCK_X, gap_mode=1024, thr=0x7f8a2c2de240) at storage/innobase/lock/lock0lock.cc:5862
#9 0x000055e62674f5e1 in sel_set_rec_lock (pcur=0x7f8a2c2ddd38, rec=0x7f8aac52c149 "", index=0x7f8a2c2cdb18, offsets=0x7f8ac41923b0, mode=3, type=1024, thr=0x7f8a2c2de240, mtr=0x7f8ac4192cd0) at storage/innobase/row/row0sel.cc:1247
#10 0x000055e6267588d4 in row_search_mvcc (buf=0x7f8a2c2dd720 "\377", mode=PAGE_CUR_G, prebuilt=0x7f8a2c2ddb78, match_mode=0, direction=0) at storage/innobase/row/row0sel.cc:5020
#11 0x000055e6265c17bd in ha_innobase::index_read (this=0x7f8a2c2dcf90, buf=0x7f8a2c2dd720 "\377", key_ptr=0x0, key_len=0, find_flag=HA_READ_AFTER_KEY) at storage/innobase/handler/ha_innodb.cc:9315
#12 0x000055e6265c2ab6 in ha_innobase::index_first (this=0x7f8a2c2dcf90, buf=0x7f8a2c2dd720 "\377") at storage/innobase/handler/ha_innodb.cc:9732
#13 0x000055e6265c2cc5 in ha_innobase::rnd_next (this=0x7f8a2c2dcf90, buf=0x7f8a2c2dd720 "\377") at storage/innobase/handler/ha_innodb.cc:9825
#14 0x000055e6263b16f1 in handler::ha_rnd_next (this=0x7f8a2c2dcf90, buf=0x7f8a2c2dd720 "\377") at sql/handler.cc:2764
#15 0x000055e62652db86 in rr_sequential (info=0x7f8ac41935c0) at sql/records.cc:481
#16 0x000055e62602aa5d in READ_RECORD::read_record (this=0x7f8ac41935c0) at sql/records.h:73
#17 0x000055e62654ae08 in mysql_delete (thd=0x7f8a2c000a98, table_list=0x7f8a2c014c78, conds=0x7f8a2c0157b8, order_list=0x7f8a2c0052f0, limit=18446744073709551615, options=549755813888, result=0x0) at sql/sql_delete.cc:727
#18 0x000055e6260c5c03 in mysql_execute_command (thd=0x7f8a2c000a98) at sql/sql_parse.cc:4916
#19 0x000055e6260cfe9e in mysql_parse (thd=0x7f8a2c000a98, rawbuf=0x7f8a2c014b20 "DELETE FROM t1 WHERE col1 = 1 OR col1 IS NULL /* E_R Thread1 QNO 6083 CON_ID 15 */", length=83, parser_state=0x7f8ac4194640, is_com_multi=false, is_next_command=false) at sql/sql_parse.cc:8076
...

Sorry, but the frontend did not allow me to fill in the MariaDB versions used.
No replay on 10.1.35 commit 1d10c9afe0f2f4fba73892e6c12ea6efe90d5931.
Replay on
- 10.2.17 commit 400cf017152c732387c89deaa082b43c8fb42d71
- 10.3.8 commit 71144afa966a85d08053eb616a1021fd339102d1
all around 2018-07-02 + compiled with debug.
{noformat}
Server error log entries + Backtrace of 10.3:
InnoDB: Assertion failure in file storage/innobase/lock/lock0lock.cc line 5019
InnoDB: Failing assertion: !other_lock || wsrep_thd_is_BF(lock->trx->mysql_thd, FALSE) || wsrep_thd_is_BF(other_lock->trx->mysql_thd, FALSE)
...
Query: DELETE FROM t1 WHERE col1 = 1 OR col1 IS NULL /* E_R Thread1 QNO 6083 CON_ID 15 */
...
Status: NOT_KILLED
#3 <signal handler called>
#4 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:58
#5 0x00007f8acda5237a in __GI_abort () at abort.c:89
#6 0x000055e6267f1eaf in ut_dbg_assertion_failed (expr=0x55e626e559d8 "!other_lock || wsrep_thd_is_BF(lock->trx->mysql_thd, FALSE) || wsrep_thd_is_BF(other_lock->trx->mysql_thd, FALSE)", file=0x55e626e53a40 "storage/innobase/lock/lock0lock.cc", line=5019) at storage/innobase/ut/ut0dbg.cc:61
#7 0x000055e62664b06c in lock_rec_queue_validate (locked_lock_trx_sys=false, block=0x7f8aaba3e100, rec=0x7f8aac52c149 "", index=0x7f8a2c2cdb18, offsets=0x7f8ac41923b0) at storage/innobase/lock/lock0lock.cc:5017
#8 0x000055e62664dd24 in lock_clust_rec_read_check_and_lock (flags=0, block=0x7f8aaba3e100, rec=0x7f8aac52c149 "", index=0x7f8a2c2cdb18, offsets=0x7f8ac41923b0, mode=LOCK_X, gap_mode=1024, thr=0x7f8a2c2de240) at storage/innobase/lock/lock0lock.cc:5862
#9 0x000055e62674f5e1 in sel_set_rec_lock (pcur=0x7f8a2c2ddd38, rec=0x7f8aac52c149 "", index=0x7f8a2c2cdb18, offsets=0x7f8ac41923b0, mode=3, type=1024, thr=0x7f8a2c2de240, mtr=0x7f8ac4192cd0) at storage/innobase/row/row0sel.cc:1247
#10 0x000055e6267588d4 in row_search_mvcc (buf=0x7f8a2c2dd720 "\377", mode=PAGE_CUR_G, prebuilt=0x7f8a2c2ddb78, match_mode=0, direction=0) at storage/innobase/row/row0sel.cc:5020
#11 0x000055e6265c17bd in ha_innobase::index_read (this=0x7f8a2c2dcf90, buf=0x7f8a2c2dd720 "\377", key_ptr=0x0, key_len=0, find_flag=HA_READ_AFTER_KEY) at storage/innobase/handler/ha_innodb.cc:9315
#12 0x000055e6265c2ab6 in ha_innobase::index_first (this=0x7f8a2c2dcf90, buf=0x7f8a2c2dd720 "\377") at storage/innobase/handler/ha_innodb.cc:9732
#13 0x000055e6265c2cc5 in ha_innobase::rnd_next (this=0x7f8a2c2dcf90, buf=0x7f8a2c2dd720 "\377") at storage/innobase/handler/ha_innodb.cc:9825
#14 0x000055e6263b16f1 in handler::ha_rnd_next (this=0x7f8a2c2dcf90, buf=0x7f8a2c2dd720 "\377") at sql/handler.cc:2764
#15 0x000055e62652db86 in rr_sequential (info=0x7f8ac41935c0) at sql/records.cc:481
#16 0x000055e62602aa5d in READ_RECORD::read_record (this=0x7f8ac41935c0) at sql/records.h:73
#17 0x000055e62654ae08 in mysql_delete (thd=0x7f8a2c000a98, table_list=0x7f8a2c014c78, conds=0x7f8a2c0157b8, order_list=0x7f8a2c0052f0, limit=18446744073709551615, options=549755813888, result=0x0) at sql/sql_delete.cc:727
#18 0x000055e6260c5c03 in mysql_execute_command (thd=0x7f8a2c000a98) at sql/sql_parse.cc:4916
#19 0x000055e6260cfe9e in mysql_parse (thd=0x7f8a2c000a98, rawbuf=0x7f8a2c014b20 "DELETE FROM t1 WHERE col1 = 1 OR col1 IS NULL /* E_R Thread1 QNO 6083 CON_ID 15 */", length=83, parser_state=0x7f8ac4194640, is_com_multi=false, is_next_command=false) at sql/sql_parse.cc:8076
...
{noformat}

Sergei Golubchik made changes - 2020-08-13 19:26

Description

Sorry, but the frontend did not allow me to fill in the MariaDB versions used.
No replay on 10.1.35 commit 1d10c9afe0f2f4fba73892e6c12ea6efe90d5931.
Replay on
- 10.2.17 commit 400cf017152c732387c89deaa082b43c8fb42d71
- 10.3.8 commit 71144afa966a85d08053eb616a1021fd339102d1
all around 2018-07-02 + compiled with debug.
{noformat}
Server error log entries + Backtrace of 10.3:
InnoDB: Assertion failure in file storage/innobase/lock/lock0lock.cc line 5019
InnoDB: Failing assertion: !other_lock || wsrep_thd_is_BF(lock->trx->mysql_thd, FALSE) || wsrep_thd_is_BF(other_lock->trx->mysql_thd, FALSE)
...
Query: DELETE FROM t1 WHERE col1 = 1 OR col1 IS NULL /* E_R Thread1 QNO 6083 CON_ID 15 */
...
Status: NOT_KILLED
#3 <signal handler called>
#4 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:58
#5 0x00007f8acda5237a in __GI_abort () at abort.c:89
#6 0x000055e6267f1eaf in ut_dbg_assertion_failed (expr=0x55e626e559d8 "!other_lock || wsrep_thd_is_BF(lock->trx->mysql_thd, FALSE) || wsrep_thd_is_BF(other_lock->trx->mysql_thd, FALSE)", file=0x55e626e53a40 "storage/innobase/lock/lock0lock.cc", line=5019) at storage/innobase/ut/ut0dbg.cc:61
#7 0x000055e62664b06c in lock_rec_queue_validate (locked_lock_trx_sys=false, block=0x7f8aaba3e100, rec=0x7f8aac52c149 "", index=0x7f8a2c2cdb18, offsets=0x7f8ac41923b0) at storage/innobase/lock/lock0lock.cc:5017
#8 0x000055e62664dd24 in lock_clust_rec_read_check_and_lock (flags=0, block=0x7f8aaba3e100, rec=0x7f8aac52c149 "", index=0x7f8a2c2cdb18, offsets=0x7f8ac41923b0, mode=LOCK_X, gap_mode=1024, thr=0x7f8a2c2de240) at storage/innobase/lock/lock0lock.cc:5862
#9 0x000055e62674f5e1 in sel_set_rec_lock (pcur=0x7f8a2c2ddd38, rec=0x7f8aac52c149 "", index=0x7f8a2c2cdb18, offsets=0x7f8ac41923b0, mode=3, type=1024, thr=0x7f8a2c2de240, mtr=0x7f8ac4192cd0) at storage/innobase/row/row0sel.cc:1247
#10 0x000055e6267588d4 in row_search_mvcc (buf=0x7f8a2c2dd720 "\377", mode=PAGE_CUR_G, prebuilt=0x7f8a2c2ddb78, match_mode=0, direction=0) at storage/innobase/row/row0sel.cc:5020
#11 0x000055e6265c17bd in ha_innobase::index_read (this=0x7f8a2c2dcf90, buf=0x7f8a2c2dd720 "\377", key_ptr=0x0, key_len=0, find_flag=HA_READ_AFTER_KEY) at storage/innobase/handler/ha_innodb.cc:9315
#12 0x000055e6265c2ab6 in ha_innobase::index_first (this=0x7f8a2c2dcf90, buf=0x7f8a2c2dd720 "\377") at storage/innobase/handler/ha_innodb.cc:9732
#13 0x000055e6265c2cc5 in ha_innobase::rnd_next (this=0x7f8a2c2dcf90, buf=0x7f8a2c2dd720 "\377") at storage/innobase/handler/ha_innodb.cc:9825
#14 0x000055e6263b16f1 in handler::ha_rnd_next (this=0x7f8a2c2dcf90, buf=0x7f8a2c2dd720 "\377") at sql/handler.cc:2764
#15 0x000055e62652db86 in rr_sequential (info=0x7f8ac41935c0) at sql/records.cc:481
#16 0x000055e62602aa5d in READ_RECORD::read_record (this=0x7f8ac41935c0) at sql/records.h:73
#17 0x000055e62654ae08 in mysql_delete (thd=0x7f8a2c000a98, table_list=0x7f8a2c014c78, conds=0x7f8a2c0157b8, order_list=0x7f8a2c0052f0, limit=18446744073709551615, options=549755813888, result=0x0) at sql/sql_delete.cc:727
#18 0x000055e6260c5c03 in mysql_execute_command (thd=0x7f8a2c000a98) at sql/sql_parse.cc:4916
#19 0x000055e6260cfe9e in mysql_parse (thd=0x7f8a2c000a98, rawbuf=0x7f8a2c014b20 "DELETE FROM t1 WHERE col1 = 1 OR col1 IS NULL /* E_R Thread1 QNO 6083 CON_ID 15 */", length=83, parser_state=0x7f8ac4194640, is_com_multi=false, is_next_command=false) at sql/sql_parse.cc:8076
...
{noformat}

Sorry, but the frontend did not allow me to fill in the MariaDB versions used.
No replay on 10.1.35 commit 1d10c9afe0f2f4fba73892e6c12ea6efe90d5931.
Replay on
- 10.2.17 commit 400cf017152c732387c89deaa082b43c8fb42d71
- 10.3.8 commit 71144afa966a85d08053eb616a1021fd339102d1
all around 2018-07-02 + compiled with debug.

{noformat}
Server error log entries + Backtrace of 10.3:
InnoDB: Assertion failure in file storage/innobase/lock/lock0lock.cc line 5019
InnoDB: Failing assertion: !other_lock || wsrep_thd_is_BF(lock->trx->mysql_thd, FALSE) || wsrep_thd_is_BF(other_lock->trx->mysql_thd, FALSE)
...
Query: DELETE FROM t1 WHERE col1 = 1 OR col1 IS NULL /* E_R Thread1 QNO 6083 CON_ID 15 */
...
Status: NOT_KILLED
#3 <signal handler called>
#4 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:58
#5 0x00007f8acda5237a in __GI_abort () at abort.c:89
#6 0x000055e6267f1eaf in ut_dbg_assertion_failed (expr=0x55e626e559d8 "!other_lock || wsrep_thd_is_BF(lock->trx->mysql_thd, FALSE) || wsrep_thd_is_BF(other_lock->trx->mysql_thd, FALSE)", file=0x55e626e53a40 "storage/innobase/lock/lock0lock.cc", line=5019) at storage/innobase/ut/ut0dbg.cc:61
#7 0x000055e62664b06c in lock_rec_queue_validate (locked_lock_trx_sys=false, block=0x7f8aaba3e100, rec=0x7f8aac52c149 "", index=0x7f8a2c2cdb18, offsets=0x7f8ac41923b0) at storage/innobase/lock/lock0lock.cc:5017
#8 0x000055e62664dd24 in lock_clust_rec_read_check_and_lock (flags=0, block=0x7f8aaba3e100, rec=0x7f8aac52c149 "", index=0x7f8a2c2cdb18, offsets=0x7f8ac41923b0, mode=LOCK_X, gap_mode=1024, thr=0x7f8a2c2de240) at storage/innobase/lock/lock0lock.cc:5862
#9 0x000055e62674f5e1 in sel_set_rec_lock (pcur=0x7f8a2c2ddd38, rec=0x7f8aac52c149 "", index=0x7f8a2c2cdb18, offsets=0x7f8ac41923b0, mode=3, type=1024, thr=0x7f8a2c2de240, mtr=0x7f8ac4192cd0) at storage/innobase/row/row0sel.cc:1247
#10 0x000055e6267588d4 in row_search_mvcc (buf=0x7f8a2c2dd720 "\377", mode=PAGE_CUR_G, prebuilt=0x7f8a2c2ddb78, match_mode=0, direction=0) at storage/innobase/row/row0sel.cc:5020
#11 0x000055e6265c17bd in ha_innobase::index_read (this=0x7f8a2c2dcf90, buf=0x7f8a2c2dd720 "\377", key_ptr=0x0, key_len=0, find_flag=HA_READ_AFTER_KEY) at storage/innobase/handler/ha_innodb.cc:9315
#12 0x000055e6265c2ab6 in ha_innobase::index_first (this=0x7f8a2c2dcf90, buf=0x7f8a2c2dd720 "\377") at storage/innobase/handler/ha_innodb.cc:9732
#13 0x000055e6265c2cc5 in ha_innobase::rnd_next (this=0x7f8a2c2dcf90, buf=0x7f8a2c2dd720 "\377") at storage/innobase/handler/ha_innodb.cc:9825
#14 0x000055e6263b16f1 in handler::ha_rnd_next (this=0x7f8a2c2dcf90, buf=0x7f8a2c2dd720 "\377") at sql/handler.cc:2764
#15 0x000055e62652db86 in rr_sequential (info=0x7f8ac41935c0) at sql/records.cc:481
#16 0x000055e62602aa5d in READ_RECORD::read_record (this=0x7f8ac41935c0) at sql/records.h:73
#17 0x000055e62654ae08 in mysql_delete (thd=0x7f8a2c000a98, table_list=0x7f8a2c014c78, conds=0x7f8a2c0157b8, order_list=0x7f8a2c0052f0, limit=18446744073709551615, options=549755813888, result=0x0) at sql/sql_delete.cc:727
#18 0x000055e6260c5c03 in mysql_execute_command (thd=0x7f8a2c000a98) at sql/sql_parse.cc:4916
#19 0x000055e6260cfe9e in mysql_parse (thd=0x7f8a2c000a98, rawbuf=0x7f8a2c014b20 "DELETE FROM t1 WHERE col1 = 1 OR col1 IS NULL /* E_R Thread1 QNO 6083 CON_ID 15 */", length=83, parser_state=0x7f8ac4194640, is_com_multi=false, is_next_command=false) at sql/sql_parse.cc:8076
...
{noformat}

Sergei Golubchik made changes - 2020-08-17 11:07

Remote Link

This issue links to "see the discussion at the end of #248 (Web Link)" [ 30003 ]

Sergei Golubchik made changes - 2020-08-17 11:08

Remote Link

This issue links to "see the discussion at the end of #248 (Web Link)" [ 30003 ]

Sergei Golubchik made changes - 2020-08-17 11:09

Remote Link

This issue links to "see the discussion at the end of #248 (Web Link)" [ 30004 ]

Marko Mäkelä added a comment - 2020-08-17 13:13

I believe that the ALTER TABLE statement in the grammar that mleich posted could be replaced with COMMIT. I tried to create a MTR test case based on the RQG grammar, but my test failed to reproduce the assertion failure on a recent 10.5 branch.

Marko Mäkelä added a comment - 2020-08-17 13:13 I believe that the ALTER TABLE statement in the grammar that mleich posted could be replaced with COMMIT . I tried to create a MTR test case based on the RQG grammar, but my test failed to reproduce the assertion failure on a recent 10.5 branch.

Marko Mäkelä made changes - 2020-08-17 13:13

Labels

regression-10.2

need_rr regression-10.2

Matthias Leich made changes - 2020-09-07 16:34

Labels

need_rr regression-10.2

regression-10.2 rr-profile

Matthias Leich made changes - 2020-09-17 18:15

Comment

[ A comment with security level 'Developers' was removed. ]

Marko Mäkelä added a comment - 2020-09-18 16:04

I analyzed a recent trace of this. We have two DELETE statements executing the locking read part, and both are being granted an exclusive lock on the same record (in a clustered index root page). The lock breakage appears to occur in the following section of the function lock_rec_insert_by_trx_age():

	if (lock_get_wait(in_lock) && !lock_rec_has_to_wait_in_queue(in_lock)) {

		lock_grant_have_trx_mutex(in_lock);

Right before the call, lock_rec_enqueue_waiting() removed the lock from the hash table:

	if (innodb_lock_schedule_algorithm

	    == INNODB_LOCK_SCHEDULE_ALGORITHM_VATS

	    && !prdt

	    && !thd_is_replication_slave_thread(lock->trx->mysql_thd)) {

		HASH_DELETE(lock_t, hash, &lock_sys.rec_hash,

			    lock_rec_lock_fold(lock), lock);

		dberr_t res = lock_rec_insert_by_trx_age(lock);

In lock_rec_has_to_wait_in_queue(), we will find no conflicting lock, even though the lock that was acquired by the first thread still exists:

	for (lock = lock_rec_get_first_on_page_addr(hash, space, page_no);

	     lock != wait_lock;

	     lock = lock_rec_get_next_on_page_const(lock)) {

In that loop, we fail to invoke lock_has_to_wait(wait_lock, lock) on the other_lock object will be flagged by the assertion failure in lock_rec_queue_validate(), invoked by lock_clust_rec_read_check_and_lock(), even though that other_lock remains granted in LOCK_X mode for the same record.

At the time the conflicting LOCK_X other_lock was granted to the other DELETE, our DELETE was waiting for block->lock on the clustered index root page. You definitely should not need a big table to repeat this, because the clustered index consists of the root page only. The test case does not involve any secondary indexes. The PAGE_N_RECS is 6 at the time of the failure, but there are some PAGE_GARBAGE records in the page.

I suspect that the bug is at the end of lock_rec_insert_by_trx_age().

Marko Mäkelä added a comment - 2020-09-18 16:04 I analyzed a recent trace of this. We have two DELETE statements executing the locking read part, and both are being granted an exclusive lock on the same record (in a clustered index root page). The lock breakage appears to occur in the following section of the function lock_rec_insert_by_trx_age() : if (lock_get_wait(in_lock) && !lock_rec_has_to_wait_in_queue(in_lock)) { lock_grant_have_trx_mutex(in_lock); Right before the call, lock_rec_enqueue_waiting() removed the lock from the hash table: if (innodb_lock_schedule_algorithm == INNODB_LOCK_SCHEDULE_ALGORITHM_VATS && !prdt && !thd_is_replication_slave_thread(lock->trx->mysql_thd)) { HASH_DELETE(lock_t, hash, &lock_sys.rec_hash, lock_rec_lock_fold(lock), lock); dberr_t res = lock_rec_insert_by_trx_age(lock); In lock_rec_has_to_wait_in_queue() , we will find no conflicting lock, even though the lock that was acquired by the first thread still exists: for (lock = lock_rec_get_first_on_page_addr(hash, space, page_no); lock != wait_lock; lock = lock_rec_get_next_on_page_const(lock)) { In that loop, we fail to invoke lock_has_to_wait(wait_lock, lock) on the other_lock object will be flagged by the assertion failure in lock_rec_queue_validate() , invoked by lock_clust_rec_read_check_and_lock() , even though that other_lock remains granted in LOCK_X mode for the same record. At the time the conflicting LOCK_X other_lock was granted to the other DELETE , our DELETE was waiting for block->lock on the clustered index root page. You definitely should not need a big table to repeat this, because the clustered index consists of the root page only. The test case does not involve any secondary indexes. The PAGE_N_RECS is 6 at the time of the failure, but there are some PAGE_GARBAGE records in the page. I suspect that the bug is at the end of lock_rec_insert_by_trx_age() .

Marko Mäkelä added a comment - 2020-10-02 16:29

MariaDB Server 10.2, 10.3, 10.4, 10.5 will issue a warning message if the parameter is set to innodb_lock_schedule_algorithm=VATS, but the buggy behaviour will not be removed.

The parameter innodb_lock_schedule_algorithm will be removed in MariaDB Server 10.6.

Marko Mäkelä added a comment - 2020-10-02 16:29 MariaDB Server 10.2, 10.3, 10.4, 10.5 will issue a warning message if the parameter is set to innodb_lock_schedule_algorithm=VATS , but the buggy behaviour will not be removed. The parameter innodb_lock_schedule_algorithm will be removed in MariaDB Server 10.6.

Marko Mäkelä made changes - 2020-10-02 16:29

Fix Version/s		10.2 [ 14601 ]
Fix Version/s		10.3 [ 22126 ]
Fix Version/s		10.4 [ 22408 ]
Fix Version/s		10.5 [ 23123 ]

Geoff Montee (Inactive) made changes - 2020-10-02 17:30

Link

This issue blocks ~~MDEV-23877~~ [ ~~MDEV-23877~~ ]

Jan Lindström (Inactive) made changes - 2020-10-05 07:42

Assignee

Marko Mäkelä [ marko ]

Jan Lindström [ jplindst ]

Jan Lindström (Inactive) made changes - 2020-10-05 07:42

Status

Open [ 1 ]

In Progress [ 3 ]

Marko Mäkelä added a comment - 2020-10-05 08:43 - edited

Starting with MariaDB Server 10.2.35, 10.3.26, 10.4.16, 10.5.7, a deprecation and corruption warning will be issued if the server is being started up with innodb_lock_schedule_algorithm=VATS.

Starting with MariaDB Server 10.6, the parameter innodb_lock_schedule_algorithm will be removed.

Marko Mäkelä added a comment - 2020-10-05 08:43 - edited Starting with MariaDB Server 10.2.35, 10.3.26, 10.4.16, 10.5.7, a deprecation and corruption warning will be issued if the server is being started up with innodb_lock_schedule_algorithm=VATS . Starting with MariaDB Server 10.6, the parameter innodb_lock_schedule_algorithm will be removed .

Marko Mäkelä made changes - 2020-10-05 08:43

issue.field.resolutiondate

2020-10-05 08:43:42.0

2020-10-05 08:43:42.374

Marko Mäkelä made changes - 2020-10-05 08:43

Fix Version/s		10.2.34 [ 24505 ]
Fix Version/s		10.3.25 [ 24506 ]
Fix Version/s		10.4.15 [ 24507 ]
Fix Version/s		10.5.6 [ 24508 ]
Fix Version/s		10.6.0 [ 24431 ]
Fix Version/s	10.2 [ 14601 ]
Fix Version/s	10.3 [ 22126 ]
Fix Version/s	10.4 [ 22408 ]
Fix Version/s	10.5 [ 23123 ]
Fix Version/s	10.6 [ 24028 ]
Assignee	Jan Lindström [ jplindst ]	Marko Mäkelä [ marko ]
Resolution		Fixed [ 1 ]
Status	In Progress [ 3 ]	Closed [ 6 ]

Jan Lindström (Inactive) made changes - 2020-10-06 06:13

Assignee	Marko Mäkelä [ marko ]	Jan Lindström [ jplindst ]
Resolution	Fixed [ 1 ]
Status	Closed [ 6 ]	Stalled [ 10000 ]

Jan Lindström (Inactive) added a comment - 2020-10-06 06:15 - edited

https://github.com/MariaDB/server/commit/f35b29674ec22f1ee7d944dc3765707b070e0ea0

Fix candidate tested with :

perl ./runall-new.pl \

     --sqltrace=MarkErrors \

     --grammar=conf/engines/engine_stress.yy \

     --gendata=conf/engines/engine_stress.zz \

     --duration=10300 \

     --queries=100000000 \

     --threads=16 \

     --mysqld=--log-error=mariadb.err \

     --seed=1 \

     --basedir=/home/jan/mysql/10.2-bugs \

     --vardir=/home/jan/mysql/galera-test \

     --mysqld=--core \

     --mysqld=--general-log \

     --mysqld=--general-log-file=queries.log \

     --mysqld=--log-output=file \

     --mysqld=--wsrep-on=OFF \

     --mysqld=--default-storage-engine=innodb \

     --mysqld=--sort_buffer_size=200M \

     --mysqld=--innodb-autoinc-lock-mode=2 \

     --mysqld=--innodb-lock-wait-timeout=1 \

     --mysqld=--binlog-format=ROW \

     --mysqld=--innodb-locks-unsafe-for-binlog=1 \

     --mysqld=--innodb-lock-schedule-algorithm=VATS

Jan Lindström (Inactive) added a comment - 2020-10-06 06:15 - edited https://github.com/MariaDB/server/commit/f35b29674ec22f1ee7d944dc3765707b070e0ea0 Fix candidate tested with : perl ./runall-new.pl \ --sqltrace=MarkErrors \ --grammar=conf/engines/engine_stress.yy \ --gendata=conf/engines/engine_stress.zz \ --duration=10300 \ --queries=100000000 \ --threads=16 \ --mysqld=--log-error=mariadb.err \ --seed=1 \ --basedir=/home/jan/mysql/10.2-bugs \ --vardir=/home/jan/mysql/galera-test \ --mysqld=--core \ --mysqld=--general-log \ --mysqld=--general-log-file=queries.log \ --mysqld=--log-output=file \ --mysqld=--wsrep-on=OFF \ --mysqld=--default-storage-engine=innodb \ --mysqld=--sort_buffer_size=200M \ --mysqld=--innodb-autoinc-lock-mode=2 \ --mysqld=--innodb-lock-wait-timeout=1 \ --mysqld=--binlog-format=ROW \ --mysqld=--innodb-locks-unsafe-for-binlog=1 \ --mysqld=--innodb-lock-schedule-algorithm=VATS

Jan Lindström (Inactive) made changes - 2020-10-06 06:16

Status

Stalled [ 10000 ]

In Progress [ 3 ]

Jan Lindström (Inactive) made changes - 2020-10-06 06:16

Assignee	Jan Lindström [ jplindst ]	Marko Mäkelä [ marko ]
Status	In Progress [ 3 ]	In Review [ 10002 ]

Ralf Gebhardt made changes - 2020-10-06 18:35

Fix Version/s

10.5.7 [ 25019 ]

Ralf Gebhardt made changes - 2020-10-06 18:37

Fix Version/s

10.5.6 [ 24508 ]

Ralf Gebhardt made changes - 2020-10-06 18:40

Fix Version/s

10.4.16 [ 25020 ]

Ralf Gebhardt made changes - 2020-10-06 18:42

Fix Version/s

10.4.15 [ 24507 ]

Ralf Gebhardt made changes - 2020-10-06 18:44

Fix Version/s

10.3.26 [ 25021 ]

Ralf Gebhardt made changes - 2020-10-06 18:45

Fix Version/s

10.3.25 [ 24506 ]

Ralf Gebhardt made changes - 2020-10-06 18:47

Fix Version/s

10.2.35 [ 25022 ]

Ralf Gebhardt made changes - 2020-10-06 18:48

Fix Version/s

10.2.34 [ 24505 ]

Marko Mäkelä made changes - 2020-10-06 19:09

Link

This issue relates to ~~MDEV-23897~~ [ ~~MDEV-23897~~ ]

Marko Mäkelä added a comment - 2020-10-07 08:02

jplindst, I think that it is confusing to reopen a bug that we already used for removing (10.6) and deprecating a feature that did not work. It might be better to file a separate ticket for fixing the crash.

Please rebase your work to a branch that includes a fix of ~~MDEV-23897~~, and include a deterministic test case. Based on my analysis of the rr record trace, it should be relatively easy to create a DEBUG_SYNC test case that repeats the problem. We have a few DEBUG_SYNC points in the btr and row subdirectories that could be useful. Hopefully, at most one new sync point would have to be added.

Marko Mäkelä added a comment - 2020-10-07 08:02 jplindst , I think that it is confusing to reopen a bug that we already used for removing (10.6) and deprecating a feature that did not work. It might be better to file a separate ticket for fixing the crash. Please rebase your work to a branch that includes a fix of MDEV-23897 , and include a deterministic test case. Based on my analysis of the rr record trace, it should be relatively easy to create a DEBUG_SYNC test case that repeats the problem. We have a few DEBUG_SYNC points in the btr and row subdirectories that could be useful. Hopefully, at most one new sync point would have to be added.

Marko Mäkelä made changes - 2020-10-07 08:02

Assignee	Marko Mäkelä [ marko ]	Jan Lindström [ jplindst ]
Status	In Review [ 10002 ]	Stalled [ 10000 ]

Sergei Golubchik made changes - 2020-10-12 18:43

Fix Version/s		10.2 [ 14601 ]
Fix Version/s		10.3 [ 22126 ]
Fix Version/s		10.4 [ 22408 ]
Fix Version/s		10.5 [ 23123 ]
Fix Version/s	10.6.0 [ 24431 ]
Fix Version/s	10.5.7 [ 25019 ]
Fix Version/s	10.4.16 [ 25020 ]
Fix Version/s	10.3.26 [ 25021 ]
Fix Version/s	10.2.35 [ 25022 ]

Jan Lindström (Inactive) made changes - 2020-10-19 07:45

Status

Stalled [ 10000 ]

In Progress [ 3 ]

Jan Lindström (Inactive) made changes - 2020-10-20 10:30

Status

In Progress [ 3 ]

Stalled [ 10000 ]

Jan Lindström (Inactive) made changes - 2020-10-21 14:27

Priority

Blocker [ 1 ]

Major [ 3 ]

Marko Mäkelä made changes - 2020-10-22 13:46

Link

This issue blocks ~~MDEV-23877~~ [ ~~MDEV-23877~~ ]

Marko Mäkelä added a comment - 2020-10-22 13:50

~~MDEV-23877~~ was filed for documenting the deprecation of the parameter innodb_lock_schedule_algorithm.

This ticket will be used for tracking the originally reported corruption that is caused by the setting innodb_lock_schedule_algorithm=VATS. Once the feature is fixed to work correctly, and if it has been demonstrated to improve performance, it can be permanently enabled in a later version. The code does not exist in the current MariaDB 10.6 development branch.

Marko Mäkelä added a comment - 2020-10-22 13:50 MDEV-23877 was filed for documenting the deprecation of the parameter innodb_lock_schedule_algorithm . This ticket will be used for tracking the originally reported corruption that is caused by the setting innodb_lock_schedule_algorithm=VATS . Once the feature is fixed to work correctly, and if it has been demonstrated to improve performance, it can be permanently enabled in a later version. The code does not exist in the current MariaDB 10.6 development branch.

Marko Mäkelä made changes - 2020-10-22 13:50

Link

This issue relates to ~~MDEV-23877~~ [ ~~MDEV-23877~~ ]

Jan Lindström (Inactive) made changes - 2020-10-23 07:17

Attachment

MDEV-16664.test [ 54480 ]

Jan Lindström (Inactive) made changes - 2020-10-27 09:21

Attachment

~~MDEV-16664~~.test [ 54480 ]

Jan Lindström (Inactive) made changes - 2020-10-27 09:25

Attachment		innodb_vats.opt [ 54503 ]
Attachment		innodb_vats.test [ 54504 ]

Jan Lindström (Inactive) added a comment - 2020-10-27 09:27

Test case attached produces lock wait queue exactly as in crash case but still actual crash does not reproduce so there is something more needed.

Jan Lindström (Inactive) added a comment - 2020-10-27 09:27 Test case attached produces lock wait queue exactly as in crash case but still actual crash does not reproduce so there is something more needed.

Marko Mäkelä added a comment - 2021-01-28 06:45

While working on ~~MDEV-20612~~, I noticed that possibly related to the MySQL 8.0 variant of this (called CATS), the deadlock detection was refactored and moved into a dedicated thread.

Marko Mäkelä added a comment - 2021-01-28 06:45 While working on MDEV-20612 , I noticed that possibly related to the MySQL 8.0 variant of this (called CATS), the deadlock detection was refactored and moved into a dedicated thread .

Jan Lindström (Inactive) added a comment - 2021-02-11 09:58

Unfortunately, I have not found a test case that would reproduce this and my guess fixes do not work. Therefore, I must say based on my current knowledge that I do not know how to fix this issue. If we want maybe we should use same as MySQL 8.0 (this change might be out of 10.6 scope).

Jan Lindström (Inactive) added a comment - 2021-02-11 09:58 Unfortunately, I have not found a test case that would reproduce this and my guess fixes do not work. Therefore, I must say based on my current knowledge that I do not know how to fix this issue. If we want maybe we should use same as MySQL 8.0 (this change might be out of 10.6 scope).

Sergei Golubchik made changes - 2021-12-06 21:35

Workflow

MariaDB v3 [ 88192 ]

MariaDB v4 [ 143552 ]

Jan Lindström (Inactive) made changes - 2021-12-23 08:13

Fix Version/s		N/A [ 14700 ]
Fix Version/s	10.2 [ 14601 ]
Fix Version/s	10.3 [ 22126 ]
Fix Version/s	10.4 [ 22408 ]
Fix Version/s	10.5 [ 23123 ]
Resolution		Won't Fix [ 2 ]
Status	Stalled [ 10000 ]	Closed [ 6 ]

Elena Stepanova made changes - 2022-01-15 20:09

Resolution	Won't Fix [ 2 ]
Status	Closed [ 6 ]	Stalled [ 10000 ]

Elena Stepanova made changes - 2022-01-15 20:10

Resolution		Cannot Reproduce [ 5 ]
Status	Stalled [ 10000 ]	Closed [ 6 ]

Marko Mäkelä made changes - 2022-06-30 09:42

Assignee	Jan Lindström [ jplindst ]	Marko Mäkelä [ marko ]
Resolution	Cannot Reproduce [ 5 ]
Status	Closed [ 6 ]	Stalled [ 10000 ]

Marko Mäkelä added a comment - 2022-06-30 09:43

This was fixed in MariaDB Server 10.6.0 by removing the parameter.

Marko Mäkelä added a comment - 2022-06-30 09:43 This was fixed in MariaDB Server 10.6.0 by removing the parameter .

Marko Mäkelä made changes - 2022-06-30 09:43

Fix Version/s		10.6.0 [ 24431 ]
Fix Version/s	N/A [ 14700 ]
Resolution		Fixed [ 1 ]
Status	Stalled [ 10000 ]	Closed [ 6 ]

Marko Mäkelä made changes - 2023-02-28 06:29

Link

This issue relates to MDEV-30748 [ MDEV-30748 ]

People

Assignee:: Marko Mäkelä

Reporter:: Matthias Leich

Votes:: 1 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 2018-07-02 20:26

Updated:: 2023-02-28 06:29

Resolved:: 2022-06-30 09:43

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.