[MDEV-25750] Assertion `wait_lock->is_waiting()' in lock_wait_rpl_report Created: 2021-05-21  Updated: 2021-06-15  Resolved: 2021-06-10

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.6.1
Fix Version/s: 10.6.2

Type: Bug Priority: Major
Reporter: Matthias Leich Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: rr-profile

Issue Links:
Problem/Incident
is caused by MDEV-24738 Improve the InnoDB deadlock checker Closed

 Description   

mysqld: /data/Server/bb-10.6-marko/storage/innobase/lock/lock0lock.cc:1657: void lock_wait_rpl_report(trx_t*): Assertion `wait_lock->is_waiting()' failed.
Query (0x62b0000c4238): UPDATE t1 SET col2 = NULL LIMIT 2
Connection ID (thread ID): 15
# 2021-05-20T19:03:42 | Status: KILL_TIMEOUT
rr) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x000026e174dc3859 in __GI_abort () at abort.c:79
#2  0x000026e174dc3729 in __assert_fail_base (fmt=0x26e174f59588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x564c9dcced40 "wait_lock->is_waiting()", file=0x564c9dccc3a0 "/data/Server/bb-10.6-marko/storage/innobase/lock/lock0lock.cc", line=1657, 
    function=<optimized out>) at assert.c:92
#3  0x000026e174dd4f36 in __GI___assert_fail (assertion=0x564c9dcced40 "wait_lock->is_waiting()", file=0x564c9dccc3a0 "/data/Server/bb-10.6-marko/storage/innobase/lock/lock0lock.cc", line=1657, function=0x564c9dccf0c0 "void lock_wait_rpl_report(trx_t*)")
    at assert.c:101
#4  0x0000564c9c7baf56 in lock_wait_rpl_report (trx=0x54dc78c30a48) at /data/Server/bb-10.6-marko/storage/innobase/lock/lock0lock.cc:1657
#5  0x0000564c9c7bbd2a in lock_wait (thr=0x6200001ec860) at /data/Server/bb-10.6-marko/storage/innobase/lock/lock0lock.cc:1786
#6  0x0000564c9c9be712 in row_mysql_handle_errors (new_err=0x6400011bf440, trx=0x54dc78c30a48, thr=0x6200001ec860, savept=0x0) at /data/Server/bb-10.6-marko/storage/innobase/row/row0mysql.cc:744
#7  0x0000564c9ca4ebc5 in row_search_mvcc (buf=0x6190004f29c8 "\377", mode=PAGE_CUR_G, prebuilt=0x6200001ec108, match_mode=0, direction=0) at /data/Server/bb-10.6-marko/storage/innobase/row/row0sel.cc:5784
#8  0x0000564c9c6545c3 in ha_innobase::index_read (this=0x61d0006644b8, buf=0x6190004f29c8 "\377", key_ptr=0x0, key_len=0, find_flag=HA_READ_AFTER_KEY) at /data/Server/bb-10.6-marko/storage/innobase/handler/ha_innodb.cc:8913
#9  0x0000564c9c656f40 in ha_innobase::index_first (this=0x61d0006644b8, buf=0x6190004f29c8 "\377") at /data/Server/bb-10.6-marko/storage/innobase/handler/ha_innodb.cc:9274
#10 0x0000564c9c65717e in ha_innobase::rnd_next (this=0x61d0006644b8, buf=0x6190004f29c8 "\377") at /data/Server/bb-10.6-marko/storage/innobase/handler/ha_innodb.cc:9367
#11 0x0000564c9bd3f76c in handler::ha_rnd_next (this=0x61d0006644b8, buf=0x6190004f29c8 "\377") at /data/Server/bb-10.6-marko/sql/handler.cc:3123
#12 0x0000564c9c1477b5 in rr_sequential (info=0x6400011c0a40) at /data/Server/bb-10.6-marko/sql/records.cc:519
#13 0x0000564c9b407a30 in READ_RECORD::read_record (this=0x6400011c0a40) at /data/Server/bb-10.6-marko/sql/records.h:81
#14 0x0000564c9b87664b in mysql_update (thd=0x62b0000bd218, table_list=0x62b0000c4398, fields=..., values=..., conds=0x0, order_num=0, order=0x0, limit=2, ignore=false, found_return=0x6400011c1270, updated_return=0x6400011c1290)
    at /data/Server/bb-10.6-marko/sql/sql_update.cc:998
#15 0x0000564c9b59b9f6 in mysql_execute_command (thd=0x62b0000bd218) at /data/Server/bb-10.6-marko/sql/sql_parse.cc:4402
#16 0x0000564c9b5b3f0f in mysql_parse (thd=0x62b0000bd218, rawbuf=0x62b0000c4238 "UPDATE t1 SET col2 = NULL LIMIT 2  /* E_R Thread1 QNO 876 CON_ID 15 */", length=70, parser_state=0x6400011c1b20) at /data/Server/bb-10.6-marko/sql/sql_parse.cc:8019
#17 0x0000564c9b58c1ac in dispatch_command (command=COM_QUERY, thd=0x62b0000bd218, packet=0x629000c12219 "UPDATE t1 SET col2 = NULL LIMIT 2  /* E_R Thread1 QNO 876 CON_ID 15 */ ", packet_length=71, blocking=true) at /data/Server/bb-10.6-marko/sql/sql_parse.cc:1897
#18 0x0000564c9b589581 in do_command (thd=0x62b0000bd218, blocking=true) at /data/Server/bb-10.6-marko/sql/sql_parse.cc:1406
#19 0x0000564c9b984fda in do_handle_one_connection (connect=0x608000003138, put_in_cache=true) at /data/Server/bb-10.6-marko/sql/sql_connect.cc:1410
#20 0x0000564c9b98493c in handle_one_connection (arg=0x608000002c38) at /data/Server/bb-10.6-marko/sql/sql_connect.cc:1312
#21 0x00004dc10bd5c609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#22 0x000026e174ec0293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(rr)
pluto:/data/Results/1621525973/TBR-943/dev/shm/vardir/1621525973/69/1/rr
_RR_TRACE_DIR="." rr replay --mark-stdio
 
GIT_SHOW: HEAD, origin/bb-10.6-marko 25d150d3fe193b5a41948a589c0ce30368c054e1 2021-05-20T18:36:07+03:00
 
RQG
====
git clone https://github.com/mleich1/rqg --branch experimental RQG
 
perl rqg.pl \
--grammar=conf/runtime/alter_online.yy \
--gendata=conf/runtime/alter_online.zz \
--mysqld=--loose-innodb_lock_schedule_algorithm=fcfs \
--mysqld=--loose-idle_write_transaction_timeout=0 \
--mysqld=--loose-idle_transaction_timeout=0 \
--mysqld=--loose-idle_readonly_transaction_timeout=0 \
--mysqld=--connect_timeout=60 \
--mysqld=--interactive_timeout=28800 \
--mysqld=--slave_net_timeout=60 \
--mysqld=--net_read_timeout=30 \
--mysqld=--net_write_timeout=60 \
--mysqld=--loose-table_lock_wait_timeout=50 \
--mysqld=--wait_timeout=28800 \
--mysqld=--lock-wait-timeout=86400 \
--mysqld=--innodb-lock-wait-timeout=50 \
--no-mask \
--queries=10000000 \
--seed=random \
--reporters=Backtrace \
--reporters=ErrorLog \
--reporters=Deadlock1 \
--validators=None \
--mysqld=--log_output=none \
--mysqld=--log-bin \
--mysqld=--log_bin_trust_function_creators=1 \
--mysqld=--loose-debug_assert_on_not_freed_memory=0 \
--engine=InnoDB \
--restart_timeout=240 \
--mysqld=--plugin-load-add=file_key_management.so \
--mysqld=--loose-file-key-management-filename=$RQG_HOME/conf/mariadb/encryption_keys.txt \
--duration=300 \
--mysqld=--loose-innodb_fatal_semaphore_wait_threshold=300 \
--mysqld=--innodb_stats_persistent=off \
--mysqld=--innodb_adaptive_hash_index=off \
--mysqld=--loose-innodb_evict_tables_on_commit_debug=off \
--mysqld=--loose-max-statement-time=30 \
--threads=2 \
--mysqld=--innodb-use-native-aio=0 \
--rr=Extended \
--rr_options="--chaos --wait" \
--mysqld=--innodb_page_size=8K \
--mysqld=--innodb-buffer-pool-size=8M \
--no_mask \
--workdir=<localsettings> \
--vardir=<localsettings> \
--mtr-build-thread=<localsettings> \
--basedir1=<localsettings> \
--script_debug=_nix_
 
RQG Error pattern
[ 'TBR-943-MDEV-25750' , 'mysqld: .{1,200}lock0lock.cc:.{1,20}: void lock_wait_rpl_report.{1,500}: Assertion .wait_lock->is_waiting\(\). failed' ],
 



 Comments   
Comment by Matthias Leich [ 2021-05-21 ]

Hitting this assert is rare.
It was observed first time mid of March 2021 on some 10.6 development tree but up till today never on a main tree.

Comment by Marko Mäkelä [ 2021-05-21 ]

The race is visualized by running the following commands on the rr replay trace:

frame 4
watch -l wait_lock->type_mode
reverse-continue
reverse-continue
thread apply 2 bt
bt

The later-asserting thread (thread 2) was waiting for a latch:

Thread 2 (Thread 1566528.1567578):
#0  ssux_lock_low::wr_lock_try (this=0x564c9fa0b300 <lock_sys+64>) at /data/Server/bb-10.6-marko/storage/innobase/include/srw_lock.h:194
#1  0x0000564c9c7ddc08 in lock_sys_t::wr_lock_try (this=0x564c9fa0b2c0 <lock_sys>) at /data/Server/bb-10.6-marko/storage/innobase/include/lock0lock.h:753
#2  0x0000564c9c7bae55 in lock_wait_rpl_report (trx=0x54dc78c30a48) at /data/Server/bb-10.6-marko/storage/innobase/lock/lock0lock.cc:1644

The LOCK_WAIT flag had been cleared by a thread that was running UPDATE:

#0  0x0000564c9c7bf232 in lock_rec_move (receiver_cell=..., receiver=..., receiver_id=..., 
    donator_cell=..., donator_id=..., receiver_heap_no=0, donator_heap_no=9)
    at /data/Server/bb-10.6-marko/storage/innobase/lock/lock0lock.cc:2181
#1  0x0000564c9c7c5649 in lock_rec_store_on_page_infimum (block=0x2b134a8e4020, rec=0x2b134aaba238 "")
    at /data/Server/bb-10.6-marko/storage/innobase/lock/lock0lock.cc:3056
#2  0x0000564c9cbb0d0f in btr_cur_pessimistic_update (flags=10, cursor=0x611000176a00, 
    offsets=0x764566886a40, offsets_heap=0x764566886b40, entry_heap=0x6190003a7a80, 
    big_rec=0x764566886a20, update=0x620000015a38, cmpl_info=1, thr=0x620000015c78, trx_id=4810, 
    mtr=0x764566886e40) at /data/Server/bb-10.6-marko/storage/innobase/btr/btr0cur.cc:5113

The assertion fails immediately after the latch acquisition, so it is obviously wrong:

  if (!lock_sys.wr_lock_try())
  ut_ad(wait_lock->is_waiting());

I believe that the fix should be simple:

diff --git a/storage/innobase/lock/lock0lock.cc b/storage/innobase/lock/lock0lock.cc
index 058f8386434..0f68ebc2423 100644
--- a/storage/innobase/lock/lock0lock.cc
+++ b/storage/innobase/lock/lock0lock.cc
@@ -1653,8 +1653,10 @@ static void lock_wait_rpl_report(trx_t *trx)
       lock_sys.wr_unlock();
       return;
     }
+    ut_ad(wait_lock->is_waiting());
   }
-  ut_ad(wait_lock->is_waiting());
+  else if (!wait_lock->is_waiting())
+    goto func_exit;
   ut_ad(!(wait_lock->type_mode & LOCK_AUTO_INC));
 
   if (wait_lock->is_table())

Note: You will have to run with binlog enabled to exercise this code. (Until MDEV-24948 has been fixed, this code is unnecessarily executed even when parallel replication is not in use.)

Generated at Thu Feb 08 09:40:04 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.