[MDEV-28709] unexpected X lock on Supremum in READ COMMITTED Created: 2022-05-31  Updated: 2023-12-09  Resolved: 2022-10-24

Status: Closed
Project: MariaDB Server
Component/s: Replication, Storage Engine - InnoDB
Affects Version/s: 10.5.16
Fix Version/s: 10.5.18, 10.6.11, 10.7.7, 10.8.6, 10.9.4, 10.10.2, 10.11.1

Type: Bug Priority: Critical
Reporter: Andrei Elkin Assignee: Vladislav Lesin
Resolution: Fixed Votes: 2
Labels: XA

Attachments: File MDEV-28709-stable.test     File MDEV-28709_x_sup.diff    
Issue Links:
Blocks
is blocked by MDEV-29575 Access to innodb_trx, innodb_locks an... Closed
Relates
relates to MDEV-30165 X-lock on supremum for prepared trans... Closed
relates to MDEV-17814 Server crashes in is_current_stmt_bin... Open
relates to MDEV-26682 slave lock timeout with xa and gap locks Closed

 Description   

Affects/is observed on 821808c45dd

Transaction that update,insert,delete a distinctive single record specified with a PK value
should not have any gap lock. E.g such one as call foo1(); where the table and procedure
are defined as follows

CREATE TABLE `t2_text` (
  `a` int(11) NOT NULL,
  `b` int(11) DEFAULT NULL,
  `c` text DEFAULT NULL,
  PRIMARY KEY (`a`)
) ENGINE=InnoDB
 
 
create procedure foo1()
begin
     declare av int;
     declare ni int;
     declare mk int;
     declare u0,u1,d int;
     declare skip_max int;
 
     set av = ceil(rand()*100);
     set ni = ceil(rand()*10);
     set skip_max = 2;
     set d = av + mod(ceil(rand()*100), skip_max*ni);
     set u0 = av + mod(ceil(rand()*100), skip_max*ni);
     set u1 = av + mod(ceil(rand()*100), skip_max*ni);
 
     while (ni > 0) do
       set mk = mod(ceil(rand()*100),4);
       replace into t2_text values (av, av, repeat('a', mk*1024));
       set ni = ni - 1;
       set av = av + 1 + mod(ceil(rand()*100), skip_max);
     end while;  
     delete from t2_text where a = d;
     update t2_text set a=u1,b=u1 where a = u0;
end|
delimiter ;

However a X lock on Supremum record appears when the above `call foo1()` runs concurrently, like

--connection one
--send \
     select sleep(0.1);xa start '1'; call foo1(); xa end '1'; xa prepare '1'; select sleep(0.05); xa commit '1'; select sleep(0.1);
 
--connection two
--send \
     select sleep(0.1);xa start '2'; call foo1(); xa end '2'; xa prepare '2'; select sleep(0.05); xa commit '2'; select sleep(0.1);

so an assert, see the diff file attached, fires after few repeats.

It also survives MDEV-26682 commit's action to clear the user XA out of gap locks.
The latter, as it was in MDEV-26682 context, may have as a consequence seemingly non-conflicting XA transactions to become actually conflicting.
when they replayed on slave.

The attached diff suggests how to fix this though its idea is merely to work around xa replication issues, and not possible
isolation ones (when the presence of X on supremum indeed 'unexpected' and can't be justified).

(Edited to remove unnecessary Unique constraint from `b`)



 Comments   
Comment by Vladislav Lesin [ 2022-06-07 ]

Some update. I have analyzed the test case. DELETE's and UPDATE's are not important to reproduce it, the key role plays REPLACE.

The lock on supremum is set during page split when the lock from successor of the infimum of the right page is inherited to the supremum of the left page:

#0  0x0000564f2d80978b in lock_rec_set_nth_bit (lock=0x6fac37a8, i=1) at ./storage/innobase/include/lock0priv.inl:105
#1  0x0000564f2d80de0a in lock_rec_create_low (c_lock=0x0, thr=0x0, type_mode=35, page_id=..., page=0x6f87c000 "", heap_no=1, index=0x39a61420f758, trx=0x6fac31c0, holds_trx_mutex=false)
    at ./storage/innobase/lock/lock0lock.cc:1352
#2  0x0000564f2d808d66 in lock_rec_create (c_lock=0x0, thr=0x0, type_mode=35, block=0x6f2e1338, heap_no=1, index=0x39a61420f758, trx=0x6fac31c0, caller_owns_trx_mutex=false)
    at ./storage/innobase/include/lock0lock.inl:97
#3  0x0000564f2d80f9eb in lock_rec_add_to_queue (type_mode=35, block=0x6f2e1338, heap_no=1, index=0x39a61420f758, trx=0x6fac31c0, caller_owns_trx_mutex=false)
    at ./storage/innobase/lock/lock0lock.cc:1837
#4  0x0000564f2d81173a in lock_rec_inherit_to_gap (heir_block=0x6f2e1338, block=0x6f2e14a0, heir_heap_no=1, heap_no=2)
    at ./storage/innobase/lock/lock0lock.cc:2332
#5  0x0000564f2d8137bb in lock_update_split_right (right_block=0x6f2e14a0, left_block=0x6f2e1338) at ./storage/innobase/lock/lock0lock.cc:2946
#6  0x0000564f2da0cf91 in btr_page_split_and_insert (flags=0, cursor=0x689e6d5b37c0, offsets=0x689e6d5b3758, heap=0x689e6d5b3750, tuple=0x5ba51c980138, n_ext=0, mtr=0x689e6d5b3b10)
    at ./storage/innobase/btr/btr0btr.cc:3037
#7  0x0000564f2da09cbb in btr_root_raise_and_insert (flags=0, cursor=0x689e6d5b37c0, offsets=0x689e6d5b3758, heap=0x689e6d5b3750, tuple=0x5ba51c980138, n_ext=0, mtr=0x689e6d5b3b10)
    at ./storage/innobase/btr/btr0btr.cc:2041
#8  0x0000564f2da38f1b in btr_cur_pessimistic_insert (flags=0, cursor=0x689e6d5b37c0, offsets=0x689e6d5b3758, heap=0x689e6d5b3750, entry=0x5ba51c980138, rec=0x689e6d5b3760,
    big_rec=0x689e6d5b374b, n_ext=0, thr=0x5ba51c06e6b0, mtr=0x689e6d5b3b10) at ./storage/innobase/btr/btr0cur.cc:3777
#9  0x0000564f2d8f5c84 in row_ins_clust_index_entry_low (flags=0, mode=33, index=0x39a61420f758, n_uniq=1, entry=0x5ba51c980138, n_ext=0, thr=0x5ba51c06e6b0)

The lock is inherited because lock->trx->duplicates is set, i.e. the transaction of the lock executed REPLACE:

lock_rec_inherit_to_gap(...)                                                    
{                                                                               
...                                                                             
        /* At READ UNCOMMITTED or READ COMMITTED isolation level,               
        we do not want locks set                                                
        by an UPDATE or a DELETE to be inherited as gap type locks. But we      
        DO want S-locks/X-locks(taken for replace) set by a consistency         
        constraint to be inherited also then. */                                
                                                                                
        for (lock = lock_rec_get_first(&lock_sys.rec_hash, block, heap_no);     
             lock != NULL;                                                      
             lock = lock_rec_get_next(heap_no, lock)) {                         
                                                                                
                if (!lock_rec_get_insert_intention(lock)                        
                    && (lock->trx->isolation_level > TRX_ISO_READ_COMMITTED     
                        || lock_get_mode(lock) !=                               
                        (lock->trx->duplicates ? LOCK_S : LOCK_X))) {           
                        lock_rec_add_to_queue(                                  
                                LOCK_REC | LOCK_GAP | lock_get_mode(lock),      
                                heir_block, heir_heap_no, lock->index,          
                                lock->trx, FALSE);                              
                }                                                               
        }                                                                       
}     

The comment of the code above says(as I understood it) that we intentionally inherit the lock in the case of REPLACE to keep consistency during constraints check. This makes sense when lock_rec_inherit_to_gap() is invoked from purge, as we need to inherit the lock of the purged record. But does it make sense in the case of page split, more precisely, in the case of lock_update_split_right(), and for clustered index?

I continue the analysis, I plan to make stable test case to check if it can be reproduced on 10.6, if not, then it might already contain the fix.

Comment by Marko Mäkelä [ 2022-06-07 ]

I think that it should be possible to remove the indicated code snippet by implementing MDEV-17814.

Comment by Vladislav Lesin [ 2022-06-23 ]

Some update. I developed stable test case for it MDEV-28709-stable.test. It causes the following assertion:

--- a/storage/innobase/lock/lock0lock.cc
+++ b/storage/innobase/lock/lock0lock.cc
@@ -4251,6 +4251,7 @@ void lock_release_on_prepare(trx_t *trx)
       {
         ut_ad(trx->dict_operation ||
               lock->index->table->id >= DICT_HDR_FIRST_ID);
+        ut_ad(!lock_rec_get_nth_bit(lock, PAGE_HEAP_NO_SUPREMUM));
 retain_lock:
         lock= UT_LIST_GET_PREV(trx_locks, lock);
         continue;

I have not yet tested if it is repeatable on 10.6 or not.

One more issue I see with that test is that InnoDB transaction can change trx_t::duplicates value during execution(it's reset from TRX_DUP_REPLACE to 0 in mysql_insert(), see comments in the test for details) , but gap lock inheritance, invoked from another transaction, depends on that value. This is some kind of undefined behavior because something in the middle of one transaction influence gap lock inheritance in the other transaction. It's like changing isolation level in the middle of transaction. Gap lock inheritance happens not only during page splitting, but also during purging. It can also influence some other code, it needs to be checked.

Comment by Vladislav Lesin [ 2022-06-24 ]

The test fails on 10.6 too with the following assertion:

--- a/storage/innobase/lock/lock0lock.cc
+++ b/storage/innobase/lock/lock0lock.cc
@@ -3993,8 +3993,10 @@ static bool lock_release_on_prepare_try(trx_t *trx)
     if (!lock->is_table())
     {
       ut_ad(!lock->index->table->is_temporary());
-      if (lock->mode() == LOCK_X && !lock->is_gap())
+      if (lock->mode() == LOCK_X && !lock->is_gap()) {
+        ut_ad(!lock_rec_get_nth_bit(lock, PAGE_HEAP_NO_SUPREMUM));
         continue;
+      }
       auto &lock_hash= lock_sys.hash_get(lock->type_mode);
       auto cell= lock_hash.cell_get(lock->un_member.rec_lock.page_id.fold());
       auto latch= lock_sys_t::hash_table::latch(cell);
@@ -4057,6 +4059,7 @@ void lock_release_on_prepare(trx_t *trx)
       ut_ad(!lock->index->table->is_temporary());
       if (lock->mode() != LOCK_X || lock->is_gap())
         lock_rec_dequeue_from_page(lock, false);
+      ut_d(else ut_a(!lock_rec_get_nth_bit(lock, PAGE_HEAP_NO_SUPREMUM));)
     }
     else
     {

Comment by Vladislav Lesin [ 2022-06-27 ]

I think MDEV-17814 and related commit does not relate to this issue. Even if we remove trx_t::duplicates from trx_t, the issue will stay the same, i.e. we have the same function with the same algorithm of gap lock inheritance both for purging and for page splitting. What does not look correct to me.

Suppose we have a table with two columns (col1, col2), primary key is col1, secondary key is col2. Suppose we have the following records in clustered index: (10, 'a'), (20, 'b'), (30, 'c'). And we want to replace (20, 'b') with (20, 'd'). Transaction isolation level is READ COMMITTED. 'REPLACE' must set non-gap X-lock on ('b', 20) secondary index record to prevent its reading by some other transaction until the REPLACE's transaction is finished. Purge must inherit the non-gap X-lock from ('b', 20) to gap lock on the next ('c', 30) record to prevent some other transaction from inserting ('b', 20).

But what if the page is split, and the ('b', 20) record is the first record on the right page after splitting? The non-gap lock will be inherited from ('b', 20) to the supremum of the left page. It made sense if the isolation level would be REPEATABLE READ. But why do we need it for READ COMMITTED? ('b', 20) is still exist, and non-gap X lock on the record is still exist too, so another transaction can't read it, so the lock on supremum is useless, right?

marko, do you see any hidden rocks?

Comment by Vladislav Lesin [ 2022-06-29 ]

Pushed bb-10.[56]-MDEV-28709-sup_X_lock for testing.

Comment by Marko Mäkelä [ 2022-06-30 ]

The suggestion makes sense to me. Page splits would not set gap locks on the page supremum when moving a record on which a non-gap lock is held by a READ UNCOMMITTED or READ COMMITTED transaction.

Extensive testing, possibly developing additional tests, and evaluating existing tools for exposing inconsistency, will be needed. Previously, the innocent-looking improvement MDEV-27025 caused the incorrect-result bug MDEV-27992 when the PRIMARY KEY of a table was concurrently updated to a smaller value.

Comment by Vladislav Lesin [ 2022-07-07 ]

Some thoughts about testing. We need some tool to test isolation levels to prevent the cases like MDEV-27025, when we released the fix, but it caused MDEV-27992, which was reported after the release.

Marko gave me some hint for possible testing method: https://aphyr.com/posts/327-call-me-maybe-mariadb-galera-cluster. The general idea is to generate some operations on some limited amount of rows to have some concurrency, and then check data consistency.

This method does not work for all isolation levels, and even does not work for some kinds of queries. For example, here was shown, that the initial test was expected to cause data inconsistency for single-node RR level. Even if some transaction works in RR level, locking reads and writes work in RC mode.

But it's still can be used for some particular cases, for example Serializable testing, and this test can be a starting point for it.

One more interesting project is here. This can also be used as a starting point, it can give the idea what to test, but the initial tests does not involve some concurrency related operations, like pages splitting or purging.

Comment by Matthias Leich [ 2022-08-29 ]

Preliminary result of RQG testing on 
origin/bb-10.5-MDEV-28709-sup_X_lock d2b334aaa7656e7fa41868536ae3c65df94db470 2022-08-29T18:17:04+03:00
 
# 2022-08-29T15:39:56 [1054959] | [rr 1057634 252092]mysqld: /data/Server/bb-10.5-MDEV-28709-sup_X_lock/storage/innobase/lock/lock0lock.cc:4257: void lock_release_on_prepare(trx_t*): Assertion `lock->trx->isolation_level > 1 || !lock_rec_get_nth_bit(lock, 1U)' failed.
 
pluto:/data/results/1661787227/TBR-1590$ _RR_TRACE_DIR=./1/rr/ rr replay
 
# 2022-08-29T15:39:56 [1054959] | Thread 3 (Thread 1057634.1061052):
# 2022-08-29T15:39:56 [1054959] | #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
# 2022-08-29T15:39:56 [1054959] | #1  0x000073d1297eb859 in __GI_abort () at abort.c:79
# 2022-08-29T15:39:56 [1054959] | #2  0x000073d1297eb729 in __assert_fail_base (fmt=0x73d129981588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x564a259c53a0 "lock->trx->isolation_level > 1 || !lock_rec_get_nth_bit(lock, 1U)", file=0x564a259c02e0 "/data/Server/bb-10.5-MDEV-28709-sup_X_lock/storage/innobase/lock/lock0lock.cc", line=4257, function=<optimized out>) at assert.c:92
# 2022-08-29T15:39:56 [1054959] | #3  0x000073d1297fcf36 in __GI___assert_fail (assertion=0x564a259c53a0 "lock->trx->isolation_level > 1 || !lock_rec_get_nth_bit(lock, 1U)", file=0x564a259c02e0 "/data/Server/bb-10.5-MDEV-28709-sup_X_lock/storage/innobase/lock/lock0lock.cc", line=4257, function=0x564a259c52c0 "void lock_release_on_prepare(trx_t*)") at assert.c:101
# 2022-08-29T15:39:56 [1054959] | #4  0x0000564a24633a48 in lock_release_on_prepare (trx=0x3c774a8544f8) at /data/Server/bb-10.5-MDEV-28709-sup_X_lock/storage/innobase/lock/lock0lock.cc:4257
# 2022-08-29T15:39:56 [1054959] | #5  0x0000564a249b00b9 in trx_prepare (trx=0x3c774a8544f8) at /data/Server/bb-10.5-MDEV-28709-sup_X_lock/storage/innobase/trx/trx0trx.cc:1988
# 2022-08-29T15:39:56 [1054959] | #6  0x0000564a249b019e in trx_prepare_for_mysql (trx=0x3c774a8544f8) at /data/Server/bb-10.5-MDEV-28709-sup_X_lock/storage/innobase/trx/trx0trx.cc:2001
# 2022-08-29T15:39:56 [1054959] | #7  0x0000564a244d42b1 in innobase_xa_prepare (hton=0x615000002b18, thd=0x62b000150218, prepare_trx=true) at /data/Server/bb-10.5-MDEV-28709-sup_X_lock/storage/innobase/handler/ha_innodb.cc:17003
# 2022-08-29T15:39:56 [1054959] | #8  0x0000564a23b70162 in prepare_or_error (ht=0x615000002b18, thd=0x62b000150218, all=true) at /data/Server/bb-10.5-MDEV-28709-sup_X_lock/sql/handler.cc:1381
# 2022-08-29T15:39:56 [1054959] | #9  0x0000564a23b70387 in ha_prepare (thd=0x62b000150218) at /data/Server/bb-10.5-MDEV-28709-sup_X_lock/sql/handler.cc:1419
# 2022-08-29T15:39:56 [1054959] | #10 0x0000564a239dec23 in trans_xa_prepare (thd=0x62b000150218) at /data/Server/bb-10.5-MDEV-28709-sup_X_lock/sql/xa.cc:532
# 2022-08-29T15:39:56 [1054959] | #11 0x0000564a233f93ea in mysql_execute_command (thd=0x62b000150218) at /data/Server/bb-10.5-MDEV-28709-sup_X_lock/sql/sql_parse.cc:5911
# 2022-08-29T15:39:56 [1054959] | #12 0x0000564a2340674a in mysql_parse (thd=0x62b000150218, rawbuf=0x62b000157238 "XA PREPARE 'xid65'  /* E_R Thread2 QNO 1411 CON_ID 70 */", length=56, parser_state=0x137e2d34abb0, is_com_multi=false, is_next_command=false) at /data/Server/bb-10.5-MDEV-28709-sup_X_lock/sql/sql_parse.cc:8101
# 2022-08-29T15:39:56 [1054959] | #13 0x0000564a233de835 in dispatch_command (command=COM_QUERY, thd=0x62b000150218, packet=0x629000cda219 "XA PREPARE 'xid65'  /* E_R Thread2 QNO 1411 CON_ID 70 */ ", packet_length=57, is_com_multi=false, is_next_command=false) at /data/Server/bb-10.5-MDEV-28709-sup_X_lock/sql/sql_parse.cc:1891
# 2022-08-29T15:39:56 [1054959] | #14 0x0000564a233db93d in do_command (thd=0x62b000150218) at /data/Server/bb-10.5-MDEV-28709-sup_X_lock/sql/sql_parse.cc:1375
# 2022-08-29T15:39:56 [1054959] | #15 0x0000564a237c5fc7 in do_handle_one_connection (connect=0x6080000030b8, put_in_cache=true) at /data/Server/bb-10.5-MDEV-28709-sup_X_lock/sql/sql_connect.cc:1418
# 2022-08-29T15:39:56 [1054959] | #16 0x0000564a237c57e0 in handle_one_connection (arg=0x6080000035b8) at /data/Server/bb-10.5-MDEV-28709-sup_X_lock/sql/sql_connect.cc:1312
# 2022-08-29T15:39:56 [1054959] | #17 0x0000564a277b8609 in start_thread (arg=<optimized out>) at pthread_create.c:477
# 2022-08-29T15:39:56 [1054959] | #18 0x000073d1298e8293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
 
There are also several test fails (no use of rr) where InnoDB reports
[ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 300 seconds. We intentionally crash the server because it appears to be hung.   

Comment by Matthias Leich [ 2022-09-02 ]

origin/bb-10.6-MDEV-28709-sup_X_lock 5acabbf1c2cc5c3880ef152bed36922d23cfba45 2022-08-31T15:04:56+03:00
performed well in RQG testing. No new bad effects. The replayed one are in the main trees too.
Please be aware that my RQG tests are unable to reveal the bad effect MDEV-28709 is about.

Comment by Matthias Leich [ 2022-09-07 ]

origin/bb-10.5-MDEV-28709-sup_X_lock-debug 45238f905bbaa5c5186129bfafe072814f283371 2022-09-05T20:07:41+03:00
performed well in RQG testing.

Comment by Vladislav Lesin [ 2022-10-14 ]

I have added one more fix which forbids any gap lock inheritance after XA is prepared.

Generated at Thu Feb 08 10:02:52 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.