[MDEV-31185] rw_trx_hash_t::find() unpins pins too early Created: 2023-05-04  Updated: 2023-09-21  Resolved: 2023-05-19

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.5
Fix Version/s: 10.4.31, 10.5.22, 10.6.15, 10.9.8, 10.10.6, 10.11.5, 11.0.3, 11.1.2

Type: Bug Priority: Major
Reporter: Vladislav Lesin Assignee: Vladislav Lesin
Resolution: Fixed Votes: 1
Labels: parallelslave

Attachments: File rpl_debug.test     File trx_sys_t_find_lf_hash_error.diff    
Issue Links:
Problem/Incident
causes MDEV-31038 Parallel Replication Breaks if XA PRE... Closed
Relates
relates to MDEV-31780 InnoDB: Assertion failure in file D:\... Closed
relates to MDEV-31038 Parallel Replication Breaks if XA PRE... Closed

 Description   

Take a look rw_trx_hash_t::find():

    rw_trx_hash_element_t *element= reinterpret_cast<rw_trx_hash_element_t*>    
      (lf_hash_search(&hash, pins, reinterpret_cast<const void*>(&trx_id),      
                      sizeof(trx_id_t)));                                       
    if (element)                                                                                                                 
    {                                                                                                                            
      mutex_enter(&element->mutex);                                             
      lf_hash_search_unpin(pins);                                               
      if ((trx= element->trx)) {                                                 
        DBUG_ASSERT(trx_id == trx->id);                                         
        ...                                                                     
      }                                                                         
      mutex_exit(&element->mutex);                                              
    } 

It acquires element->mutex, then unpins transaction pins. After that the "element" can be deallocated and reused by some other thread.

If we take a look rw_trx_hash_t::insert()->lf_hash_insert()->lf_alloc_new() calls, we will not find any element->mutex acquisition, as it was not initialized yet before it's allocation. My assumption is that rw_trx_hash_t::insert() can easily reuse the chunk, unpinned in rw_trx_hash_t::find().

The scenario is the following:

1. Thread 1 have just executed lf_hash_search() in rw_trx_hash_t::find(), but have not acquired element->mutex yet.
2. Thread 2 have removed the element from hash table with rw_trx_hash_t::erase() call.
3. Thread 1 acquired element->mutex and unpinned pin 2 pin with lf_hash_search_unpin(pins) call.
4. Some thread purged memory of the element.
5. Thread 3 reused the memory for the element, filled element->id, element->trx.
6. Thread 1 crashes with failed "DBUG_ASSERT(trx_id == trx->id)" assertion.

The fix is to invoke "lf_hash_search_unpin(pins);" after "mutex_exit(&element->mutex);" call in rw_trx_hash_t::find().

The above scenario is indirectly confirmed with the following trick. If we set one my_sleep(1) before mutex_enter(&element->mutex) call in rw_trx_hash_t::find(), another my_sleep(1) after lf_hash_search_unpin(pins) call in rw_trx_hash_t::find(), then the assertion failure is reproduced much more faster with the test case rpl_debug.test caused it.

To reproduce it, jut run the test with several instances in a loop, like:

./mtr --max-test-fail=1 --suite-timeout=999999999 --testcase-timeout=99999999 --parallel=60 rpl_debug{,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,} --mem --repeat=5000

Without the trick, described above, it can take up to 12 hours to reproduce it, with the trick it's reproduces with several minutes.

The following comments can also be useful for bug analyses: 1, 2.

UPD: the scenario is completely confirmed with rr trace, recorded with the above delays and rpl_debug.test .



 Comments   
Comment by Vladislav Lesin [ 2023-05-11 ]

The following stable test case proves the above scenario: trx_sys_t_find_lf_hash_error.diff.

Comment by Vladislav Lesin [ 2023-05-11 ]

Some comment from Marko how it can influence MDEV-31038: "I am convinced that it can cause locking mystery. The scenario is that transaction A holds an implicit lock that would block some operation in transaction B. While B is converting the implicit lock to explicit, transaction A is committed (all implicit locks should be released), but the memory object is being reused for just starting transaction C. Transaction B will assign the committed transaction A’s lock to C, and C could block B for a long time, even though it is not even."

Comment by Vladislav Lesin [ 2023-05-12 ]

The bug is quite ancient, and the question is why we still did not catch it during RQG testing. The explanation can be the following.

lf_pinbox_free() invokes lf_pinbox_real_free() only on "if (pins->purgatory_count % LF_PURGATORY_SIZE == 0)" condition, i.e. once per 100 elements in the purgatory. And the probability that trx_sys.rw_trx_hash.hash.alloc.top points to the same element as were unpinned in rw_trx_hash_t::find() is quite low, as lf_pinbox_real_free() pushes pointer of the first element of linked list of unpinned elements gathered from all the thread purgatories at trx_sys.rw_trx_hash.hash.alloc.top(see alloc_free() and pinbox->free_func(first, last, pinbox->free_func_arg) call in lf_pinbox_real_free()). I.e. the unpinned in rw_trx_hash_t::find() element can more likely be not at the head of the list.

At the other hand, lf_pinbox_real_free() can be invoked not only from lf_pinbox_free(), but also from lf_pinbox_put_pins(). The rpl_debug.test invokes the following stack trace on slave:

#0  0x000056548628651c in alloc_free (first=0x7fd1ec0bbdb8 "\251W\t", last=0x7fd1ec0bbdb8 "\251W\t", allocator=0x565487568928 <trx_sys+2536>) at ./mysys/lf_alloc-pin.c:430
#1  0x00005654862864c0 in lf_pinbox_real_free (pins=pins@entry=0x5654895a58a0) at ./mysys/lf_alloc-pin.c:403
#2  0x00005654862866dc in lf_pinbox_put_pins (pins=0x5654895a58a0) at ./mysys/lf_alloc-pin.c:231
#3  0x00005654860872e4 in rw_trx_hash_t::put_pins (trx=0x7fd2122aab00, this=<optimized out>) at ./storage/innobase/include/trx0sys.h:581
#4  trx_t::free (this=this@entry=0x7fd2122aab00) at ./storage/innobase/trx/trx0trx.cc:400
#5  0x0000565485e5a4d9 in innobase_commit_by_xid (hton=<optimized out>, xid=0x7fd1ec08cac0) at ./storage/innobase/handler/ha_innodb.cc:17093
#6  0x0000565485b20e4b in xacommit_handlerton (unused1=unused1@entry=0x0, plugin=<optimized out>, arg=arg@entry=0x7fd1d4f7b310) at ./sql/handler.cc:2223
#7  0x00005654858acb7e in plugin_foreach_with_mask (thd=thd@entry=0x0, func=<optimized out>, type=type@entry=1, state_mask=state_mask@entry=8, arg=arg@entry=0x7fd1d4f7b310)
    at ./sql/sql_plugin.cc:2508                                                 
#8  0x0000565485b26364 in ha_commit_or_rollback_by_xid (xid=<optimized out>, commit=<optimized out>) at ./sql/handler.cc:2257
#9  0x0000565485a93671 in trans_xa_commit (thd=thd@entry=0x7fd1ec082f48) at ./sql/xa.cc:638
#10 0x00005654858a0105 in mysql_execute_command (thd=thd@entry=0x7fd1ec082f48) at ./sql/sql_parse.cc:5923
#11 0x00005654858a2540 in mysql_parse (thd=0x7fd1ec082f48, rawbuf=<optimized out>, length=<optimized out>, parser_state=parser_state@entry=0x7fd1d4f7ba60, is_com_multi=is_com_multi@entry=false,
    is_next_command=is_next_command@entry=false) at ./sql/sql_parse.cc:8127     
#12 0x0000565485c90e84 in Query_log_event::do_apply_event (this=0x7fd1e40afa58, rgi=0x7fd1e40af040, query_arg=0x7fd1e40afbb3 "XA COMMIT X'696e735f33303030303030',X'',1", q_len_arg=<optimized out>)
    at ./sql/sql_class.h:233 

I.e. the bug is more reproducible in the case when "XA commit" is executed by slave thread. But our RQG testing does not test replication well.

Comment by Marko Mäkelä [ 2023-05-16 ]

The fix looks OK to me. I did not try to run the test with and without the fix. In the test, you could write CREATE TABLE…STATS_PERSISTENT=0 so that there is no need to SET GLOBAL innodb_stats_persistent=OFF.

Generated at Thu Feb 08 10:21:55 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.