[MDEV-31185] rw_trx_hash_t::find() unpins pins too early - Jira

XML

Word

Printable

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.5(EOL)
Fix Version/s: 10.4.31, 10.5.22, 10.6.15, 10.9.8, 10.10.6, 10.11.5, 11.0.3, 11.1.2
Component/s: Storage Engine - InnoDB
Labels:
- parallelslave

Description

Take a look rw_trx_hash_t::find():

    rw_trx_hash_element_t *element= reinterpret_cast<rw_trx_hash_element_t*>

      (lf_hash_search(&hash, pins, reinterpret_cast<const void*>(&trx_id),

                      sizeof(trx_id_t)));

    if (element)

      mutex_enter(&element->mutex);

      lf_hash_search_unpin(pins);

      if ((trx= element->trx)) {

        DBUG_ASSERT(trx_id == trx->id);

...

      mutex_exit(&element->mutex);

It acquires element->mutex, then unpins transaction pins. After that the "element" can be deallocated and reused by some other thread.

If we take a look rw_trx_hash_t::insert()->lf_hash_insert()->lf_alloc_new() calls, we will not find any element->mutex acquisition, as it was not initialized yet before it's allocation. My assumption is that rw_trx_hash_t::insert() can easily reuse the chunk, unpinned in rw_trx_hash_t::find().

The scenario is the following:

1. Thread 1 have just executed lf_hash_search() in rw_trx_hash_t::find(), but have not acquired element->mutex yet.
2. Thread 2 have removed the element from hash table with rw_trx_hash_t::erase() call.
3. Thread 1 acquired element->mutex and unpinned pin 2 pin with lf_hash_search_unpin(pins) call.
4. Some thread purged memory of the element.
5. Thread 3 reused the memory for the element, filled element->id, element->trx.
6. Thread 1 crashes with failed "DBUG_ASSERT(trx_id == trx->id)" assertion.

The fix is to invoke "lf_hash_search_unpin(pins);" after "mutex_exit(&element->mutex);" call in rw_trx_hash_t::find().

The above scenario is indirectly confirmed with the following trick. If we set one my_sleep(1) before mutex_enter(&element->mutex) call in rw_trx_hash_t::find(), another my_sleep(1) after lf_hash_search_unpin(pins) call in rw_trx_hash_t::find(), then the assertion failure is reproduced much more faster with the test case rpl_debug.test caused it.

To reproduce it, jut run the test with several instances in a loop, like:

./mtr --max-test-fail=1 --suite-timeout=999999999 --testcase-timeout=99999999 --parallel=60 rpl_debug{,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,} --mem --repeat=5000

Without the trick, described above, it can take up to 12 hours to reproduce it, with the trick it's reproduces with several minutes.

The following comments can also be useful for bug analyses: 1, 2.

UPD: the scenario is completely confirmed with rr trace, recorded with the above delays and rpl_debug.test .

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

trx_sys_t_find_lf_hash_error.diff
4 kB
2023-05-11 12:22
rpl_debug.test
2 kB
2023-05-04 06:51

Issue Links

causes

MDEV-31038 Parallel Replication Breaks if XA PREPARE Fails Updating Slave GTID State

Closed

relates to

MDEV-31780 InnoDB: Assertion failure in file D:\winx64-packages\build\src\storage\innobase\trx\trx0trx.cc line 1252

Closed

MDEV-31038 Parallel Replication Breaks if XA PREPARE Fails Updating Slave GTID State

Closed

Activity

People

Assignee:: Vladislav Lesin

Reporter:: Vladislav Lesin

Votes:: 1 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 2023-05-04 06:50

Updated:: 2023-12-06 20:06

Resolved:: 2023-05-19 13:53

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.