[MDEV-35626] Race condition between buf_page_create_low() and read completion - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.6.20, 10.11.10, 11.2.6, 11.4.4, 11.6.2
Fix Version/s: 10.6.21, 10.11.11, 11.4.5, 11.7.2
Component/s: Storage Engine - InnoDB
Labels:

Description

The fix of ~~MDEV-34453~~ removed a read of a block descriptor state in case we are able to acquire an exclusive latch on the page without waiting:

diff --git a/storage/innobase/buf/buf0buf.cc b/storage/innobase/buf/buf0buf.cc

index ae604f0826c..ba49754db6a 100644

--- a/storage/innobase/buf/buf0buf.cc

+++ b/storage/innobase/buf/buf0buf.cc

@@ -3322,6 +3322,8 @@ static buf_block_t *buf_page_create_low(page_id_t page_id, ulint zip_size,

         ut_ad(!bpage->is_io_fixed(state));

         ut_ad(bpage->buf_fix_count(state));

+      else

+        state= bpage->state();

       ut_ad(state >= buf_page_t::FREED);

       ut_ad(state < buf_page_t::READ_FIX);

In an rr replay trace, we have buf_page_t::read_complete() completing between the initial read of bpage->state() (before the above code snippet) and the time the last quoted assertion on failing because state == buf_page_t::READ_FIX + 1.

As far as I can tell, this bug may cause a totally incorrect state of the block to be set in a subsequent call:

        bpage->set_reinit(state & buf_page_t::LRU_MASK);

Furthermore, there seems to be some duplicated or redundant code that affects ROW_FORMAT=COMPRESSED pages that exist in the buffer pool only in compressed form:

@@ -3345,24 +3347,12 @@ static buf_block_t *buf_page_create_low(page_id_t page_id, ulint zip_size,

       else

-        auto state= bpage->state();

-        ut_ad(state >= buf_page_t::FREED);

-        ut_ad(state < buf_page_t::READ_FIX);

         page_hash_latch &hash_lock= buf_pool.page_hash.lock_get(chain);

         /* It does not make sense to use transactional_lock_guard here,

         because buf_relocate() would likely make the memory transaction

         too large. */

         hash_lock.lock();

-        if (state < buf_page_t::UNFIXED)

-          bpage->set_reinit(buf_page_t::FREED);

-        else

-        {

-          bpage->set_reinit(state & buf_page_t::LRU_MASK);

-          ibuf_exist= (state & buf_page_t::LRU_MASK) == buf_page_t::IBUF_EXIST;

-        }

         mysql_mutex_lock(&buf_pool.flush_list_mutex);

         buf_relocate(bpage, &free_block->page);

         free_block->page.lock.x_lock();

As far as I can tell, this code should not have had any ill effect. The else branch should always be taken because a state of FREED would already have been transformed to UNFIXED.

Attachments

Issue Links

is caused by

MDEV-34453 Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: ./ibdata1

Closed

is duplicated by

MDEV-35181 Assertion 'state < buf_page_t::READ_FIX' in buf_page_create_low()

Closed

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2024-12-11 11:46

ssh sdp

rr replay /data/results/1733310532/TBR-2190/1/rr/latest-trace

10.6-MDEV-35494 1a4b97f1b4dc4ad4b71f15f98ae3a4c83cf9e300
mariadbd: /data/Server/10.6-MDEV-35494/storage/innobase/buf/buf0buf.cc:3327: buf_block_t* buf_page_create_low(page_id_t, ulint, mtr_t, buf_block_t): Assertion `state < buf_page_t::READ_FIX' failed.

Marko Mäkelä added a comment - 2024-12-11 11:46 ssh sdp rr replay /data/results/1733310532/TBR-2190/1/rr/latest-trace 10.6-MDEV-35494 1a4b97f1b4dc4ad4b71f15f98ae3a4c83cf9e300 mariadbd: /data/Server/10.6-MDEV-35494/storage/innobase/buf/buf0buf.cc:3327: buf_block_t* buf_page_create_low(page_id_t, ulint, mtr_t*, buf_block_t*): Assertion `state < buf_page_t::READ_FIX' failed.

Marko Mäkelä added a comment - 2024-12-11 11:58

We already had ~~MDEV-35181~~ about exactly the same, but without an rr replay trace.

Marko Mäkelä added a comment - 2024-12-11 11:58 We already had MDEV-35181 about exactly the same, but without an rr replay trace.

Debarun Banerjee added a comment - 2024-12-11 13:26

The issue got introduced during review of ~~MDEV-34453~~ where some local optimizations were added. Here is the simplest patch that marko has already pasted that should solve it.

https://github.com/MariaDB/server/pull/3690

Debarun Banerjee added a comment - 2024-12-11 13:26 The issue got introduced during review of MDEV-34453 where some local optimizations were added. Here is the simplest patch that marko has already pasted that should solve it. https://github.com/MariaDB/server/pull/3690

Marko Mäkelä added a comment - 2024-12-11 13:44

In https://github.com/MariaDB/server/pull/3690 I posted some analysis regarding my suggested code removal, which is not strictly needed for fixing this race condition. I may be mistaken, but it seems to me that removing that code prevent fix a bug similar to ~~MDEV-32132~~ in a ROW_FORMAT=COMPRESSED table. A possible scenario could involve some innodb_change_buffering activity followed by DROP INDEX and an index page allocation in one of the surviving indexes. Please investigate and file a separate bug if you agree about that analysis.

Marko Mäkelä added a comment - 2024-12-11 13:44 In https://github.com/MariaDB/server/pull/3690 I posted some analysis regarding my suggested code removal, which is not strictly needed for fixing this race condition. I may be mistaken, but it seems to me that removing that code prevent fix a bug similar to MDEV-32132 in a ROW_FORMAT=COMPRESSED table. A possible scenario could involve some innodb_change_buffering activity followed by DROP INDEX and an index page allocation in one of the surviving indexes. Please investigate and file a separate bug if you agree about that analysis.

Debarun Banerjee added a comment - 2024-12-13 15:46

I need to investigate the independent compressed page scenario and file a new MDEV.

Debarun Banerjee added a comment - 2024-12-13 15:46 I need to investigate the independent compressed page scenario and file a new MDEV.

Debarun Banerjee added a comment - 2024-12-17 05:35

This issue reports a debug assert and I investigated more to check the real impact of this issue in release mode. The issue is with page state being noted while the page was being read from disk.

auto state= bpage->fix();

state: 0x80 00 00 00 [buf_page_t::READ_FIX]

bpage->zip.fix: 0x80 00 00 01 [Buffer fixed]

After the page is successfully read and latched the state is set in buf_page_t::read_complete()

bpage->zip.fix: 0x20 00 00 01 [buf_page_t::UNFIXED]

We miss reading the state again after successfully latching the page and the state is 0x80 00 00 00 instead of 0x20 00 00 01. In buf_page_create_low, we use this state information only to update buf_page_t::REINIT state.

if (state < buf_page_t::UNFIXED)

   bpage->set_reinit(buf_page_t::FREED);

else

   bpage->set_reinit(state & buf_page_t::LRU_MASK);

buf_page_t::REINIT: 0x60 00 00 00

After the IO completion we go to else part, and the state is set as

 void set_reinit(uint32_t prev_state)

  ut_d(const auto s=) zip.fix.fetch_add(REINIT - prev_state);

...

0x20 00 00 01  + 0x60 00 00 00 - 0x80 00 00 00 = 0x00 00 00 01

Ideally it should have been ...

0x20 00 00 01  + 0x60 00 00 00 - 0x20 00 00 00 = 0x60 00 00 01

So, in effect we simply fail to set the buf_page_t::REINIT state and the rest of the state is correct. The ::REINIT state allows for skipping the double-write buffer for newly created pages and is only an optimization. Thus the current issue doesn't seem to create any functional impact in release mode. I checked the same by enforcing such incorrect state in release mode and no impacts were visible.

While the issue might result is some asserts being triggered, it cannot cause any data corruption and likely no visible impact for end user.

Debarun Banerjee added a comment - 2024-12-17 05:35 This issue reports a debug assert and I investigated more to check the real impact of this issue in release mode. The issue is with page state being noted while the page was being read from disk. auto state= bpage->fix(); state: 0x80 00 00 00 [buf_page_t::READ_FIX] bpage->zip.fix: 0x80 00 00 01 [Buffer fixed] After the page is successfully read and latched the state is set in buf_page_t::read_complete() bpage->zip.fix: 0x20 00 00 01 [buf_page_t::UNFIXED] We miss reading the state again after successfully latching the page and the state is 0x80 00 00 00 instead of 0x20 00 00 01. In buf_page_create_low, we use this state information only to update buf_page_t::REINIT state. if (state < buf_page_t::UNFIXED) bpage->set_reinit(buf_page_t::FREED); else bpage->set_reinit(state & buf_page_t::LRU_MASK); buf_page_t::REINIT: 0x60 00 00 00 After the IO completion we go to else part, and the state is set as void set_reinit(uint32_t prev_state) { ut_d(const auto s=) zip.fix.fetch_add(REINIT - prev_state); ... 0x20 00 00 01 + 0x60 00 00 00 - 0x80 00 00 00 = 0x00 00 00 01 Ideally it should have been ... 0x20 00 00 01 + 0x60 00 00 00 - 0x20 00 00 00 = 0x60 00 00 01 So, in effect we simply fail to set the buf_page_t::REINIT state and the rest of the state is correct. The ::REINIT state allows for skipping the double-write buffer for newly created pages and is only an optimization. Thus the current issue doesn't seem to create any functional impact in release mode. I checked the same by enforcing such incorrect state in release mode and no impacts were visible. While the issue might result is some asserts being triggered, it cannot cause any data corruption and likely no visible impact for end user.

Debarun Banerjee added a comment - 2024-12-18 08:20

For the independent issue observed for compressed row format I have filed ~~MDEV-35679~~.

Debarun Banerjee added a comment - 2024-12-18 08:20 For the independent issue observed for compressed row format I have filed MDEV-35679 .

People

Assignee:: Debarun Banerjee

Reporter:: Marko Mäkelä

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 2024-12-11 11:42

Updated:: 2024-12-18 08:20

Resolved:: 2024-12-13 15:46

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server