Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-35626

Race condition between buf_page_create_low() and read completion

Details

    Description

      The fix of MDEV-34453 removed a read of a block descriptor state in case we are able to acquire an exclusive latch on the page without waiting:

      diff --git a/storage/innobase/buf/buf0buf.cc b/storage/innobase/buf/buf0buf.cc
      index ae604f0826c..ba49754db6a 100644
      --- a/storage/innobase/buf/buf0buf.cc
      +++ b/storage/innobase/buf/buf0buf.cc
      @@ -3322,6 +3322,8 @@ static buf_block_t *buf_page_create_low(page_id_t page_id, ulint zip_size,
               ut_ad(!bpage->is_io_fixed(state));
               ut_ad(bpage->buf_fix_count(state));
             }
      +      else
      +        state= bpage->state();
       
             ut_ad(state >= buf_page_t::FREED);
             ut_ad(state < buf_page_t::READ_FIX);
      

      In an rr replay trace, we have buf_page_t::read_complete() completing between the initial read of bpage->state() (before the above code snippet) and the time the last quoted assertion on failing because state == buf_page_t::READ_FIX + 1.

      As far as I can tell, this bug may cause a totally incorrect state of the block to be set in a subsequent call:

              bpage->set_reinit(state & buf_page_t::LRU_MASK);
      

      Furthermore, there seems to be some duplicated or redundant code that affects ROW_FORMAT=COMPRESSED pages that exist in the buffer pool only in compressed form:

      @@ -3345,24 +3347,12 @@ static buf_block_t *buf_page_create_low(page_id_t page_id, ulint zip_size,
             }
             else
             {
      -        auto state= bpage->state();
      -        ut_ad(state >= buf_page_t::FREED);
      -        ut_ad(state < buf_page_t::READ_FIX);
      -
               page_hash_latch &hash_lock= buf_pool.page_hash.lock_get(chain);
               /* It does not make sense to use transactional_lock_guard here,
               because buf_relocate() would likely make the memory transaction
               too large. */
               hash_lock.lock();
       
      -        if (state < buf_page_t::UNFIXED)
      -          bpage->set_reinit(buf_page_t::FREED);
      -        else
      -        {
      -          bpage->set_reinit(state & buf_page_t::LRU_MASK);
      -          ibuf_exist= (state & buf_page_t::LRU_MASK) == buf_page_t::IBUF_EXIST;
      -        }
      -
               mysql_mutex_lock(&buf_pool.flush_list_mutex);
               buf_relocate(bpage, &free_block->page);
               free_block->page.lock.x_lock();
      

      As far as I can tell, this code should not have had any ill effect. The else branch should always be taken because a state of FREED would already have been transformed to UNFIXED.

      Attachments

        Issue Links

          Activity

            ssh sdp
            rr replay /data/results/1733310532/TBR-2190/1/rr/latest-trace
            

            10.6-MDEV-35494 1a4b97f1b4dc4ad4b71f15f98ae3a4c83cf9e300

            mariadbd: /data/Server/10.6-MDEV-35494/storage/innobase/buf/buf0buf.cc:3327: buf_block_t* buf_page_create_low(page_id_t, ulint, mtr_t*, buf_block_t*): Assertion `state < buf_page_t::READ_FIX' failed.
            

            marko Marko Mäkelä added a comment - ssh sdp rr replay /data/results/1733310532/TBR-2190/1/rr/latest-trace 10.6-MDEV-35494 1a4b97f1b4dc4ad4b71f15f98ae3a4c83cf9e300 mariadbd: /data/Server/10.6-MDEV-35494/storage/innobase/buf/buf0buf.cc:3327: buf_block_t* buf_page_create_low(page_id_t, ulint, mtr_t*, buf_block_t*): Assertion `state < buf_page_t::READ_FIX' failed.

            We already had MDEV-35181 about exactly the same, but without an rr replay trace.

            marko Marko Mäkelä added a comment - We already had MDEV-35181 about exactly the same, but without an rr replay trace.

            The issue got introduced during review of MDEV-34453 where some local optimizations were added. Here is the simplest patch that marko has already pasted that should solve it.

            https://github.com/MariaDB/server/pull/3690

            debarun Debarun Banerjee added a comment - The issue got introduced during review of MDEV-34453 where some local optimizations were added. Here is the simplest patch that marko has already pasted that should solve it. https://github.com/MariaDB/server/pull/3690

            In https://github.com/MariaDB/server/pull/3690 I posted some analysis regarding my suggested code removal, which is not strictly needed for fixing this race condition. I may be mistaken, but it seems to me that removing that code prevent fix a bug similar to MDEV-32132 in a ROW_FORMAT=COMPRESSED table. A possible scenario could involve some innodb_change_buffering activity followed by DROP INDEX and an index page allocation in one of the surviving indexes. Please investigate and file a separate bug if you agree about that analysis.

            marko Marko Mäkelä added a comment - In https://github.com/MariaDB/server/pull/3690 I posted some analysis regarding my suggested code removal, which is not strictly needed for fixing this race condition. I may be mistaken, but it seems to me that removing that code prevent fix a bug similar to MDEV-32132 in a ROW_FORMAT=COMPRESSED table. A possible scenario could involve some innodb_change_buffering activity followed by DROP INDEX and an index page allocation in one of the surviving indexes. Please investigate and file a separate bug if you agree about that analysis.

            I need to investigate the independent compressed page scenario and file a new MDEV.

            debarun Debarun Banerjee added a comment - I need to investigate the independent compressed page scenario and file a new MDEV.

            This issue reports a debug assert and I investigated more to check the real impact of this issue in release mode. The issue is with page state being noted while the page was being read from disk.

            auto state= bpage->fix();
            state: 0x80 00 00 00 [buf_page_t::READ_FIX]
            bpage->zip.fix: 0x80 00 00 01 [Buffer fixed]
            

            After the page is successfully read and latched the state is set in buf_page_t::read_complete()

            bpage->zip.fix: 0x20 00 00 01 [buf_page_t::UNFIXED]
            

            We miss reading the state again after successfully latching the page and the state is 0x80 00 00 00 instead of 0x20 00 00 01. In buf_page_create_low, we use this state information only to update buf_page_t::REINIT state.

            if (state < buf_page_t::UNFIXED)
               bpage->set_reinit(buf_page_t::FREED);
            else 
               bpage->set_reinit(state & buf_page_t::LRU_MASK);
            

            buf_page_t::REINIT: 0x60 00 00 00

            After the IO completion we go to else part, and the state is set as

             void set_reinit(uint32_t prev_state)
            {
              ut_d(const auto s=) zip.fix.fetch_add(REINIT - prev_state);
              ...
            

            0x20 00 00 01  + 0x60 00 00 00 - 0x80 00 00 00 = 0x00 00 00 01
            

            Ideally it should have been ...

            0x20 00 00 01  + 0x60 00 00 00 - 0x20 00 00 00 = 0x60 00 00 01
            

            So, in effect we simply fail to set the buf_page_t::REINIT state and the rest of the state is correct. The ::REINIT state allows for skipping the double-write buffer for newly created pages and is only an optimization. Thus the current issue doesn't seem to create any functional impact in release mode. I checked the same by enforcing such incorrect state in release mode and no impacts were visible.

            While the issue might result is some asserts being triggered, it cannot cause any data corruption and likely no visible impact for end user.

            debarun Debarun Banerjee added a comment - This issue reports a debug assert and I investigated more to check the real impact of this issue in release mode. The issue is with page state being noted while the page was being read from disk. auto state= bpage->fix(); state: 0x80 00 00 00 [buf_page_t::READ_FIX] bpage->zip.fix: 0x80 00 00 01 [Buffer fixed] After the page is successfully read and latched the state is set in buf_page_t::read_complete() bpage->zip.fix: 0x20 00 00 01 [buf_page_t::UNFIXED] We miss reading the state again after successfully latching the page and the state is 0x80 00 00 00 instead of 0x20 00 00 01. In buf_page_create_low, we use this state information only to update buf_page_t::REINIT state. if (state < buf_page_t::UNFIXED) bpage->set_reinit(buf_page_t::FREED); else bpage->set_reinit(state & buf_page_t::LRU_MASK); buf_page_t::REINIT: 0x60 00 00 00 After the IO completion we go to else part, and the state is set as void set_reinit(uint32_t prev_state) { ut_d(const auto s=) zip.fix.fetch_add(REINIT - prev_state); ... 0x20 00 00 01 + 0x60 00 00 00 - 0x80 00 00 00 = 0x00 00 00 01 Ideally it should have been ... 0x20 00 00 01 + 0x60 00 00 00 - 0x20 00 00 00 = 0x60 00 00 01 So, in effect we simply fail to set the buf_page_t::REINIT state and the rest of the state is correct. The ::REINIT state allows for skipping the double-write buffer for newly created pages and is only an optimization. Thus the current issue doesn't seem to create any functional impact in release mode. I checked the same by enforcing such incorrect state in release mode and no impacts were visible. While the issue might result is some asserts being triggered, it cannot cause any data corruption and likely no visible impact for end user.

            For the independent issue observed for compressed row format I have filed MDEV-35679.

            debarun Debarun Banerjee added a comment - For the independent issue observed for compressed row format I have filed MDEV-35679 .

            People

              debarun Debarun Banerjee
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.