Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-32269

InnoDB after ALTER TABLE…IMPORT TABLESPACE may not be crash safe

Details

    Description

      Today, I got a local test failure that looks much like this old failure:

      10.6 cce994057bf7e2bdb62686075de03aeb

      CURRENT_TEST: innodb.import_corrupted
      mysqltest: At line 63: query 'ALTER TABLE t2 IMPORT TABLESPACE' failed with wrong errno <Unknown> (2013): 'Lost connection to server during query', instead of ER_NOT_KEYFILE (1034)...
      ...
      2022-02-09 21:18:29 79 [ERROR] InnoDB: Trying to read 16384 bytes at 376832 outside the bounds of the file: ./test/t2.ibd
      mariadbd: /home/buildbot/buildbot/build/mariadb-10.6.7/storage/innobase/buf/buf0flu.cc:2507: void buf_flush_validate_low(): Assertion `om == 1 || !bpage || __builtin_expect(recv_sys.recovery_on, (0)) || om >= bpage->oldest_modification()' failed.
      

      In the core dump of my local failure, the IMPORT thread was waiting in mtr_t::commit():

            if (UNIV_UNLIKELY(m_made_dirty)) /* This should be IMPORT TABLESPACE */
              mysql_mutex_lock(&log_sys.flush_order_mutex);
      

      The flush list validation was executing as part of a mtr_t::commit() in trx_purge_truncate_rseg_history(). That mini-transaction was only holding page latches on tablespace 0x44 (the tablespace that is being imported).

      The bpage in the assertion expression was page 344 in the system tablespace. The bpage->list.prev was page 13 in the tablespace that is being imported. Neither page was buffer-fixed or latched, and apparently no asynchronous page reads or writes were in progress:

      (gdb) p write_slots.m_cache.m_pos
      $4 = 0
      (gdb) p read_slots.m_cache.m_pos
      $5 = 0
      

      Notably, the status of the previous page (13 in tablespace 0x44 in my case) is buf_page_t::FREED. While searching for failures in the cross-reference, I found several hangs during a test, and also the following assertion failure that could share a root cause with this one:

      bb-10.8-merge ed204a5cf9ec524e00f9ae9680d68911599c8ae5

      2022-11-17 18:10:01 7 [Note] InnoDB: Phase II - Purge records from index `PRIMARY`
      Assertion failed: s > buf_page_t::FREED, file D:\Buildbot\amd64-windows\build\storage\innobase\mtr\mtr0mtr.cc, line 230
      

      Attachments

        Issue Links

          Activity

            This was observed in another IMPORT TABLESPACE test: https://buildbot.mariadb.org/#/builders/534/builds/11223/steps/7/logs/stdio

            10.6 b52b7b41296fa8857bcd2abc42f9ab6721b86f8b

            CURRENT_TEST: innodb_zip.wl5522_debug_zip
            mysqltest: At line 372: query 'ALTER TABLE t1 IMPORT TABLESPACE' failed: <Unknown> (2013): Lost connection to server during query
            2023-11-06 14:58:39 3 [Note] InnoDB: Phase II - Purge records from index `idx1`
            mariadbd: /home/buildbot/amd64-ubuntu-2204-debug-ps/build/storage/innobase/buf/buf0flu.cc:2642: void buf_flush_validate_low(): Assertion `om == 1 || !bpage || __builtin_expect(recv_sys.recovery_on, (0)) || om >= bpage->oldest_modification()' failed.
            

            marko Marko Mäkelä added a comment - This was observed in another IMPORT TABLESPACE test: https://buildbot.mariadb.org/#/builders/534/builds/11223/steps/7/logs/stdio 10.6 b52b7b41296fa8857bcd2abc42f9ab6721b86f8b CURRENT_TEST: innodb_zip.wl5522_debug_zip mysqltest: At line 372: query 'ALTER TABLE t1 IMPORT TABLESPACE' failed: <Unknown> (2013): Lost connection to server during query … 2023-11-06 14:58:39 3 [Note] InnoDB: Phase II - Purge records from index `idx1` mariadbd: /home/buildbot/amd64-ubuntu-2204-debug-ps/build/storage/innobase/buf/buf0flu.cc:2642: void buf_flush_validate_low(): Assertion `om == 1 || !bpage || __builtin_expect(recv_sys.recovery_on, (0)) || om >= bpage->oldest_modification()' failed.

            mleich was able to reproduce an rr replay trace of this. The following would seem to fix this:

            diff --git a/storage/innobase/mtr/mtr0mtr.cc b/storage/innobase/mtr/mtr0mtr.cc
            index 7ce54a9b04d..ccbd88c5231 100644
            --- a/storage/innobase/mtr/mtr0mtr.cc
            +++ b/storage/innobase/mtr/mtr0mtr.cc
            @@ -160,10 +160,16 @@ void mtr_t::commit()
                 {
                   ut_ad(m_log_mode == MTR_LOG_NO_REDO);
                   ut_ad(m_log.size() == 0);
            -      m_commit_lsn= log_sys.get_lsn();
            -      lsns= { m_commit_lsn, PAGE_FLUSH_NO };
                   if (UNIV_UNLIKELY(m_made_dirty)) /* This should be IMPORT TABLESPACE */
            +      {
            +        mysql_mutex_lock(&log_sys.mutex);
            +        m_commit_lsn= log_sys.get_lsn();
                     mysql_mutex_lock(&log_sys.flush_order_mutex);
            +        mysql_mutex_unlock(&log_sys.mutex);
            +      }
            +      else
            +        m_commit_lsn= log_sys.get_lsn();
            +      lsns= { m_commit_lsn, PAGE_FLUSH_NO };
                 }
             
                 if (m_freed_pages)
            

            As far as I can tell, this should be applicable to 10.5 too.

            The user impact on non-debug builds should be that the server may not be crash safe after an ALTER TABLE…IMPORT TABLESPACE

            I found a related failure on 10.5, which I think could be fixed by the above patch.

            10.5 6af0bd69074725c8d8b10f07ed4ccc01

            innodb_zip.wl5522_debug_zip '4k,innodb'  w2 [ fail ]
                    Test ended at 2020-04-18 10:39:10
             
            CURRENT_TEST: innodb_zip.wl5522_debug_zip
            mysqltest: At line 378: query 'INSERT INTO t1
            SELECT 100, REPEAT('Karanbir', 128), REPEAT('Ajeeth', 1200)
            FROM seq_1_to_256' failed: 2013: Lost connection to MySQL server during query
            2020-04-18 10:39:06 3 [Note] InnoDB: Phase IV - Flush complete
            2020-04-18 10:39:06 3 [Note] InnoDB: `test`.`t1` autoinc value set to 129
            mariadbd: /home/buildbot/buildbot/build/mariadb-10.5.3/storage/innobase/log/log0log.cc:1367: bool log_checkpoint(): Assertion `oldest_lsn >= log_sys.last_checkpoint_lsn' failed.
            

            marko Marko Mäkelä added a comment - mleich was able to reproduce an rr replay trace of this. The following would seem to fix this: diff --git a/storage/innobase/mtr/mtr0mtr.cc b/storage/innobase/mtr/mtr0mtr.cc index 7ce54a9b04d..ccbd88c5231 100644 --- a/storage/innobase/mtr/mtr0mtr.cc +++ b/storage/innobase/mtr/mtr0mtr.cc @@ -160,10 +160,16 @@ void mtr_t::commit() { ut_ad(m_log_mode == MTR_LOG_NO_REDO); ut_ad(m_log.size() == 0); - m_commit_lsn= log_sys.get_lsn(); - lsns= { m_commit_lsn, PAGE_FLUSH_NO }; if (UNIV_UNLIKELY(m_made_dirty)) /* This should be IMPORT TABLESPACE */ + { + mysql_mutex_lock(&log_sys.mutex); + m_commit_lsn= log_sys.get_lsn(); mysql_mutex_lock(&log_sys.flush_order_mutex); + mysql_mutex_unlock(&log_sys.mutex); + } + else + m_commit_lsn= log_sys.get_lsn(); + lsns= { m_commit_lsn, PAGE_FLUSH_NO }; } if (m_freed_pages) As far as I can tell, this should be applicable to 10.5 too. The user impact on non-debug builds should be that the server may not be crash safe after an ALTER TABLE…IMPORT TABLESPACE I found a related failure on 10.5, which I think could be fixed by the above patch. 10.5 6af0bd69074725c8d8b10f07ed4ccc01 innodb_zip.wl5522_debug_zip '4k,innodb' w2 [ fail ] Test ended at 2020-04-18 10:39:10   CURRENT_TEST: innodb_zip.wl5522_debug_zip mysqltest: At line 378: query 'INSERT INTO t1 SELECT 100, REPEAT('Karanbir', 128), REPEAT('Ajeeth', 1200) FROM seq_1_to_256' failed: 2013: Lost connection to MySQL server during query … 2020-04-18 10:39:06 3 [Note] InnoDB: Phase IV - Flush complete 2020-04-18 10:39:06 3 [Note] InnoDB: `test`.`t1` autoinc value set to 129 mariadbd: /home/buildbot/buildbot/build/mariadb-10.5.3/storage/innobase/log/log0log.cc:1367: bool log_checkpoint(): Assertion `oldest_lsn >= log_sys.last_checkpoint_lsn' failed.

            LGTM

            vlad.lesin Vladislav Lesin added a comment - LGTM

            In MariaDB Server 10.8 and later major releases, this code was last touched by MDEV-28708, and it is not affected by this race condition:

            /** Insert a modified block into buf_pool.flush_list on IMPORT TABLESPACE. */
            static void insert_imported(buf_block_t *block)
            {
              if (block->page.oldest_modification() <= 1)
              {
                log_sys.latch.rd_lock(SRW_LOCK_CALL);
                const lsn_t lsn= log_sys.last_checkpoint_lsn;
                mysql_mutex_lock(&buf_pool.flush_list_mutex);
                buf_pool.insert_into_flush_list
                  (buf_pool.prepare_insert_into_flush_list(lsn), block, lsn);
                log_sys.latch.rd_unlock();
                mysql_mutex_unlock(&buf_pool.flush_list_mutex);
              }
            }
            

            marko Marko Mäkelä added a comment - In MariaDB Server 10.8 and later major releases, this code was last touched by MDEV-28708 , and it is not affected by this race condition: /** Insert a modified block into buf_pool.flush_list on IMPORT TABLESPACE. */ static void insert_imported(buf_block_t *block) { if (block->page.oldest_modification() <= 1) { log_sys.latch.rd_lock(SRW_LOCK_CALL); const lsn_t lsn= log_sys.last_checkpoint_lsn; mysql_mutex_lock(&buf_pool.flush_list_mutex); buf_pool.insert_into_flush_list (buf_pool.prepare_insert_into_flush_list(lsn), block, lsn); log_sys.latch.rd_unlock(); mysql_mutex_unlock(&buf_pool.flush_list_mutex); } }

            It turns out that 10.8 and later major releases in fact were affected by something similar, filed as MDEV-33363.

            marko Marko Mäkelä added a comment - It turns out that 10.8 and later major releases in fact were affected by something similar, filed as MDEV-33363 .

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.