Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-33189

Server crash after reading outside of bounds on ibdata1 , file corrupted, no auto-recovery

Details

    • Bug
    • Status: Closed (View Workflow)
    • Critical
    • Resolution: Incomplete
    • 10.11.6, 11.2(EOL), 11.4
    • N/A
    • None
    • Debian GNU/Linux 11 (bullseye)
      Dell PowerEdge R750
      XFS filesystem

    Description

      We experienced a one-time server crash in production, so far not reproducible.

      We are running MariaDB 10.11.6 (1:10.11.6+maria~deb11) installed from a MariaDB repo mirror on Debian GNU/Linux 11 (bullseye) as the database primary for a read- and write-heavy application. It runs on a bare-metal server Dell PowerEdge R750 with 64 cores (Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz), 512 GiB RAM on a software RAID-1 NVMe with an XFS filesystem.

      The server crash happened on 2023-12-21. On 2023-12-02 we had upgraded to v10.11.6. Prior to that the DB ran without any problems on v10.6.7 for almost 1.5 years.

      The server itself crashed with:

      InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: ./ibdata1
      InnoDB: File './ibdata1' is corrupted
      

      and two assertion failures in trx0undo.cc and buf0lru.cc. All subsequent restart attempts failed so we switched the application over to the replica database.

      We did not attempt any forced recovery. The assertion failures:

      InnoDB: Assertion failure in file ./storage/innobase/trx/trx0undo.cc line 1416
      InnoDB: Failing assertion: rollback
      231221 14:24:48 [ERROR] mysqld got signal 6 ;
      

      The backtrace only gave one line before having the next assertion failure.

      stack_bottom = 0x7f614d088cd8 thread_stack 0x49000
      InnoDB: Assertion failure in file ./storage/innobase/buf/buf0lru.cc line 285
      InnoDB: Failing assertion: !block->page.in_file()
      

      See attachment db-syslog.2023-12-21.txt for all the relevant syslog entries.

      We have preserved the corrupt 716 MiB ibdata1 (750780416 B) file for further inspection, should the need arise.

      Attachments

        Issue Links

          Activity

            marko Marko Mäkelä added a comment - - edited

            I retested a CMAKE_BUILD_TYPE=RelWithDebInfo build of the same 11.2 commit e4cb1e3295f7e6f0e5287d97884d6149a2390d22 with the following patch:

            diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc
            index ebcf430d532..311ccdff802 100644
            --- a/storage/innobase/buf/buf0flu.cc
            +++ b/storage/innobase/buf/buf0flu.cc
            @@ -2608,6 +2608,11 @@ static void buf_flush_page_cleaner()
                 }
                 else if (buf_flush_async_lsn <= oldest_lsn)
                   goto check_oldest_and_set_idle;
            +    else
            +    {
            +      abort();
            +      mysql_mutex_lock(&buf_pool.mutex);
            +    }
             
                 n= n >= n_flushed ? n - n_flushed : 0;
                 goto LRU_flush;
            diff --git a/storage/innobase/include/ut0new.h b/storage/innobase/include/ut0new.h
            index f4183e4c61a..85c2f662760 100644
            --- a/storage/innobase/include/ut0new.h
            +++ b/storage/innobase/include/ut0new.h
            @@ -234,7 +234,7 @@ struct ut_new_pfx_t {
             #endif
             };
             
            -#if defined(DBUG_OFF) && defined(HAVE_MADVISE) && defined(MADV_DODUMP)
            +#if 0 && defined(DBUG_OFF) && defined(HAVE_MADVISE) && defined(MADV_DODUMP)
             static inline void ut_dontdump(void *ptr, size_t m_size, bool dontdump)
             {
             	ut_a(ptr != NULL);
            

            The first hunk would cause the server to crash if the code line was executed. The second hunk would ensure that any core dumps will include a copy of the buffer pool.

            On the first run of the default test suites, there was no crash. I then started 60 concurrent runs of the test stress.ddl_innodb. The code was unreachable with the first run with the mtr default innodb_log_file_size=10m. Second attempt: innodb_log_file_size=100m. Third attempt: additionally set innodb_buffer_pool_size=5m (instead of the mtr default innodb_buffer_pool_size=8m). Fourth attempt: additionally set innodb_io_capacity=10000. Fifth attempt: changed innodb_buffer_pool_size=100m. All attempts to cover this piece of code were unsuccessful.

            I think that we will need a longer-running, DML-heavy test to cover this piece of code. The stress.ddl_innodb only runs for 160 seconds (less than 3 minutes).

            The test atomic.rename_table, which is where I observed the debug assertion failure that pointed to this mutex, is not usable in this build, because depends on debug injection.

            marko Marko Mäkelä added a comment - - edited I retested a CMAKE_BUILD_TYPE=RelWithDebInfo build of the same 11.2 commit e4cb1e3295f7e6f0e5287d97884d6149a2390d22 with the following patch: diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc index ebcf430d532..311ccdff802 100644 --- a/storage/innobase/buf/buf0flu.cc +++ b/storage/innobase/buf/buf0flu.cc @@ -2608,6 +2608,11 @@ static void buf_flush_page_cleaner() } else if (buf_flush_async_lsn <= oldest_lsn) goto check_oldest_and_set_idle; + else + { + abort(); + mysql_mutex_lock(&buf_pool.mutex); + } n= n >= n_flushed ? n - n_flushed : 0; goto LRU_flush; diff --git a/storage/innobase/include/ut0new.h b/storage/innobase/include/ut0new.h index f4183e4c61a..85c2f662760 100644 --- a/storage/innobase/include/ut0new.h +++ b/storage/innobase/include/ut0new.h @@ -234,7 +234,7 @@ struct ut_new_pfx_t { #endif }; -#if defined(DBUG_OFF) && defined(HAVE_MADVISE) && defined(MADV_DODUMP) +#if 0 && defined(DBUG_OFF) && defined(HAVE_MADVISE) && defined(MADV_DODUMP) static inline void ut_dontdump(void *ptr, size_t m_size, bool dontdump) { ut_a(ptr != NULL); The first hunk would cause the server to crash if the code line was executed. The second hunk would ensure that any core dumps will include a copy of the buffer pool. On the first run of the default test suites, there was no crash. I then started 60 concurrent runs of the test stress.ddl_innodb . The code was unreachable with the first run with the mtr default innodb_log_file_size=10m . Second attempt: innodb_log_file_size=100m . Third attempt: additionally set innodb_buffer_pool_size=5m (instead of the mtr default innodb_buffer_pool_size=8m ). Fourth attempt: additionally set innodb_io_capacity=10000 . Fifth attempt: changed innodb_buffer_pool_size=100m . All attempts to cover this piece of code were unsuccessful. I think that we will need a longer-running, DML-heavy test to cover this piece of code. The stress.ddl_innodb only runs for 160 seconds (less than 3 minutes). The test atomic.rename_table , which is where I observed the debug assertion failure that pointed to this mutex, is not usable in this build, because depends on debug injection.

            I filed MDEV-33275 for the observed bug that might explain this. We do not have enough data to say if it really fixes this.

            marko Marko Mäkelä added a comment - I filed MDEV-33275 for the observed bug that might explain this. We do not have enough data to say if it really fixes this.

            On our CI systems, the test stress.ddl_innodb occasionally fails. The first failure that loosely matches the symptoms here occurred 3 months after MDEV-26827 had been pushed:

            bb-11.2-MDEV-5816 ffb445d2b9337478b8bd750b06dac7336983503d

            stress.ddl_innodb 'innodb'               w9 [ fail ]
                    Test ended at 2023-06-14 04:49:36
             
            CURRENT_TEST: stress.ddl_innodb
            mysqltest: In included file "./suite/stress/include/ddl4.inc": 
            included from /home/buildbot/aarch64-fedora-37/build/mysql-test/suite/stress/t/ddl_innodb.test at line 41:
            At line 331: query 'EXECUTE create_table1' failed: <Unknown> (2013): Lost connection to server during query
            Server log from this test:
            ----------SERVER LOG START-----------
            2023-06-14 04:49:35 0xffff9c218060  InnoDB: Assertion failure in file /home/buildbot/aarch64-fedora-37/build/storage/innobase/trx/trx0purge.cc line 354
            InnoDB: Failing assertion: flst_add_first(rseg_header, TRX_RSEG + TRX_RSEG_HISTORY, undo_page, uint16_t(page_offset(undo_header) + TRX_UNDO_HISTORY_NODE), mtr) == DB_SUCCESS
            

            marko Marko Mäkelä added a comment - On our CI systems, the test stress.ddl_innodb occasionally fails. The first failure that loosely matches the symptoms here occurred 3 months after MDEV-26827 had been pushed: bb-11.2-MDEV-5816 ffb445d2b9337478b8bd750b06dac7336983503d stress.ddl_innodb 'innodb' w9 [ fail ] Test ended at 2023-06-14 04:49:36   CURRENT_TEST: stress.ddl_innodb mysqltest: In included file "./suite/stress/include/ddl4.inc": included from /home/buildbot/aarch64-fedora-37/build/mysql-test/suite/stress/t/ddl_innodb.test at line 41: At line 331: query 'EXECUTE create_table1' failed: <Unknown> (2013): Lost connection to server during query … Server log from this test: ----------SERVER LOG START----------- 2023-06-14 04:49:35 0xffff9c218060 InnoDB: Assertion failure in file /home/buildbot/aarch64-fedora-37/build/storage/innobase/trx/trx0purge.cc line 354 InnoDB: Failing assertion: flst_add_first(rseg_header, TRX_RSEG + TRX_RSEG_HISTORY, undo_page, uint16_t(page_offset(undo_header) + TRX_UNDO_HISTORY_NODE), mtr) == DB_SUCCESS

            The fix of MDEV-33275 was included in 10.6.17, 10.11.7, 11.0.5, 11.1.4, 11.2.3. Is this bug reproducible with those server versions?

            There also is the open bug MDEV-33363, which could potentially explain this.

            marko Marko Mäkelä added a comment - The fix of MDEV-33275 was included in 10.6.17, 10.11.7, 11.0.5, 11.1.4, 11.2.3. Is this bug reproducible with those server versions? There also is the open bug MDEV-33363 , which could potentially explain this.

            Unfortunately, we still have some corruption going on. See the linked related tickets.

            marko Marko Mäkelä added a comment - Unfortunately, we still have some corruption going on. See the linked related tickets.

            People

              marko Marko Mäkelä
              wschemmel Wolfgang Schemmel
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.