Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-34453

Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: ./ibdata1

Details

    Description

      Customer gets the following error when starting 10.6.17 immediately following sofware update from previous version, 10.6.16.

      2024-06-19 21:05:44 10 [ERROR] InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: ./ibdata1
      2024-06-19 21:05:44 10 [ERROR] InnoDB: File './ibdata1' is corrupted
      2024-06-19 21:05:44 10 [ERROR] InnoDB: We detected index corruption in an InnoDB type table. You have to dump + drop + reimport the table or, in a case of widespread corruption, dump all InnoDB tables and recreate the whole tablespace. If the mariadbd server crashes after the startup or when you dump the tables. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.
      2024-06-19 21:05:44 10 [ERROR] mariadbd: Index for table 'gtid_slave_pos' is corrupt; try to repair it
      2024-06-19 21:05:44 10 [ERROR] InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: ./ibdata1
      2024-06-19 21:05:44 10 [ERROR] InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: ./ibdata1
      2024-06-19 21:05:44 0x7f266c669700  InnoDB: Assertion failure in file /home/buildbot/buildbot/build/storage/innobase/trx/trx0purge.cc line 268
      

      For customer, this has occurred more than one time.

      Customer discovered a workaround:
      Start database one time with

      innodb_force_recovery=1
      

      Shutdown, remove innodb_force_recovery configuration.
      Start database normally.

      Attachments

        Issue Links

          Activity

            70368744161280 is the byte offset of the largest 32-bit page number (4294967295, also known as FIL_NULL in the source code) when using the default innodb_page_size=16k.

            In many persistent InnoDB data structures, such as linked lists of pages, FIL_NULL is being used as a "null pointer". A list somewhere is corrupted, and we do not know where.

            It would be great if a copy of the corrupted data had been saved before engaging the workaround, and MariaDB 10.6.18 could be started on such a copy. I would like to see if the changes made in MDEV-33325 would have any impact on this. Ultimately, I would like to know where exactly the incorrect page number was being read without validation, so that we could get closer to the root cause (the write that corrupted the database).

            Based on the fact that innodb_force_recovery=1 seemed to heal the database, I assume that the corruption is somewhere in the undo logs, and as a result of this corruption, some purging of the committed transaction history was skipped. I would expect some CHECK TABLE…EXTENDED (MDEV-24402) to report some orphan secondary index records. But, we must keep in mind that such orphans can also be caused by the known bug MDEV-29823 even if this undo log corruption bug wasn’t involved.

            marko Marko Mäkelä added a comment - 70368744161280 is the byte offset of the largest 32-bit page number (4294967295, also known as FIL_NULL in the source code) when using the default innodb_page_size=16k . In many persistent InnoDB data structures, such as linked lists of pages, FIL_NULL is being used as a "null pointer". A list somewhere is corrupted, and we do not know where. It would be great if a copy of the corrupted data had been saved before engaging the workaround, and MariaDB 10.6.18 could be started on such a copy. I would like to see if the changes made in MDEV-33325 would have any impact on this. Ultimately, I would like to know where exactly the incorrect page number was being read without validation, so that we could get closer to the root cause (the write that corrupted the database). Based on the fact that innodb_force_recovery=1 seemed to heal the database, I assume that the corruption is somewhere in the undo logs, and as a result of this corruption, some purging of the committed transaction history was skipped. I would expect some CHECK TABLE…EXTENDED ( MDEV-24402 ) to report some orphan secondary index records. But, we must keep in mind that such orphans can also be caused by the known bug MDEV-29823 even if this undo log corruption bug wasn’t involved.

            I did not have time to check this. However, it would be good if debarun can check this. I think that a matching copy of the ib_logfile0 will be necessary as well. We may have to tweak some recovery code, and to hack the checkpoint fields to start recovery from an earlier checkpoint LSN, to understand the issue. Having just the ibdata1 is not that useful. If we also got the log file, we could say more.

            MDEV-32817 and the related bugs look promising, because apparently some Chinese people are able to reproduce corruption using Sysbench. I think that we must engage with them.

            marko Marko Mäkelä added a comment - I did not have time to check this. However, it would be good if debarun can check this. I think that a matching copy of the ib_logfile0 will be necessary as well. We may have to tweak some recovery code, and to hack the checkpoint fields to start recovery from an earlier checkpoint LSN, to understand the issue. Having just the ibdata1 is not that useful. If we also got the log file, we could say more. MDEV-32817 and the related bugs look promising, because apparently some Chinese people are able to reproduce corruption using Sysbench. I think that we must engage with them.

            Please provide the corrupt redo logs for further analysis.

            debarun Debarun Banerjee added a comment - Please provide the corrupt redo logs for further analysis.

            This comment for some reason has the wrong visibility:

            Marko Mäkelä added a comment - 2024-06-28 14:57 - Restricted to Developers

            It is a long shot, but in MDEV-32817 I just posted that MDEV-33588 might have fixed this corruption. This guess has not been checked yet.

            knielsen Kristian Nielsen added a comment - This comment for some reason has the wrong visibility: Marko Mäkelä added a comment - 2024-06-28 14:57 - Restricted to Developers It is a long shot, but in MDEV-32817 I just posted that MDEV-33588 might have fixed this corruption. This guess has not been checked yet.

            As a data point, I just got something similar to this in heavy benchmark runs:

            2024-08-28 14:42:54 0 [Note] /mdev34705/install/ma1104_base/bin/mariadbd: ready for connections.
            Version: '11.4.2-MariaDB-log'  socket: '/tmp/mysql.sock'  port: 3306  Source distribution
            2024-08-28 15:48:10 696 [ERROR] InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: .//undo011
            2024-08-28 15:48:10 696 [ERROR] InnoDB: File './/undo011' is corrupted
            2024-08-28 15:48:10 0x7feb974696c0  InnoDB: Assertion failure in file /mdev34705/src/mariadb-11.4-base/storage/innobase/trx/trx0purge.cc line 268
            

            2024-08-28 15:48:11 0 [Note] Starting MariaDB 11.4.2-MariaDB-log source revision ae6684d79f4183cdc69b3bba9b146e7455addc0b as process 494163
            2024-08-28 15:48:11 0 [Note] mariadbd: Aria engine: starting recovery
            recovered pages: 0% 11% 22% 33% 51% 66% 100% (0.0 seconds); tables to flush: 2 1 0 (0.0 seconds); 
            2024-08-28 15:48:11 0 [Note] mariadbd: Aria engine: recovery done
            2024-08-28 15:48:11 0 [Note] InnoDB: Compressed tables use zlib 1.2.13
            2024-08-28 15:48:11 0 [Note] InnoDB: Number of transaction pools: 1
            2024-08-28 15:48:11 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
            2024-08-28 15:48:11 0 [Note] InnoDB: Using Linux native AIO
            2024-08-28 15:48:11 0 [Note] InnoDB: Initializing buffer pool, total size = 8.000GiB, chunk size = 128.000MiB
            2024-08-28 15:48:11 0 [Note] InnoDB: Completed initialization of buffer pool
            2024-08-28 15:48:11 0 [Note] InnoDB: Buffered log writes (block size=512 bytes)
            2024-08-28 15:48:11 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=173397039313
            2024-08-28 15:48:26 0 [Note] InnoDB: Parsed redo log up to LSN=175522198374; to recover: 175470 pages
            2024-08-28 15:48:34 0 [ERROR] InnoDB: Corrupted page identifier at 176274948073; set innodb_force_recovery=1 to ignore the record.
            2024-08-28 15:48:34 0 [Note] InnoDB: End of log at LSN=176274948073
            2024-08-28 15:48:34 0 [ERROR] InnoDB: Plugin initialization aborted with error Generic error
            

            I got this with old 11.4 version on commit ae6684d79f4183cdc69b3bba9b146e7455addc0b . Note that this version does not have the MDEV-33588 fix.

            Full error log attached as Debian-bookworm-latest-amd64-base.err.

            This bug happened on a new clean install of the database, after 1 hour of heavy load. As seen in the attached error log, after the crash the database is reported as corrupted. Thus, unlike the original report, this is not related to upgrading.

            This is from an automated benchmark run and occurred only once so far, so I don't have any way to reproduce, nor any saved data files or redo/undo logs. But I thought it was still worth it to mention as a data point.

            knielsen Kristian Nielsen added a comment - As a data point, I just got something similar to this in heavy benchmark runs: 2024-08-28 14:42:54 0 [Note] /mdev34705/install/ma1104_base/bin/mariadbd: ready for connections. Version: '11.4.2-MariaDB-log' socket: '/tmp/mysql.sock' port: 3306 Source distribution 2024-08-28 15:48:10 696 [ERROR] InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: .//undo011 2024-08-28 15:48:10 696 [ERROR] InnoDB: File './/undo011' is corrupted 2024-08-28 15:48:10 0x7feb974696c0 InnoDB: Assertion failure in file /mdev34705/src/mariadb-11.4-base/storage/innobase/trx/trx0purge.cc line 268 2024-08-28 15:48:11 0 [Note] Starting MariaDB 11.4.2-MariaDB-log source revision ae6684d79f4183cdc69b3bba9b146e7455addc0b as process 494163 2024-08-28 15:48:11 0 [Note] mariadbd: Aria engine: starting recovery recovered pages: 0% 11% 22% 33% 51% 66% 100% (0.0 seconds); tables to flush: 2 1 0 (0.0 seconds); 2024-08-28 15:48:11 0 [Note] mariadbd: Aria engine: recovery done 2024-08-28 15:48:11 0 [Note] InnoDB: Compressed tables use zlib 1.2.13 2024-08-28 15:48:11 0 [Note] InnoDB: Number of transaction pools: 1 2024-08-28 15:48:11 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions 2024-08-28 15:48:11 0 [Note] InnoDB: Using Linux native AIO 2024-08-28 15:48:11 0 [Note] InnoDB: Initializing buffer pool, total size = 8.000GiB, chunk size = 128.000MiB 2024-08-28 15:48:11 0 [Note] InnoDB: Completed initialization of buffer pool 2024-08-28 15:48:11 0 [Note] InnoDB: Buffered log writes (block size=512 bytes) 2024-08-28 15:48:11 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=173397039313 2024-08-28 15:48:26 0 [Note] InnoDB: Parsed redo log up to LSN=175522198374; to recover: 175470 pages 2024-08-28 15:48:34 0 [ERROR] InnoDB: Corrupted page identifier at 176274948073; set innodb_force_recovery=1 to ignore the record. 2024-08-28 15:48:34 0 [Note] InnoDB: End of log at LSN=176274948073 2024-08-28 15:48:34 0 [ERROR] InnoDB: Plugin initialization aborted with error Generic error I got this with old 11.4 version on commit ae6684d79f4183cdc69b3bba9b146e7455addc0b . Note that this version does not have the MDEV-33588 fix. Full error log attached as Debian-bookworm-latest-amd64-base.err. This bug happened on a new clean install of the database, after 1 hour of heavy load. As seen in the attached error log, after the crash the database is reported as corrupted. Thus, unlike the original report, this is not related to upgrading. This is from an automated benchmark run and occurred only once so far, so I don't have any way to reproduce, nor any saved data files or redo/undo logs. But I thought it was still worth it to mention as a data point.

            Thank you, it is great that you found and fixed my mistake.

            The proposed fix looks correct to me. But it looks like we can eliminate a pair of unfix() and fix() operations. I would also like to see some form of intermittent fault injection that would better cover this rare code path at least in one regression test, such as stress.ddl_innodb.

            marko Marko Mäkelä added a comment - Thank you, it is great that you found and fixed my mistake. The proposed fix looks correct to me. But it looks like we can eliminate a pair of unfix() and fix() operations. I would also like to see some form of intermittent fault injection that would better cover this rare code path at least in one regression test, such as stress.ddl_innodb .

            This issue was not completely fixed. Thanks to marko and mleich for identifying and filing MDEV-35181to handle it.

            debarun Debarun Banerjee added a comment - This issue was not completely fixed. Thanks to marko and mleich for identifying and filing MDEV-35181 to handle it.

            People

              debarun Debarun Banerjee
              edward Edward Stoever
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.