Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-19081

DB crashes periodicly with - InnoDB: Assertion failure

Details

    • Bug
    • Status: Closed (View Workflow)
    • Critical
    • Resolution: Incomplete
    • 10.2.22, 10.2.23
    • N/A

    Description

      The DB service stops from time to time but non-periodic (estimated mostly between 11pm and 9am) with the following error:

      2019-03-28 23:31:08 0x7ff4dc62d700  InnoDB: Assertion failure in file /home/buildbot/buildbot/build/mariadb-10.2.23/storage/innobase/rem/rem0rec.cc line 574
      InnoDB: We intentionally generate a memory trap.
      InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
      InnoDB: If you get repeated assertion failures or crashes, even
      InnoDB: immediately after the mysqld startup, there may be
      InnoDB: corruption in the InnoDB tablespace. Please refer to
      InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/
      InnoDB: about forcing recovery.
      190328 23:31:08 [ERROR] mysqld got signal 6 ;
      This could be because you hit a bug. It is also possible that this binary
      or one of the libraries it was linked against is corrupt, improperly built,
      or misconfigured. This error can also be caused by malfunctioning hardware.
       
      To report this bug, see https://mariadb.com/kb/en/reporting-bugs
       
      We will try our best to scrape up some info that will hopefully help
      diagnose the problem, but since we have already crashed, 
      something is definitely wrong and this may fail.
       
      Server version: 10.2.23-MariaDB-10.2.23+maria~stretch
      key_buffer_size=134217728
      read_buffer_size=131072
      max_used_connections=234
      max_threads=752
      thread_count=273
      It is possible that mysqld could use up to 
      key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1783435 K  bytes of memory
      Hope that's ok; if not, decrease some variables in the equation.
       
      Thread pointer: 0x7ff4d8001f28
      Attempting backtrace. You can use the following information to find out
      where mysqld died. If you see no messages after this, something went
      terribly wrong...
      stack_bottom = 0x7ff4dc62ccc8 thread_stack 0x49000
      *** buffer overflow detected ***: /usr/sbin/mysqld terminated
      ======= Backtrace: =========
      /lib/x86_64-linux-gnu/libc.so.6(+0x70bfb)[0x7ffa09af5bfb]
      /lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x37)[0x7ffa09b7e437]
      /lib/x86_64-linux-gnu/libc.so.6(+0xf7570)[0x7ffa09b7c570]
      /lib/x86_64-linux-gnu/libc.so.6(+0xf93aa)[0x7ffa09b7e3aa]
      /usr/sbin/mysqld(my_addr_resolve+0xe2)[0x55ca42284922]
      /usr/sbin/mysqld(my_print_stacktrace+0x1bb)[0x55ca4226b1eb]
      /usr/sbin/mysqld(handle_fatal_signal+0x41d)[0x55ca41d0a01d]
      /lib/x86_64-linux-gnu/libpthread.so.0(+0x110e0)[0x7ffa0b4180e0]
      /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcf)[0x7ffa09ab7fff]
      /lib/x86_64-linux-gnu/libc.so.6(abort+0x16a)[0x7ffa09ab942a]
      /usr/sbin/mysqld(+0x40f971)[0x55ca41ab8971]
      /usr/sbin/mysqld(+0x887df6)[0x55ca41f30df6]
      /usr/sbin/mysqld(+0x863673)[0x55ca41f0c673]
      /usr/sbin/mysqld(+0x96648e)[0x55ca4200f48e]
      /usr/sbin/mysqld(+0x89b559)[0x55ca41f44559]
      /usr/sbin/mysqld(+0x8a15e4)[0x55ca41f4a5e4]
      /usr/sbin/mysqld(+0x8a2187)[0x55ca41f4b187]
      /usr/sbin/mysqld(+0x8b1a20)[0x55ca41f5aa20]
      /usr/sbin/mysqld(+0x7f5c04)[0x55ca41e9ec04]
      /usr/sbin/mysqld(_ZN7handler12ha_write_rowEPh+0x107)[0x55ca41d140d7]
      /usr/sbin/mysqld(_Z12write_recordP3THDP5TABLEP12st_copy_info+0x72)[0x55ca41b4b992]
      /usr/sbin/mysqld(_Z12mysql_insertP3THDP10TABLE_LISTR4ListI4ItemERS3_IS5_ES6_S6_15enum_duplicatesb+0x1206)[0x55ca41b560f6]
      /usr/sbin/mysqld(_Z21mysql_execute_commandP3THD+0x3f68)[0x55ca41b6bee8]
      /usr/sbin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_statebb+0x28a)[0x55ca41b70e4a]
      /usr/sbin/mysqld(+0x4c864f)[0x55ca41b7164f]
      /usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcjbb+0x1a7c)[0x55ca41b737fc]
      /usr/sbin/mysqld(_Z10do_commandP3THD+0x176)[0x55ca41b748a6]
      /usr/sbin/mysqld(_Z24do_handle_one_connectionP7CONNECT+0x25a)[0x55ca41c3ec0a]
      /usr/sbin/mysqld(handle_one_connection+0x3d)[0x55ca41c3ed7d]
      /usr/sbin/mysqld(+0xb75791)[0x55ca4221e791]
      /lib/x86_64-linux-gnu/libpthread.so.0(+0x74a4)[0x7ffa0b40e4a4]
      /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7ffa09b6dd0f]
      ======= Memory map: ========
      55ca416a9000-55ca4278f000 r-xp 00000000 fe:02 1320586                    /usr/sbin/mysqld
      55ca4298e000-55ca42a60000 r--p 010e5000 fe:02 1320586                    /usr/sbin/mysqld
      55ca42a60000-55ca42b17000 rw-p 011b7000 fe:02 1320586                    /usr/sbin/mysqld
      55ca42b17000-55ca433a9000 rw-p 00000000 00:00 0 
      55ca43458000-55ca508b9000 rw-p 00000000 00:00 0                          [heap]
      7ff3d9e04000-7ff3d9e05000 ---p 00000000 00:00 0 
      7ff3d9e05000-7ff3d9e4f000 rw-p 00000000 00:00 0 
      7ff3d9e4f000-7ff3d9e50000 ---p 00000000 00:00 0 
      ...
      

      1. The DB node which crashed at first (db1b) tonight, was set up yesterday (running 10.2.23).
      2. The second server (db1c) crashed 70 seconds later after becoming the primary DB node and during the reconnect/resync (SST) of db1b (last update 3 weeks ago from 10.2.14 to 10.2.22)
      3. The third DB node db1a wasn't running since the weekend and was therefore not involved in tonights crash

      If seems that the processes keep some kind of running but the service info (systemctl status mariadb.service) are in a failed state and nobody can connect or use the db services. Furthermore the process has to be killed manually.
      After that I could rebuild the Cluster by creating a new Galera Cluster (galera_new_cluster) from the most advanced node.

      Attachments

        Issue Links

          Activity

            InnoDB is crashing because it detects corruption in a record header of a table that does not use ROW_FORMAT=REDUNDANT (which was the original and only format before MySQL 5.0.3). The function rec_get_offsets_func() is reading the 3 bits from rec_get_status(), and expects the most significant bit of it to be zero.

            In MDEV-13542 we should make the error handling of InnoDB more robust, so that crashes like this would be avoided.

            But, the cause of the corruption is unclear and cannot be found without a repeatable test case. I fixed some corruption bugs recently (MDEV-14126 just missed the 10.2.23 release), but I don’t think those fixes would affect this kind of corruption.

            I would suggest using innodb_compression_algorithm=strict_crc32 to improve the chances of detecting this kind of corruption earlier, when reading pages into the buffer pool. If corrupted pages enter the buffer pool, InnoDB can crash in very many places.

            There could be a problem in the snapshot transfer (SST) procedure, or the corruption could simply have been propagated by the SST. That is the downside of physical replication.

            If you can provide more information, such as page dumps (which could be obtained by using a debugger), I can try to help further.

            marko Marko Mäkelä added a comment - InnoDB is crashing because it detects corruption in a record header of a table that does not use ROW_FORMAT=REDUNDANT (which was the original and only format before MySQL 5.0.3). The function rec_get_offsets_func() is reading the 3 bits from rec_get_status() , and expects the most significant bit of it to be zero. In MDEV-13542 we should make the error handling of InnoDB more robust, so that crashes like this would be avoided. But, the cause of the corruption is unclear and cannot be found without a repeatable test case. I fixed some corruption bugs recently ( MDEV-14126 just missed the 10.2.23 release), but I don’t think those fixes would affect this kind of corruption. I would suggest using innodb_compression_algorithm=strict_crc32 to improve the chances of detecting this kind of corruption earlier, when reading pages into the buffer pool. If corrupted pages enter the buffer pool, InnoDB can crash in very many places. There could be a problem in the snapshot transfer (SST) procedure, or the corruption could simply have been propagated by the SST. That is the downside of physical replication. If you can provide more information, such as page dumps (which could be obtained by using a debugger), I can try to help further.
            stefman_87 Stefan B. added a comment -

            Firstly thanks a lot for the quick response.

            After reading your comment i got some question which might help me to solve our situation.

            1. We got some tables which use ROW_FORMAT=COMPRESSED. Is it possibly to alter the row-format for these tables and if yes would this bypass our problems.
            2. You are suggesting to set innodb_compression_algorithm=strict_crc32 but in the official docs i can't find this algorithm as a supported option (https://mariadb.com/kb/en/library/compression/#choosing-compression-algorithm)
            3. If this option leads to an earlier detection of corruptions, does that mean that they are also not loaded into buffer pool?
            4. Assuming that the appropriate webservers to the tables (see 1) ) are less important, would it be helpful to change some table (or schema) specific options, forbid write access or re-import the tables from existing db dumps or even shut down the application?
            5. If I'm correctly understanding there is currently no bugfix in any of the newer versions?
            stefman_87 Stefan B. added a comment - Firstly thanks a lot for the quick response. After reading your comment i got some question which might help me to solve our situation. We got some tables which use ROW_FORMAT=COMPRESSED . Is it possibly to alter the row-format for these tables and if yes would this bypass our problems. You are suggesting to set innodb_compression_algorithm=strict_crc32 but in the official docs i can't find this algorithm as a supported option ( https://mariadb.com/kb/en/library/compression/#choosing-compression-algorithm ) If this option leads to an earlier detection of corruptions, does that mean that they are also not loaded into buffer pool? Assuming that the appropriate webservers to the tables (see 1) ) are less important, would it be helpful to change some table (or schema) specific options, forbid write access or re-import the tables from existing db dumps or even shut down the application? If I'm correctly understanding there is currently no bugfix in any of the newer versions?

            Sorry, I meant to write innodb_checksum_algorithm=strict_crc32. In MariaDB 10.4 there will be an even better option, full_crc32, but it will not affect the format of ROW_FORMAT=COMPRESSED pages.

            I do not think that this particular form of corruption should be possible with ROW_FORMAT=COMPRESSED pages, because the status bits are generated when the page is decompressed. (I am the author of that format.) The only way how you could get this assertion failure for ROW_FORMAT=COMPRESSED is that something (software bug or hardware fault) corrupts the InnoDB buffer pool, which I think is rather unlikely.

            Rebuilding tables can sometimes remove corruption, if that corruption is not in the pages that must be read in order to copy the table. SELECT * (possibly done by mysqldump) or a table rebuild will have to read all leaf pages of the clustered index tree, and all BLOB pages that are pointed to by leaf-page records. Additionally, all pages on the path from the root to the first leaf page will have to be read. If you are lucky, the corruption is in a secondary index page or in a non-leaf page that will not be accessed when copying the table.

            I would have to see a dump of the corrupted page in order to determine the cause of the corruption. We have not seen this form of page corruption in our internal testing, except when some bug in mariabackup has been involved.

            marko Marko Mäkelä added a comment - Sorry, I meant to write innodb_checksum_algorithm=strict_crc32 . In MariaDB 10.4 there will be an even better option, full_crc32 , but it will not affect the format of ROW_FORMAT=COMPRESSED pages. I do not think that this particular form of corruption should be possible with ROW_FORMAT=COMPRESSED pages, because the status bits are generated when the page is decompressed. (I am the author of that format.) The only way how you could get this assertion failure for ROW_FORMAT=COMPRESSED is that something (software bug or hardware fault) corrupts the InnoDB buffer pool, which I think is rather unlikely. Rebuilding tables can sometimes remove corruption, if that corruption is not in the pages that must be read in order to copy the table. SELECT * (possibly done by mysqldump ) or a table rebuild will have to read all leaf pages of the clustered index tree, and all BLOB pages that are pointed to by leaf-page records. Additionally, all pages on the path from the root to the first leaf page will have to be read. If you are lucky, the corruption is in a secondary index page or in a non-leaf page that will not be accessed when copying the table. I would have to see a dump of the corrupted page in order to determine the cause of the corruption. We have not seen this form of page corruption in our internal testing, except when some bug in mariabackup has been involved.
            stefman_87 Stefan B. added a comment - - edited

            > Sorry, I meant to write innodb_checksum_algorithm=strict_crc32
            Am i mistaken or is this exactly what you wrote in your first comment?
            Ah ok, I see ...sorry it's friday
            Unfortunately the servers are already configured with this algorithm

            Rebuilding tables can sometimes remove corruption, if that corruption is not in the pages that must be read in order to copy the table. SELECT * (possibly done by mysqldump) or a table rebuild will have to read all leaf pages of the clustered index tree, and all BLOB pages that are pointed to by leaf-page records. Additionally, all pages on the path from the root to the first leaf page will have to be read. If you are lucky, the corruption is in a secondary index page or in a non-leaf page that will not be accessed when copying the table.
            I will try this in the next days and until then hopefully nothing will happen.

            I would have to see a dump of the corrupted page in order to determine the cause of the corruption. We have not seen this form of page corruption in our internal testing, except when some bug in mariabackup has been involved.
            Sorry if I'm asking but how could I create this dump and within which time after a crash could I get it

            stefman_87 Stefan B. added a comment - - edited > Sorry, I meant to write innodb_checksum_algorithm=strict_crc32 Am i mistaken or is this exactly what you wrote in your first comment? Ah ok, I see ...sorry it's friday Unfortunately the servers are already configured with this algorithm Rebuilding tables can sometimes remove corruption, if that corruption is not in the pages that must be read in order to copy the table. SELECT * (possibly done by mysqldump) or a table rebuild will have to read all leaf pages of the clustered index tree, and all BLOB pages that are pointed to by leaf-page records. Additionally, all pages on the path from the root to the first leaf page will have to be read. If you are lucky, the corruption is in a secondary index page or in a non-leaf page that will not be accessed when copying the table. I will try this in the next days and until then hopefully nothing will happen. I would have to see a dump of the corrupted page in order to determine the cause of the corruption. We have not seen this form of page corruption in our internal testing, except when some bug in mariabackup has been involved. Sorry if I'm asking but how could I create this dump and within which time after a crash could I get it
            stefman_87 Stefan B. added a comment -

            Today we run into the same Error. I have added the backtrace and the memory-map as shown up
            in the log file.

            I hope this is what you mean with a dump of the corrupted page .
            Otherwise I'm not really shure how to get such a dump.
            db1b_backtrace-MemoryMap

            stefman_87 Stefan B. added a comment - Today we run into the same Error. I have added the backtrace and the memory-map as shown up in the log file. I hope this is what you mean with a dump of the corrupted page . Otherwise I'm not really shure how to get such a dump. db1b_backtrace-MemoryMap

            You could get a dump of the corrupted page by attaching a debugger to the mysqld process before the crash, and running suitable commands. In GDB, you would use up or frame and then dump the buf_block_t::frame and also the buf_block_t::page metadata. In the stack frame of the failing assertion in db1b_backtrace-MemoryMap , that would be something like the following:

            print latch_leaves.blocks[0]->page
            dump binary memory /tmp/page.bin latch_leaves.blocks[0]->frame latch_leaves.blocks[0]->frame+srv_page_size
            

            We’d also want to identify the index and the table. For our support customers, we would provide detailed instructions how to do that.

            marko Marko Mäkelä added a comment - You could get a dump of the corrupted page by attaching a debugger to the mysqld process before the crash, and running suitable commands. In GDB, you would use up or frame and then dump the buf_block_t::frame and also the buf_block_t::page metadata. In the stack frame of the failing assertion in db1b_backtrace-MemoryMap , that would be something like the following: print latch_leaves.blocks[0]->page dump binary memory /tmp/page.bin latch_leaves.blocks[0]->frame latch_leaves.blocks[0]->frame+srv_page_size We’d also want to identify the index and the table. For our support customers, we would provide detailed instructions how to do that.

            People

              marko Marko Mäkelä
              stefman_87 Stefan B.
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.