Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-19743

Crash while reorganizing an index page

Details

    Description

      A user saw the following crash:

      2019-06-11 18:00:09 0 [Note] WSREP: Created page /data/galera/gcache.page.000000 of size 299963248 bytes
      2019-06-11 18:00:18 0 [Note] WSREP: Created page /data/galera/gcache.page.000001 of size 162454568 bytes
      2019-06-11 18:00:35 0 [Warning] WSREP: Failed to report last committed 34212182, -110 (Connection timed out)
      2019-06-11 18:00:45 0 [Note] WSREP: Deleted page /data/galera/gcache.page.000000
      2019-06-11 18:00:57 0 [Note] WSREP: Deleted page /data/galera/gcache.page.000001
      2019-06-11 18:01:35 5 [ERROR] InnoDB: Page old data size 890 new data size 2347, page old max ins size 15356 new max ins size 13899
      2019-06-11 18:01:35 5 [ERROR] InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
      2019-06-11 18:01:35 4 [ERROR] [FATAL] InnoDB: Rec offset 99, cur1 offset 6181, cur2 offset 16009
      190611 18:01:35 [ERROR] mysqld got signal 6 ;
      This could be because you hit a bug. It is also possible that this binary
      or one of the libraries it was linked against is corrupt, improperly built,
      or misconfigured. This error can also be caused by malfunctioning hardware.
       
      To report this bug, see https://mariadb.com/kb/en/reporting-bugs
       
      We will try our best to scrape up some info that will hopefully help
      diagnose the problem, but since we have already crashed, 
      something is definitely wrong and this may fail.
       
      Server version: 10.3.14-MariaDB-log
      key_buffer_size=134217728
      read_buffer_size=131072
      max_used_connections=14
      max_threads=5002
      thread_count=16
      It is possible that mysqld could use up to 
      key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 11126969 K  bytes of memory
      Hope that's ok; if not, decrease some variables in the equation.
       
      Thread pointer: 0x7f43c00009a8
      Attempting backtrace. You can use the following information to find out
      where mysqld died. If you see no messages after this, something went
      terribly wrong...
      stack_bottom = 0x7f43d625ac30 thread_stack 0x49000
      *** buffer overflow detected ***: /usr/sbin/mysqld terminated
      ======= Backtrace: =========
      /lib64/libc.so.6(__fortify_fail+0x37)[0x7f5d8a7f39e7]
      /lib64/libc.so.6(+0x115b62)[0x7f5d8a7f1b62]
      /lib64/libc.so.6(+0x117947)[0x7f5d8a7f3947]
      /usr/sbin/mysqld(my_addr_resolve+0xda)[0x560c0de6e8fa]
      /usr/sbin/mysqld(my_print_stacktrace+0x1c2)[0x560c0de57f92]
      /usr/sbin/mysqld(handle_fatal_signal+0x357)[0x560c0d8f7cc7]
      /lib64/libpthread.so.0(+0xf5d0)[0x7f5d8c43f5d0]
      /lib64/libc.so.6(gsignal+0x37)[0x7f5d8a712207]
      /lib64/libc.so.6(abort+0x148)[0x7f5d8a7138f8]
      /usr/sbin/mysqld(+0xa6c663)[0x560c0dbe1663]
      /usr/sbin/mysqld(+0x99b801)[0x560c0db10801]
      /usr/sbin/mysqld(+0xa7656e)[0x560c0dbeb56e]
      /usr/sbin/mysqld(+0xa79b1d)[0x560c0dbeeb1d]
      /usr/sbin/mysqld(+0xa7b016)[0x560c0dbf0016]
      /usr/sbin/mysqld(+0xa8b94e)[0x560c0dc0094e]
      /usr/sbin/mysqld(+0xa8e0ee)[0x560c0dc030ee]
      /usr/sbin/mysqld(+0x9f51be)[0x560c0db6a1be]
      /usr/sbin/mysqld(+0x9f7939)[0x560c0db6c939]
      /usr/sbin/mysqld(+0x9f859b)[0x560c0db6d59b]
      /usr/sbin/mysqld(+0x9b3d7a)[0x560c0db28d7a]
      /usr/sbin/mysqld(+0xa219cd)[0x560c0db969cd]
      /lib64/libpthread.so.0(+0x7dd5)[0x7f5d8c437dd5]
      /lib64/libc.so.6(clone+0x6d)[0x7f5d8a7d9ead]
      

      There seems to be two different errors here.

      The first error is:

      2019-06-11 18:01:35 5 [ERROR] InnoDB: Page old data size 890 new data size 2347, page old max ins size 15356 new max ins size 13899
      2019-06-11 18:01:35 5 [ERROR] InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
      

      This first error seems to occur here:

      https://github.com/MariaDB/server/blob/mariadb-10.3.14/storage/innobase/btr/btr0btr.cc#L1706

      This first error was also seen in MDEV-18519.

      The second error is:

      2019-06-11 18:01:35 4 [ERROR] [FATAL] InnoDB: Rec offset 99, cur1 offset 6181, cur2 offset 16009
      

      This second error seems to occur here:

      https://github.com/MariaDB/server/blob/mariadb-10.3.14/storage/innobase/page/page0page.cc#L612

      Some of the messages immediately prior to the crash indicate that a large Galera Cluster transaction was executed:

      2019-06-11 18:00:09 0 [Note] WSREP: Created page /data/galera/gcache.page.000000 of size 299963248 bytes
      2019-06-11 18:00:18 0 [Note] WSREP: Created page /data/galera/gcache.page.000001 of size 162454568 bytes
      2019-06-11 18:00:35 0 [Warning] WSREP: Failed to report last committed 34212182, -110 (Connection timed out)
      2019-06-11 18:00:45 0 [Note] WSREP: Deleted page /data/galera/gcache.page.000000
      2019-06-11 18:00:57 0 [Note] WSREP: Deleted page /data/galera/gcache.page.000001
      

      Note that "Failed to report last committed" sounds scary, but it is pretty harmless. See MDEV-17550.

      Attachments

        Issue Links

          Activity

            The error messages immediately preceding the intentional crash are complaining that the payload size of the index page is changing during a reorganize operation. The reorganize should only do ‘garbage collection’, not affecting the total size of the contained records.

            It is possible that the page was corrupted when the reorganize was initiated.

            I remember seeing this a lot during the development of MDEV-15562 (instant DROP COLUMN) in 10.4, but that was fixed before the initial push. Here, it is being reported for 10.3. Potentially, MDEV-11369 (instant ADD COLUMN) in 10.3 could affect this, but I would rather tend to believe that the bug is possible also in earlier versions.

            I am assigning this to Matthias, for the creation of a test case.

            marko Marko Mäkelä added a comment - The error messages immediately preceding the intentional crash are complaining that the payload size of the index page is changing during a reorganize operation. The reorganize should only do ‘garbage collection’, not affecting the total size of the contained records. It is possible that the page was corrupted when the reorganize was initiated. I remember seeing this a lot during the development of MDEV-15562 (instant DROP COLUMN) in 10.4, but that was fixed before the initial push. Here, it is being reported for 10.3. Potentially, MDEV-11369 (instant ADD COLUMN) in 10.3 could affect this, but I would rather tend to believe that the bug is possible also in earlier versions. I am assigning this to Matthias, for the creation of a test case.

            I just found this case, and this seems similar to what we encountered and I tried to describe in https://jira.mariadb.org/browse/MDEV-19783 (i linked both cases).

            wonko Bernard Grymonpon added a comment - I just found this case, and this seems similar to what we encountered and I tried to describe in https://jira.mariadb.org/browse/MDEV-19783 (i linked both cases).

            I have a plausible explanation of this bug in MDEV-19783, which makes also this bug a very likely duplicate of MDEV-18519.

            marko Marko Mäkelä added a comment - I have a plausible explanation of this bug in MDEV-19783 , which makes also this bug a very likely duplicate of MDEV-18519 .

            I believe that I may have found an explanation why mleich failed to repeat this corruption with recent 10.3 or 10.4.
            This error could be fully explained by MDEV-19916, which was fixed in MariaDB Server 10.3.17 and 10.4.7.

            Note: upgrading to a fixed version will not fix existing corruption. I am afraid that the corruption can only be fixed by restoring the table from a logical or physical backup. You might try your luck with

            ALTER TABLE tablename FORCE;
            

            Even if it did not crash, it could mean that some data (in particular, the contents of any instantly added columns) could be corrupted.

            marko Marko Mäkelä added a comment - I believe that I may have found an explanation why mleich failed to repeat this corruption with recent 10.3 or 10.4. This error could be fully explained by MDEV-19916 , which was fixed in MariaDB Server 10.3.17 and 10.4.7. Note: upgrading to a fixed version will not fix existing corruption. I am afraid that the corruption can only be fixed by restoring the table from a logical or physical backup. You might try your luck with ALTER TABLE tablename FORCE ; Even if it did not crash, it could mean that some data (in particular, the contents of any instantly added columns) could be corrupted.

            I think that this report duplicates MDEV-19916, which was fixed in MariaDB 10.3.17 and 10.4.7.

            marko Marko Mäkelä added a comment - I think that this report duplicates MDEV-19916 , which was fixed in MariaDB 10.3.17 and 10.4.7.

            People

              marko Marko Mäkelä
              GeoffMontee Geoff Montee (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.