Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-19783

Random crashes and corrupt data in INSTANT-added columns

    XMLWordPrintable

Details

    Description

      This will be a bit vague bug report, but I'll file it anyhow, as I think it relates to a couple of other reports and might give additional insight. We run a 5 node galera cluster, 1.1TB of data, quite heavy read/write during usage and high parallellism. This runs on Centos7.

      The core of the issue are random crashes, giving either terminations on signal 6 or signal 11. The signal 6 ones are always intentionally crashed by innodb as it detects some sort of corruption, the signal 11s mostly come after it reports on a corrupted index, and then crashes without info a couple of seconds later.

      In retrospect, we believe it started while we were at version 10.3.14. The first manifestation was with this stacktrace on one of our nodes:

      2019-05-23 19:53:16 4 [ERROR] [FATAL] InnoDB: Rec offset 99, cur1 offset 14135, cur2 offset 16209
      190523 19:53:16 [ERROR] mysqld got signal 6 ;
      This could be because you hit a bug. It is also possible that this binary
      or one of the libraries it was linked against is corrupt, improperly built,
      or misconfigured. This error can also be caused by malfunctioning hardware.
      To report this bug, see https://mariadb.com/kb/en/reporting-bugs
      We will try our best to scrape up some info that will hopefully help
      diagnose the problem, but since we have already crashed,
      something is definitely wrong and this may fail.
      Server version: 10.3.14-MariaDB-log
      key_buffer_size=134217728
      read_buffer_size=131072
      max_used_connections=485
      max_threads=1502
      thread_count=179
      It is possible that mysqld could use up to
      key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 3432903 K bytes of memory
      Hope that's ok; if not, decrease some variables in the equation.
      Thread pointer: 0x7f55780009a8
      Attempting backtrace. You can use the following information to find out
      where mysqld died. If you see no messages after this, something went
      terribly wrong...
      stack_bottom = 0x7f558cff8c30 thread_stack 0x49000
      *** buffer overflow detected ***: /usr/sbin/mysqld terminated
      ======= Backtrace: =========
      /lib64/libc.so.6(__fortify_fail+0x37)[0x7f6c417a19e7]
      /lib64/libc.so.6(+0x115b62)[0x7f6c4179fb62]
      /lib64/libc.so.6(+0x117947)[0x7f6c417a1947]
      /usr/sbin/mysqld(my_addr_resolve+0xda)[0x55cf5d7408fa]
      /usr/sbin/mysqld(my_print_stacktrace+0x1c2)[0x55cf5d729f92]
      /usr/sbin/mysqld(handle_fatal_signal+0x357)[0x55cf5d1c9cc7]
      /lib64/libpthread.so.0(+0xf5d0)[0x7f6c433ec5d0]
      /lib64/libc.so.6(gsignal+0x37)[0x7f6c416c0207]
      /lib64/libc.so.6(abort+0x148)[0x7f6c416c18f8]
      /usr/sbin/mysqld(+0xa6c663)[0x55cf5d4b3663]
      /usr/sbin/mysqld(+0x99b801)[0x55cf5d3e2801]
      /usr/sbin/mysqld(+0xa7656e)[0x55cf5d4bd56e]
      /usr/sbin/mysqld(+0xa79b1d)[0x55cf5d4c0b1d]
      /usr/sbin/mysqld(+0xa7b016)[0x55cf5d4c2016]
      /usr/sbin/mysqld(+0xa8b94e)[0x55cf5d4d294e]
      

      Server restarted and continued happily after that. We didn't find a cause, and blamed it as a freak accident.

      However, after a while, we started seeing issues on our backups. We take the backups with mariabackup, and part of the backup process is to actually restore the full backup again and verify that it load and runs fine on a test instance. Part of the verification is running a mysqlcheck. We had a couple of tables reporting that they were corrupt, but an optimize fixed this (secondary indexes being corrupt). We blamed this on running an older version of mariabackup on the recovery node to apply the logs, but think this might be unrelated.

      Things got worse last week. We had a crash on 3 of our nodes, exactly at the same time, with exactly the same error on all three nodes (this is probably galera pushing a change to the same table on all nodes):

      2019-06-13 13:23:18 9 [ERROR] [FATAL] InnoDB: Rec offset 99, cur1 offset 11337, cur2 offset 16190
      190613 13:23:18 [ERROR] mysqld got signal 6 ;
      This could be because you hit a bug. It is also possible that this binary
      or one of the libraries it was linked against is corrupt, improperly built,
      or misconfigured. This error can also be caused by malfunctioning hardware.
      To report this bug, see https://mariadb.com/kb/en/reporting-bugs
      We will try our best to scrape up some info that will hopefully help
      diagnose the problem, but since we have already crashed,
      something is definitely wrong and this may fail.
      Server version: 10.3.15-MariaDB-log
      key_buffer_size=134217728
      read_buffer_size=131072
      max_used_connections=301
      max_threads=1502
      thread_count=138
      It is possible that mysqld could use up to
      key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 3433196 K bytes of memory
      Hope that's ok; if not, decrease some variables in the equation.
      Thread pointer: 0x7f06940009a8
      Attempting backtrace. You can use the following information to find out
      where mysqld died. If you see no messages after this, something went
      terribly wrong...
      stack_bottom = 0x7f18682b8d60 thread_stack 0x49000
      *** buffer overflow detected ***: /usr/sbin/mysqld terminated
      ======= Backtrace: =========
      /lib64/libc.so.6(__fortify_fail+0x37)[0x7f1d770279e7]
      /lib64/libc.so.6(+0x115b62)[0x7f1d77025b62]
      /lib64/libc.so.6(+0x117947)[0x7f1d77027947]
      /usr/sbin/mysqld(my_addr_resolve+0xda)[0x5574e38f647a]
      /usr/sbin/mysqld(my_print_stacktrace+0x1c2)[0x5574e38dfab2]
      /usr/sbin/mysqld(handle_fatal_signal+0x30f)[0x5574e337c53f]
      /lib64/libpthread.so.0(+0xf5d0)[0x7f1d78c735d0]
      /lib64/libc.so.6(gsignal+0x37)[0x7f1d76f46207]
      /lib64/libc.so.6(abort+0x148)[0x7f1d76f478f8]
      /usr/sbin/mysqld(+0xa6c223)[0x5574e366a223]
      /usr/sbin/mysqld(+0x99ab61)[0x5574e3598b61]
      /usr/sbin/mysqld(+0xa760be)[0x5574e36740be]
      /usr/sbin/mysqld(+0xa83fde)[0x5574e3681fde]
      /usr/sbin/mysqld(+0xa8e718)[0x5574e368c718]
      /usr/sbin/mysqld(+0xa0da82)[0x5574e360ba82]
      /usr/sbin/mysqld(+0xa12f99)[0x5574e3610f99]
      /usr/sbin/mysqld(+0xa13c44)[0x5574e3611c44]
      /usr/sbin/mysqld(+0x9e497b)[0x5574e35e297b]
      /usr/sbin/mysqld(+0x92042c)[0x5574e351e42c]
      /usr/sbin/mysqld(_ZN7handler13ha_update_rowEPKhS1_+0x1c2)[0x5574e3388032]
      /usr/sbin/mysqld(_ZN21Update_rows_log_event11do_exec_rowEP14rpl_group_info+0x291)[0x5574e3480ca1]
      2019-06-13 13:23:19 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT1.53285S), skipping check
      /usr/sbin/mysqld(_ZN14Rows_log_event14do_apply_eventEP14rpl_group_info+0x24c)[0x5574e347425c]
      /usr/sbin/mysqld(wsrep_apply_cb+0x4ac)[0x5574e32f8e9c]
      /usr/lib64/galera/libgalera_smm.so(_ZNK6galera9TrxHandle5applyEPvPF15wsrep_cb_statusS1_PKvmjPK14wsrep_trx_metaERS6_+0xd8)[0x7f1d73314598]
      /usr/lib64/galera/libgalera_smm.so(+0x1de753)[0x7f1d73351753]
      /usr/lib64/galera/libgalera_smm.so(_ZN6galera13ReplicatorSMM9apply_trxEPvPNS_9TrxHandleE+0x15c)[0x7f1d733548ac]
      /usr/lib64/galera/libgalera_smm.so(_ZN6galera13ReplicatorSMM11process_trxEPvPNS_9TrxHandleE+0x10e)[0x7f1d73357a8e]
      /usr/lib64/galera/libgalera_smm.so(_ZN6galera15GcsActionSource8dispatchEPvRK10gcs_actionRb+0x1b8)[0x7f1d73332f38]
      /usr/lib64/galera/libgalera_smm.so(_ZN6galera15GcsActionSource7processEPvRb+0x4c)[0x7f1d7333462c]
      /usr/lib64/galera/libgalera_smm.so(_ZN6galera13ReplicatorSMM10async_recvEPv+0x6b)[0x7f1d7335806b]
      /usr/lib64/galera/libgalera_smm.so(galera_recv+0x18)[0x7f1d73366068]
      /usr/sbin/mysqld(+0x6fbb49)[0x5574e32f9b49]
      /usr/sbin/mysqld(start_wsrep_THD+0x29e)[0x5574e32ed58e]
      /lib64/libpthread.so.0(+0x7dd5)[0x7f1d78c6bdd5]
      /lib64/libc.so.6(clone+0x6d)[0x7f1d7700dead]
      ======= Memory map: ========
      5574e2bfe000-5574e3eb1000 r-xp 00000000 ca:01 12760256 /usr/sbin/mysqld
      5574e40b0000-5574e41af000 r--p 012b2000 ca:01 12760256 /usr/sbin/mysqld
      

      We recovered from this issue, and further investigation gave some tables which had the index corruption (redacted some names):

      db_a.table_1
      Warning  : InnoDB: Index IDX_33BF90829E6B1585 is marked as corrupted
      Warning  : InnoDB: Index <redacted> is marked as corrupted
      Warning  : InnoDB: Index <redacted> is marked as corrupted
      error    : Corrupt
      db_2.table_2
      Warning  : InnoDB: Index IDX_2AF5A5C39C6FCE9 is marked as corrupted
      Warning  : InnoDB: Index IDX_2AF5A5C41859289 is marked as corrupted
      Warning  : InnoDB: Index <redacted> is marked as corrupted
      Warning  : InnoDB: Index IDX_2AF5A5C54B9D732 is marked as corrupted
      Warning  : InnoDB: Index <redacted> is marked as corrupted
      Warning  : InnoDB: Index FK_2AF5A5CCD17CB07 is marked as corrupted
      Warning  : InnoDB: Index <redacted> is marked as corrupted
      error    : Corrupt
      db_3.table_3
      Warning  : InnoDB: Index UNIQ_1B2C3247D0F6D69 is marked as corrupted
      Warning  : InnoDB: Index IDX_1B2C324796915FC is marked as corrupted
      Warning  : InnoDB: Index IDX_1B2C3247108B7592 is marked as corrupted
      error    : Corrupt
      db_4.table_4
      Warning  : InnoDB: Index IDX_5615FCE3E7A1254A is marked as corrupted
      error    : Corrupt
      db_5.table_5
      Warning  : InnoDB: Index IDX_4C62E638A76ED395 is marked as corrupted
      error    : Corrupt
      db_6.table_6
      Warning  : InnoDB: Index IDX_F84AA10D613FECDF is marked as corrupted
      Warning  : InnoDB: Index IDX_F84AA10D5DA1941 is marked as corrupted
      Warning  : InnoDB: Index IDX_F84AA10DCCFA12B8 is marked as corrupted
      Warning  : InnoDB: Index IDX_F84AA10DF025383A is marked as corrupted
      error    : Corrupt
      db_7.table_7
      Warning  : InnoDB: The B-tree of index PRIMARY is corrupted.
      Warning  : InnoDB: Index IDX_F84AA10D613FECDF is marked as corrupted
      Warning  : InnoDB: Index IDX_F84AA10D5DA1941 is marked as corrupted
      Warning  : InnoDB: Index IDX_F84AA10DCCFA12B8 is marked as corrupted
      Warning  : InnoDB: Index IDX_F84AA10DF025383A is marked as corrupted
      error    : Corrupt
      db_8.table_8
      Warning  : InnoDB: Index 'IDX_FD2BEAC094A4C7D4' contains 1374306 entries, should be 1374337.
      Warning  : InnoDB: Index 'IDX_FD2BEAC0A76ED395' contains 1374306 entries, should be 1374337.
      Warning  : InnoDB: Index '<redacted>' contains 1374333 entries, should be 1374337.
      Warning  : InnoDB: Index '<redacted>' contains 1374332 entries, should be 1374337.
      Warning  : InnoDB: Index '<redacted>' contains 1374334 entries, should be 1374337.
      Warning  : InnoDB: Index 'IDX_FD2BEAC0727ACA70' contains 1374322 entries, should be 1374337.
      Warning  : InnoDB: Index 'FK_FD2BEAC067433D9C' contains 1374333 entries, should be 1374337.
      Warning  : InnoDB: Index 'IDX_FD2BEAC0727BAC71' contains 1374312 entries, should be 1374337.
      Warning  : InnoDB: Index 'IDX_FD2BEAC0E25D857E47CC8C92B5D554E5' contains 1374329 entries, should be 1374337.
      Warning  : InnoDB: Index '<redacted>' contains 1374333 entries, should be 1374337.
      Warning  : InnoDB: Index '<redacted>' contains 1374333 entries, should be 1374337.
      error    : Corrupt
      db_9.table_9
      Warning  : InnoDB: Index 'IDX_9B8A8D46A76ED395' contains 669 entries, should be 670.
      Warning  : InnoDB: Index 'IDX_9B8A8D4694A4C7D4' contains 669 entries, should be 670.
      Warning  : InnoDB: Index 'IDX_9B8A8D46CCFA12B8' contains 669 entries, should be 670.
      error    : Corrupt
      

      We recovered eventually from this by mysqldump-ing the data, dropping the table and reload the content. The dumping needed in one case the startup of the server with force_innodb_recovery set to 6.

      There is one symptom which made us raise some eyebrows, and maybe this might help in debugging. The tables which had corruption had a field appended (alter table add column varchar...), using the new "instant" method (https://mariadb.com/kb/en/library/instant-add-column-for-innodb/). When mysqldumping the content from the corrupted tables, this field was actually filled with corrupted data (while it should have been NULL, we added it but didn't use/fill it yet). The data in the field looked like binary content of the table itself, as we could recognise parts of the other varchar columns in there. All other fields seemed correct.

      My apologies if this is vague. We have the binary datafiles still on disk in our backups, so if needed we can "restore" the corrupted situation.

      We thought our case has similarities to https://jira.mariadb.org/browse/MDEV-19666 and https://jira.mariadb.org/browse/MDEV-18223.

      Attachments

        Issue Links

          Activity

            People

              marko Marko Mäkelä
              wonko Bernard Grymonpon
              Votes:
              1 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.