[MDEV-19783] Random crashes and corrupt data in INSTANT-added columns - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Duplicate
Affects Version/s: 10.3.13, 10.3.14, 10.3.15
Fix Version/s: 10.3.17, 10.4.7
Component/s: Galera, Server, Storage Engine - InnoDB
Labels:
- corruption
- instant
Environment:
Centos 7

Description

This will be a bit vague bug report, but I'll file it anyhow, as I think it relates to a couple of other reports and might give additional insight. We run a 5 node galera cluster, 1.1TB of data, quite heavy read/write during usage and high parallellism. This runs on Centos7.

The core of the issue are random crashes, giving either terminations on signal 6 or signal 11. The signal 6 ones are always intentionally crashed by innodb as it detects some sort of corruption, the signal 11s mostly come after it reports on a corrupted index, and then crashes without info a couple of seconds later.

In retrospect, we believe it started while we were at version 10.3.14. The first manifestation was with this stacktrace on one of our nodes:

2019-05-23 19:53:16 4 [ERROR] [FATAL] InnoDB: Rec offset 99, cur1 offset 14135, cur2 offset 16209

190523 19:53:16 [ERROR] mysqld got signal 6 ;

This could be because you hit a bug. It is also possible that this binary

or one of the libraries it was linked against is corrupt, improperly built,

or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help

diagnose the problem, but since we have already crashed,

something is definitely wrong and this may fail.

Server version: 10.3.14-MariaDB-log

key_buffer_size=134217728

read_buffer_size=131072

max_used_connections=485

max_threads=1502

thread_count=179

It is possible that mysqld could use up to

key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 3432903 K bytes of memory

Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7f55780009a8

Attempting backtrace. You can use the following information to find out

where mysqld died. If you see no messages after this, something went

terribly wrong...

stack_bottom = 0x7f558cff8c30 thread_stack 0x49000

*** buffer overflow detected ***: /usr/sbin/mysqld terminated

======= Backtrace: =========

/lib64/libc.so.6(__fortify_fail+0x37)[0x7f6c417a19e7]

/lib64/libc.so.6(+0x115b62)[0x7f6c4179fb62]

/lib64/libc.so.6(+0x117947)[0x7f6c417a1947]

/usr/sbin/mysqld(my_addr_resolve+0xda)[0x55cf5d7408fa]

/usr/sbin/mysqld(my_print_stacktrace+0x1c2)[0x55cf5d729f92]

/usr/sbin/mysqld(handle_fatal_signal+0x357)[0x55cf5d1c9cc7]

/lib64/libpthread.so.0(+0xf5d0)[0x7f6c433ec5d0]

/lib64/libc.so.6(gsignal+0x37)[0x7f6c416c0207]

/lib64/libc.so.6(abort+0x148)[0x7f6c416c18f8]

/usr/sbin/mysqld(+0xa6c663)[0x55cf5d4b3663]

/usr/sbin/mysqld(+0x99b801)[0x55cf5d3e2801]

/usr/sbin/mysqld(+0xa7656e)[0x55cf5d4bd56e]

/usr/sbin/mysqld(+0xa79b1d)[0x55cf5d4c0b1d]

/usr/sbin/mysqld(+0xa7b016)[0x55cf5d4c2016]

/usr/sbin/mysqld(+0xa8b94e)[0x55cf5d4d294e]

Server restarted and continued happily after that. We didn't find a cause, and blamed it as a freak accident.

However, after a while, we started seeing issues on our backups. We take the backups with mariabackup, and part of the backup process is to actually restore the full backup again and verify that it load and runs fine on a test instance. Part of the verification is running a mysqlcheck. We had a couple of tables reporting that they were corrupt, but an optimize fixed this (secondary indexes being corrupt). We blamed this on running an older version of mariabackup on the recovery node to apply the logs, but think this might be unrelated.

Things got worse last week. We had a crash on 3 of our nodes, exactly at the same time, with exactly the same error on all three nodes (this is probably galera pushing a change to the same table on all nodes):

2019-06-13 13:23:18 9 [ERROR] [FATAL] InnoDB: Rec offset 99, cur1 offset 11337, cur2 offset 16190

190613 13:23:18 [ERROR] mysqld got signal 6 ;

This could be because you hit a bug. It is also possible that this binary

or one of the libraries it was linked against is corrupt, improperly built,

or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help

diagnose the problem, but since we have already crashed,

something is definitely wrong and this may fail.

Server version: 10.3.15-MariaDB-log

key_buffer_size=134217728

read_buffer_size=131072

max_used_connections=301

max_threads=1502

thread_count=138

It is possible that mysqld could use up to

key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 3433196 K bytes of memory

Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7f06940009a8

Attempting backtrace. You can use the following information to find out

where mysqld died. If you see no messages after this, something went

terribly wrong...

stack_bottom = 0x7f18682b8d60 thread_stack 0x49000

*** buffer overflow detected ***: /usr/sbin/mysqld terminated

======= Backtrace: =========

/lib64/libc.so.6(__fortify_fail+0x37)[0x7f1d770279e7]

/lib64/libc.so.6(+0x115b62)[0x7f1d77025b62]

/lib64/libc.so.6(+0x117947)[0x7f1d77027947]

/usr/sbin/mysqld(my_addr_resolve+0xda)[0x5574e38f647a]

/usr/sbin/mysqld(my_print_stacktrace+0x1c2)[0x5574e38dfab2]

/usr/sbin/mysqld(handle_fatal_signal+0x30f)[0x5574e337c53f]

/lib64/libpthread.so.0(+0xf5d0)[0x7f1d78c735d0]

/lib64/libc.so.6(gsignal+0x37)[0x7f1d76f46207]

/lib64/libc.so.6(abort+0x148)[0x7f1d76f478f8]

/usr/sbin/mysqld(+0xa6c223)[0x5574e366a223]

/usr/sbin/mysqld(+0x99ab61)[0x5574e3598b61]

/usr/sbin/mysqld(+0xa760be)[0x5574e36740be]

/usr/sbin/mysqld(+0xa83fde)[0x5574e3681fde]

/usr/sbin/mysqld(+0xa8e718)[0x5574e368c718]

/usr/sbin/mysqld(+0xa0da82)[0x5574e360ba82]

/usr/sbin/mysqld(+0xa12f99)[0x5574e3610f99]

/usr/sbin/mysqld(+0xa13c44)[0x5574e3611c44]

/usr/sbin/mysqld(+0x9e497b)[0x5574e35e297b]

/usr/sbin/mysqld(+0x92042c)[0x5574e351e42c]

/usr/sbin/mysqld(_ZN7handler13ha_update_rowEPKhS1_+0x1c2)[0x5574e3388032]

/usr/sbin/mysqld(_ZN21Update_rows_log_event11do_exec_rowEP14rpl_group_info+0x291)[0x5574e3480ca1]

2019-06-13 13:23:19 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT1.53285S), skipping check

/usr/sbin/mysqld(_ZN14Rows_log_event14do_apply_eventEP14rpl_group_info+0x24c)[0x5574e347425c]

/usr/sbin/mysqld(wsrep_apply_cb+0x4ac)[0x5574e32f8e9c]

/usr/lib64/galera/libgalera_smm.so(_ZNK6galera9TrxHandle5applyEPvPF15wsrep_cb_statusS1_PKvmjPK14wsrep_trx_metaERS6_+0xd8)[0x7f1d73314598]

/usr/lib64/galera/libgalera_smm.so(+0x1de753)[0x7f1d73351753]

/usr/lib64/galera/libgalera_smm.so(_ZN6galera13ReplicatorSMM9apply_trxEPvPNS_9TrxHandleE+0x15c)[0x7f1d733548ac]

/usr/lib64/galera/libgalera_smm.so(_ZN6galera13ReplicatorSMM11process_trxEPvPNS_9TrxHandleE+0x10e)[0x7f1d73357a8e]

/usr/lib64/galera/libgalera_smm.so(_ZN6galera15GcsActionSource8dispatchEPvRK10gcs_actionRb+0x1b8)[0x7f1d73332f38]

/usr/lib64/galera/libgalera_smm.so(_ZN6galera15GcsActionSource7processEPvRb+0x4c)[0x7f1d7333462c]

/usr/lib64/galera/libgalera_smm.so(_ZN6galera13ReplicatorSMM10async_recvEPv+0x6b)[0x7f1d7335806b]

/usr/lib64/galera/libgalera_smm.so(galera_recv+0x18)[0x7f1d73366068]

/usr/sbin/mysqld(+0x6fbb49)[0x5574e32f9b49]

/usr/sbin/mysqld(start_wsrep_THD+0x29e)[0x5574e32ed58e]

/lib64/libpthread.so.0(+0x7dd5)[0x7f1d78c6bdd5]

/lib64/libc.so.6(clone+0x6d)[0x7f1d7700dead]

======= Memory map: ========

5574e2bfe000-5574e3eb1000 r-xp 00000000 ca:01 12760256 /usr/sbin/mysqld

5574e40b0000-5574e41af000 r--p 012b2000 ca:01 12760256 /usr/sbin/mysqld

We recovered from this issue, and further investigation gave some tables which had the index corruption (redacted some names):

db_a.table_1

Warning  : InnoDB: Index IDX_33BF90829E6B1585 is marked as corrupted

Warning  : InnoDB: Index <redacted> is marked as corrupted

Warning  : InnoDB: Index <redacted> is marked as corrupted

error    : Corrupt

db_2.table_2

Warning  : InnoDB: Index IDX_2AF5A5C39C6FCE9 is marked as corrupted

Warning  : InnoDB: Index IDX_2AF5A5C41859289 is marked as corrupted

Warning  : InnoDB: Index <redacted> is marked as corrupted

Warning  : InnoDB: Index IDX_2AF5A5C54B9D732 is marked as corrupted

Warning  : InnoDB: Index <redacted> is marked as corrupted

Warning  : InnoDB: Index FK_2AF5A5CCD17CB07 is marked as corrupted

Warning  : InnoDB: Index <redacted> is marked as corrupted

error    : Corrupt

db_3.table_3

Warning  : InnoDB: Index UNIQ_1B2C3247D0F6D69 is marked as corrupted

Warning  : InnoDB: Index IDX_1B2C324796915FC is marked as corrupted

Warning  : InnoDB: Index IDX_1B2C3247108B7592 is marked as corrupted

error    : Corrupt

db_4.table_4

Warning  : InnoDB: Index IDX_5615FCE3E7A1254A is marked as corrupted

error    : Corrupt

db_5.table_5

Warning  : InnoDB: Index IDX_4C62E638A76ED395 is marked as corrupted

error    : Corrupt

db_6.table_6

Warning  : InnoDB: Index IDX_F84AA10D613FECDF is marked as corrupted

Warning  : InnoDB: Index IDX_F84AA10D5DA1941 is marked as corrupted

Warning  : InnoDB: Index IDX_F84AA10DCCFA12B8 is marked as corrupted

Warning  : InnoDB: Index IDX_F84AA10DF025383A is marked as corrupted

error    : Corrupt

db_7.table_7

Warning  : InnoDB: The B-tree of index PRIMARY is corrupted.

Warning  : InnoDB: Index IDX_F84AA10D613FECDF is marked as corrupted

Warning  : InnoDB: Index IDX_F84AA10D5DA1941 is marked as corrupted

Warning  : InnoDB: Index IDX_F84AA10DCCFA12B8 is marked as corrupted

Warning  : InnoDB: Index IDX_F84AA10DF025383A is marked as corrupted

error    : Corrupt

db_8.table_8

Warning  : InnoDB: Index 'IDX_FD2BEAC094A4C7D4' contains 1374306 entries, should be 1374337.

Warning  : InnoDB: Index 'IDX_FD2BEAC0A76ED395' contains 1374306 entries, should be 1374337.

Warning  : InnoDB: Index '<redacted>' contains 1374333 entries, should be 1374337.

Warning  : InnoDB: Index '<redacted>' contains 1374332 entries, should be 1374337.

Warning  : InnoDB: Index '<redacted>' contains 1374334 entries, should be 1374337.

Warning  : InnoDB: Index 'IDX_FD2BEAC0727ACA70' contains 1374322 entries, should be 1374337.

Warning  : InnoDB: Index 'FK_FD2BEAC067433D9C' contains 1374333 entries, should be 1374337.

Warning  : InnoDB: Index 'IDX_FD2BEAC0727BAC71' contains 1374312 entries, should be 1374337.

Warning  : InnoDB: Index 'IDX_FD2BEAC0E25D857E47CC8C92B5D554E5' contains 1374329 entries, should be 1374337.

Warning  : InnoDB: Index '<redacted>' contains 1374333 entries, should be 1374337.

Warning  : InnoDB: Index '<redacted>' contains 1374333 entries, should be 1374337.

error    : Corrupt

db_9.table_9

Warning  : InnoDB: Index 'IDX_9B8A8D46A76ED395' contains 669 entries, should be 670.

Warning  : InnoDB: Index 'IDX_9B8A8D4694A4C7D4' contains 669 entries, should be 670.

Warning  : InnoDB: Index 'IDX_9B8A8D46CCFA12B8' contains 669 entries, should be 670.

error    : Corrupt

We recovered eventually from this by mysqldump-ing the data, dropping the table and reload the content. The dumping needed in one case the startup of the server with force_innodb_recovery set to 6.

There is one symptom which made us raise some eyebrows, and maybe this might help in debugging. The tables which had corruption had a field appended (alter table add column varchar...), using the new "instant" method (https://mariadb.com/kb/en/library/instant-add-column-for-innodb/). When mysqldumping the content from the corrupted tables, this field was actually filled with corrupted data (while it should have been NULL, we added it but didn't use/fill it yet). The data in the field looked like binary content of the table itself, as we could recognise parts of the other varchar columns in there. All other fields seemed correct.

My apologies if this is vague. We have the binary datafiles still on disk in our backups, so if needed we can "restore" the corrupted situation.

We thought our case has similarities to https://jira.mariadb.org/browse/MDEV-19666 and https://jira.mariadb.org/browse/MDEV-18223.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

innodb_evict_tables_on_commit_debug.patch
7 kB
2019-08-23 14:45

Issue Links

duplicates

MDEV-18519 0x7f0118195700 InnoDB: Assertion failure in file /home/buildbot/buildbot/build/mariadb-10.3.7/storage/innobase/btr/btr0cur.cc line 4308

Closed

MDEV-19916 Corruption after instant ADD/DROP and shrinking the index tree

Closed

is caused by

MDEV-11369 Instant add column for InnoDB

Closed

relates to

MDEV-19743 Crash while reorganizing an index page

Closed

MDEV-19916 Corruption after instant ADD/DROP and shrinking the index tree

Closed

MDEV-20590 Introduce a file format constraint to ALTER TABLE

Closed

MDEV-21251 CHECK TABLE fails to check info_bits of records

Closed

MDEV-21371 Assertion failure in page_rec_get_next_low() during innodb_gis.rtree_compress

Closed

MDEV-28786 InnoDB crash leads to pagesize comparison failure

Closed

MDEV-18932 MariaDB 10.3.10-10.3.13 corrupts table and refuses to start with assertion in row0sel.cc 2986

Closed

MDEV-20413 Mariadb Galera Cluster Crash and won't restart

Closed

MDEV-22404 Server Crash after Galera WSREP event received

Closed

(7 relates to)

Activity

Ascending order - Click to sort in descending order

View 16 older comments

Marko Mäkelä added a comment - 2019-09-25 10:04

We should ensure that CHECK TABLE will report corruption if:

!table->is_instant() holds and we find a metadata record
we encounter any other leaf-page where REC_INFO_MIN_REC_FLAG is set

Note: REC_INFO_MIN_REC_FLAG did not occur in any leaf-page records until ~~MDEV-11369~~ introduced it to distinguish the hidden metadata record. Such records are always at the start of the leftmost non-leaf pages (with FIL_PAGE_PREV containing FIL_NULL and PAGE_LEVEL not being 0). The flag causes the key of the record to be ignored, and makes the record smaller than any other key.

It would also help if normal SQL operations noticed these forms of corruption.
Such an instrumented build should be used in an attempt to reproduce the problem. Tests with the recently introduced debug parameter

SET GLOBAL innodb_evict_tables_on_commit_debug=ON;

were not successful yet. I think that in order to be able to reproduce this bug, we must try to evict and reload table definitions as often as possible. If the table remains in the InnoDB dict_sys cache, any changes to the FIL_PAGE_TYPE of the clustered index root page should not matter at all.

Marko Mäkelä added a comment - 2019-09-25 10:04 We should ensure that CHECK TABLE will report corruption if: !table->is_instant() holds and we find a metadata record we encounter any other leaf-page where REC_INFO_MIN_REC_FLAG is set Note: REC_INFO_MIN_REC_FLAG did not occur in any leaf-page records until MDEV-11369 introduced it to distinguish the hidden metadata record. Such records are always at the start of the leftmost non-leaf pages (with FIL_PAGE_PREV containing FIL_NULL and PAGE_LEVEL not being 0). The flag causes the key of the record to be ignored, and makes the record smaller than any other key. It would also help if normal SQL operations noticed these forms of corruption. Such an instrumented build should be used in an attempt to reproduce the problem. Tests with the recently introduced debug parameter SET GLOBAL innodb_evict_tables_on_commit_debug= ON ; were not successful yet. I think that in order to be able to reproduce this bug, we must try to evict and reload table definitions as often as possible. If the table remains in the InnoDB dict_sys cache, any changes to the FIL_PAGE_TYPE of the clustered index root page should not matter at all.

Eugene Kosov (Inactive) added a comment - 2019-09-30 18:34

Please review CHECK TABLE improvements.

Eugene Kosov (Inactive) added a comment - 2019-09-30 18:34 Please review CHECK TABLE improvements.

Marko Mäkelä added a comment - 2019-10-08 15:46

I pushed the CHECK TABLE improvements and improved checks to a staging tree of 10.2 after testing it WITH_INNODB_EXTRA_DEBUG enabled. I don’t think that it hurts to have better checks in 10.2 already.

Marko Mäkelä added a comment - 2019-10-08 15:46 I pushed the CHECK TABLE improvements and improved checks to a staging tree of 10.2 after testing it WITH_INNODB_EXTRA_DEBUG enabled. I don’t think that it hurts to have better checks in 10.2 already.

Marko Mäkelä added a comment - 2019-11-12 09:54

I believe that I may have found an explanation why mleich failed to repeat this corruption with recent 10.3 or 10.4.
This error could be fully explained by ~~MDEV-19916~~, which was fixed in MariaDB Server 10.3.17 and 10.4.7. Notably, "affected versions" do not list anything after those versions.

Marko Mäkelä added a comment - 2019-11-12 09:54 I believe that I may have found an explanation why mleich failed to repeat this corruption with recent 10.3 or 10.4. This error could be fully explained by MDEV-19916 , which was fixed in MariaDB Server 10.3.17 and 10.4.7. Notably, "affected versions" do not list anything after those versions.

Marko Mäkelä added a comment - 2019-11-15 11:56

I think that this report duplicates ~~MDEV-19916~~, which was fixed in MariaDB 10.3.17 and 10.4.7.

Marko Mäkelä added a comment - 2019-11-15 11:56 I think that this report duplicates MDEV-19916 , which was fixed in MariaDB 10.3.17 and 10.4.7.

MariaDB Server

Random crashes and corrupt data in INSTANT-added columns

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration