[MDEV-19081] DB crashes periodicly with - InnoDB: Assertion failure - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Incomplete
Affects Version/s: 10.2.22, 10.2.23
Fix Version/s: N/A
Component/s: Storage Engine - InnoDB
Labels:
- KVM
- LXC
- corruption
- galera
- innodb
- need_feedback
Environment:

Hide
- 3 DB-Nodes running in virtualized environments (2 LXCs, 1 KVM)
- Debian 9.8 Stretch
- Galera Replication
- about 120 web servers access DBs using respective schemas and users
- ProxySQL as SQL-aware load balancer (only one DB-Node should Read/Write) in between the DB-Nodes and the Webservers

Show
- 3 DB-Nodes running in virtualized environments (2 LXCs, 1 KVM) - Debian 9.8 Stretch - Galera Replication - about 120 web servers access DBs using respective schemas and users - ProxySQL as SQL-aware load balancer (only one DB-Node should Read/Write) in between the DB-Nodes and the Webservers

Description

The DB service stops from time to time but non-periodic (estimated mostly between 11pm and 9am) with the following error:

2019-03-28 23:31:08 0x7ff4dc62d700  InnoDB: Assertion failure in file /home/buildbot/buildbot/build/mariadb-10.2.23/storage/innobase/rem/rem0rec.cc line 574

InnoDB: We intentionally generate a memory trap.

InnoDB: Submit a detailed bug report to https://jira.mariadb.org/

InnoDB: If you get repeated assertion failures or crashes, even

InnoDB: immediately after the mysqld startup, there may be

InnoDB: corruption in the InnoDB tablespace. Please refer to

InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/

InnoDB: about forcing recovery.

190328 23:31:08 [ERROR] mysqld got signal 6 ;

This could be because you hit a bug. It is also possible that this binary

or one of the libraries it was linked against is corrupt, improperly built,

or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help

diagnose the problem, but since we have already crashed,

something is definitely wrong and this may fail.

Server version: 10.2.23-MariaDB-10.2.23+maria~stretch

key_buffer_size=134217728

read_buffer_size=131072

max_used_connections=234

max_threads=752

thread_count=273

It is possible that mysqld could use up to

key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1783435 K  bytes of memory

Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7ff4d8001f28

Attempting backtrace. You can use the following information to find out

where mysqld died. If you see no messages after this, something went

terribly wrong...

stack_bottom = 0x7ff4dc62ccc8 thread_stack 0x49000

*** buffer overflow detected ***: /usr/sbin/mysqld terminated

======= Backtrace: =========

/lib/x86_64-linux-gnu/libc.so.6(+0x70bfb)[0x7ffa09af5bfb]

/lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x37)[0x7ffa09b7e437]

/lib/x86_64-linux-gnu/libc.so.6(+0xf7570)[0x7ffa09b7c570]

/lib/x86_64-linux-gnu/libc.so.6(+0xf93aa)[0x7ffa09b7e3aa]

/usr/sbin/mysqld(my_addr_resolve+0xe2)[0x55ca42284922]

/usr/sbin/mysqld(my_print_stacktrace+0x1bb)[0x55ca4226b1eb]

/usr/sbin/mysqld(handle_fatal_signal+0x41d)[0x55ca41d0a01d]

/lib/x86_64-linux-gnu/libpthread.so.0(+0x110e0)[0x7ffa0b4180e0]

/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcf)[0x7ffa09ab7fff]

/lib/x86_64-linux-gnu/libc.so.6(abort+0x16a)[0x7ffa09ab942a]

/usr/sbin/mysqld(+0x40f971)[0x55ca41ab8971]

/usr/sbin/mysqld(+0x887df6)[0x55ca41f30df6]

/usr/sbin/mysqld(+0x863673)[0x55ca41f0c673]

/usr/sbin/mysqld(+0x96648e)[0x55ca4200f48e]

/usr/sbin/mysqld(+0x89b559)[0x55ca41f44559]

/usr/sbin/mysqld(+0x8a15e4)[0x55ca41f4a5e4]

/usr/sbin/mysqld(+0x8a2187)[0x55ca41f4b187]

/usr/sbin/mysqld(+0x8b1a20)[0x55ca41f5aa20]

/usr/sbin/mysqld(+0x7f5c04)[0x55ca41e9ec04]

/usr/sbin/mysqld(_ZN7handler12ha_write_rowEPh+0x107)[0x55ca41d140d7]

/usr/sbin/mysqld(_Z12write_recordP3THDP5TABLEP12st_copy_info+0x72)[0x55ca41b4b992]

/usr/sbin/mysqld(_Z12mysql_insertP3THDP10TABLE_LISTR4ListI4ItemERS3_IS5_ES6_S6_15enum_duplicatesb+0x1206)[0x55ca41b560f6]

/usr/sbin/mysqld(_Z21mysql_execute_commandP3THD+0x3f68)[0x55ca41b6bee8]

/usr/sbin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_statebb+0x28a)[0x55ca41b70e4a]

/usr/sbin/mysqld(+0x4c864f)[0x55ca41b7164f]

/usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcjbb+0x1a7c)[0x55ca41b737fc]

/usr/sbin/mysqld(_Z10do_commandP3THD+0x176)[0x55ca41b748a6]

/usr/sbin/mysqld(_Z24do_handle_one_connectionP7CONNECT+0x25a)[0x55ca41c3ec0a]

/usr/sbin/mysqld(handle_one_connection+0x3d)[0x55ca41c3ed7d]

/usr/sbin/mysqld(+0xb75791)[0x55ca4221e791]

/lib/x86_64-linux-gnu/libpthread.so.0(+0x74a4)[0x7ffa0b40e4a4]

/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7ffa09b6dd0f]

======= Memory map: ========

55ca416a9000-55ca4278f000 r-xp 00000000 fe:02 1320586                    /usr/sbin/mysqld

55ca4298e000-55ca42a60000 r--p 010e5000 fe:02 1320586                    /usr/sbin/mysqld

55ca42a60000-55ca42b17000 rw-p 011b7000 fe:02 1320586                    /usr/sbin/mysqld

55ca42b17000-55ca433a9000 rw-p 00000000 00:00 0

55ca43458000-55ca508b9000 rw-p 00000000 00:00 0                          [heap]

7ff3d9e04000-7ff3d9e05000 ---p 00000000 00:00 0

7ff3d9e05000-7ff3d9e4f000 rw-p 00000000 00:00 0

7ff3d9e4f000-7ff3d9e50000 ---p 00000000 00:00 0

...

The DB node which crashed at first (db1b) tonight, was set up yesterday (running 10.2.23).
The second server (db1c) crashed 70 seconds later after becoming the primary DB node and during the reconnect/resync (SST) of db1b (last update 3 weeks ago from 10.2.14 to 10.2.22)
The third DB node db1a wasn't running since the weekend and was therefore not involved in tonights crash

If seems that the processes keep some kind of running but the service info (systemctl status mariadb.service) are in a failed state and nobody can connect or use the db services. Furthermore the process has to be killed manually.
After that I could rebuild the Cluster by creating a new Galera Cluster (galera_new_cluster) from the most advanced node.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

db1b_backtrace-MemoryMap
74 kB
2019-04-10 09:36
mariadb-config_db1b
6 kB
2019-03-29 12:00

Issue Links

relates to

MDEV-13542 Crashing on a corrupted page is unhelpful

Closed

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2019-03-29 14:07

InnoDB is crashing because it detects corruption in a record header of a table that does not use ROW_FORMAT=REDUNDANT (which was the original and only format before MySQL 5.0.3). The function rec_get_offsets_func() is reading the 3 bits from rec_get_status(), and expects the most significant bit of it to be zero.

In ~~MDEV-13542~~ we should make the error handling of InnoDB more robust, so that crashes like this would be avoided.

But, the cause of the corruption is unclear and cannot be found without a repeatable test case. I fixed some corruption bugs recently (~~MDEV-14126~~ just missed the 10.2.23 release), but I don’t think those fixes would affect this kind of corruption.

I would suggest using innodb_compression_algorithm=strict_crc32 to improve the chances of detecting this kind of corruption earlier, when reading pages into the buffer pool. If corrupted pages enter the buffer pool, InnoDB can crash in very many places.

There could be a problem in the snapshot transfer (SST) procedure, or the corruption could simply have been propagated by the SST. That is the downside of physical replication.

If you can provide more information, such as page dumps (which could be obtained by using a debugger), I can try to help further.

Marko Mäkelä added a comment - 2019-03-29 14:07 InnoDB is crashing because it detects corruption in a record header of a table that does not use ROW_FORMAT=REDUNDANT (which was the original and only format before MySQL 5.0.3). The function rec_get_offsets_func() is reading the 3 bits from rec_get_status() , and expects the most significant bit of it to be zero. In MDEV-13542 we should make the error handling of InnoDB more robust, so that crashes like this would be avoided. But, the cause of the corruption is unclear and cannot be found without a repeatable test case. I fixed some corruption bugs recently ( MDEV-14126 just missed the 10.2.23 release), but I don’t think those fixes would affect this kind of corruption. I would suggest using innodb_compression_algorithm=strict_crc32 to improve the chances of detecting this kind of corruption earlier, when reading pages into the buffer pool. If corrupted pages enter the buffer pool, InnoDB can crash in very many places. There could be a problem in the snapshot transfer (SST) procedure, or the corruption could simply have been propagated by the SST. That is the downside of physical replication. If you can provide more information, such as page dumps (which could be obtained by using a debugger), I can try to help further.

Stefan B. added a comment - 2019-03-29 15:01

Firstly thanks a lot for the quick response.

After reading your comment i got some question which might help me to solve our situation.

We got some tables which use ROW_FORMAT=COMPRESSED. Is it possibly to alter the row-format for these tables and if yes would this bypass our problems.
You are suggesting to set innodb_compression_algorithm=strict_crc32 but in the official docs i can't find this algorithm as a supported option (https://mariadb.com/kb/en/library/compression/#choosing-compression-algorithm)
If this option leads to an earlier detection of corruptions, does that mean that they are also not loaded into buffer pool?
Assuming that the appropriate webservers to the tables (see 1) ) are less important, would it be helpful to change some table (or schema) specific options, forbid write access or re-import the tables from existing db dumps or even shut down the application?
If I'm correctly understanding there is currently no bugfix in any of the newer versions?

Stefan B. added a comment - 2019-03-29 15:01 Firstly thanks a lot for the quick response. After reading your comment i got some question which might help me to solve our situation. We got some tables which use ROW_FORMAT=COMPRESSED . Is it possibly to alter the row-format for these tables and if yes would this bypass our problems. You are suggesting to set innodb_compression_algorithm=strict_crc32 but in the official docs i can't find this algorithm as a supported option ( https://mariadb.com/kb/en/library/compression/#choosing-compression-algorithm ) If this option leads to an earlier detection of corruptions, does that mean that they are also not loaded into buffer pool? Assuming that the appropriate webservers to the tables (see 1) ) are less important, would it be helpful to change some table (or schema) specific options, forbid write access or re-import the tables from existing db dumps or even shut down the application? If I'm correctly understanding there is currently no bugfix in any of the newer versions?

Marko Mäkelä added a comment - 2019-03-29 15:10

Sorry, I meant to write innodb_checksum_algorithm=strict_crc32. In MariaDB 10.4 there will be an even better option, full_crc32, but it will not affect the format of ROW_FORMAT=COMPRESSED pages.

I do not think that this particular form of corruption should be possible with ROW_FORMAT=COMPRESSED pages, because the status bits are generated when the page is decompressed. (I am the author of that format.) The only way how you could get this assertion failure for ROW_FORMAT=COMPRESSED is that something (software bug or hardware fault) corrupts the InnoDB buffer pool, which I think is rather unlikely.

Rebuilding tables can sometimes remove corruption, if that corruption is not in the pages that must be read in order to copy the table. SELECT * (possibly done by mysqldump) or a table rebuild will have to read all leaf pages of the clustered index tree, and all BLOB pages that are pointed to by leaf-page records. Additionally, all pages on the path from the root to the first leaf page will have to be read. If you are lucky, the corruption is in a secondary index page or in a non-leaf page that will not be accessed when copying the table.

I would have to see a dump of the corrupted page in order to determine the cause of the corruption. We have not seen this form of page corruption in our internal testing, except when some bug in mariabackup has been involved.

Marko Mäkelä added a comment - 2019-03-29 15:10 Sorry, I meant to write innodb_checksum_algorithm=strict_crc32 . In MariaDB 10.4 there will be an even better option, full_crc32 , but it will not affect the format of ROW_FORMAT=COMPRESSED pages. I do not think that this particular form of corruption should be possible with ROW_FORMAT=COMPRESSED pages, because the status bits are generated when the page is decompressed. (I am the author of that format.) The only way how you could get this assertion failure for ROW_FORMAT=COMPRESSED is that something (software bug or hardware fault) corrupts the InnoDB buffer pool, which I think is rather unlikely. Rebuilding tables can sometimes remove corruption, if that corruption is not in the pages that must be read in order to copy the table. SELECT * (possibly done by mysqldump ) or a table rebuild will have to read all leaf pages of the clustered index tree, and all BLOB pages that are pointed to by leaf-page records. Additionally, all pages on the path from the root to the first leaf page will have to be read. If you are lucky, the corruption is in a secondary index page or in a non-leaf page that will not be accessed when copying the table. I would have to see a dump of the corrupted page in order to determine the cause of the corruption. We have not seen this form of page corruption in our internal testing, except when some bug in mariabackup has been involved.

Stefan B. added a comment - 2019-03-29 15:30 - edited

> Sorry, I meant to write innodb_checksum_algorithm=strict_crc32
~~Am i mistaken or is this exactly what you wrote in your first comment?~~
Ah ok, I see ...sorry it's friday
Unfortunately the servers are already configured with this algorithm

Rebuilding tables can sometimes remove corruption, if that corruption is not in the pages that must be read in order to copy the table. SELECT * (possibly done by mysqldump) or a table rebuild will have to read all leaf pages of the clustered index tree, and all BLOB pages that are pointed to by leaf-page records. Additionally, all pages on the path from the root to the first leaf page will have to be read. If you are lucky, the corruption is in a secondary index page or in a non-leaf page that will not be accessed when copying the table.
I will try this in the next days and until then hopefully nothing will happen.

I would have to see a dump of the corrupted page in order to determine the cause of the corruption. We have not seen this form of page corruption in our internal testing, except when some bug in mariabackup has been involved.
Sorry if I'm asking but how could I create this dump and within which time after a crash could I get it

Stefan B. added a comment - 2019-03-29 15:30 - edited > Sorry, I meant to write innodb_checksum_algorithm=strict_crc32 Am i mistaken or is this exactly what you wrote in your first comment? Ah ok, I see ...sorry it's friday Unfortunately the servers are already configured with this algorithm Rebuilding tables can sometimes remove corruption, if that corruption is not in the pages that must be read in order to copy the table. SELECT * (possibly done by mysqldump) or a table rebuild will have to read all leaf pages of the clustered index tree, and all BLOB pages that are pointed to by leaf-page records. Additionally, all pages on the path from the root to the first leaf page will have to be read. If you are lucky, the corruption is in a secondary index page or in a non-leaf page that will not be accessed when copying the table. I will try this in the next days and until then hopefully nothing will happen. I would have to see a dump of the corrupted page in order to determine the cause of the corruption. We have not seen this form of page corruption in our internal testing, except when some bug in mariabackup has been involved. Sorry if I'm asking but how could I create this dump and within which time after a crash could I get it

Stefan B. added a comment - 2019-04-10 09:38

Today we run into the same Error. I have added the backtrace and the memory-map as shown up
in the log file.

I hope this is what you mean with a dump of the corrupted page .
Otherwise I'm not really shure how to get such a dump.
db1b_backtrace-MemoryMap

Stefan B. added a comment - 2019-04-10 09:38 Today we run into the same Error. I have added the backtrace and the memory-map as shown up in the log file. I hope this is what you mean with a dump of the corrupted page . Otherwise I'm not really shure how to get such a dump. db1b_backtrace-MemoryMap

Marko Mäkelä added a comment - 2019-04-12 19:10

You could get a dump of the corrupted page by attaching a debugger to the mysqld process before the crash, and running suitable commands. In GDB, you would use up or frame and then dump the buf_block_t::frame and also the buf_block_t::page metadata. In the stack frame of the failing assertion in db1b_backtrace-MemoryMap , that would be something like the following:

print latch_leaves.blocks[0]->page

dump binary memory /tmp/page.bin latch_leaves.blocks[0]->frame latch_leaves.blocks[0]->frame+srv_page_size

We’d also want to identify the index and the table. For our support customers, we would provide detailed instructions how to do that.

Marko Mäkelä added a comment - 2019-04-12 19:10 You could get a dump of the corrupted page by attaching a debugger to the mysqld process before the crash, and running suitable commands. In GDB, you would use up or frame and then dump the buf_block_t::frame and also the buf_block_t::page metadata. In the stack frame of the failing assertion in db1b_backtrace-MemoryMap , that would be something like the following: print latch_leaves.blocks[0]->page dump binary memory /tmp/page.bin latch_leaves.blocks[0]->frame latch_leaves.blocks[0]->frame+srv_page_size We’d also want to identify the index and the table. For our support customers, we would provide detailed instructions how to do that.

MariaDB Server

DB crashes periodicly with - InnoDB: Assertion failure

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration