[MDEV-19081] DB crashes periodicly with - InnoDB: Assertion failure Created: 2019-03-29 Updated: 2019-05-13 Resolved: 2019-05-13 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.2.22, 10.2.23 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Critical |
| Reporter: | Stefan B. | Assignee: | Marko Mäkelä |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | KVM, LXC, corruption, galera, innodb, need_feedback | ||
| Environment: |
|
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Description |
|
The DB service stops from time to time but non-periodic (estimated mostly between 11pm and 9am) with the following error:
If seems that the processes keep some kind of running but the service info (systemctl status mariadb.service) are in a failed state and nobody can connect or use the db services. Furthermore the process has to be killed manually. |
| Comments |
| Comment by Marko Mäkelä [ 2019-03-29 ] | ||
|
InnoDB is crashing because it detects corruption in a record header of a table that does not use ROW_FORMAT=REDUNDANT (which was the original and only format before MySQL 5.0.3). The function rec_get_offsets_func() is reading the 3 bits from rec_get_status(), and expects the most significant bit of it to be zero. In But, the cause of the corruption is unclear and cannot be found without a repeatable test case. I fixed some corruption bugs recently ( I would suggest using innodb_compression_algorithm=strict_crc32 to improve the chances of detecting this kind of corruption earlier, when reading pages into the buffer pool. If corrupted pages enter the buffer pool, InnoDB can crash in very many places. There could be a problem in the snapshot transfer (SST) procedure, or the corruption could simply have been propagated by the SST. That is the downside of physical replication. If you can provide more information, such as page dumps (which could be obtained by using a debugger), I can try to help further. | ||
| Comment by Stefan B. [ 2019-03-29 ] | ||
|
Firstly thanks a lot for the quick response. After reading your comment i got some question which might help me to solve our situation.
| ||
| Comment by Marko Mäkelä [ 2019-03-29 ] | ||
|
Sorry, I meant to write innodb_checksum_algorithm=strict_crc32. In MariaDB 10.4 there will be an even better option, full_crc32, but it will not affect the format of ROW_FORMAT=COMPRESSED pages. I do not think that this particular form of corruption should be possible with ROW_FORMAT=COMPRESSED pages, because the status bits are generated when the page is decompressed. (I am the author of that format.) The only way how you could get this assertion failure for ROW_FORMAT=COMPRESSED is that something (software bug or hardware fault) corrupts the InnoDB buffer pool, which I think is rather unlikely. Rebuilding tables can sometimes remove corruption, if that corruption is not in the pages that must be read in order to copy the table. SELECT * (possibly done by mysqldump) or a table rebuild will have to read all leaf pages of the clustered index tree, and all BLOB pages that are pointed to by leaf-page records. Additionally, all pages on the path from the root to the first leaf page will have to be read. If you are lucky, the corruption is in a secondary index page or in a non-leaf page that will not be accessed when copying the table. I would have to see a dump of the corrupted page in order to determine the cause of the corruption. We have not seen this form of page corruption in our internal testing, except when some bug in mariabackup has been involved. | ||
| Comment by Stefan B. [ 2019-03-29 ] | ||
|
> Sorry, I meant to write innodb_checksum_algorithm=strict_crc32 Rebuilding tables can sometimes remove corruption, if that corruption is not in the pages that must be read in order to copy the table. SELECT * (possibly done by mysqldump) or a table rebuild will have to read all leaf pages of the clustered index tree, and all BLOB pages that are pointed to by leaf-page records. Additionally, all pages on the path from the root to the first leaf page will have to be read. If you are lucky, the corruption is in a secondary index page or in a non-leaf page that will not be accessed when copying the table. I would have to see a dump of the corrupted page in order to determine the cause of the corruption. We have not seen this form of page corruption in our internal testing, except when some bug in mariabackup has been involved. | ||
| Comment by Stefan B. [ 2019-04-10 ] | ||
|
Today we run into the same Error. I have added the backtrace and the memory-map as shown up I hope this is what you mean with a dump of the corrupted page . | ||
| Comment by Marko Mäkelä [ 2019-04-12 ] | ||
|
You could get a dump of the corrupted page by attaching a debugger to the mysqld process before the crash, and running suitable commands. In GDB, you would use up or frame and then dump the buf_block_t::frame and also the buf_block_t::page metadata. In the stack frame of the failing assertion in db1b_backtrace-MemoryMap
We’d also want to identify the index and the table. For our support customers, we would provide detailed instructions how to do that. |