[MDEV-30397] InnoDB crash due to DB_FAIL reported for a corrupted page Created: 2023-01-12  Updated: 2023-03-02  Resolved: 2023-02-16

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.6, 10.7, 10.8, 10.9, 10.10, 10.11, 11.0
Fix Version/s: 10.11.3, 11.0.1, 10.6.13, 10.7.8, 10.8.8, 10.9.6, 10.10.4

Type: Bug Priority: Critical
Reporter: Richard Green Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: None
Environment:

windows 10 pro 22H2 64-bit 19045.2364


Attachments: File DX-APP06.err     File my.ini     File mysqld.dmp    
Issue Links:
Blocks
Relates
relates to MDEV-13542 Crashing on a corrupted page is unhel... Closed
relates to MDEV-30598 Mariadb crashes Closed

 Description   

I can connect to MariaDB with HeidiSQL and MySqlConnector, but when I try to view table data (all tables InnoDB) or make a query the MariaDB service crashes.

I get a popup: "SQL Error (2013): Lost connection to the MySQL server during query."

From the .err log

2023-01-12 17:33:30 3 [ERROR] [FATAL] InnoDB: Unknown error Failed, retry may succeed
230112 17:33:30 [ERROR] mysqld got exception 0x80000003 ;



 Comments   
Comment by Marko Mäkelä [ 2023-01-13 ]

Thank you for the report. There is a nice stack trace in DX-APP06.err:

mariadb-10.9.4

server.dll!ib::fatal::~fatal()[ut0ut.cc:527]
server.dll!row_mysql_handle_errors()[row0mysql.cc:631]
server.dll!row_search_mvcc()[row0sel.cc:5842]
server.dll!ha_innobase::index_read()[ha_innodb.cc:9003]
server.dll!ha_innobase::index_last()[ha_innodb.cc:9400]
server.dll!handler::ha_index_last()[handler.cc:3585]
server.dll!join_read_last()[sql_select.cc:22441]
server.dll!sub_select()[sql_select.cc:21363]
server.dll!do_select()[sql_select.cc:20909]
server.dll!JOIN::exec_inner()[sql_select.cc:4818]
server.dll!mysql_select()[sql_select.cc:5076]
server.dll!handle_select()[sql_select.cc:579]
server.dll!execute_sqlcom_select()[sql_parse.cc:6261]
server.dll!mysql_execute_command()[sql_parse.cc:3945]
server.dll!mysql_parse()[sql_parse.cc:8027]

The InnoDB error code is DB_FAIL.

Also the SELECT statement is present. Is this reproducible when loading a copy of the table imcleandata.imcleansyncdataadv to a newly initialized database? Can you produce the minimal SQL statements (CREATE TABLE, INSERT, SELECT) for reproducing this crash?

Comment by Richard Green [ 2023-01-13 ]

Hi,
It's difficult because the table is 100GB in size. I was able to dump about 1GB to a .sql file and import it into another table in the same database.

With DBeaver I was able to see the data in the table briefly before it crashed (service stopped). With HeidiSQL I cannot get that far it crashes before showing data.

Any query from MySqlConnector in C# causes the MariaDB service to stop. This would be the minimal statement that does it:
"SELECT ID_ORIG FROM imcleansyncdataadv ORDER BY ID DESC LIMIT 1"

Comment by Marko Mäkelä [ 2023-01-14 ]

Can you provide the CREATE TABLE statement for this table?

Comment by Vladislav Vaintroub [ 2023-01-14 ]

If it crashes with HeidiSQL, it might be that a LIMIT is involved, same as your C# example that contains LIMIT, and "ORDER BY DESC"
Anything without the above , does it ever crash?

Comment by Marko Mäkelä [ 2023-01-14 ]

The InnoDB error code DB_FAIL is usually associated with data modifications, not with reads, like in the stack trace. It is a long shot, but the error could be issued due to a failed change buffer merge (which would have crashed the server until MDEV-13542 was fixed). A custom built server could be installed to confirm or refute this guess.

Comment by Richard Green [ 2023-01-16 ]

Here is the CREATE statement:

CREATE TABLE `imcleansyncdataadv` (
`ID` BIGINT(20) NOT NULL AUTO_INCREMENT,
`ID_ORIG` BIGINT(20) NULL DEFAULT NULL,
`IDX` UUID NOT NULL,
`SerialNumberShort` INT(11) NOT NULL DEFAULT '0',
`SerialNumber` VARCHAR(16) NOT NULL DEFAULT '0' COLLATE 'latin1_swedish_ci',
`Time` DATETIME NOT NULL DEFAULT '0000-00-00 00:00:00',
`Payload` LONGBLOB NOT NULL,
`IPAddress` VARCHAR(16) NOT NULL DEFAULT '' COLLATE 'latin1_swedish_ci',
`ConnectionResult` BIT(1) NULL DEFAULT b'0',
`VinMask` VARCHAR(20) NOT NULL COLLATE 'latin1_swedish_ci',
`VehicleRecord` INT(11) NULL DEFAULT NULL,
`RecordParsed` INT(11) NULL DEFAULT NULL,
`Edition` INT(11) NULL DEFAULT NULL,
PRIMARY KEY (`ID`) USING BTREE,
INDEX `ID_ORIG` (`ID_ORIG`) USING BTREE,
INDEX `Time` (`Time`) USING BTREE,
INDEX `VinMask` (`VinMask`) USING BTREE
)
COLLATE='latin1_swedish_ci'
ENGINE=InnoDB
;

Comment by Richard Green [ 2023-01-16 ]

I'm afraid I have had to move on, I uninstalled MariaDB from the server.

I am now trying mysql to see if I run into the same issues.

Comment by Marko Mäkelä [ 2023-01-17 ]

Because there is a secondary index on the column id_orig, corruption of the change buffer is a possible cause of this. It is unfortunate that we lost the data set.

Comment by Marko Mäkelä [ 2023-01-19 ]

I do not see how the change buffer code could return DB_FAIL up to the caller. The cause of this crash remains a mystery.

Comment by Marko Mäkelä [ 2023-02-08 ]

I can reproduce this behaviour on 10.6 by modifying buf_LRU_free_page() so that it would attempt to evict modified pages of temporary tables. There obviously is some flaw with that modification, but it would reproduce this error in a couple of tests that exercise temporary tables. In other words, we are attempting some operation on a corrupted page, getting DB_FAIL and not handling it in a consistent fashion (I suppose, by initiating a rollback).

Comment by Marko Mäkelä [ 2023-02-08 ]

In my case (caused by a buggy code change that I am working on), we are reading a page that has been filled by NUL bytes, and returning DB_FAIL due to that. Other corruption detected by that function results in a different error code:

diff --git a/storage/innobase/buf/buf0buf.cc b/storage/innobase/buf/buf0buf.cc
index 5c81e34856b..7b18906f395 100644
--- a/storage/innobase/buf/buf0buf.cc
+++ b/storage/innobase/buf/buf0buf.cc
@@ -3600,7 +3600,7 @@ dberr_t buf_page_t::read_complete(const fil_node_t &node)
     else if (read_id == page_id_t(0, 0))
     {
       /* This is likely an uninitialized (all-zero) page. */
-      err= DB_FAIL;
+      err= DB_PAGE_CORRUPTED;
       goto release_page;
     }
     else if (!node.space->full_crc32() &&

Comment by Marko Mäkelä [ 2023-02-08 ]

The DB_FAIL return value was added to that function in MDEV-13542. I would not call this bug a regression of that, but an omission of that fix. Before MDEV-13542, we would typically simply crash due to the corrupted page.

Comment by Marko Mäkelä [ 2023-02-09 ]

The special return value is needed so that fil_aio_callback() can avoid reporting an error when read-ahead is covering an unallocated page. The error code needs to be mapped for synchronous reads, in buf_read_page_low().

Generated at Thu Feb 08 10:15:56 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.