[MDEV-13542] Crashing on a corrupted page is unhelpful Created: 2017-08-16  Updated: 2024-02-07  Resolved: 2022-06-06

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 5.5, 10.0, 10.1, 10.2, 10.3
Fix Version/s: 10.6.9, 10.7.5, 10.8.4, 10.9.2

Type: Bug Priority: Critical
Reporter: Marko Mäkelä Assignee: Marko Mäkelä
Resolution: Fixed Votes: 7
Labels: corruption, encryption, page_compression, recovery

Attachments: File cursor_restore_supremum.patch    
Issue Links:
Blocks
blocks MDEV-18519 0x7f0118195700 InnoDB: Assertion fai... Closed
blocks MDEV-18976 Implement a CHECKSUM redo log record ... Closed
blocks MDEV-21098 Crash in rec_get_offsets_func() due t... Closed
blocks MDEV-22388 Corrupted undo log record leads to se... Closed
blocks MDEV-27593 Crashing on I/O error is unhelpful Closed
blocks MDEV-28457 Crash in page_dir_find_owner_slot() Closed
blocks MDEV-29603 btr_cur_open_at_index_side() is missi... Closed
Duplicate
is duplicated by MDEV-14170 InnoDB: Failing assertion: sibling_mo... Closed
is duplicated by MDEV-16497 Server crashes on corrupted pages dur... Closed
is duplicated by MDEV-26550 [FATAL] [ERROR] mysqld got signal 6 ;... Closed
is duplicated by MDEV-27865 buf_page_read_complete assertions dur... Closed
is duplicated by MDEV-28148 InnoDB slave crash / signal 6 Closed
is duplicated by MDEV-28574 Assertion failure in fil_invalid_page... Closed
is duplicated by MDEV-28786 InnoDB crash leads to pagesize compar... Closed
is duplicated by MDEV-29126 DB is crashing repeatedly with error ... Closed
is duplicated by MDEV-29130 InnoDB: Assertion failure in file sto... Closed
is duplicated by MDEV-29150 DB continuously rebooting Closed
is duplicated by MDEV-33398 Server crashes when executing DML wit... Closed
Problem/Incident
causes MDEV-28845 InnoDB: Failing assertion: bpage->can... Closed
causes MDEV-28950 Assertion `*err == DB_SUCCESS' failed... Closed
causes MDEV-29321 Percona XtraDB 5.7 can't be upgrade t... Closed
causes MDEV-29374 Frequent "Data structure corruption" ... Closed
causes MDEV-29383 Assertion mysql_mutex_assert_owner(&l... Closed
causes MDEV-29435 CHECK TABLE forgets to release latche... Closed
causes MDEV-31333 fsp_free_page() fails to move the ext... Closed
Relates
relates to MDEV-12112 corruption in encrypted table may be ... Closed
relates to MDEV-13680 InnoDB may crash when btr_page_alloc(... Closed
relates to MDEV-17810 Improve error printout when decryptio... Closed
relates to MDEV-17957 Make Innodb_checksum_algorithm strict... Closed
relates to MDEV-17958 Make bug-endian innodb_checksum_algor... Closed
relates to MDEV-18025 Mariabackup fails to detect corrupted... Closed
relates to MDEV-18026 Make mariabackup option to double rea... Closed
relates to MDEV-18691 Failing assertion: page_is_comp(next_... Closed
relates to MDEV-18932 MariaDB 10.3.10-10.3.13 corrupts tabl... Closed
relates to MDEV-19081 DB crashes periodicly with - InnoDB: ... Closed
relates to MDEV-19541 InnoDB crashes when trying to recover... Closed
relates to MDEV-21030 MariaDB keepts crashing on start Closed
relates to MDEV-21513 Fix some crashes in ALTER TABLE…IMPOR... Closed
relates to MDEV-21863 InnoDB: Assertion failure rem0rec.cc... Closed
relates to MDEV-22373 Unable to find a record to delete-mar... Closed
relates to MDEV-22481 Server crash after access to table wi... Closed
relates to MDEV-22624 mariadb used for wordpress restarts a... Closed
relates to MDEV-23344 InnoDB: Fatal: Trying to access page ... Closed
relates to MDEV-23734 InnoDB: File (unknown): 'read' retur... Closed
relates to MDEV-26780 Tablespaces corrupt, MariaDB does not... Closed
relates to MDEV-26893 innodb assertion on startup - rem0rec... Closed
relates to MDEV-28143 Data table corruption/crashing on btrfs Open
relates to MDEV-28975 AWS RDS mariadb 10.5.12 crashes upon ... Closed
relates to MDEV-29082 InnoDB: Failing assertion: !cursor->i... Closed
relates to MDEV-29938 InnoDB: Assertion failure in btr0pcur... Open
relates to MDEV-30132 Crash after recovery, with InnoDB: Tr... Closed
relates to MDEV-31353 InnoDB recovery hangs after reporting... Closed
relates to MDEV-32269 InnoDB after ALTER TABLE…IMPORT TABLE... Closed
relates to MDEV-33325 Crash in flst_read_addr on corrupted ... Confirmed
relates to MDEV-9663 InnoDB assertion failure: *cursor->in... Closed
relates to MDEV-11125 Introduce a reduced doublewrite mode,... Closed
relates to MDEV-12253 Buffer pool blocks are accessed after... Closed
relates to MDEV-12288 Reset DB_TRX_ID when the history is r... Closed
relates to MDEV-13536 DB_TRX_ID is not actually being reset... Closed
relates to MDEV-13559 encryption.innodb-redo-badkey failed ... Closed
relates to MDEV-14132 innodb.innodb-64k failed in buildbot,... Closed
relates to MDEV-14481 Execute InnoDB crash recovery in the ... Closed
relates to MDEV-16763 SIGNAL 6 ERROR Closed
relates to MDEV-17482 InnoDB fails to say which fatal error... Closed
relates to MDEV-17638 Improve error message about corruptio... Closed
relates to MDEV-17641 start slave and mariadb server crash Closed
relates to MDEV-17947 mariabackup SST encounters InnoDB tab... Closed
relates to MDEV-18115 Remove dummy tablespace for the redo log Closed
relates to MDEV-18529 InnoDB wrongly skips decryption of en... Closed
relates to MDEV-18606 innodb crashes on large update and it... Closed
relates to MDEV-19711 Assertion failure during IMPORT TABLE... Closed
relates to MDEV-20931 ALTER...IMPORT can crash the server Closed
relates to MDEV-24287 mysqlcheck unexpectedly causes table ... Closed
relates to MDEV-24412 Mariadb 10.5: InnoDB: Upgrade after a... Closed
relates to MDEV-24578 MariaDB 10.5 fails to join Galera clu... Closed
relates to MDEV-26022 InnoDB page data size mismatch crash Closed
relates to MDEV-26537 InnoDB corrupts files due to incorrec... Closed
relates to MDEV-27053 Crash on assertion failure in btr0cur... Closed
relates to MDEV-27538 MariaDB server startup crash Closed
relates to MDEV-28002 Failing assertion: btr_page_get_prev(... Closed
relates to MDEV-28174 InnoDB: Assertion failure Closed
relates to MDEV-28349 Provide "crash safe" options for CHEC... Open
relates to MDEV-29276 [FATAL] InnoDB: Page old data size 78... Closed
relates to MDEV-30397 InnoDB crash due to DB_FAIL reported ... Closed
relates to MDEV-31245 Assertion Failure results in crash Closed
relates to MDEV-31767 InnoDB tables are being flagged as co... Closed
relates to MDEV-32044 Mariadb crash after upgrading to 11.0... Closed

 Description   

Since the very first version, InnoDB was aborting the whole MySQL or MariaDB server when trying to access a corrupted page.

Earlier, small steps have been taken to improve MariaDB stability, such as MDEV-12253.

We must enable error handling for corrupted pages instead of allowing the server to crash. This is not only about detecting a checksum failure when reading a corrupted page (like suggested in MySQL Bug#10132 Crashing the server on corrupt InnoDB page is unhelpful, filed on April 25, 2005), but also about performing some operations on corrupted pages whose checksum appears valid.

In some background operations, such as the purge of committed transaction history, errors may be reported to the server error log only, or silently ignored. While executing SQL, errors should be reported back to the client, and the transaction could be rolled back.



 Comments   
Comment by Marko Mäkelä [ 2018-09-17 ]

As part of fixing this, we should also ensure that when fil_space_verify_crypt_checksum() has reported corruption, access to the corrupted page will not be allowed, because that could cause various assertion failures inside InnoDB, like this one somewhere inside a B-tree:

10.2.12

2018-09-15 22:01:17 139673797834496 [ERROR] InnoDB:  Page 39432:264336 may be corrupted. Post encryption checksum 2055092086 stored [1704382167:1704382167] key_version 1
2018-09-15 22:01:17 0x7f08571ab700  InnoDB: Assertion failure in file /home/buildbot/buildbot/build/storage/innobase/rem/rem0rec.cc line 580

Comment by Marko Mäkelä [ 2018-11-08 ]

Also for ROW_FORMAT=COMPRESSED tables, it looks like corrupted pages will be wrongly added to the buffer pool. I was under the impression that MySQL would remove the corrupted page from the buffer pool (or not add it), but that definitely is not the case in MariaDB.

Adding a corrupted page to the buffer pool would cause a crash anywhere when the page contents is accessed. Here is an example from CHECK TABLE:

10.3.10

 Version: '10.3.10-MariaDB-log' socket: '' port: 3306 mariadb.org binary distribution
2018-11-03 16:40:21 2467 [ERROR] InnoDB: Unable to decompress XXX.ibd[page id: space=1922, page number=865]
2018-11-03 16:40:21 2470 [ERROR] Got error 126 when reading table 'XXX'
2018-11-03 16:40:21 2473 [ERROR] Got error 126 when reading table 'XXX'
2018-11-03 16:40:21 2475 [ERROR] Got error 126 when reading table 'XXX'
2018-11-03 16:40:21 2467 [ERROR] Got error 192 when reading table 'XXX'
...
2018-11-03 16:40:23 2489 [ERROR] Got error 1877 when reading table 'XXX'
2018-11-03 16:46:47 29 [ERROR] InnoDB: Wrong owned count 8, 1, rec 129
2018-11-03 16:46:47 29 [ERROR] InnoDB: Apparent corruption in space 1922 page 865 index `PRIMARY`
2018-11-03 16:46:47 29 [ERROR] InnoDB: In page 865 of index `PRIMARY` of table `XXX`.`XXX`
Thread pointer: 0x1649b2d4d98
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
mysqld.exe!page_dir_find_owner_slot()[page0page.cc:99]
mysqld.exe!page_rec_get_prev_const()[page0page.ic:762]
mysqld.exe!btr_validate_level()[btr0btr.cc:5119]
mysqld.exe!btr_validate_index()[btr0btr.cc:5474]
mysqld.exe!ha_innobase::check()[ha_innodb.cc:14404]
mysqld.exe!handler::ha_check()[handler.cc:4199]
mysqld.exe!mysql_admin_table()[sql_admin.cc:803]
mysqld.exe!Sql_cmd_check_table::execute()[sql_admin.cc:1348]
mysqld.exe!mysql_execute_command()[sql_parse.cc:6295]
mysqld.exe!mysql_parse()[sql_parse.cc:8096]

Comment by Geoff Montee (Inactive) [ 2018-12-10 ]

InnoDB also currently crashes when a tablespace is missing or inconsistent with the data dictionary. For example, see the crash from MDEV-17947:

2018-12-08  2:19:46 139957174704896 [ERROR] InnoDB: Trying to do i/o to a tablespace which exists without .ibd data file. i/o type 10, space id 1355633, page no 0, i/o length 16384 bytes
2018-12-08  2:19:46 139957174704896 [ERROR] InnoDB: Trying to access tablespace [space=1355633: page=0] but the tablespace does not exist or is just being dropped.
2018-12-08  2:19:46 139957174704896 [ERROR] InnoDB: tablespace id is 1355633 in the data dictionary but in file ./DB1/TAB_TEMP.ibd it is 1354713!
 
2018-12-08  2:19:46 139957174704896 [ERROR] InnoDB: Trying to do i/o to a tablespace which exists without .ibd data file. i/o type 10, space id 1355633, page no 0, i/o length 16384 bytes
2018-12-08  2:19:46 139957174704896 [ERROR] InnoDB: Trying to access tablespace [space=1355633: page=0] but the tablespace does not exist or is just being dropped.
2018-12-08  2:19:46 139957174704896 [ERROR] InnoDB: Unable to read tablespace 1355633 page no 0 into the buffer pool after 100 attempts. The most probable cause of this error may be that the table has been corrupted. You can try to fix this problem by using innodb_force_recovery. Please see http://dev.mysql.com/doc/refman/5.6/en/ for more details. Aborting...
2018-12-08 02:19:46 7f4a51ae9b00  InnoDB: Assertion failure in thread 139957174704896 in file ha_innodb.cc line 22061
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
181208  2:19:46 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.1.35-MariaDB
key_buffer_size=16777216
read_buffer_size=262144
max_used_connections=5
max_threads=2002
thread_count=4
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1595581 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x7f41af9ed008
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f4a51ae9140 thread_stack 0x48400
mysys/stacktrace.c:268(my_print_stacktrace)[0x55a87a7089bb]
sql/signal_handler.cc:168(handle_fatal_signal)[0x55a87a264d85]
/lib64/libpthread.so.0(+0x36aca0f7e0)[0x7f4a518167e0]
/lib64/libc.so.6(gsignal+0x35)[0x7f4a50086495]
/lib64/libc.so.6(abort+0x175)[0x7f4a50087c75]
handler/ha_innodb.cc:22034(ib_logf(ib_log_level_t, char const*, ...))[0x55a87a48fe53]
buf/buf0buf.cc:3268(buf_page_get_gen(unsigned long, unsigned long, unsigned long, unsigned long, buf_block_t*, unsigned long, char const*, unsigned long, mtr_t*, dberr_t*))[0x55a87a5d90eb]
fil/fil0fil.cc:3921(fsp_flags_try_adjust(unsigned long, unsigned long))[0x55a87a6352b3]
fil/fil0fil.cc:5492(fil_space_for_table_exists_in_mem(unsigned long, char const*, bool, bool, bool, mem_block_info_t*, unsigned long, unsigned long))[0x55a87a635bc5]
dict/dict0load.cc:2409(dict_load_table(char const*, unsigned long, dict_err_ignore_t))[0x55a87a618fb4]
dict/dict0dict.cc:1155(dict_table_open_on_name(char const*, unsigned long, unsigned long, dict_err_ignore_t))[0x55a87a606e30]
handler/ha_innodb.cc:6364(ha_innobase::open(char const*, int, unsigned int))[0x55a87a494753]
sql/handler.cc:2533(handler::ha_open(TABLE*, char const*, int, unsigned int))[0x55a87a26928e]
sql/table.cc:2975(open_table_from_share(THD*, TABLE_SHARE*, char const*, unsigned int, unsigned int, unsigned int, TABLE*, bool))[0x55a87a190023]
sql/sql_base.cc:2565(open_table(THD*, TABLE_LIST*, Open_table_context*))[0x55a87a0a61dd]
sql/sql_base.cc:4091(open_and_process_table)[0x55a87a0aa732]
sql/sql_show.cc:1166(mysqld_show_create_get_fields(THD*, TABLE_LIST*, List<Item>*, String*))[0x55a87a14b762]
sql/sql_show.cc:1260(mysqld_show_create(THD*, TABLE_LIST*))[0x55a87a14c0a7]
sql/sql_parse.cc:3740(mysql_execute_command(THD*))[0x55a87a0ed485]
sql/sql_parse.cc:7463(mysql_parse(THD*, char*, unsigned int, Parser_state*))[0x55a87a0f22e5]
sql/sql_parse.cc:7268(wsrep_mysql_parse)[0x55a87a0f23f4]
sql/sql_parse.cc:1493(dispatch_command(enum_server_command, THD*, char*, unsigned int))[0x55a87a0f4807]
sql/sql_parse.cc:1126(do_command(THD*))[0x55a87a0f52b7]
sql/sql_connect.cc:1330(do_handle_one_connection(THD*))[0x55a87a1b699f]
sql/sql_connect.cc:1244(handle_one_connection)[0x55a87a1b6ad7]
/lib64/libpthread.so.0(+0x36aca07aa1)[0x7f4a5180eaa1]
/lib64/libc.so.6(clone+0x6d)[0x7f4a5013cbdd]
 
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x7f41b68d0020): show create table `TAB_TEMP`
Connection ID (thread ID): 108197
Status: NOT_KILLED
 
Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=off,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=off

Would you also want to prevent crashes in situations like that?

Comment by Thirunarayanan Balathandayuthapani [ 2019-05-06 ]

There is another crash because of error case mishandling:

 
#6  0x00007f7afa8dabd7 in __assert_fail_base (fmt=<optimized out>, assertion=assertion@entry=0x564c4d8a7512 "bpage->buf_fix_count == 0", file=file@entry=0x564c4d8a5f08 "/home/thiru/mariarepo/10.4/storage/innobase/buf/buf0buf.cc", line=line@entry=5811, function=function@entry=0x564c4d8abfa0 <buf_mark_space_corrupt(buf_page_t*, fil_space_t const*)::__PRETTY_FUNCTION__> "void buf_mark_space_corrupt(buf_page_t*, const fil_space_t*)") at assert.c:92
#7  0x00007f7afa8dac82 in __GI___assert_fail (assertion=0x564c4d8a7512 "bpage->buf_fix_count == 0", file=0x564c4d8a5f08 "/home/thiru/mariarepo/10.4/storage/innobase/buf/buf0buf.cc", line=5811, function=0x564c4d8abfa0 <buf_mark_space_corrupt(buf_page_t*, fil_space_t const*)::__PRETTY_FUNCTION__> "void buf_mark_space_corrupt(buf_page_t*, const fil_space_t*)") at assert.c:101
#8  0x0000564c4d2cddf3 in buf_mark_space_corrupt (bpage=0x7f7aef82ec50, space=0x564c504c3bd0) at /home/thiru/mariarepo/10.4/storage/innobase/buf/buf0buf.cc:5811
#9  0x0000564c4d2ced38 in buf_page_io_complete (bpage=0x7f7aef82ec50, dblwr=false, evict=false) at /home/thiru/mariarepo/10.4/storage/innobase/buf/buf0buf.cc:6107
#10 0x0000564c4d2f62ea in buf_read_page_low (err=0x7f7ad6ffaef4, sync=true, type=0, mode=132, page_id=..., zip_size=0, unzip=false, ignore_missing_space=false) at /home/thiru/mariarepo/10.4/storage/innobase/buf/buf0rea.cc:202
#11 0x0000564c4d2f6bb7 in buf_read_page (page_id=..., zip_size=0) at /home/thiru/mariarepo/10.4/storage/innobase/buf/buf0rea.cc:410
#12 0x0000564c4d2c95bc in buf_page_get_gen (page_id=..., zip_size=0, rw_latch=2, guess=0x0, mode=10, file=0x564c4d845878 "/home/thiru/mariarepo/10.4/storage/innobase/row/row0row.cc", line=1219, mtr=0x7f7ad6ffc890, err=0x7f7ad6ffb388) at /home/thiru/mariarepo/10.4/storage/innobase/buf/buf0buf.cc:4384
#13 0x0000564c4d28adb5 in btr_cur_search_to_nth_level_func (index=0x7f7ac8022318, level=0, tuple=0x7f7ac000c688, mode=PAGE_CUR_LE, latch_mode=2, cursor=0x564c5054d3a0, ahi_latch=0x0, file=0x564c4d845878 "/home/thiru/mariarepo/10.4/storage/innobase/row/row0row.cc", line=1219, mtr=0x7f7ad6ffc890, autoinc=0) at /home/thiru/mariarepo/10.4/storage/innobase/btr/btr0cur.cc:1544
#14 0x0000564c4d1c8d4a in btr_pcur_open_low (index=0x7f7ac8022318, level=0, tuple=0x7f7ac000c688, mode=PAGE_CUR_LE, latch_mode=2, cursor=0x564c5054d3a0, file=0x564c4d845878 "/home/thiru/mariarepo/10.4/storage/innobase/row/row0row.cc", line=1219, autoinc=0, mtr=0x7f7ad6ffc890) at /home/thiru/mariarepo/10.4/storage/innobase/include/btr0pcur.ic:459
#15 0x0000564c4d1cbf3b in row_search_on_row_ref (pcur=0x564c5054d3a0, mode=2, table=0x7f7ac8020cd8, ref=0x7f7ac000c688, mtr=0x7f7ad6ffc890) at /home/thiru/mariarepo/10.4/storage/innobase/row/row0row.cc:1219
#16 0x0000564c4d1c079c in row_purge_reposition_pcur (mode=2, node=0x564c5054d2f8, mtr=0x7f7ad6ffc890) at /home/thiru/mariarepo/10.4/storage/innobase/row/row0purge.cc:78
#17 0x0000564c4d1c26f2 in row_purge_reset_trx_id (node=0x564c5054d2f8, mtr=0x7f7ad6ffc890) at /home/thiru/mariarepo/10.4/storage/innobase/row/row0purge.cc:798
#18 0x0000564c4d1c3db0 in row_purge_record_func (node=0x564c5054d2f8, undo_rec=0x7f7ac80110c8 "", thr=0x564c5054d128, updated_extern=false) at /home/thiru/mariarepo/10.4/storage/innobase/row/row0purge.cc:1205
#19 0x0000564c4d1c3f80 in row_purge (node=0x564c5054d2f8, undo_rec=0x7f7ac80110c8 "", thr=0x564c5054d128) at /home/thiru/mariarepo/10.4/storage/innobase/row/row0purge.cc:1254
#20 0x0000564c4d1c41a9 in row_purge_step (thr=0x564c5054d128) at /home/thiru/mariarepo/10.4/storage/innobase/row/row0purge.cc:1315
#21 0x0000564c4d146e89 in que_thr_step (thr=0x564c5054d128) at /home/thiru/mariarepo/10.4/storage/innobase/que/que0que.cc:1042
#22 0x0000564c4d1470bf in que_run_threads_low (thr=0x564c5054d128) at /home/thiru/mariarepo/10.4/storage/innobase/que/que0que.cc:1104
#23 0x0000564c4d1472b1 in que_run_threads (thr=0x564c5054d128) at /home/thiru/mariarepo/10.4/storage/innobase/que/que0que.cc:1144
#24 0x0000564c4d201ee4 in srv_task_execute () at /home/thiru/mariarepo/10.4/storage/innobase/srv/srv0srv.cc:2437
#25 0x0000564c4d20209a in srv_worker_thread (arg=0x0) at /home/thiru/mariarepo/10.4/storage/innobase/srv/srv0srv.cc:2485

Comment by Marko Mäkelä [ 2019-05-21 ]

MDEV-19541 reports a crash during crash recovery, related to a slightly modified test of MDEV-12699.

Comment by Marko Mäkelä [ 2020-01-03 ]

For testing on Linux, we can create a block device with injected bad blocks, like this:

dd if=/dev/zero of=/var/lib/virtualblock.img bs=512 count=1048576
losetup /dev/loop0 /var/lib/virtualblock.img
 
dmsetup create errdev0 << EOF
0 261144 linear /dev/loop0 0
261144 5 error
261149 787427 linear /dev/loop0 261139
EOF

The device with the ‘hole’ in it will typically be created in /dev/mapper.

Comment by Marko Mäkelä [ 2020-10-28 ]

MDEV-23344 reports a crash when trying to access a page that does not exist in the tablespace.

Comment by Marko Mäkelä [ 2022-05-02 ]

When it comes to reading corrupted pages into the buffer pool, we seem to already do the right thing with regard to buf_page_is_corrupted() and buf_page_t::read_complete(). One relevant change was MDEV-19541.

It looks like the main part of the fix would be to check that error results from buf_page_get_gen() will be propagated from the caller.

I would exclude that in change buffer merge from this. We have seen crashes in btr_page_reorganize() that were invoked from ibuf_insert_to_index_page_low() during ibuf_merge_or_delete_for_page(). Due to the design of the change buffer, such crashes can also occur during CHECK TABLE, where some extra attention has been spent to avoid crashes due to corruption. We already disabled the change buffer by default (MDEV-27734) and deprecated it (MDEV-27735) due to corruption that we are unable to reproduce in our internal testing.

Comment by Marko Mäkelä [ 2022-05-02 ]

It looks like we already have a large part of these robustness fixes present in 10.5 (added by MDEV-15528 and related changes) or 10.6 (added in MDEV-25506 and changes related to atomic DDL). I already went through all direct callers of buf_page_get_gen(), and there is not that much to be fixed. I still have to check all callers of some 12 functions and then implement some fault injection in buf_page_get_gen() so that we can run some stress tests.

I would declare that those linked open bugs that are related to some other forms of corruption (say, a garbage BLOB pointer or DB_ROLL_PTR, or an incorrect key specified for decrypting an encrypted page) can be out of the scope of this fix. This fix should mainly be about handling failures to read a page.

Comment by Matthias Leich [ 2022-05-19 ]

The RQG testing on the development tree
     origin/bb-10.6-MDEV-13542 dca6937a30eeee56f46fc47ac85656243227e4d2 2022-05-16T16:31:15+03:00
showed frequent
- SEGV's with a backtrace like in the description on top of MDEV-25257
- cases where the server error log contains the message '[ERROR] mysqld got signal 11' but the server process
   did not disappear even after more than 300s waiting --> no core file nor rr trace
After applying Thiru's initial patch for MDEV-25257
    diff --git a/storage/innobase/dict/dict0load.cc b/storage/innobase/dict/dict0load.cc
index 20916c2c96e..2c3d48b9573 100644
--- a/storage/innobase/dict/dict0load.cc
+++ b/storage/innobase/dict/dict0load.cc
@@ -1383,6 +1383,7 @@ dict_load_columns(
                        the flag is set before the table is created. */
                        if (table->fts == NULL) {
                                table->fts = fts_create(table);
+                               table->fts->cache = fts_cache_create(table);
                        }
                        ut_a(table->fts->doc_col == ULINT_UNDEFINED);
 to the tree bb-10.6-MDEV-13542 both bad effects were no more seen again in testing.
 
Now some new dominant problem showed up.
[rr 3532891 209362]mysqld: /data/Server/bb-10.6-MDEV-13542G/storage/innobase/buf/buf0buf.cc:2682: buf_block_t* buf_page_get_low(page_id_t, ulint, ulint, buf_block_t*, ulint, mtr_t*, dberr_t*, bool): Assertion `mode == 11 || mode == 12 || block->zip_size() == zip_size' failed.
(rr) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007fcf37482859 in __GI_abort () at abort.c:79
#2  0x00007fcf37482729 in __assert_fail_base (fmt=0x7fcf37618588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x55996a841da0 "mode == 11 || mode == 12 || block->zip_size() == zip_size", 
    file=0x55996a83d680 "/data/Server/bb-10.6-MDEV-13542G/storage/innobase/buf/buf0buf.cc", line=2682, function=<optimized out>) at assert.c:92
#3  0x00007fcf37493f36 in __GI___assert_fail (assertion=0x55996a841da0 "mode == 11 || mode == 12 || block->zip_size() == zip_size", file=0x55996a83d680 "/data/Server/bb-10.6-MDEV-13542G/storage/innobase/buf/buf0buf.cc", 
    line=2682, function=0x55996a841880 "buf_block_t* buf_page_get_low(page_id_t, ulint, ulint, buf_block_t*, ulint, mtr_t*, dberr_t*, bool)") at assert.c:101
#4  0x00005599696fa35f in buf_page_get_low (page_id=..., zip_size=0, rw_latch=8, guess=0x0, mode=16, mtr=0x7fcf1a61aec0, err=0x7fcf1a61a770, allow_ibuf_merge=false)
    at /data/Server/bb-10.6-MDEV-13542G/storage/innobase/buf/buf0buf.cc:2682
#5  0x00005599696fbaad in buf_page_get_gen (page_id=..., zip_size=0, rw_latch=8, guess=0x0, mode=16, mtr=0x7fcf1a61aec0, err=0x7fcf1a61a770, allow_ibuf_merge=false)
    at /data/Server/bb-10.6-MDEV-13542G/storage/innobase/buf/buf0buf.cc:2984
#6  0x00005599696ae711 in btr_cur_pess_upd_restore_supremum (block=0x7fcf2aea57e0, rec=0x7fcf2b49e781 "", mtr=0x7fcf1a61aec0) at /data/Server/bb-10.6-MDEV-13542G/storage/innobase/btr/btr0cur.cc:4782
#7  0x00005599696b1f88 in btr_cur_pessimistic_update (flags=10, cursor=0x61100005f180, offsets=0x7fcf1a61aac0, offsets_heap=0x7fcf1a61abc0, entry_heap=0x619000bf9080, big_rec=0x7fcf1a61aaa0, update=0x62100026f3f0, cmpl_info=0, 
    thr=0x62100026f750, trx_id=1719, mtr=0x7fcf1a61aec0) at /data/Server/bb-10.6-MDEV-13542G/storage/innobase/btr/btr0cur.cc:5243
#8  0x000055996956ff1b in row_upd_clust_rec (flags=0, node=0x62100026f2c0, index=0x616002f91a08, offsets=0x619000bf9850, offsets_heap=0x7fcf1a61abc0, thr=0x62100026f750, mtr=0x7fcf1a61aec0)
    at /data/Server/bb-10.6-MDEV-13542G/storage/innobase/row/row0upd.cc:2456
#9  0x0000559969571812 in row_upd_clust_step (node=0x62100026f2c0, thr=0x62100026f750) at /data/Server/bb-10.6-MDEV-13542G/storage/innobase/row/row0upd.cc:2729
#10 0x0000559969571f69 in row_upd (node=0x62100026f2c0, thr=0x62100026f750) at /data/Server/bb-10.6-MDEV-13542G/storage/innobase/row/row0upd.cc:2792
#11 0x0000559969572a6b in row_upd_step (thr=0x62100026f750) at /data/Server/bb-10.6-MDEV-13542G/storage/innobase/row/row0upd.cc:2934
#12 0x00005599694d46e5 in row_update_for_mysql (prebuilt=0x62100026e988) at /data/Server/bb-10.6-MDEV-13542G/storage/innobase/row/row0mysql.cc:1691
#13 0x000055996914b046 in ha_innobase::update_row (this=0x61d0007f94b8, old_row=0x61a0003c80f8 <incomplete sequence \375>, new_row=0x61a0003c7eb8 "\265")
    at /data/Server/bb-10.6-MDEV-13542G/storage/innobase/handler/ha_innodb.cc:8668
#14 0x00005599688687dd in handler::ha_update_row (this=0x61d0007f94b8, old_data=0x61a0003c80f8 <incomplete sequence \375>, new_data=0x61a0003c7eb8 "\265") at /data/Server/bb-10.6-MDEV-13542G/sql/handler.cc:7602
#15 0x00005599683484c9 in mysql_update (thd=0x62b00009a218, table_list=0x62b0000a13f8, fields=..., values=..., conds=0x0, order_num=0, order=0x0, limit=18446744073709551602, ignore=false, found_return=0x7fcf1a61c2f0, 
    updated_return=0x7fcf1a61c310) at /data/Server/bb-10.6-MDEV-13542G/sql/sql_update.cc:1087
#16 0x000055996806794f in mysql_execute_command (thd=0x62b00009a218, is_called_from_prepared_stmt=false) at /data/Server/bb-10.6-MDEV-13542G/sql/sql_parse.cc:4423
#17 0x000055996807fd40 in mysql_parse (thd=0x62b00009a218, rawbuf=0x62b0000a1238 "UPDATE t2 SET col_text = REPEAT(SUBSTR(CAST( 2361 AS CHAR),1,1), @fill_amount)  /* E_R Thread2 QNO 1139 CON_ID 16 */", length=116, 
    parser_state=0x7fcf1a61cb20) at /data/Server/bb-10.6-MDEV-13542G/sql/sql_parse.cc:8045
#18 0x0000559968058133 in dispatch_command (command=COM_QUERY, thd=0x62b00009a218, packet=0x629000c53219 " UPDATE t2 SET col_text = REPEAT(SUBSTR(CAST( 2361 AS CHAR),1,1), @fill_amount)  /* E_R Thread2 QNO 1139 CON_ID 16 */ ", 
    packet_length=118, blocking=true) at /data/Server/bb-10.6-MDEV-13542G/sql/sql_parse.cc:1912
#19 0x0000559968055369 in do_command (thd=0x62b00009a218, blocking=true) at /data/Server/bb-10.6-MDEV-13542G/sql/sql_parse.cc:1409
#20 0x0000559968458c53 in do_handle_one_connection (connect=0x608000002eb8, put_in_cache=true) at /data/Server/bb-10.6-MDEV-13542G/sql/sql_connect.cc:1418
#21 0x00005599684584df in handle_one_connection (arg=0x608000002838) at /data/Server/bb-10.6-MDEV-13542G/sql/sql_connect.cc:1312
#22 0x00007fcf379ac609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#23 0x00007fcf3757f293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(rr)
pluto:/data/results/1652901598/TBR-1503/dev/shm/rqg/1652901598/57/1/rr
 
This problem disappeared after applying Thiru's cursor_restore_supremum.patch.

Comment by Marko Mäkelä [ 2022-05-23 ]

thiru, thank you for cursor_restore_supremum.patch. I hope that my simpler solution works as well.

Even today, the fault injection exposed a couple new crashes, which I fixed:

./mtr --parallel=auto --big-test --force --max-test-fail=0 --mysqld=--debug=d,intermittent_read_failure

My main effort today was to prevent crashes on change buffer merge failure (motivated by MDEV-28349) as well as on page split or merge. Such corruption should not be a direct consequence of a page checksum mismatch, but something else (say, MDEV-28312, a bug in Galera snapshot transfer).

Generated at Thu Feb 08 08:06:24 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.