[MDEV-28148] InnoDB slave crash / signal 6 Created: 2022-03-22  Updated: 2022-06-22  Resolved: 2022-06-22

Status: Closed
Project: MariaDB Server
Component/s: Server, Storage Engine - InnoDB
Affects Version/s: 10.5.15
Fix Version/s: 10.6.9, 10.7.5, 10.8.4, 10.9.2, 10.10.0

Type: Bug Priority: Critical
Reporter: Jarkko Hyvärinen Assignee: Marko Mäkelä
Resolution: Duplicate Votes: 0
Labels: crash, innodb
Environment:

CentOS 7


Issue Links:
Duplicate
duplicates MDEV-13542 Crashing on a corrupted page is unhel... Closed
duplicates MDEV-22739 !cursor->index->is_committed() in row... Closed
Relates
relates to MDEV-26977 mariadb 10.5.12 reboot loop in AWS | ... Closed
relates to MDEV-27734 Set innodb_change_buffering=none by d... Closed

 Description   

This happened yesterday:

mariadbd: 2022-03-21 13:15:30 13094 [ERROR] [FATAL] InnoDB: Rec offset 99, cur1 offset 8485, cur2 offset 15780
mariadbd: 220321 13:15:30 [ERROR] mysqld got signal 6 ;
mariadbd: This could be because you hit a bug. It is also possible that this binary
mariadbd: or one of the libraries it was linked against is corrupt, improperly built,
mariadbd: or misconfigured. This error can also be caused by malfunctioning hardware.
mariadbd: To report this bug, see https://mariadb.com/kb/en/reporting-bugs
mariadbd: We will try our best to scrape up some info that will hopefully help
mariadbd: diagnose the problem, but since we have already crashed,
mariadbd: something is definitely wrong and this may fail.
mariadbd: Server version: 10.5.15-MariaDB-log
mariadbd: key_buffer_size=134217728
mariadbd: read_buffer_size=131072
mariadbd: max_used_connections=4
mariadbd: max_threads=153
mariadbd: thread_count=3
mariadbd: It is possible that mysqld could use up to
mariadbd: key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 467872 K  bytes of memory
mariadbd: Hope that's ok; if not, decrease some variables in the equation.
mariadbd: Thread pointer: 0x7f42980029a8
mariadbd: Attempting backtrace. You can use the following information to find out
mariadbd: where mysqld died. If you see no messages after this, something went
mariadbd: terribly wrong...
mariadbd: stack_bottom = 0x7f446861a3e0 thread_stack 0x49000
mariadbd: ??:0(my_print_stacktrace)[0x55a751fffe3e]
mariadbd: ??:0(handle_fatal_signal)[0x55a7519f19c7]
mariadbd: sigaction.c:0(__restore_rt)[0x7f448224d630]
mariadbd: :0(__GI_raise)[0x7f4481698387]
mariadbd: :0(__GI_abort)[0x7f4481699a78]
mariadbd: ??:0(std::pair<std::_Rb_tree_iterator<unsigned int>, bool> std::_Rb_tree<unsigned int, unsigned int, std::_Identity<unsigned int>, std::less<unsigned int>, std::allocator<unsigned int> >::_M_insert_unique<unsigned int const&>(unsigned int const&))[0x55a751e568b0]
mariadbd: ??:0(std::unique_lock<std::mutex>::unlock())[0x55a751d71579]
mariadbd: ??:0(std::pair<std::_Rb_tree_iterator<unsigned int>, bool> std::_Rb_tree<unsigned int, unsigned int, std::_Identity<unsigned int>, std::less<unsigned int>, std::allocator<unsigned int> >::_M_insert_unique<unsigned int const&>(unsigned int const&))[0x55a751e5ce86]
mariadbd: ??:0(std::pair<std::_Rb_tree_iterator<unsigned int>, bool> std::_Rb_tree<unsigned int, unsigned int, std::_Identity<unsigned int>, std::less<unsigned int>, std::allocator<unsigned int> >::_M_insert_unique<unsigned int const&>(unsigned int const&))[0x55a751e5dab7]
mariadbd: ??:0(std::pair<std::_Rb_tree_iterator<unsigned int>, bool> std::_Rb_tree<unsigned int, unsigned int, std::_Identity<unsigned int>, std::less<unsigned int>, std::allocator<unsigned int> >::_M_insert_unique<unsigned int const&>(unsigned int const&))[0x55a751e7c63c]
mariadbd: ??:0(void std::__introsort_loop<unsigned char**, long>(unsigned char**, unsigned char**, long))[0x55a751db5c46]
mariadbd: ??:0(void std::__introsort_loop<unsigned char**, long>(unsigned char**, unsigned char**, long))[0x55a751db6932]
mariadbd: ??:0(void std::__introsort_loop<unsigned char**, long>(unsigned char**, unsigned char**, long))[0x55a751db6e2a]
mariadbd: ??:0(void std::__introsort_loop<unsigned char**, long>(unsigned char**, unsigned char**, long))[0x55a751dc9a14]
mariadbd: ??:0(wsrep_notify_status(wsrep::server_state::state, wsrep::view const*))[0x55a751cf6d94]
mariadbd: ??:0(handler::ha_write_row(unsigned char const*))[0x55a7519feb4f]
mariadbd: ??:0(Rows_log_event::write_row(rpl_group_info*, bool))[0x55a751b0c2f5]
mariadbd: ??:0(Write_rows_log_event::do_exec_row(rpl_group_info*))[0x55a751b0c75d]
mariadbd: ??:0(Rows_log_event::do_apply_event(rpl_group_info*))[0x55a751b01b84]
mariadbd: ??:0(non-virtual thunk to Item_string_sys::~Item_string_sys())[0x55a75172f65f]
mariadbd: ??:0(handle_slave_sql)[0x55a75173984c]
mariadbd: ??:0(MyCTX_nopad::finish(unsigned char*, unsigned int*))[0x55a751c471b7]
mariadbd: pthread_create.c:0(start_thread)[0x7f4482245ea5]
mariadbd: ??:0(__clone)[0x7f4481760b0d]
mariadbd: Trying to get some variables.
mariadbd: Some pointers may be invalid and cause the dump to abort.
mariadbd: Query (0x7f42980a540b): INSERT INTO ...
mariadbd: Status: NOT_KILLED
mariadbd: Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=of
f,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off
mariadbd: The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
mariadbd: information that should help you find out what is causing the crash.
mariadbd: Writing a core file...
mariadbd: Working directory at /var/lib/mysql
mariadbd: Resource Limits:
mariadbd: Limit                     Soft Limit           Hard Limit           Units
mariadbd: Max cpu time              unlimited            unlimited            seconds
mariadbd: Max file size             unlimited            unlimited            bytes
mariadbd: Max data size             unlimited            unlimited            bytes
mariadbd: Max stack size            8388608              unlimited            bytes
mariadbd: Max core file size        0                    unlimited            bytes
mariadbd: Max resident set          unlimited            unlimited            bytes
mariadbd: Max processes             31150                31150                processes
mariadbd: Max open files            32768                32768                files
mariadbd: Max locked memory         65536                65536                bytes
mariadbd: Max address space         unlimited            unlimited            bytes
mariadbd: Max file locks            unlimited            unlimited            locks
mariadbd: Max pending signals       31150                31150                signals
mariadbd: Max msgqueue size         819200               819200               bytes
mariadbd: Max nice priority         0                    0
mariadbd: Max realtime priority     0                    0
mariadbd: Max realtime timeout      unlimited            unlimited            us
mariadbd: Core pattern: |/usr/libexec/abrt-hook-ccpp %s %c %p %u %g %t e %P %I %h
abrt-hook-ccpp: Process 9433 (mariadbd) of user 987 killed by SIGABRT - dumping core
abrtd: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'xorg-2019-01-19-08:12:24-6238-1'
systemd: mariadb.service: main process exited, code=killed, status=6/ABRT
systemd: Unit mariadb.service entered failed state.
systemd: mariadb.service failed.
abrtd: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'xorg-2019-02-23-08:17:59-6276-1'
abrtd: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'oops-2017-09-13-18:37:14-4564-0'
abrtd: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'oops-2017-09-13-18:36:34-4511-0'
abrtd: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'Python-2017-09-13-17:53:58-1093'
abrt-server: Package 'MariaDB-server' isn't signed with proper key
abrt-server: 'post-create' on '/var/spool/abrt/ccpp-2022-03-21-13:15:30-9433' exited with 1
abrt-server: Deleting problem directory '/var/spool/abrt/ccpp-2022-03-21-13:15:30-9433'
systemd: mariadb.service holdoff time over, scheduling restart.
systemd: Cannot add dependency job for unit firewalld.service, ignoring: Unit is masked.
systemd: Stopped MariaDB 10.5.15 database server.

I was able to get server back running by starting it with innodb_force_recovery = 1, dropping databases and finally recreating them from another slave.



 Comments   
Comment by Marko Mäkelä [ 2022-03-22 ]

Can you please try to produce stack traces from the core dump? The built-in stack trace generator for the current thread often produces garbage, like in this case.

The crash message matches MDEV-26977. In MDEV-27734 we disabled the InnoDB change buffer by default because of such bug reports and because of questionable performance benefit.

Since we have not been able to reproduce the corruption ourselves, we are unable to fix it, and the best course of action seems to be to disable the change buffer.

Comment by Marko Mäkelä [ 2022-03-22 ]

MariaDB Server 10.5.15 does disable the change buffering by default (MDEV-27734), but it is possible that some buffered changes existed in the change buffer when the server was started. The corruption would not be noticed until a change buffer merge occurs.

The change buffer covers changes to secondary indexes. Therefore, the statement

ALTER TABLE tablename FORCE;

should be able to fix this type of corruption by rebuilding the table.

If you restore the server from data files, any corruption in the persistent change buffer would be copied along. Is the crash reproducible after restoring from a logical dump, with the change buffer disabled? Can you provide a proper stack trace? Does it include ibuf_merge_or_delete_for_page()?

Comment by Jarkko Hyvärinen [ 2022-03-22 ]

Unfortunately it seems that I am not able to produce stack traces from the core dump because the databases were recreated and there is no server snapshot taken from the crash.

But here are some details what I was able to gather:

  • MariaDB had been running constantly since Mar 1 2022 (update from 10.5.13 to 10.5.15)
  • After the crash when I tried to start the server without innodb_force_recovery=1 it always crashed
  • The server is originally started in 2017 with MariaDB 10.1 or 10.2 and it has been updated every year to new major version (10.1 -> 10.2, 10.2 -> 10.3 etc.)
  • server.cnf file is down below

[mysqld]
log_bin                   = mysql-bin
binlog_do_db              = db1
binlog_do_db              = db2
binlog_format             = ROW
expire_logs_days          = 5
server_id                 = 2
 
innodb_buffer_pool_size   = 5GB
innodb_file_per_table     = 1
 
ssl
ssl-ca                    = /path/to/file
ssl-cert                  = /path/to/file
ssl-key                   = /path/to/file
 
local_infile              = 0
sql_mode                  = ''
symbolic_links            = 0
innodb_strict_mode        = 0
max_password_errors       = 10
 
relay-log                 = mysql-relay-bin.log
read-only                 = 1
replicate-do-db           = db1
replicate-do-db           = db2
log_slave_updates         = 1
 
plugin_load                  = server_audit=server_audit.so
server_audit_logging         = ON
server_audit_events          = 'CONNECT,TABLE,QUERY_DDL'
server_audit_file_rotate_now = ON
server_audit_file_rotations  = 30

We had simiral style crash in February to another slave (using MariaDB 10.5). In that case the error was "Page old data size xyz new data size xyz, page old max ins size xyz new max ins size xyz" and the server didn't start without innodb_force_recovery=4.

Comment by Marko Mäkelä [ 2022-04-20 ]

jarkko.hyvarinen, thank you. While I do not know the exact stack trace, the amount of stack frames could match InnoDB change buffer merge, which may invoke btr_page_reorganize(), which would output the fatal error message if the page is corrupted.

You wrote that you were able to break a crash loop by setting innodb_force_recovery=4. Actually, in MariaDB Server 10.5, thanks to MDEV-19514, that setting is equivalent to innodb_force_recovery=3. Such a crash loop was reported earlier in MDEV-26977.

I would need a repeatable test case in order to analyze and fix this bug. We do not hit this in our internal testing, possibly because we typically run rather short tests with rather small amounts of data, and by design, some data flow patterns of the change buffer would only be exercised with larger amounts of data over a longer period. The fact that the change buffer is persistent and that corruption may remain unnoticed for a long time makes this hard.

In MDEV-27734, the change buffer was disabled by default to prevent further corruption due to the change buffer.

Comment by Marko Mäkelä [ 2022-06-22 ]

The reported code that would intentionally cause a crash was removed in MDEV-13542.

Generated at Thu Feb 08 09:58:26 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.