[MDEV-28148] InnoDB slave crash / signal 6 - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Duplicate
Affects Version/s: 10.5.15
Fix Version/s: 10.7.5, 10.10.0, 10.6.9, 10.8.4, 10.9.2
Component/s: Server, Storage Engine - InnoDB
Labels:
- crash
- innodb
Environment:
CentOS 7

Description

This happened yesterday:

mariadbd: 2022-03-21 13:15:30 13094 [ERROR] [FATAL] InnoDB: Rec offset 99, cur1 offset 8485, cur2 offset 15780

mariadbd: 220321 13:15:30 [ERROR] mysqld got signal 6 ;

mariadbd: This could be because you hit a bug. It is also possible that this binary

mariadbd: or one of the libraries it was linked against is corrupt, improperly built,

mariadbd: or misconfigured. This error can also be caused by malfunctioning hardware.

mariadbd: To report this bug, see https://mariadb.com/kb/en/reporting-bugs

mariadbd: We will try our best to scrape up some info that will hopefully help

mariadbd: diagnose the problem, but since we have already crashed,

mariadbd: something is definitely wrong and this may fail.

mariadbd: Server version: 10.5.15-MariaDB-log

mariadbd: key_buffer_size=134217728

mariadbd: read_buffer_size=131072

mariadbd: max_used_connections=4

mariadbd: max_threads=153

mariadbd: thread_count=3

mariadbd: It is possible that mysqld could use up to

mariadbd: key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 467872 K  bytes of memory

mariadbd: Hope that's ok; if not, decrease some variables in the equation.

mariadbd: Thread pointer: 0x7f42980029a8

mariadbd: Attempting backtrace. You can use the following information to find out

mariadbd: where mysqld died. If you see no messages after this, something went

mariadbd: terribly wrong...

mariadbd: stack_bottom = 0x7f446861a3e0 thread_stack 0x49000

mariadbd: ??:0(my_print_stacktrace)[0x55a751fffe3e]

mariadbd: ??:0(handle_fatal_signal)[0x55a7519f19c7]

mariadbd: sigaction.c:0(__restore_rt)[0x7f448224d630]

mariadbd: :0(__GI_raise)[0x7f4481698387]

mariadbd: :0(__GI_abort)[0x7f4481699a78]

mariadbd: ??:0(std::pair<std::_Rb_tree_iterator<unsigned int>, bool> std::_Rb_tree<unsigned int, unsigned int, std::_Identity<unsigned int>, std::less<unsigned int>, std::allocator<unsigned int> >::_M_insert_unique<unsigned int const&>(unsigned int const&))[0x55a751e568b0]

mariadbd: ??:0(std::unique_lock<std::mutex>::unlock())[0x55a751d71579]

mariadbd: ??:0(std::pair<std::_Rb_tree_iterator<unsigned int>, bool> std::_Rb_tree<unsigned int, unsigned int, std::_Identity<unsigned int>, std::less<unsigned int>, std::allocator<unsigned int> >::_M_insert_unique<unsigned int const&>(unsigned int const&))[0x55a751e5ce86]

mariadbd: ??:0(std::pair<std::_Rb_tree_iterator<unsigned int>, bool> std::_Rb_tree<unsigned int, unsigned int, std::_Identity<unsigned int>, std::less<unsigned int>, std::allocator<unsigned int> >::_M_insert_unique<unsigned int const&>(unsigned int const&))[0x55a751e5dab7]

mariadbd: ??:0(std::pair<std::_Rb_tree_iterator<unsigned int>, bool> std::_Rb_tree<unsigned int, unsigned int, std::_Identity<unsigned int>, std::less<unsigned int>, std::allocator<unsigned int> >::_M_insert_unique<unsigned int const&>(unsigned int const&))[0x55a751e7c63c]

mariadbd: ??:0(void std::__introsort_loop<unsigned char**, long>(unsigned char**, unsigned char**, long))[0x55a751db5c46]

mariadbd: ??:0(void std::__introsort_loop<unsigned char**, long>(unsigned char**, unsigned char**, long))[0x55a751db6932]

mariadbd: ??:0(void std::__introsort_loop<unsigned char**, long>(unsigned char**, unsigned char**, long))[0x55a751db6e2a]

mariadbd: ??:0(void std::__introsort_loop<unsigned char**, long>(unsigned char**, unsigned char**, long))[0x55a751dc9a14]

mariadbd: ??:0(wsrep_notify_status(wsrep::server_state::state, wsrep::view const*))[0x55a751cf6d94]

mariadbd: ??:0(handler::ha_write_row(unsigned char const*))[0x55a7519feb4f]

mariadbd: ??:0(Rows_log_event::write_row(rpl_group_info*, bool))[0x55a751b0c2f5]

mariadbd: ??:0(Write_rows_log_event::do_exec_row(rpl_group_info*))[0x55a751b0c75d]

mariadbd: ??:0(Rows_log_event::do_apply_event(rpl_group_info*))[0x55a751b01b84]

mariadbd: ??:0(non-virtual thunk to Item_string_sys::~Item_string_sys())[0x55a75172f65f]

mariadbd: ??:0(handle_slave_sql)[0x55a75173984c]

mariadbd: ??:0(MyCTX_nopad::finish(unsigned char*, unsigned int*))[0x55a751c471b7]

mariadbd: pthread_create.c:0(start_thread)[0x7f4482245ea5]

mariadbd: ??:0(__clone)[0x7f4481760b0d]

mariadbd: Trying to get some variables.

mariadbd: Some pointers may be invalid and cause the dump to abort.

mariadbd: Query (0x7f42980a540b): INSERT INTO ...

mariadbd: Status: NOT_KILLED

mariadbd: Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=of

f,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off

mariadbd: The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains

mariadbd: information that should help you find out what is causing the crash.

mariadbd: Writing a core file...

mariadbd: Working directory at /var/lib/mysql

mariadbd: Resource Limits:

mariadbd: Limit                     Soft Limit           Hard Limit           Units

mariadbd: Max cpu time              unlimited            unlimited            seconds

mariadbd: Max file size             unlimited            unlimited            bytes

mariadbd: Max data size             unlimited            unlimited            bytes

mariadbd: Max stack size            8388608              unlimited            bytes

mariadbd: Max core file size        0                    unlimited            bytes

mariadbd: Max resident set          unlimited            unlimited            bytes

mariadbd: Max processes             31150                31150                processes

mariadbd: Max open files            32768                32768                files

mariadbd: Max locked memory         65536                65536                bytes

mariadbd: Max address space         unlimited            unlimited            bytes

mariadbd: Max file locks            unlimited            unlimited            locks

mariadbd: Max pending signals       31150                31150                signals

mariadbd: Max msgqueue size         819200               819200               bytes

mariadbd: Max nice priority         0                    0

mariadbd: Max realtime priority     0                    0

mariadbd: Max realtime timeout      unlimited            unlimited            us

mariadbd: Core pattern: |/usr/libexec/abrt-hook-ccpp %s %c %p %u %g %t e %P %I %h

abrt-hook-ccpp: Process 9433 (mariadbd) of user 987 killed by SIGABRT - dumping core

abrtd: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'xorg-2019-01-19-08:12:24-6238-1'

systemd: mariadb.service: main process exited, code=killed, status=6/ABRT

systemd: Unit mariadb.service entered failed state.

systemd: mariadb.service failed.

abrtd: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'xorg-2019-02-23-08:17:59-6276-1'

abrtd: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'oops-2017-09-13-18:37:14-4564-0'

abrtd: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'oops-2017-09-13-18:36:34-4511-0'

abrtd: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'Python-2017-09-13-17:53:58-1093'

abrt-server: Package 'MariaDB-server' isn't signed with proper key

abrt-server: 'post-create' on '/var/spool/abrt/ccpp-2022-03-21-13:15:30-9433' exited with 1

abrt-server: Deleting problem directory '/var/spool/abrt/ccpp-2022-03-21-13:15:30-9433'

systemd: mariadb.service holdoff time over, scheduling restart.

systemd: Cannot add dependency job for unit firewalld.service, ignoring: Unit is masked.

systemd: Stopped MariaDB 10.5.15 database server.

I was able to get server back running by starting it with innodb_force_recovery = 1, dropping databases and finally recreating them from another slave.

Attachments

Issue Links

duplicates

MDEV-13542 Crashing on a corrupted page is unhelpful

Closed

MDEV-22739 !cursor->index->is_committed() in row0ins.cc after update to 10.4.13 from 10.3.21

Closed

relates to

MDEV-26977 mariadb 10.5.12 reboot loop in AWS | [ERROR] InnoDB: Insert buffer insert fails | Corruption of an index tree

Closed

MDEV-27734 Set innodb_change_buffering=none by default

Closed

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2022-03-22 06:49

Can you please try to produce stack traces from the core dump? The built-in stack trace generator for the current thread often produces garbage, like in this case.

The crash message matches ~~MDEV-26977~~. In ~~MDEV-27734~~ we disabled the InnoDB change buffer by default because of such bug reports and because of questionable performance benefit.

Since we have not been able to reproduce the corruption ourselves, we are unable to fix it, and the best course of action seems to be to disable the change buffer.

Marko Mäkelä added a comment - 2022-03-22 06:49 Can you please try to produce stack traces from the core dump? The built-in stack trace generator for the current thread often produces garbage, like in this case. The crash message matches MDEV-26977 . In MDEV-27734 we disabled the InnoDB change buffer by default because of such bug reports and because of questionable performance benefit. Since we have not been able to reproduce the corruption ourselves, we are unable to fix it, and the best course of action seems to be to disable the change buffer.

Marko Mäkelä added a comment - 2022-03-22 06:56

MariaDB Server 10.5.15 does disable the change buffering by default (~~MDEV-27734~~), but it is possible that some buffered changes existed in the change buffer when the server was started. The corruption would not be noticed until a change buffer merge occurs.

The change buffer covers changes to secondary indexes. Therefore, the statement

ALTER TABLE tablename FORCE;

should be able to fix this type of corruption by rebuilding the table.

If you restore the server from data files, any corruption in the persistent change buffer would be copied along. Is the crash reproducible after restoring from a logical dump, with the change buffer disabled? Can you provide a proper stack trace? Does it include ibuf_merge_or_delete_for_page()?

Marko Mäkelä added a comment - 2022-03-22 06:56 MariaDB Server 10.5.15 does disable the change buffering by default ( MDEV-27734 ), but it is possible that some buffered changes existed in the change buffer when the server was started. The corruption would not be noticed until a change buffer merge occurs. The change buffer covers changes to secondary indexes. Therefore, the statement ALTER TABLE tablename FORCE ; should be able to fix this type of corruption by rebuilding the table. If you restore the server from data files, any corruption in the persistent change buffer would be copied along. Is the crash reproducible after restoring from a logical dump, with the change buffer disabled? Can you provide a proper stack trace? Does it include ibuf_merge_or_delete_for_page() ?

Jarkko Hyvärinen added a comment - 2022-03-22 08:49

Unfortunately it seems that I am not able to produce stack traces from the core dump because the databases were recreated and there is no server snapshot taken from the crash.

But here are some details what I was able to gather:

MariaDB had been running constantly since Mar 1 2022 (update from 10.5.13 to 10.5.15)
After the crash when I tried to start the server without innodb_force_recovery=1 it always crashed
The server is originally started in 2017 with MariaDB 10.1 or 10.2 and it has been updated every year to new major version (10.1 -> 10.2, 10.2 -> 10.3 etc.)
server.cnf file is down below

[mysqld]

log_bin                   = mysql-bin

binlog_do_db              = db1

binlog_do_db              = db2

binlog_format             = ROW

expire_logs_days          = 5

server_id                 = 2

innodb_buffer_pool_size   = 5GB

innodb_file_per_table     = 1

ssl

ssl-ca                    = /path/to/file

ssl-cert                  = /path/to/file

ssl-key                   = /path/to/file

local_infile              = 0

sql_mode                  = ''

symbolic_links            = 0

innodb_strict_mode        = 0

max_password_errors       = 10

relay-log                 = mysql-relay-bin.log

read-only                 = 1

replicate-do-db           = db1

replicate-do-db           = db2

log_slave_updates         = 1

plugin_load                  = server_audit=server_audit.so

server_audit_logging         = ON

server_audit_events          = 'CONNECT,TABLE,QUERY_DDL'

server_audit_file_rotate_now = ON

server_audit_file_rotations  = 30

We had simiral style crash in February to another slave (using MariaDB 10.5). In that case the error was "Page old data size xyz new data size xyz, page old max ins size xyz new max ins size xyz" and the server didn't start without innodb_force_recovery=4.

Jarkko Hyvärinen added a comment - 2022-03-22 08:49 Unfortunately it seems that I am not able to produce stack traces from the core dump because the databases were recreated and there is no server snapshot taken from the crash. But here are some details what I was able to gather: MariaDB had been running constantly since Mar 1 2022 (update from 10.5.13 to 10.5.15) After the crash when I tried to start the server without innodb_force_recovery=1 it always crashed The server is originally started in 2017 with MariaDB 10.1 or 10.2 and it has been updated every year to new major version (10.1 -> 10.2, 10.2 -> 10.3 etc.) server.cnf file is down below [mysqld] log_bin = mysql-bin binlog_do_db = db1 binlog_do_db = db2 binlog_format = ROW expire_logs_days = 5 server_id = 2 innodb_buffer_pool_size = 5GB innodb_file_per_table = 1 ssl ssl-ca = /path/to/file ssl-cert = /path/to/file ssl-key = /path/to/file local_infile = 0 sql_mode = '' symbolic_links = 0 innodb_strict_mode = 0 max_password_errors = 10 relay-log = mysql-relay-bin.log read-only = 1 replicate- do -db = db1 replicate- do -db = db2 log_slave_updates = 1 plugin_load = server_audit=server_audit.so server_audit_logging = ON server_audit_events = 'CONNECT,TABLE,QUERY_DDL' server_audit_file_rotate_now = ON server_audit_file_rotations = 30 We had simiral style crash in February to another slave (using MariaDB 10.5). In that case the error was "Page old data size xyz new data size xyz, page old max ins size xyz new max ins size xyz" and the server didn't start without innodb_force_recovery=4.

Marko Mäkelä added a comment - 2022-04-20 06:25

jarkko.hyvarinen, thank you. While I do not know the exact stack trace, the amount of stack frames could match InnoDB change buffer merge, which may invoke btr_page_reorganize(), which would output the fatal error message if the page is corrupted.

You wrote that you were able to break a crash loop by setting innodb_force_recovery=4. Actually, in MariaDB Server 10.5, thanks to ~~MDEV-19514~~, that setting is equivalent to innodb_force_recovery=3. Such a crash loop was reported earlier in ~~MDEV-26977~~.

I would need a repeatable test case in order to analyze and fix this bug. We do not hit this in our internal testing, possibly because we typically run rather short tests with rather small amounts of data, and by design, some data flow patterns of the change buffer would only be exercised with larger amounts of data over a longer period. The fact that the change buffer is persistent and that corruption may remain unnoticed for a long time makes this hard.

In ~~MDEV-27734~~, the change buffer was disabled by default to prevent further corruption due to the change buffer.

Marko Mäkelä added a comment - 2022-04-20 06:25 jarkko.hyvarinen , thank you. While I do not know the exact stack trace, the amount of stack frames could match InnoDB change buffer merge, which may invoke btr_page_reorganize() , which would output the fatal error message if the page is corrupted. You wrote that you were able to break a crash loop by setting innodb_force_recovery=4 . Actually, in MariaDB Server 10.5, thanks to MDEV-19514 , that setting is equivalent to innodb_force_recovery=3 . Such a crash loop was reported earlier in MDEV-26977 . I would need a repeatable test case in order to analyze and fix this bug. We do not hit this in our internal testing, possibly because we typically run rather short tests with rather small amounts of data, and by design, some data flow patterns of the change buffer would only be exercised with larger amounts of data over a longer period. The fact that the change buffer is persistent and that corruption may remain unnoticed for a long time makes this hard. In MDEV-27734 , the change buffer was disabled by default to prevent further corruption due to the change buffer.

Marko Mäkelä added a comment - 2022-06-22 09:18

The reported code that would intentionally cause a crash was removed in ~~MDEV-13542~~.

Marko Mäkelä added a comment - 2022-06-22 09:18 The reported code that would intentionally cause a crash was removed in MDEV-13542 .

MariaDB Server

InnoDB slave crash / signal 6

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration