[MDEV-25198] mariadb backup fails with signal 6 and signal 11 during prepare Created: 2021-03-19  Updated: 2021-09-30  Resolved: 2021-04-29

Status: Closed
Project: MariaDB Server
Component/s: mariabackup, Server, Storage Engine - InnoDB
Affects Version/s: 10.2.27
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Allen Lee (Inactive) Assignee: Unassigned
Resolution: Incomplete Votes: 0
Labels: None

Issue Links:
Problem/Incident
is caused by MDEV-14229 Stack trace is not resolved for share... Closed
Relates
relates to MDEV-24449 Corruption of system tablespace or la... Closed

 Description   

customer reported the following crash during mariabackup --prepare stage.
Actual command the used is as following:

/glide/mariadb/10.2.27snc2/bin/mariabackup --prepare --use-memory=32G --target-dir /glide/mysqld/acnnacioj_3402_peta/temp/restore_2021-03-16_0054271/accentureinternal_3400_db170016_s_2021-03-14_2210539 2> /glide/backup/log/restore.2021-03-16_0054271.accentureinternal_3400_db170016_s_2021-03-14_2210539.log

  • some of config

    slave_parallel_threads=16
    #binlog_format=MIXED
    binlog_format=ROW
    slave_parallel_mode=conservative
    innodb_doublewrite=ON
    #innodb_max_dirty_pages_pct=5
    innodb_log_file_size=32768M
    innodb_change_buffer_max_size=5
    innodb_buffer_pool_instances=32
    innodb_purge_batch_size=5000
    table_definition_cache=524288
    table_open_cache=524288
    table_open_cache_instances=16
    back_log=1500
    innodb_adaptive_hash_index_partitions=16
    slave_compressed_protocol=OFF
    slave_parallel_max_queued=500000
    expire_logs_days=3
    innodb_max_dirty_pages_pct_lwm=50
    

  • error log

    2021-03-16 16:45:25 140315704616704 [Note] InnoDB: To recover: 532973 pages from log
    2021-03-16 16:45:40 140315713009408 [Note] InnoDB: To recover: 401586 pages from log
    2021-03-16 16:45:42 140315696224000 [ERROR] [FATAL] InnoDB: is_short 0, info_and_status_bits 0, offset 12896, o_offset 8, mismatch index 18446744073709551595, end_seg_len 33 parsed len 3
    210316 16:45:42 [ERROR] mysqld got signal 6 ;
    This could be because you hit a bug. It is also possible that this binary
    or one of the libraries it was linked against is corrupt, improperly built,
    or misconfigured. This error can also be caused by malfunctioning hardware.
     
    To report this bug, see https://mariadb.com/kb/en/reporting-bugs
     
    We will try our best to scrape up some info that will hopefully help
    diagnose the problem, but since we have already crashed,
    something is definitely wrong and this may fail.
     
    Server version: 10.2.27-MariaDB
    key_buffer_size=0
    read_buffer_size=131072
    max_used_connections=0
    max_threads=1
    thread_count=0
    It is possible that mysqld could use up to
    key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 5419 K bytes of memory
    Hope that's ok; if not, decrease some variables in the equation.
     
    Thread pointer: 0x0
    Attempting backtrace. You can use the following information to find out
    where mysqld died. If you see no messages after this, something went
    terribly wrong...
    stack_bottom = 0x0 thread_stack 0x49000
    /glide/mariadb/10.2.27snc2/bin/mariabackup(my_print_stacktrace+0x2e)[0x5578aff47c7e]
    mysys/stacktrace.c:268(my_print_stacktrace)[0x5578af9ec54d]
    sigaction.c:0(__restore_rt)[0x7fa658b2e630]
    :0(__GI_raise)[0x7fa657938387]
    :0(__GI_abort)[0x7fa657939a78]
    ut/ut0ut.cc:645(ib::fatal::~fatal())[0x5578afd45823]
    page/page0cur.cc:1200(page_cur_parse_insert_rec(unsigned long, unsigned char const*, unsigned char const*, buf_block_t*, dict_index_t*, mtr_t*))[0x5578afc64f9b]
    log/log0recv.cc:1655(recv_parse_or_apply_log_rec_body(mlog_id_t, unsigned char*, unsigned char*, unsigned long, unsigned long, bool, buf_block_t*, mtr_t*))[0x5578afc431d1]
    log/log0recv.cc:2162(recv_recover_page(buf_block_t*, mtr_t&, recv_addr_t*, unsigned long))[0x5578afc43e19]
    log/log0recv.cc:2272(recv_recover_page(buf_page_t*))[0x5578af70278d]
    buf/buf0buf.cc:6164(buf_page_io_complete(buf_page_t*, bool, bool))[0x5578afb35b9c]
    fil/fil0fil.cc:5169(fil_aio_wait(unsigned long))[0x5578afb83461]
    srv/srv0start.cc:331(io_handler_thread)[0x5578afcf7da8]
    pthread_create.c:0(start_thread)[0x7fa658b26ea5]
    /lib64/libc.so.6(clone+0x6d)[0x7fa657a0096d]
    The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
    information that should help you find out what is causing the crash.
    Writing a core file...
    Working directory at /glide/mysqld/acnnacioj_3402_peta/temp/restore_2021-03-16_0054271/accentureinternal_3400_db170016_s_2021-03-14_2210539
    Resource Limits:
    Fatal signal 11 while backtracing
    



 Comments   
Comment by Marko Mäkelä [ 2021-03-19 ]

The crash within the stack trace reporter was probably caused by MDEV-14229. I suspect that the signal hander is invoking library functions that are not listed in man 7 signal-safety on GNU/Linux.

To analyze the original crash, we would need more information. Failures in recovery or mariabackup --prepare are usually deterministic, and thus there should be a good chance that the problem can be repeated with the original data files. It should be almost guaranteed if we had a copy of the files as they were before any recovery steps were applied.

I cannot claim it for sure, but I suspect that the bug that was fixed in MDEV-24449 can cause various sorts of corruption in both the InnoDB system tablespace (any page in it) and secondary index leaf pages of any tablespace. The output is from page_cur_parse_insert_rec(). When the mismatch_index is interpreted as a signed number, it is -21. The only reason for that failure should be that the index page contents is inconsistent with the redo log record. MDEV-24449 could explain it.

Does this occur with any server that was never corrupted by the bug that was fixed in MDEV-24449?

Generated at Thu Feb 08 09:35:53 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.