[MDEV-16866] InnoDB fails to start upon crash recovery with "[ERROR] InnoDB: Redo log crypto: failed to decrypt log block" Created: 2018-07-31  Updated: 2020-08-25  Resolved: 2019-06-12

Status: Closed
Project: MariaDB Server
Component/s: Encryption, Storage Engine - InnoDB
Affects Version/s: 10.1, 10.1.34, 10.1.39
Fix Version/s: 10.1.41

Type: Bug Priority: Critical
Reporter: Elena Stepanova Assignee: Thirunarayanan Balathandayuthapani
Resolution: Fixed Votes: 1
Labels: None


 Description   

The initial test flow was this:

  • start 10.1.34 server with the following options:

    `
    --no-defaults --file-key-management --file-key-management-filename=keys.txt --plugin-load-add=file_key_management.so --innodb-encrypt-tables --innodb-encrypt-log --innodb-encryption-threads=4 --aria-encrypt-tables=1 --encrypt-tmp-disk-tables=1 --encrypt-binlog --innodb-page-size=8K --innodb-compression-algorithm=none --loose-innodb-file-format=Barracuda --loose-innodb-file-per-table=1 --loose-max-statement-time=20 --loose-lock-wait-timeout=20 --loose-innodb-lock-wait-timeout=10 --loose-innodb_log_compressed_pages=on --loose-innodb-change-buffering=none
    

  • run a mix of DDL/DML on the server for a few minutes;
  • SIGKILL the server;
  • start the same server with the exact same options and additionally --innodb-force-recovery=3

InnoDB doesn't start:

10.1.34

2018-07-31 16:08:31 140279075214848 [Note] InnoDB:  innodb-page-size has been changed from the default value 16384 to 8192 .
2018-07-31 16:08:31 140279075214848 [Note] InnoDB: Using mutexes to ref count buffer pool pages
2018-07-31 16:08:31 140279075214848 [Note] InnoDB: The InnoDB memory heap is disabled
2018-07-31 16:08:31 140279075214848 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2018-07-31 16:08:31 140279075214848 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
2018-07-31 16:08:31 140279075214848 [Note] InnoDB: Compressed tables use zlib 1.2.3
2018-07-31 16:08:31 140279075214848 [Note] InnoDB: Using Linux native AIO
2018-07-31 16:08:31 140279075214848 [Note] InnoDB: Using SSE crc32 instructions
2018-07-31 16:08:31 140279075214848 [Note] InnoDB: Initializing buffer pool, size = 128.0M
2018-07-31 16:08:31 140279075214848 [Note] InnoDB: Completed initialization of buffer pool
2018-07-31 16:08:31 140279075214848 [Note] InnoDB: Highest supported file format is Barracuda.
2018-07-31 16:08:31 140279075214848 [Note] InnoDB: Starting crash recovery from checkpoint LSN=23790689
2018-07-31 16:08:31 140279075214848 [Note] InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
InnoDB: redo log checkpoint: 0 [ chk key ]: [ 4 1 ] 
2018-07-31 16:08:32 140279075214848 [ERROR] InnoDB: Redo log crypto: failed to decrypt log block. Reason could be that requested key version is not found, required encryption key management plugin is not found or configured encryption algorithm and/or method does not match.
2018-07-31 16:08:32 140279075214848 [ERROR] Plugin 'InnoDB' init function returned error.
2018-07-31 16:08:32 140279075214848 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed.
2018-07-31 16:08:32 140279075214848 [Note] Plugin 'FEEDBACK' is disabled.
2018-07-31 16:08:32 140279075214848 [ERROR] Unknown/unsupported storage engine: InnoDB
2018-07-31 16:08:32 140279075214848 [ERROR] Aborting
 
Error in my_thread_global_end(): 1 threads didn't exit

The key file is present and is the same for both starts.

The datadir and the key file are available, nothing is confidential, the datadir just doesn't fit into JIRA.



 Comments   
Comment by Elena Stepanova [ 2018-08-05 ]

The following has been observed on 10.2.16, might be related?

10.2.16

2018-08-05 18:20:13 139628292781952 [Note] InnoDB: Initializing buffer pool, total size = 128M, instances = 1, chunk size = 128M
2018-08-05 18:20:13 139628292781952 [Note] InnoDB: Completed initialization of buffer pool
2018-08-05 18:20:13 139627563095808 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
2018-08-05 18:20:13 139628292781952 [Note] InnoDB: Highest supported file format is Barracuda.
2018-08-05 18:20:13 139628292781952 [Note] InnoDB: Starting crash recovery from checkpoint LSN=34668996
2018-08-05 18:20:14 139628292781952 [Note] InnoDB: Recovered page [page id: space=16, page number=16] from the doublewrite buffer.
2018-08-05 18:20:14 139628292781952 [Note] InnoDB: Recovered page [page id: space=16, page number=17] from the doublewrite buffer.
2018-08-05 18:20:14 139628292781952 [Note] InnoDB: Recovered page [page id: space=16, page number=18] from the doublewrite buffer.
2018-08-05 18:20:14 139628292781952 [Note] InnoDB: Recovered page [page id: space=16, page number=19] from the doublewrite buffer.
2018-08-05 18:20:14 139628292781952 [Note] InnoDB: Recovered page [page id: space=16, page number=20] from the doublewrite buffer.
2018-08-05 18:20:14 139628292781952 [Note] InnoDB: Recovered page [page id: space=16, page number=21] from the doublewrite buffer.
2018-08-05 18:20:14 139628292781952 [Note] InnoDB: Recovered page [page id: space=16, page number=22] from the doublewrite buffer.
2018-08-05 18:20:14 139628292781952 [Note] InnoDB: Recovered page [page id: space=16, page number=23] from the doublewrite buffer.
2018-08-05 18:20:14 139628292781952 [Note] InnoDB: Recovered page [page id: space=16, page number=24] from the doublewrite buffer.
2018-08-05 18:20:14 139628292781952 [Note] InnoDB: Recovered page [page id: space=16, page number=25] from the doublewrite buffer.
2018-08-05 18:20:14 139628292781952 [Note] InnoDB: Recovered page [page id: space=16, page number=26] from the doublewrite buffer.
2018-08-05 18:20:14 139628292781952 [Note] InnoDB: Recovered page [page id: space=16, page number=27] from the doublewrite buffer.
2018-08-05 18:20:14 139628292781952 [Note] InnoDB: Starting final batch to recover 1576 pages from redo log.
2018-08-05 18:20:14 139628036220672 [ERROR] InnoDB: Space id and page no stored in the page, read in are [page id: space=14, page number=1], should be [page id: space=16, page number=16]
2018-08-05 18:20:14 139628036220672 [ERROR] InnoDB: The page [page id: space=16, page number=16] in file './test/t004.ibd' cannot be decrypted.
2018-08-05 18:20:14 139628036220672 [Note] InnoDB: However key management plugin or used key_version 1 is not found or used encryption algorithm or method does not match.
2018-08-05 18:20:14 139628036220672 [Note] InnoDB: Marking tablespace as missing. You may drop this table or install correct key management plugin and key file.
2018-08-05 18:20:14 0x7efdaf7fe700  InnoDB: Assertion failure in file /home/buildbot/buildbot/build/storage/innobase/rem/rem0rec.cc line 580
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: https://mariadb.com/kb/en/library/xtradbinnodb-recovery-modes/
InnoDB: about forcing recovery.
180805 18:20:14 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.2.16-MariaDB-log
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=0
max_threads=153
thread_count=0
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 467246 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x49000
mysys/stacktrace.c:268(my_print_stacktrace)[0x557774f9c9ab]
sql/signal_handler.cc:168(handle_fatal_signal)[0x557774a6df65]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x7efdbde42330]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7efdbd495c37]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7efdbd499028]
ut/ut0rbt.cc:461(rbt_eject_node)[0x55777483d88a]
rem/rem0rec.cc:580(rec_get_offsets_func(unsigned char const*, dict_index_t const*, unsigned long*, unsigned long, mem_block_info_t**))[0x557774cb1686]
page/page0cur.cc:2267(page_cur_parse_delete_rec(unsigned char*, unsigned char*, buf_block_t*, dict_index_t*, mtr_t*))[0x557774c915d9]
log/log0recv.cc:1480(recv_parse_or_apply_log_rec_body)[0x557774c70912]
log/log0recv.cc:1900(recv_recover_page(bool, buf_block_t*))[0x557774c7290d]
buf/buf0buf.cc:6198(buf_page_io_complete(buf_page_t*, bool, bool))[0x557774db095c]
fil/fil0fil.cc:5291(fil_aio_wait(unsigned long))[0x557774e1fc62]
srv/srv0start.cc:334(io_handler_thread)[0x557774d21a98]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8184)[0x7efdbde3a184]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7efdbd55cffd]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
Writing a core file at /home/travis/logs/current1_1/data/
Aborted (core dumped)

Comment by Elena Stepanova [ 2018-10-26 ]

https://travis-ci.org/elenst/travis-tests/jobs/446626672

10.2 76318d55aa0f5cb2ef5282538d8997b308c48204

2018-10-26 11:27:50 139718894253824 [ERROR] InnoDB: Space id and page no stored in the page, read in are [page id: space=48, page number=1], should be [page id: space=46, page number=30]

Comment by Marko Mäkelä [ 2018-10-29 ]

The problem that was initially reported for 10.1 appears to be that decrypting redo log fails.
Originally, the redo log encryption key rotation was broken, and it was disabled in 10.1.14 by MDEV-9840 and finally enabled in 10.4 by MDEV-12041.

The 10.2 problem could be duplicating MDEV-12699.

I would like to see the data set for 10.1.

Comment by Elena Stepanova [ 2018-10-29 ]

The datadir for the original 10.1 report is here: ftp://ftp.askmonty.org/public/mdev16866.tar.gz

The archive contains:

  • mdev16866/data – the datadir
  • mdev16866/keys.txt – the key file
  • cmd_options – command line options (for the reference)

Unpack into the basedir and start the current 10.1 server with

--datadir=`pwd`/mdev16866/data --file-key-management-filename=`pwd`/mdev16866/keys.txt --general-log --server-id=111 --file-key-management --plugin-load-add=file_key_management.so --innodb-encrypt-tables --innodb-encrypt-log --innodb-encryption-threads=4 --aria-encrypt-tables=1 --encrypt-tmp-disk-tables=1 --encrypt-binlog --innodb-page-size=8K --innodb-compression-algorithm=none --loose-innodb-file-format=Barracuda --loose-innodb-file-per-table=1 --loose-max-statement-time=20 --loose-lock-wait-timeout=20 --loose-innodb-lock-wait-timeout=10 --loose-innodb_log_compressed_pages=on --loose-innodb-change-buffering=none --innodb-force-recovery=3

Observe the error:

2018-10-29 13:42:06 139971190631360 [Note] InnoDB: Starting crash recovery from checkpoint LSN=23790689
2018-10-29 13:42:06 139971190631360 [Note] InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
InnoDB: redo log checkpoint: 0 [ chk key ]: [ 4 1 ] 
2018-10-29 13:42:08 139971190631360 [ERROR] InnoDB: Redo log crypto: failed to decrypt log block. Reason could be that requested key version is not found, required encryption key management plugin is not found or configured encryption algorithm and/or method does not match.
2018-10-29 13:42:08 139971190631360 [ERROR] Plugin 'InnoDB' init function returned error.
2018-10-29 13:42:08 139971190631360 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed.
2018-10-29 13:42:08 139971190631360 [Note] Plugin 'FEEDBACK' is disabled.
2018-10-29 13:42:08 139971190631360 [ERROR] Unknown/unsupported storage engine: InnoDB
2018-10-29 13:42:08 139971190631360 [ERROR] Aborting

Comment by Thirunarayanan Balathandayuthapani [ 2019-06-11 ]

I can able to repeat the issue from the given data directory. It looks like the recovery finds a zero-filled block and fails to decrypt it.
In 10.2, log_group_read_log_seg() has the condition check to whether the log block number matches with corresponding LSN.
That could happen when the server was killed while writing the redo log.

Comment by Marko Mäkelä [ 2019-06-11 ]

Looks OK to me.

Comment by Chris Calender (Inactive) [ 2019-09-05 ]

To be clear, this bug only affected 10.1, and not 10.2+.

Generated at Thu Feb 08 08:32:08 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.