[MDEV-15110] InnoDB crash on recovery Created: 2018-01-29  Updated: 2020-08-25  Resolved: 2018-03-13

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.1.22, 10.1.30
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Hartmut Holzgraefe Assignee: Marko Mäkelä
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

linux, backup taken by Zmanda


Issue Links:
Relates
relates to MDEV-21347 innodb_log_optimize_ddl=OFF is not cr... Closed

 Description   

Startup error log:

mysql-maria-10.1.22 > $(pwd)/bin/mysqld --defaults-file=./my.cnf --user=mysql
2018-01-29 13:18:42 140584210446592 [Note] /usr/local/mysql-maria-10.1.22/bin/mysqld (mysqld 10.1.22-MariaDB) starting as process 12876 ...
2018-01-29 13:18:43 7fdc4fecb900 InnoDB: Warning: Using innodb_additional_mem_pool_size is DEPRECATED. This option may be removed in future releases, together with the option innodb_use_sys_malloc and with the InnoDB's internal memory allocator.
2018-01-29 13:18:43 140584210446592 [Note] InnoDB: Using mutexes to ref count buffer pool pages
2018-01-29 13:18:43 140584210446592 [Note] InnoDB: The InnoDB memory heap is disabled
2018-01-29 13:18:43 140584210446592 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2018-01-29 13:18:43 140584210446592 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
2018-01-29 13:18:43 140584210446592 [Note] InnoDB: Compressed tables use zlib 1.2.8
2018-01-29 13:18:43 140584210446592 [Note] InnoDB: Using Linux native AIO
2018-01-29 13:18:43 140584210446592 [Note] InnoDB: Using SSE crc32 instructions
2018-01-29 13:18:43 140584210446592 [Note] InnoDB: Initializing buffer pool, size = 2.0G
2018-01-29 13:18:43 140584210446592 [Note] InnoDB: Completed initialization of buffer pool
2018-01-29 13:18:43 140584210446592 [Note] InnoDB: Highest supported file format is Barracuda.
2018-01-29 13:18:43 140584210446592 [Note] InnoDB: Starting crash recovery from checkpoint LSN=685398305
2018-01-29 13:18:43 140584210446592 [Note] InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
2018-01-29 13:18:44 140584210446592 [Note] InnoDB: Starting final batch to recover 23 pages from redo log
180129 13:18:44 [ERROR] mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.1.22-MariaDB
key_buffer_size=402653184
read_buffer_size=2097152
max_used_connections=0
max_threads=153
thread_count=0
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1023025 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x48400
/usr/local/mysql-maria-10.1.22/bin/mysqld(my_print_stacktrace+0x29)[0x5635d695f549]
/usr/local/mysql-maria-10.1.22/bin/mysqld(handle_fatal_signal+0x2f5)[0x5635d6547895]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7fdc4e37f390]
/usr/local/mysql-maria-10.1.22/bin/mysqld(+0x6fe605)[0x5635d66e6605]
/usr/local/mysql-maria-10.1.22/bin/mysqld(+0x70007e)[0x5635d66e807e]
/usr/local/mysql-maria-10.1.22/bin/mysqld(+0x6e5e70)[0x5635d66cde70]
/usr/local/mysql-maria-10.1.22/bin/mysqld(+0x6e7a13)[0x5635d66cfa13]
/usr/local/mysql-maria-10.1.22/bin/mysqld(+0x7c595f)[0x5635d67ad95f]
/usr/local/mysql-maria-10.1.22/bin/mysqld(+0x80dbd9)[0x5635d67f5bd9]
/usr/local/mysql-maria-10.1.22/bin/mysqld(+0x7612e8)[0x5635d67492e8]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7fdc4e3756ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fdc4da2041d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
Writing a core file
Segmentation fault (core dumped)

GDB backtrace:

(gdb) bt
#0  __pthread_kill (threadid=<optimized out>, signo=signo@entry=11) at ../sysdeps/unix/sysv/linux/pthread_kill.c:62
#1  0x000055985134d15a in my_write_core (sig=sig@entry=11) at /home/hartmut/projects/mariadb/releases/mariadb-10.1.22/mysys/stacktrace.c:477
#2  0x0000559850f35938 in handle_fatal_signal (sig=11) at /home/hartmut/projects/mariadb/releases/mariadb-10.1.22/sql/signal_handler.cc:296
#3  <signal handler called>
#4  mach_read_from_1 (b=0xfffffffffffffffb <error: Cannot access memory at address 0xfffffffffffffffb>)
    at /home/hartmut/projects/mariadb/releases/mariadb-10.1.22/storage/xtradb/include/mach0data.ic:55
#5  rec_get_bit_field_1 (shift=0, mask=15, offs=5, rec=0x0) at /home/hartmut/projects/mariadb/releases/mariadb-10.1.22/storage/xtradb/include/rem0rec.ic:169
#6  rec_get_n_owned_new (rec=0x0) at /home/hartmut/projects/mariadb/releases/mariadb-10.1.22/storage/xtradb/include/rem0rec.ic:549
#7  page_rec_find_owner_rec (rec=0x7fa035ec807d "") at /home/hartmut/projects/mariadb/releases/mariadb-10.1.22/storage/xtradb/include/page0page.ic:934
#8  page_cur_insert_rec_low (current_rec=0x7fa035ec8179 "", index=index@entry=0x7f9fd2833468, rec=0x7f9fdbffc295 "", offsets=<optimized out>, mtr=mtr@entry=0x7f9fdbffc7c0)
    at /home/hartmut/projects/mariadb/releases/mariadb-10.1.22/storage/xtradb/page/page0cur.cc:1152
#9  0x00005598510d607e in page_cur_rec_insert (mtr=0x7f9fdbffc7c0, offsets=<optimized out>, index=0x7f9fd2833468, rec=<optimized out>, cursor=0x7f9fdbffbf60)
    at /home/hartmut/projects/mariadb/releases/mariadb-10.1.22/storage/xtradb/include/page0cur.ic:326
#10 page_cur_parse_insert_rec (is_short=is_short@entry=0, ptr=0x7fa08e530883 <incomplete sequence \330>, end_ptr=end_ptr@entry=0x7fa08e53089d "", 
    block=block@entry=0x7fa025ff1390, index=0x7f9fd2833468, mtr=mtr@entry=0x7f9fdbffc7c0)
    at /home/hartmut/projects/mariadb/releases/mariadb-10.1.22/storage/xtradb/page/page0cur.cc:926
#11 0x00005598510bbe70 in recv_parse_or_apply_log_rec_body (type=<optimized out>, ptr=<optimized out>, ptr@entry=0x7fa08e530870 "", end_ptr=0x7fa08e53089d "", 
    block=block@entry=0x7fa025ff1390, mtr=mtr@entry=0x7f9fdbffc7c0, space_id=<optimized out>)
    at /home/hartmut/projects/mariadb/releases/mariadb-10.1.22/storage/xtradb/log/log0recv.cc:1175
#12 0x00005598510bda13 in recv_recover_page_func (just_read_in=just_read_in@entry=1, block=block@entry=0x7fa025ff1390)
    at /home/hartmut/projects/mariadb/releases/mariadb-10.1.22/storage/xtradb/log/log0recv.cc:1741
#13 0x000055985119b95f in buf_page_io_complete (bpage=0x7fa025ff1390) at /home/hartmut/projects/mariadb/releases/mariadb-10.1.22/storage/xtradb/buf/buf0buf.cc:4856
#14 0x00005598511e3bd9 in fil_aio_wait (segment=segment@entry=2) at /home/hartmut/projects/mariadb/releases/mariadb-10.1.22/storage/xtradb/fil/fil0fil.cc:6323
#15 0x00005598511372e8 in io_handler_thread (arg=<optimized out>) at /home/hartmut/projects/mariadb/releases/mariadb-10.1.22/storage/xtradb/srv/srv0start.cc:546
#16 0x00007fa0902f96ba in start_thread (arg=0x7f9fdbffd700) at pthread_create.c:333
#17 0x00007fa08f9a441d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109



 Comments   
Comment by Elena Stepanova [ 2018-01-31 ]

Might be related to MDEV-14420 and/or MDEV-13192.

Comment by Marko Mäkelä [ 2018-02-02 ]

I do not think that this is related to MDEV-14420, because this fails in a non-debug build and it is a B-tree page, not page number 0.
This is also unrelated to MDEV-13192, because the table is not using ROW_FORMAT=COMPRESSED.
It is the clustered index root page (page number 3) of tablespace 3.
When applying the redo log record, the page is almost empty (no data after the "infimum" and "supremum").
We have 2565 redo log records to that page according to --debug=d,ib_log. The problematic record appears to be right at the start of the recovery:

2018-02-02 20:25:26 140737353852736 [Note] InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
innobase_init: ib_log: scan 685398305: log rec 24 len 9 page 0:301
innobase_init: ib_log: scan 685398314: log rec 20 len 14 page 0:301
innobase_init: ib_log: scan 685398328: log rec 38 len 64 page 4:1353
innobase_init: ib_log: scan 685398392: multi-log rec 8 len 11 page 4:1301
innobase_init: ib_log: scan 685398392: multi-log rec 38 len 14 page 4:1301
innobase_init: ib_log: scan 685398392: multi-log rec 31 len 1 page 4:1301
innobase_init: ib_log: scan 685398418: log rec 20 len 23 page 0:301
innobase_init: ib_log: scan 685398441: log rec 38 len 48 page 3:3

This could be a transaction that is writing undo log to page 0:301 and then modifying two tables, in tablespaces 3 and 4. The table in tablespace 3 should be almost empty, because the data appears to have been in the root page. The table name is mysql.gtid_slave_pos.
Here is the page dump:

root page of mysql.gtid_slave_pos

000000 81 60 7b 75 00 00 00 03 ff ff ff ff ff ff ff ff
000010 00 00 00 00 00 18 a1 2f 45 bf 00 00 00 00 00 00
000020 00 00 00 00 00 03 00 02 00 78 80 02 00 00 00 00
000030 00 00 00 05 00 00 00 00 00 00 00 00 00 00 00 00
000040 00 00 00 00 00 00 00 00 00 13 00 00 00 03 00 00
000050 00 02 00 f2 00 00 00 03 00 00 00 02 00 32 01 00
000060 02 00 0d 69 6e 66 69 6d 75 6d 00 01 00 0b 00 00
000070 73 75 70 72 65 6d 75 6d 00 00 00 00 00 00 00 00
000080 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
003ff0 00 00 00 00 00 70 00 63 d0 27 09 cc 00 18 a1 2f
004000

Now, the problem is that the MLOG_COMP_PAGE_INSERT record (type 38) refers to the insert position 0x179 from the start of the page, but as you can see, the page is empty; it does not contain anything after the page supremum record at 0x70. The next-record pointer of the page infimum is pointing to the page supremum (the 16-bit big endian field at 0x61 contains the relative offset 0x0d).

I see only one set of ib_logfile* in the provided mysql.zip, so I do not think that could have used the wrong redo logs for the data files.

Above, we see that the 64-bit field FIL_PAGE_LSN at offset 0x10 is 0x18a12f, or 1,614,127. It is much less than the redo log LSN of 685,398,305. These numbers are roughly bytes written to the redo log. It is possible that the LSN 1,614,127 was written when the database was originally initialized. All the nonempty (first 4) pages of the file contain the same FIL_PAGE_LSN.

It seems to me that something must have gone wrong when creating the backup. It looks like a much too old version of the data files was copied or restored. It is also possible that in case a file system snapshot was used, it did not work correctly.

This does not look like a bug to me, but rather incorrect usage.

My wild guess is that someone tried to replace the gtid_slave_pos.ibd with an empty file from a newly initialized instance, to work around some problem. In that case, it would be a better idea to delete the file altogether, and let InnoDB skip any redo log for it.

Generated at Thu Feb 08 08:18:45 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.