Details
Description
A test frequently fails like this:
10.10 63961a08a6203f4d58363a9321e4cf9c8b07a9fe |
encryption.innodb-redo-nokeys 'cbc,innodb' w2 [ fail ] Found warnings/errors in server log file!
|
Test ended at 2022-06-28 18:48:49
|
line
|
2022-06-28 18:48:45 0 [ERROR] InnoDB: Missing FILE_CHECKPOINT(60709) at 1800619
|
2022-06-28 18:48:45 0 [ERROR] InnoDB: Log scan aborted at LSN 1800619
|
For the above failure on kvm-asan, data.tar.xz is a copy of the data directory. To my surprise, it recovered fine both when restored to /dev/shm (using mmap(2) based access to the log) and to some other directory (reading the log by pread(2)).
To my surprise, the recovery would start from a much later LSN:
2022-06-29 9:06:07 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=1824911
|
The 2 checkpoint blocks contain the following:
001000 00 00 00 00 00 1b d8 8f 00 00 00 00 00 1d c3 81
|
001010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
|
*
|
001030 00 00 00 00 00 00 00 00 00 00 00 00 27 3f 7c 88
|
001040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
|
*
|
002000 00 00 00 00 00 00 ed 25 00 00 00 00 00 1b 79 13
|
002010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
|
*
|
002030 00 00 00 00 00 00 00 00 00 00 00 00 3f 8a 1f cd
|
002040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
|
*
|
003000
|
That is, the first checkpoint is 0xed25 (60709) and the second one is 0x1bd88f (1824911). Even if I discard the second checkpoint, the recovery will succeed for me:
tar xJf data.tar.xz
|
od -Ax -t x1 -j 0x1000 -N 0x2000 data/ib_logfile0
|
mariadbd --innodb-log-file-size=10M --datadir $(pwd)/data/
|
2022-06-29 9:13:27 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=60709
|
2022-06-29 9:13:27 0 [Note] InnoDB: Starting final batch to recover 6 pages from redo log.
|
2022-06-29 9:13:27 0 [Note] InnoDB: 128 rollback segments are active.
|
2022-06-29 9:13:27 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ...
|
2022-06-29 9:13:27 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.
|
2022-06-29 9:13:27 0 [Note] InnoDB: log sequence number 1952036; transaction id 67
|
At the indicated LSN 1800619 (0x1b79ab) of the failure message, I see the end of a mini-transaction that the checkpoint pointed to (0x1b7913). In the MDEV-14425 log format, the LSN conveniently is equal to the byte offset in the ib_logfile0 that was created at database initialization and did not wrap around yet. Here is the mini-transaction in question:
od -Ax -t x1 -t a -j 0x1b7913 -N 152 data/ib_logfile0
|
1b7913 b0 01 05 00 2e 2f 74 65 73 74 2f 74 31 2e 69 62
|
0 soh enq nul . / t e s t / t 1 . i b
|
1b7923 64 b0 12 01 00 2e 2f 6d 79 73 71 6c 2f 69 6e 6e
|
d 0 dc2 soh nul . / m y s q l / i n n
|
1b7933 6f 64 62 5f 74 61 62 6c 65 5f 73 74 61 74 73 2e
|
o d b _ t a b l e _ s t a t s .
|
1b7943 69 62 64 b0 12 02 00 2e 2f 6d 79 73 71 6c 2f 69
|
i b d 0 dc2 stx nul . / m y s q l / i
|
1b7953 6e 6e 6f 64 62 5f 69 6e 64 65 78 5f 73 74 61 74
|
n n o d b _ i n d e x _ s t a t
|
1b7963 73 2e 69 62 64 b0 01 06 00 2e 2f 74 65 73 74 2f
|
s . i b d 0 soh ack nul . / t e s t /
|
1b7973 74 32 2e 69 62 64 b0 01 07 00 2e 2f 74 65 73 74
|
t 2 . i b d 0 soh bel nul . / t e s t
|
1b7983 2f 74 33 2e 69 62 64 b0 01 08 00 2e 2f 74 65 73
|
/ t 3 . i b d 0 soh bs nul . / t e s
|
1b7993 74 2f 74 34 2e 69 62 64 fa 00 00 00 00 00 00 00
|
t / t 4 . i b d z nul nul nul nul nul nul nul
|
1b79a3 00 ed 25 01 9e 41 cb 2d
|
nul m % soh rs A K -
|
1b79ab
|
This must have been written by fil_names_clear(). We have a number of FILE_MODIFY records of files that were modified since the checkpoint (0xb0, encoded length, encoded tablespace ID, 0, file name), and a FILE_CHECKPOINT record (0xfa, 0, 0, big-endian 64-bit LSN=0xed25), and the end-of-mini-transaction marker (sequence bit=1) and the CRC-32C checksum 0x9e41cb2d. Everything looks perfectly fine.
Most failures occur on FreeBSD, but the above is the first case where a copy of the data directory is available yet. Here are two failures for 10.5:
10.5 35f2cdcb99f4f8b39895070fde3f825329106361 |
encryption.innodb-redo-nokeys 'cbc,innodb' w3 [ fail ] Found warnings/errors in server log file!
|
Test ended at 2022-06-23 04:51:11
|
line
|
2022-06-23 4:51:08 0 [ERROR] InnoDB: Missing FILE_CHECKPOINT at 1381251 between the checkpoint 52621 and the end 1446400.
|
^ Found warnings in /tmp/var/3/log/mysqld.1.err
|
…
|
encryption.innodb-redo-nokeys 'ctr,innodb' w4 [ fail ] Found warnings/errors in server log file!
|
Test ended at 2022-06-23 04:53:50
|
line
|
2022-06-23 4:53:47 0 [ERROR] InnoDB: Missing FILE_CHECKPOINT at 1171764 between the checkpoint 52621 and the end 1236992.
|
^ Found warnings in /tmp/var/4/log/mysqld.1.err
|
In each case, it seems possible that the latest checkpoint was from the shutdown of the server bootstrap.
serg, to me, this looks like a race condition in mtr. I suspect that the new server process was started while the old one is still being killed by the test (while writing a checkpoint). Before MDEV-24393, such a race would have been prevented by advisory locks on the files. If it is the case, fixing MDEV-28662 should fix this bug.
Attachments
Issue Links
- is caused by
-
MDEV-24393 table_open_cache creates same number of advisory /proc/locks. Is it worth it?
- Closed
- relates to
-
MDEV-28495 Corruption due to attempting to start up multiple servers on the same data
- Closed
-
MDEV-31568 innodb protection against dual processes accessing data insufficient
- Closed
-
MDEV-28662 [ERROR] mysqld got signal 11 after updating from 10.4.8 to 10.4.24
- Closed
An attempted fix of
MDEV-28495confirmed my hypothesis that the old, supposedly-to-be-killed server was still running:bb-10.3-MDEV-28495 66d7599c2792555b52c570c3047bea310e59fc3b
encryption.innodb-read-only 'cbc,innodb' w3 [ fail ]
Test ended at 2022-07-27 13:41:59
CURRENT_TEST: encryption.innodb-read-only
Server [mysqld.1 - pid: 643356, winpid: 643356, exit: 256] failed during test run
…
2022-07-27 13:41:58 0 [Note] InnoDB: Completed initialization of buffer pool
2022-07-27 13:41:58 0 [ERROR] InnoDB: Unable to lock ./ibdata1 error: 9
2022-07-27 13:41:58 0x7f409aaa5d80 InnoDB: Assertion failure in file /mariadb/10.3/storage/innobase/fil/fil0fil.cc line 1781
InnoDB: Failing assertion: !os_file_lock(node->handle, node->name)
I think that I will add some retry loop there, like we did when InnoDB did not obey --skip-external-locking. It is not ideal that the test harness is not reliable when it comes to killing and restarting processes, but since that is outside my control, I think that it is easiest to work around that deficiency in the InnoDB startup code.