[MDEV-28976] InnoDB: Missing FILE_CHECKPOINT - Jira

XML

Word

Printable

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.6, 10.3(EOL), 10.4(EOL), 10.5(EOL), 10.7(EOL), 10.8(EOL), 10.9(EOL), 10.10(EOL)
Fix Version/s: 10.7.5, 10.3.36, 10.4.26, 10.5.17, 10.6.9, 10.8.4, 10.9.2, 10.10.1
Component/s: Storage Engine - InnoDB, Tests
Labels:
- race

Description

A test frequently fails like this:

10.10 63961a08a6203f4d58363a9321e4cf9c8b07a9fe
encryption.innodb-redo-nokeys 'cbc,innodb' w2 [ fail ] Found warnings/errors in server log file!
Test ended at 2022-06-28 18:48:49
line
2022-06-28 18:48:45 0 [ERROR] InnoDB: Missing FILE_CHECKPOINT(60709) at 1800619
2022-06-28 18:48:45 0 [ERROR] InnoDB: Log scan aborted at LSN 1800619

For the above failure on kvm-asan, data.tar.xz is a copy of the data directory. To my surprise, it recovered fine both when restored to /dev/shm (using mmap(2) based access to the log) and to some other directory (reading the log by pread(2)).

To my surprise, the recovery would start from a much later LSN:

2022-06-29  9:06:07 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=1824911

The 2 checkpoint blocks contain the following:

001000 00 00 00 00 00 1b d8 8f 00 00 00 00 00 1d c3 81

001010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

001030 00 00 00 00 00 00 00 00 00 00 00 00 27 3f 7c 88

001040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

002000 00 00 00 00 00 00 ed 25 00 00 00 00 00 1b 79 13

002010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

002030 00 00 00 00 00 00 00 00 00 00 00 00 3f 8a 1f cd

002040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

That is, the first checkpoint is 0xed25 (60709) and the second one is 0x1bd88f (1824911). Even if I discard the second checkpoint, the recovery will succeed for me:

tar xJf data.tar.xz

od -Ax -t x1 -j 0x1000 -N 0x2000 data/ib_logfile0

mariadbd --innodb-log-file-size=10M --datadir $(pwd)/data/

2022-06-29  9:13:27 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=60709

2022-06-29  9:13:27 0 [Note] InnoDB: Starting final batch to recover 6 pages from redo log.

2022-06-29  9:13:27 0 [Note] InnoDB: 128 rollback segments are active.

2022-06-29  9:13:27 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ...

2022-06-29  9:13:27 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.

2022-06-29  9:13:27 0 [Note] InnoDB: log sequence number 1952036; transaction id 67

At the indicated LSN 1800619 (0x1b79ab) of the failure message, I see the end of a mini-transaction that the checkpoint pointed to (0x1b7913). In the ~~MDEV-14425~~ log format, the LSN conveniently is equal to the byte offset in the ib_logfile0 that was created at database initialization and did not wrap around yet. Here is the mini-transaction in question:

od -Ax -t x1 -t a -j 0x1b7913 -N 152 data/ib_logfile0

1b7913  b0  01  05  00  2e  2f  74  65  73  74  2f  74  31  2e  69  62

         0 soh enq nul   .   /   t   e   s   t   /   t   1   .   i   b

1b7923  64  b0  12  01  00  2e  2f  6d  79  73  71  6c  2f  69  6e  6e

         d   0 dc2 soh nul   .   /   m   y   s   q   l   /   i   n   n

1b7933  6f  64  62  5f  74  61  62  6c  65  5f  73  74  61  74  73  2e

         o   d   b   _   t   a   b   l   e   _   s   t   a   t   s   .

1b7943  69  62  64  b0  12  02  00  2e  2f  6d  79  73  71  6c  2f  69

         i   b   d   0 dc2 stx nul   .   /   m   y   s   q   l   /   i

1b7953  6e  6e  6f  64  62  5f  69  6e  64  65  78  5f  73  74  61  74

         n   n   o   d   b   _   i   n   d   e   x   _   s   t   a   t

1b7963  73  2e  69  62  64  b0  01  06  00  2e  2f  74  65  73  74  2f

         s   .   i   b   d   0 soh ack nul   .   /   t   e   s   t   /

1b7973  74  32  2e  69  62  64  b0  01  07  00  2e  2f  74  65  73  74

         t   2   .   i   b   d   0 soh bel nul   .   /   t   e   s   t

1b7983  2f  74  33  2e  69  62  64  b0  01  08  00  2e  2f  74  65  73

         /   t   3   .   i   b   d   0 soh  bs nul   .   /   t   e   s

1b7993  74  2f  74  34  2e  69  62  64  fa  00  00  00  00  00  00  00

         t   /   t   4   .   i   b   d   z nul nul nul nul nul nul nul

1b79a3  00  ed  25  01  9e  41  cb  2d

       nul   m   % soh  rs   A   K   -

1b79ab

This must have been written by fil_names_clear(). We have a number of FILE_MODIFY records of files that were modified since the checkpoint (0xb0, encoded length, encoded tablespace ID, 0, file name), and a FILE_CHECKPOINT record (0xfa, 0, 0, big-endian 64-bit LSN=0xed25), and the end-of-mini-transaction marker (sequence bit=1) and the CRC-32C checksum 0x9e41cb2d. Everything looks perfectly fine.

Most failures occur on FreeBSD, but the above is the first case where a copy of the data directory is available yet. Here are two failures for 10.5:

10.5 35f2cdcb99f4f8b39895070fde3f825329106361
encryption.innodb-redo-nokeys 'cbc,innodb' w3 [ fail ] Found warnings/errors in server log file!
Test ended at 2022-06-23 04:51:11
line
2022-06-23 4:51:08 0 [ERROR] InnoDB: Missing FILE_CHECKPOINT at 1381251 between the checkpoint 52621 and the end 1446400.
^ Found warnings in /tmp/var/3/log/mysqld.1.err
…
encryption.innodb-redo-nokeys 'ctr,innodb' w4 [ fail ] Found warnings/errors in server log file!
Test ended at 2022-06-23 04:53:50
line
2022-06-23 4:53:47 0 [ERROR] InnoDB: Missing FILE_CHECKPOINT at 1171764 between the checkpoint 52621 and the end 1236992.
^ Found warnings in /tmp/var/4/log/mysqld.1.err

In each case, it seems possible that the latest checkpoint was from the shutdown of the server bootstrap.

serg, to me, this looks like a race condition in mtr. I suspect that the new server process was started while the old one is still being killed by the test (while writing a checkpoint). Before ~~MDEV-24393~~, such a race would have been prevented by advisory locks on the files. If it is the case, fixing ~~MDEV-28662~~ should fix this bug.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

data.tar.xz
711 kB
2022-06-29 05:57

Issue Links

is caused by

MDEV-24393 table_open_cache creates same number of advisory /proc/locks. Is it worth it?

Closed

relates to

MDEV-28495 Corruption due to attempting to start up multiple servers on the same data

Closed

MDEV-31568 innodb protection against dual processes accessing data insufficient

Closed

MDEV-28662 [ERROR] mysqld got signal 11 after updating from 10.4.8 to 10.4.24

Closed

Activity

People

Assignee:: Marko Mäkelä

Reporter:: Marko Mäkelä

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 2022-06-29 06:34

Updated:: 2024-07-07 20:15

Resolved:: 2022-07-27 11:57

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.