Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-28976

InnoDB: Missing FILE_CHECKPOINT

    XMLWordPrintable

    Details

      Description

      A test frequently fails like this:

      10.10 63961a08a6203f4d58363a9321e4cf9c8b07a9fe

      encryption.innodb-redo-nokeys 'cbc,innodb' w2 [ fail ]  Found warnings/errors in server log file!
              Test ended at 2022-06-28 18:48:49
      line
      2022-06-28 18:48:45 0 [ERROR] InnoDB: Missing FILE_CHECKPOINT(60709) at 1800619
      2022-06-28 18:48:45 0 [ERROR] InnoDB: Log scan aborted at LSN 1800619
      

      For the above failure on kvm-asan, data.tar.xz is a copy of the data directory. To my surprise, it recovered fine both when restored to /dev/shm (using mmap(2) based access to the log) and to some other directory (reading the log by pread(2)).

      To my surprise, the recovery would start from a much later LSN:

      2022-06-29  9:06:07 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=1824911
      

      The 2 checkpoint blocks contain the following:

      001000 00 00 00 00 00 1b d8 8f 00 00 00 00 00 1d c3 81
      001010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      *
      001030 00 00 00 00 00 00 00 00 00 00 00 00 27 3f 7c 88
      001040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      *
      002000 00 00 00 00 00 00 ed 25 00 00 00 00 00 1b 79 13
      002010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      *
      002030 00 00 00 00 00 00 00 00 00 00 00 00 3f 8a 1f cd
      002040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      *
      003000
      

      That is, the first checkpoint is 0xed25 (60709) and the second one is 0x1bd88f (1824911). Even if I discard the second checkpoint, the recovery will succeed for me:

      tar xJf data.tar.xz
      od -Ax -t x1 -j 0x1000 -N 0x2000 data/ib_logfile0 
      mariadbd --innodb-log-file-size=10M --datadir $(pwd)/data/
      

      2022-06-29  9:13:27 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=60709
      2022-06-29  9:13:27 0 [Note] InnoDB: Starting final batch to recover 6 pages from redo log.
      2022-06-29  9:13:27 0 [Note] InnoDB: 128 rollback segments are active.
      2022-06-29  9:13:27 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ...
      2022-06-29  9:13:27 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.
      2022-06-29  9:13:27 0 [Note] InnoDB: log sequence number 1952036; transaction id 67
      

      At the indicated LSN 1800619 (0x1b79ab) of the failure message, I see the end of a mini-transaction that the checkpoint pointed to (0x1b7913). In the MDEV-14425 log format, the LSN conveniently is equal to the byte offset in the ib_logfile0 that was created at database initialization and did not wrap around yet. Here is the mini-transaction in question:

      od -Ax -t x1 -t a -j 0x1b7913 -N 152 data/ib_logfile0
      

      1b7913  b0  01  05  00  2e  2f  74  65  73  74  2f  74  31  2e  69  62
               0 soh enq nul   .   /   t   e   s   t   /   t   1   .   i   b
      1b7923  64  b0  12  01  00  2e  2f  6d  79  73  71  6c  2f  69  6e  6e
               d   0 dc2 soh nul   .   /   m   y   s   q   l   /   i   n   n
      1b7933  6f  64  62  5f  74  61  62  6c  65  5f  73  74  61  74  73  2e
               o   d   b   _   t   a   b   l   e   _   s   t   a   t   s   .
      1b7943  69  62  64  b0  12  02  00  2e  2f  6d  79  73  71  6c  2f  69
               i   b   d   0 dc2 stx nul   .   /   m   y   s   q   l   /   i
      1b7953  6e  6e  6f  64  62  5f  69  6e  64  65  78  5f  73  74  61  74
               n   n   o   d   b   _   i   n   d   e   x   _   s   t   a   t
      1b7963  73  2e  69  62  64  b0  01  06  00  2e  2f  74  65  73  74  2f
               s   .   i   b   d   0 soh ack nul   .   /   t   e   s   t   /
      1b7973  74  32  2e  69  62  64  b0  01  07  00  2e  2f  74  65  73  74
               t   2   .   i   b   d   0 soh bel nul   .   /   t   e   s   t
      1b7983  2f  74  33  2e  69  62  64  b0  01  08  00  2e  2f  74  65  73
               /   t   3   .   i   b   d   0 soh  bs nul   .   /   t   e   s
      1b7993  74  2f  74  34  2e  69  62  64  fa  00  00  00  00  00  00  00
               t   /   t   4   .   i   b   d   z nul nul nul nul nul nul nul
      1b79a3  00  ed  25  01  9e  41  cb  2d
             nul   m   % soh  rs   A   K   -
      1b79ab
      

      This must have been written by fil_names_clear(). We have a number of FILE_MODIFY records of files that were modified since the checkpoint (0xb0, encoded length, encoded tablespace ID, 0, file name), and a FILE_CHECKPOINT record (0xfa, 0, 0, big-endian 64-bit LSN=0xed25), and the end-of-mini-transaction marker (sequence bit=1) and the CRC-32C checksum 0x9e41cb2d. Everything looks perfectly fine.

      Most failures occur on FreeBSD, but the above is the first case where a copy of the data directory is available yet. Here are two failures for 10.5:

      10.5 35f2cdcb99f4f8b39895070fde3f825329106361

      encryption.innodb-redo-nokeys 'cbc,innodb' w3 [ fail ]  Found warnings/errors in server log file!
              Test ended at 2022-06-23 04:51:11
      line
      2022-06-23  4:51:08 0 [ERROR] InnoDB: Missing FILE_CHECKPOINT at 1381251 between the checkpoint 52621 and the end 1446400.
      ^ Found warnings in /tmp/var/3/log/mysqld.1.err
      encryption.innodb-redo-nokeys 'ctr,innodb' w4 [ fail ]  Found warnings/errors in server log file!
              Test ended at 2022-06-23 04:53:50
      line
      2022-06-23  4:53:47 0 [ERROR] InnoDB: Missing FILE_CHECKPOINT at 1171764 between the checkpoint 52621 and the end 1236992.
      ^ Found warnings in /tmp/var/4/log/mysqld.1.err
      

      In each case, it seems possible that the latest checkpoint was from the shutdown of the server bootstrap.

      Sergei Golubchik, to me, this looks like a race condition in mtr. I suspect that the new server process was started while the old one is still being killed by the test (while writing a checkpoint). Before MDEV-24393, such a race would have been prevented by advisory locks on the files. If it is the case, fixing MDEV-28662 should fix this bug.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              marko Marko Mäkelä
              Reporter:
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.