Details

    Description

      A test frequently fails like this:

      10.10 63961a08a6203f4d58363a9321e4cf9c8b07a9fe

      encryption.innodb-redo-nokeys 'cbc,innodb' w2 [ fail ]  Found warnings/errors in server log file!
              Test ended at 2022-06-28 18:48:49
      line
      2022-06-28 18:48:45 0 [ERROR] InnoDB: Missing FILE_CHECKPOINT(60709) at 1800619
      2022-06-28 18:48:45 0 [ERROR] InnoDB: Log scan aborted at LSN 1800619
      

      For the above failure on kvm-asan, data.tar.xz is a copy of the data directory. To my surprise, it recovered fine both when restored to /dev/shm (using mmap(2) based access to the log) and to some other directory (reading the log by pread(2)).

      To my surprise, the recovery would start from a much later LSN:

      2022-06-29  9:06:07 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=1824911
      

      The 2 checkpoint blocks contain the following:

      001000 00 00 00 00 00 1b d8 8f 00 00 00 00 00 1d c3 81
      001010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      *
      001030 00 00 00 00 00 00 00 00 00 00 00 00 27 3f 7c 88
      001040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      *
      002000 00 00 00 00 00 00 ed 25 00 00 00 00 00 1b 79 13
      002010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      *
      002030 00 00 00 00 00 00 00 00 00 00 00 00 3f 8a 1f cd
      002040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      *
      003000
      

      That is, the first checkpoint is 0xed25 (60709) and the second one is 0x1bd88f (1824911). Even if I discard the second checkpoint, the recovery will succeed for me:

      tar xJf data.tar.xz
      od -Ax -t x1 -j 0x1000 -N 0x2000 data/ib_logfile0 
      mariadbd --innodb-log-file-size=10M --datadir $(pwd)/data/
      

      2022-06-29  9:13:27 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=60709
      2022-06-29  9:13:27 0 [Note] InnoDB: Starting final batch to recover 6 pages from redo log.
      2022-06-29  9:13:27 0 [Note] InnoDB: 128 rollback segments are active.
      2022-06-29  9:13:27 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ...
      2022-06-29  9:13:27 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.
      2022-06-29  9:13:27 0 [Note] InnoDB: log sequence number 1952036; transaction id 67
      

      At the indicated LSN 1800619 (0x1b79ab) of the failure message, I see the end of a mini-transaction that the checkpoint pointed to (0x1b7913). In the MDEV-14425 log format, the LSN conveniently is equal to the byte offset in the ib_logfile0 that was created at database initialization and did not wrap around yet. Here is the mini-transaction in question:

      od -Ax -t x1 -t a -j 0x1b7913 -N 152 data/ib_logfile0
      

      1b7913  b0  01  05  00  2e  2f  74  65  73  74  2f  74  31  2e  69  62
               0 soh enq nul   .   /   t   e   s   t   /   t   1   .   i   b
      1b7923  64  b0  12  01  00  2e  2f  6d  79  73  71  6c  2f  69  6e  6e
               d   0 dc2 soh nul   .   /   m   y   s   q   l   /   i   n   n
      1b7933  6f  64  62  5f  74  61  62  6c  65  5f  73  74  61  74  73  2e
               o   d   b   _   t   a   b   l   e   _   s   t   a   t   s   .
      1b7943  69  62  64  b0  12  02  00  2e  2f  6d  79  73  71  6c  2f  69
               i   b   d   0 dc2 stx nul   .   /   m   y   s   q   l   /   i
      1b7953  6e  6e  6f  64  62  5f  69  6e  64  65  78  5f  73  74  61  74
               n   n   o   d   b   _   i   n   d   e   x   _   s   t   a   t
      1b7963  73  2e  69  62  64  b0  01  06  00  2e  2f  74  65  73  74  2f
               s   .   i   b   d   0 soh ack nul   .   /   t   e   s   t   /
      1b7973  74  32  2e  69  62  64  b0  01  07  00  2e  2f  74  65  73  74
               t   2   .   i   b   d   0 soh bel nul   .   /   t   e   s   t
      1b7983  2f  74  33  2e  69  62  64  b0  01  08  00  2e  2f  74  65  73
               /   t   3   .   i   b   d   0 soh  bs nul   .   /   t   e   s
      1b7993  74  2f  74  34  2e  69  62  64  fa  00  00  00  00  00  00  00
               t   /   t   4   .   i   b   d   z nul nul nul nul nul nul nul
      1b79a3  00  ed  25  01  9e  41  cb  2d
             nul   m   % soh  rs   A   K   -
      1b79ab
      

      This must have been written by fil_names_clear(). We have a number of FILE_MODIFY records of files that were modified since the checkpoint (0xb0, encoded length, encoded tablespace ID, 0, file name), and a FILE_CHECKPOINT record (0xfa, 0, 0, big-endian 64-bit LSN=0xed25), and the end-of-mini-transaction marker (sequence bit=1) and the CRC-32C checksum 0x9e41cb2d. Everything looks perfectly fine.

      Most failures occur on FreeBSD, but the above is the first case where a copy of the data directory is available yet. Here are two failures for 10.5:

      10.5 35f2cdcb99f4f8b39895070fde3f825329106361

      encryption.innodb-redo-nokeys 'cbc,innodb' w3 [ fail ]  Found warnings/errors in server log file!
              Test ended at 2022-06-23 04:51:11
      line
      2022-06-23  4:51:08 0 [ERROR] InnoDB: Missing FILE_CHECKPOINT at 1381251 between the checkpoint 52621 and the end 1446400.
      ^ Found warnings in /tmp/var/3/log/mysqld.1.err
      encryption.innodb-redo-nokeys 'ctr,innodb' w4 [ fail ]  Found warnings/errors in server log file!
              Test ended at 2022-06-23 04:53:50
      line
      2022-06-23  4:53:47 0 [ERROR] InnoDB: Missing FILE_CHECKPOINT at 1171764 between the checkpoint 52621 and the end 1236992.
      ^ Found warnings in /tmp/var/4/log/mysqld.1.err
      

      In each case, it seems possible that the latest checkpoint was from the shutdown of the server bootstrap.

      serg, to me, this looks like a race condition in mtr. I suspect that the new server process was started while the old one is still being killed by the test (while writing a checkpoint). Before MDEV-24393, such a race would have been prevented by advisory locks on the files. If it is the case, fixing MDEV-28662 should fix this bug.

      Attachments

        Issue Links

          Activity

            An attempted fix of MDEV-28495 confirmed my hypothesis that the old, supposedly-to-be-killed server was still running:

            bb-10.3-MDEV-28495 66d7599c2792555b52c570c3047bea310e59fc3b

            encryption.innodb-read-only 'cbc,innodb' w3 [ fail ]
                    Test ended at 2022-07-27 13:41:59
             
            CURRENT_TEST: encryption.innodb-read-only
             
             
            Server [mysqld.1 - pid: 643356, winpid: 643356, exit: 256] failed during test run
            2022-07-27 13:41:58 0 [Note] InnoDB: Completed initialization of buffer pool
            2022-07-27 13:41:58 0 [ERROR] InnoDB: Unable to lock ./ibdata1 error: 9
            2022-07-27 13:41:58 0x7f409aaa5d80  InnoDB: Assertion failure in file /mariadb/10.3/storage/innobase/fil/fil0fil.cc line 1781
            InnoDB: Failing assertion: !os_file_lock(node->handle, node->name)
            

            I think that I will add some retry loop there, like we did when InnoDB did not obey --skip-external-locking. It is not ideal that the test harness is not reliable when it comes to killing and restarting processes, but since that is outside my control, I think that it is easiest to work around that deficiency in the InnoDB startup code.

            marko Marko Mäkelä added a comment - An attempted fix of MDEV-28495 confirmed my hypothesis that the old, supposedly-to-be-killed server was still running: bb-10.3-MDEV-28495 66d7599c2792555b52c570c3047bea310e59fc3b encryption.innodb-read-only 'cbc,innodb' w3 [ fail ] Test ended at 2022-07-27 13:41:59   CURRENT_TEST: encryption.innodb-read-only     Server [mysqld.1 - pid: 643356, winpid: 643356, exit: 256] failed during test run … 2022-07-27 13:41:58 0 [Note] InnoDB: Completed initialization of buffer pool 2022-07-27 13:41:58 0 [ERROR] InnoDB: Unable to lock ./ibdata1 error: 9 2022-07-27 13:41:58 0x7f409aaa5d80 InnoDB: Assertion failure in file /mariadb/10.3/storage/innobase/fil/fil0fil.cc line 1781 InnoDB: Failing assertion: !os_file_lock(node->handle, node->name) I think that I will add some retry loop there, like we did when InnoDB did not obey --skip-external-locking . It is not ideal that the test harness is not reliable when it comes to killing and restarting processes, but since that is outside my control, I think that it is easiest to work around that deficiency in the InnoDB startup code.

            Already in 10.3, several kill-and-restart tests were affected by this, also outside the encryption suite. Apparently, the probability of log checkpoints occurring near the server kill is much higher with encryption tests (which are writing relatively much data right before killing the server).

            marko Marko Mäkelä added a comment - Already in 10.3, several kill-and-restart tests were affected by this, also outside the encryption suite. Apparently, the probability of log checkpoints occurring near the server kill is much higher with encryption tests (which are writing relatively much data right before killing the server).

            MDEV-28495 should hopefully fix this, by ensuring (via the lock on the system tablespace file) that the being-killed server process has actually been killed.

            marko Marko Mäkelä added a comment - MDEV-28495 should hopefully fix this, by ensuring (via the lock on the system tablespace file) that the being-killed server process has actually been killed.
            marko Marko Mäkelä added a comment - - edited

            An additional fix could help with those cases that are not ‘rescued’ by InnoDB advisory file locks. I suspect that some Spider tests could have failed due to a new server being started before the killed server has properly terminated.

            marko Marko Mäkelä added a comment - - edited An additional fix could help with those cases that are not ‘rescued’ by InnoDB advisory file locks. I suspect that some Spider tests could have failed due to a new server being started before the killed server has properly terminated.

            MDEV-31568 was filed for the observation that the advisory file locking on (only) the InnoDB system tablespace does not work, starting with the MariaDB Server 10.5 release series.

            marko Marko Mäkelä added a comment - MDEV-31568 was filed for the observation that the advisory file locking on (only) the InnoDB system tablespace does not work, starting with the MariaDB Server 10.5 release series.

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.