Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-35577

Broken recovery after SET GLOBAL innodb_log_file_size

Details

    Description

      While executing the innodb.log_file_size_online test case, it fails with

      CURRENT_TEST: innodb.log_file_size_online
      mysqltest: At line 59: query 'SELECT * FROM t WHERE a<10' failed: ER_UNKNOWN_STORAGE_ENGINE (1286): Unknown storage engine 'InnoDB'

      2024-12-05 16:19:56 0 [ERROR] InnoDB: OPT_PAGE_CHECKSUM mismatch on [page id: space=5, page number=345]
      2024-12-05 16:19:56 0 [ERROR] InnoDB: Set innodb_force_recovery=1 to ignore corruption.
      2024-12-05 16:19:56 0 [Note] InnoDB: Set innodb_force_recovery=1 to ignore corrupted pages. 
      

      Debugged the test case failure, while startup fails with

      Thread 2 hit Breakpoint 1, log_phys_t::page_checksum (block=..., 
          l=0x78eeebcf335f "\267H\251N")
          at /home/thiru/source_file/server/10.11/storage/innobase/log/log0recv.cc:197
      197	    ib::error() << "OPT_PAGE_CHECKSUM mismatch on " << block.page.id();
      (rr) f 2
      p#2  0x000063c80564908d in recv_recover_page (block=0x78eee99011e0, 
          mtr=..., recs=..., space=0x63c807c73c68, init=0x0)
          at /home/thiru/source_file/server/10.11/storage/innobase/log/log0recv.cc:3248
      3248			log_phys_t::apply_status a= l->apply(*block, recs.last_offset);
      (rr) p *l
      $2 = {<log_rec_t> = {next = 0x78eeebcf3368, lsn = 7415422}, 
        start_lsn = 7415393}
      

      Checksum fails to match with page id is the following

      (rr) p mach_read_from_4(l)
      $1 = 3074992462
      

      During recovery, these are the things to be noticed.

      (rr) p checkpoint_lsn
      $5 = 7294363
      (rr) p end_lsn
      $6 = 7415362
      (rr) p first_lsn
      $7 = 7292882
      

      Let's visit before restart:
      =====================
      page_checksum calculated for page_id 21474836825 is 3074992462 during m_commit_lsn: 7323774

      mtr::page_checksum calculated 7323774 for the page_id
      (rr) p bpage.id_.m_id
      $1 = 21474836825
      (rr) p checksum
      $2 = 3074992462
      Thread 2 hit Hardware watchpoint 3: -location m_commit_lsn
      Old value = 0
      New value = 7323774
      mtr_t::finish_writer<false, false> (mtr=0x7673079ff4e0, len=29)
          at /home/thiru/source_file/server/10.11/storage/innobase/mtr/mtr0mtr.cc:1391
      1391	  return {start.first, log_close(mtr->m_commit_lsn)};
       
      (rr) f 0
      #0  log_t::append (
          d=@0x7673079ff0b8: 0x76730c80348f "G\005\200ŮŽ\022\006",
          s=0x7673079ff658, size=24)
          at /home/thiru/source_file/server/10.11/storage/innobase/mtr/mtr0m
      tr.cc:1308
      (rr) x/24b d
      0x76730c80348f: 0x47    0x05    0x80    0xd9    0x8e    0x12    0x06  0x00
      0x76730c803497: 0xb2    0x00    0x80    0xc3    0x00    0x06    0x00  0x78
      0x76730c80349f: 0x05    0x80    0xd9    0x00    0xb7    0x48    0xa9  0x4e
      

      Above is the redo log written for checksum for the page

      After that, InnoDB does flushes the log thrice. There are 3 log_t::write_buf() calls

      At the end of the first log_t::write_buf()

      (rr) p write_lsn
      $2 = 7353588
      

      checksum redo log was flushed to disk during first write_buf() call itself.
      Note: Every write_buf() swaps resize_buf with resize_flush_buf

      At the end of the 2nd log_t::write_buf():

      (rr) p write_lsn
      $4 = 7401982
      

      let's have detailed look at the last log_t::write_buf():

      At the start
       
      (rr) p write_lsn
      $16 = 7401982
      (rr) p lsn
      $17 = 7415393
      

      Initially, length is 13455. We copy the last non-aligned
      block to resize_flush_buf. Round off the length to 13824 (multiple of 512 bytes)
      InnoDB writes the old resize_buffer garbage value to the log file for the remaining
      349 bytes. Since InnoDb fails to mark the end of the log in resize_buf. During
      recovery, InnoDB reads the old mtr record and fails.

       
      (rr) x/24b resize_buf + 13455
      0x76730420348f: 0x47    0x05    0x80    0xd9    0x8e    0x12    0x06  0x00
      0x767304203497: 0xb2    0x00    0x80    0xc3    0x00    0x06    0x00  0x78
      0x76730420349f: 0x05    0x80    0xd9    0x00    0xb7    0x48    0xa9  0x4e
      
      

      Above bytes matches with mtr checksum value which was written during 7323774

      Attachments

        Issue Links

          Activity

            Attached the correct iblogfile after fix will look like by doing
            dd if=/dev/zero of=ib_logfile0 bs=1 count=1 conv=notrunc seek=134799

            thiru Thirunarayanan Balathandayuthapani added a comment - Attached the correct iblogfile after fix will look like by doing dd if=/dev/zero of=ib_logfile0 bs=1 count=1 conv=notrunc seek=134799

            Thank you for the great analysis.

            First of all, I see that MDEV-34062 seems to have broken the logic "pretend that /dev/shm is PMEM" on Linux. That could be what allowed this bug to be found in the first place. For more convenient debugging with rr record and rr replay, I fixed that:

            diff --git a/storage/innobase/log/log0log.cc b/storage/innobase/log/log0log.cc
            index 3c2257aeab4..82f9e1ffe19 100644
            --- a/storage/innobase/log/log0log.cc
            +++ b/storage/innobase/log/log0log.cc
            @@ -255,6 +255,10 @@ static void *log_mmap(os_file_t file,
               if (!read_only)
             #  ifdef HAVE_PMEM
                 prot= PROT_READ | PROT_WRITE;
            +
            +#   ifdef __linux__ /* On Linux, we pretend that /dev/shm is PMEM */
            +remap:
            +#   endif
             #  else
                 return MAP_FAILED;
             #  endif
            @@ -270,8 +274,9 @@ static void *log_mmap(os_file_t file,
             
             #  ifdef HAVE_PMEM
             #   ifdef __linux__ /* On Linux, we pretend that /dev/shm is PMEM */
            -  if (srv_operation < SRV_OPERATION_BACKUP)
            +  if (flags != MAP_SHARED && srv_operation < SRV_OPERATION_BACKUP)
               {
            +    flags= MAP_SHARED;
                 struct stat st;
                 if (!fstat(file, &st))
                 {
            @@ -283,6 +288,7 @@ static void *log_mmap(os_file_t file,
                     is_pmem= st.st_dev == st_dev;
                     if (!is_pmem)
                       return ptr; /* MAP_FAILED */
            +        goto remap;
                   }
                 }
               }
            

            As far as I understand, the suggested fix introduces some writes of a terminating NUL byte. Without that write, which the dd command simulated, the end of the write-ahead log would look as follows:

            od -Ax -t x1 -j 134799 var/log/innodb.log_file_size_online-slow/mysqld.1/data/ib_logfile0
            

            020e8f 47 05 80 d9 8e 12 06 00 b2 00 80 c3 00 06 00 78
            020e9f 05 80 d9 00 b7 48 a9 4e 01 54 04 2e 8c 47 05 80
            020eaf d9 8f 27 06 00 b2 00 80 c3 00 06 00 78 05 80 d9
            020ebf 00 9e 9a bd dc 01 6a f6 03 11 47 05 80 d9 90 3c
            020ecf 06 00 b2 00 80 c3 00 06 00 78 05 80 d9 00 12 3b
            020e8f 47 05 80 d9 8e 12 06 00 b2 00 80 c3 00 06 00 78
            020e9f 05 80 d9 00 b7 48 a9 4e 01 54 04 2e 8c 47 05 80
            020eaf d9 8f 27 06 00 b2 00 80 c3 00 06 00 78 05 80 d9
            020ebf 00 9e 9a bd dc 01 6a f6 03 11 47 05 80 d9 90 3c
            020ecf 06 00 b2 00 80 c3 00 06 00 78 05 80 d9 00 12 3b
            020edf b2 2d 01 7a 98 06 4e 47 05 80 d9 91 51 06 00 b2
            020eef 00 80 c3 00 06 00 78 05 80 d9 00 a6 3c 5f 8f 01
            020eff fb 0e 65 a0 47 05 80 d9 92 66 06 00 b2 00 80 c3
            020f0f 00 06 00 78 05 80 d9 00 43 ad 18 78 01 25 c8 d4
            020f1f 7d 47 05 80 d9 93 7b 06 00 b2 00 80 c3 00 06 00
            020f2f 78 05 80 d9 00 90 38 70 b5 01 fe 96 87 67 47 05
            020f3f 80 d9 94 90 06 00 b2 00 80 c3 00 06 00 78 05 80
            020f4f d9 00 a4 68 55 38 01 a7 9c 43 91 47 05 80 d9 95
            020f5f a5 06 00 b2 00 80 c3 00 06 00 78 05 80 d9 00 4e
            020f6f 1d 1d 07 01 12 f6 f5 47 47 05 80 d9 96 ba 06 00
            020f7f b2 00 80 c3 00 06 00 78 05 80 d9 00 83 7e 8b dc
            020f8f 01 81 7b 7d d1 47 05 80 d9 97 cf 06 00 b2 00 80
            020f9f c3 00 06 00 78 05 80 d9 00 0e c1 2f a6 01 76 f2
            020faf b1 86 47 05 80 d9 98 e4 06 00 b2 00 80 c3 00 06
            020fbf 00 78 05 80 d9 00 62 10 94 02 01 27 2c c6 ab 47
            020fcf 05 80 d9 99 f9 06 00 b2 00 80 c3 00 06 00 78 05
            020fdf 80 d9 00 1a 8a e8 c2 01 6f 57 24 f8 be 05 00 2e
            020fef 2f 74 65 73 74 2f 74 2e 69 62 64 fa 00 00 00 00
            020fff 00 00 00 6f 4d 9b 01 c0 c5 fa c6 00 b2 00 80 c3
            02100f 00 06 00 78 05 80 ca 00 98 b5 f8 3a 01 36 fc 95
            …
            0211ef 1c 47 05 80 ca 94 90 06 00 b2 00 80 c3 00 06 00
            0211ff 78 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
            02120f 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
            *
            4fffff 00
            500000
            

            On my system, the recovery would fail as follows:

            build/sql/mariadbd  --datadir "$(pwd)"/var/log/innodb.log_file_size_online-slow/mysqld.1/data
            

            10.11 3d23adb766607e5d404b06d3b8cd5de23ab672bb

            2024-12-09 10:11:23 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=7294363
            2024-12-09 10:11:23 0 [Note] InnoDB: End of log at LSN=7415772
            2024-12-09 10:11:23 0 [Note] InnoDB: To recover: 94 pages
            2024-12-09 10:11:23 0 [ERROR] InnoDB: OPT_PAGE_CHECKSUM mismatch on [page id: space=5, page number=345]
            2024-12-09 10:11:23 0 [ERROR] InnoDB: Set innodb_force_recovery=1 to ignore corruption.
            2024-12-09 10:11:23 0 [Note] InnoDB: Set innodb_force_recovery=1 to ignore corrupted pages.
            

            The checkpoint LSN is 7294363 or 0x6f4d9b in hexadecimal. At byte offset 8 of ib_logfile0 we can find the big endian 64-bit integer 0x6f47d2, which identifies the LSN that the byte offset 0x3000 of the file corresponds to. This should roughly correspond to the current LSN when the log resizing had been started. It also tells the following byte offsets:

            LSN description offset=0x3000+LSN-0x6f47d2
            0x6f4d9b checkpoint LSN=7294363 0x35c9
            0x712661 final log_sys.write_lsn=7415393 according to thiru 0x20e8f
            0x7127dc End of log at LSN=7415772 0x2100a

            We will come to the file offset 0x20e8f later.

            Based on both this output and my debugging of the recovery with rr replay, recovery was able to parse the log up to the byte offset 0x2100a. I decoded the above hexdump from 0x20e8f to 0x2100f, one line per record, and an empty line after the end of each mini-transaction (0x01 followed by a CRC-32C):

             47 05 80 d9 8e 12 06 00
             b2 00 80
             c3 00 06 00
             78 05 80 d9 00 b7 48 a9 4e
             01 54 04 2e 8c
             
             47 05 80 d9 8f 27 06 00
             b2 00 80
             c3 00 06 00
             78 05 80 d9 00 9e 9a bd dc
             01 6a f6 03 11
             
             47 05 80 d9 90 3c 06 00
             b2 00 80
             c3 00 06 00
             78 05 80 d9 00 12 3b b2 2d
             01 7a 98 06 4e
             
             47 05 80 d9 91 51 06 00
             b2 00 80
             c3 00 06 00
             78 05 80 d9 00 a6 3c 5f 8f
             01 fb 0e 65 a0
             
             47 05 80 d9 92 66 06 00
             b2 00 80
             c3 00 06 00
             78 05 80 d9 00 43 ad 18 78
             01 25 c8 d4 7d
             
             47 05 80 d9 93 7b 06 00
             b2 00 80
             c3 00 06 00
             78 05 80 d9 00 90 38 70 b5
             01 fe 96 87 67
             
             47 05 80 d9 94 90 06 00
             b2 00 80
             c3 00 06 00
             78 05 80 d9 00 a4 68 55 38
             01 a7 9c 43 91
             
             47 05 80 d9 95 a5 06 00
             b2 00 80
             c3 00 06 00
             78 05 80 d9 00 4e 1d 1d 07
             01 12 f6 f5 47
             
             47 05 80 d9 96 ba 06 00
             b2 00 80
             c3 00 06 00
             78 05 80 d9 00 83 7e 8b dc
             01 81 7b 7d d1
             
             47 05 80 d9 97 cf 06 00
             b2 00 80
             c3 00 06 00
             78 05 80 d9 00 0e c1 2f a6
             01 76 f2 b1 86
             
             47 05 80 d9 98 e4 06 00
             b2 00 80
             c3 00 06 00
             78 05 80 d9 00 62 10 94 02
             01 27 2c c6 ab
             
             47 05 80 d9 99 f9 06 00
             b2 00 80
             c3 00 06 00
             78 05 80 d9 00 1a 8a e8 c2
             01 6f 57 24 f8
             
             be 05 00 2e 2f 74 65 73 74 2f 74 2e 69 62 64
             fa 00 00 00 00 00 00 00 6f 4d 9b
             01 c0 c5 fa c6
             
             (error at 0x2100a) 00 b2 00 80 c3
            

            The last successfully parsed mini-transaction consists of a FILE_MODIFY record for tablespace 5 (test/t.ibd) followed by FILE_CHECKPOINT pointing to the checkpoint LSN 0x6f4d9b, which indeed is "our" checkpoint LSN.

            Each mini-transaction quoted above the final one consists of a MEMSET record that is always writing to the same page (tablespace 5, encoded page number 0x80d9, increasing offset, 6 bytes of NUL. Each MEMSET is followed by a WRITE of the byte 0x80 and a MEMSET of 6 more NUL bytes. Therefore, these mini-transactions must have originated from row_purge_reset_trx_id(), and the final one from fil_names_clear().

            The working hypothesis is that during the execution of SET GLOBAL innodb_log_file_size, the log_sys.resize_lsn had been adjusted forward (and the write position "jumped backwars"). That is, at the end of the log we would have some redundant log records, which should be present in a later written section of the ib_logfile101 that ultimately replaced the ib_logfile0 when the log resizing completed.

            For the bug to be triggered, at the start of the "garbage" there must
            be a valid mini-transaction, and the server must be killed in such a
            way that the actual log records up to that point are intact.

            There are 5 occurrences of the file name ./test/t.ibd in the ib_logfile0, each time as part of a FILE_MODIFY record for tablespace 5. The last 3 occurrences are followed by a FILE_CHECKPOINT record, for an earlier checkpoint at 0x6f477d (at file offset 0x1da1c) and for "our" LSN 0x6f4d9b (at file offsets 0x20e7f and 0x20ffa).

            I don’t have access to the rr replay trace of the intentionally killed server. According to thiru, the final written LSN was 7415393 (0x712661), corresponding to the file offset 0x20e8f. That would be right after the above mentioned first FILE_CHECKPOINT record:

            020e6f ac be 05 00 2e 2f 74 65 73 74 2f 74 2e 69 62 64
            020e7f fa 00 00 00 00 00 00 00 6f 4d 9b 01 c0 c5 fa c6
            020e8f 47 05 80 d9 8e 12 06 00 b2 00 80 c3 00 06 00 78
            020e9f 05 80 d9 00 b7 48 a9 4e 01 54 04 2e 8c 47 05 80
            

            That is, at 0x20e70 we have the FILE_MODIFY for ./t/test.ibd, the FILE_CHECKPOINT for our checkpoint, and the end-of-mini-transaction marker 0x01 followed by the CRC-32C of this mini-transaction.

            Immediately after that, we have the garbage that unfortunately looked like real log records. First I was suspecting that the contents had simply been shifted by 379 bytes, but it is not quite as simple as that if we look at the output of the following:

            od -An -w379 -t x1 -j 134420 -N 379 ib_logfile0|
            diff - <(od -An -w379 -t x1 -j 134799 -N 379 ib_logfile0)
            

            The simplest fix that I can think of would be to "destroy the garbage" in log_t::write_checkpoint() by writing a NUL byte right at the end of the resized log, after the CRC-32C checksum of the FILE_CHECKPOINT record. This is guaranteed to work for log resizing, because the sequence bit (which would toggle each time the ib_logfile0 wraps from its end to byte offset 0x3000) will always be written as 1 to the ib_logfile101 during resizing. The proposed fix does something else, modifying a much more frequently invoked function log_t::write_buf(). That might fix it too, but it could introduce a slight performance regression.

            We must ensure that the log is properly terminated also when the last actual payload byte is at the very end of a block. I see that we already do that in log_t::write_buf() when no log resizing is in progress; the condition length <= write_size_1 would only hold when there is one spare byte at the end of the buffer. Instead of writing a NUL byte, I think that it would be safer to write 1 or 0, always the opposite of the expected sequence byte.

            marko Marko Mäkelä added a comment - Thank you for the great analysis. First of all, I see that MDEV-34062 seems to have broken the logic "pretend that /dev/shm is PMEM" on Linux. That could be what allowed this bug to be found in the first place. For more convenient debugging with rr record and rr replay , I fixed that: diff --git a/storage/innobase/log/log0log.cc b/storage/innobase/log/log0log.cc index 3c2257aeab4..82f9e1ffe19 100644 --- a/storage/innobase/log/log0log.cc +++ b/storage/innobase/log/log0log.cc @@ -255,6 +255,10 @@ static void *log_mmap(os_file_t file, if (!read_only) # ifdef HAVE_PMEM prot= PROT_READ | PROT_WRITE; + +# ifdef __linux__ /* On Linux, we pretend that /dev/shm is PMEM */ +remap: +# endif # else return MAP_FAILED; # endif @@ -270,8 +274,9 @@ static void *log_mmap(os_file_t file, # ifdef HAVE_PMEM # ifdef __linux__ /* On Linux, we pretend that /dev/shm is PMEM */ - if (srv_operation < SRV_OPERATION_BACKUP) + if (flags != MAP_SHARED && srv_operation < SRV_OPERATION_BACKUP) { + flags= MAP_SHARED; struct stat st; if (!fstat(file, &st)) { @@ -283,6 +288,7 @@ static void *log_mmap(os_file_t file, is_pmem= st.st_dev == st_dev; if (!is_pmem) return ptr; /* MAP_FAILED */ + goto remap; } } } As far as I understand, the suggested fix introduces some writes of a terminating NUL byte. Without that write, which the dd command simulated, the end of the write-ahead log would look as follows: od -Ax -t x1 -j 134799 var/log/innodb.log_file_size_online-slow/mysqld.1/data/ib_logfile0 020e8f 47 05 80 d9 8e 12 06 00 b2 00 80 c3 00 06 00 78 020e9f 05 80 d9 00 b7 48 a9 4e 01 54 04 2e 8c 47 05 80 020eaf d9 8f 27 06 00 b2 00 80 c3 00 06 00 78 05 80 d9 020ebf 00 9e 9a bd dc 01 6a f6 03 11 47 05 80 d9 90 3c 020ecf 06 00 b2 00 80 c3 00 06 00 78 05 80 d9 00 12 3b 020e8f 47 05 80 d9 8e 12 06 00 b2 00 80 c3 00 06 00 78 020e9f 05 80 d9 00 b7 48 a9 4e 01 54 04 2e 8c 47 05 80 020eaf d9 8f 27 06 00 b2 00 80 c3 00 06 00 78 05 80 d9 020ebf 00 9e 9a bd dc 01 6a f6 03 11 47 05 80 d9 90 3c 020ecf 06 00 b2 00 80 c3 00 06 00 78 05 80 d9 00 12 3b 020edf b2 2d 01 7a 98 06 4e 47 05 80 d9 91 51 06 00 b2 020eef 00 80 c3 00 06 00 78 05 80 d9 00 a6 3c 5f 8f 01 020eff fb 0e 65 a0 47 05 80 d9 92 66 06 00 b2 00 80 c3 020f0f 00 06 00 78 05 80 d9 00 43 ad 18 78 01 25 c8 d4 020f1f 7d 47 05 80 d9 93 7b 06 00 b2 00 80 c3 00 06 00 020f2f 78 05 80 d9 00 90 38 70 b5 01 fe 96 87 67 47 05 020f3f 80 d9 94 90 06 00 b2 00 80 c3 00 06 00 78 05 80 020f4f d9 00 a4 68 55 38 01 a7 9c 43 91 47 05 80 d9 95 020f5f a5 06 00 b2 00 80 c3 00 06 00 78 05 80 d9 00 4e 020f6f 1d 1d 07 01 12 f6 f5 47 47 05 80 d9 96 ba 06 00 020f7f b2 00 80 c3 00 06 00 78 05 80 d9 00 83 7e 8b dc 020f8f 01 81 7b 7d d1 47 05 80 d9 97 cf 06 00 b2 00 80 020f9f c3 00 06 00 78 05 80 d9 00 0e c1 2f a6 01 76 f2 020faf b1 86 47 05 80 d9 98 e4 06 00 b2 00 80 c3 00 06 020fbf 00 78 05 80 d9 00 62 10 94 02 01 27 2c c6 ab 47 020fcf 05 80 d9 99 f9 06 00 b2 00 80 c3 00 06 00 78 05 020fdf 80 d9 00 1a 8a e8 c2 01 6f 57 24 f8 be 05 00 2e 020fef 2f 74 65 73 74 2f 74 2e 69 62 64 fa 00 00 00 00 020fff 00 00 00 6f 4d 9b 01 c0 c5 fa c6 00 b2 00 80 c3 02100f 00 06 00 78 05 80 ca 00 98 b5 f8 3a 01 36 fc 95 … 0211ef 1c 47 05 80 ca 94 90 06 00 b2 00 80 c3 00 06 00 0211ff 78 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 02120f 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 4fffff 00 500000 On my system, the recovery would fail as follows: build/sql/mariadbd --datadir "$(pwd)"/var/log/innodb.log_file_size_online-slow/mysqld.1/data 10.11 3d23adb766607e5d404b06d3b8cd5de23ab672bb 2024-12-09 10:11:23 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=7294363 2024-12-09 10:11:23 0 [Note] InnoDB: End of log at LSN=7415772 2024-12-09 10:11:23 0 [Note] InnoDB: To recover: 94 pages 2024-12-09 10:11:23 0 [ERROR] InnoDB: OPT_PAGE_CHECKSUM mismatch on [page id: space=5, page number=345] 2024-12-09 10:11:23 0 [ERROR] InnoDB: Set innodb_force_recovery=1 to ignore corruption. 2024-12-09 10:11:23 0 [Note] InnoDB: Set innodb_force_recovery=1 to ignore corrupted pages. The checkpoint LSN is 7294363 or 0x6f4d9b in hexadecimal. At byte offset 8 of ib_logfile0 we can find the big endian 64-bit integer 0x6f47d2, which identifies the LSN that the byte offset 0x3000 of the file corresponds to. This should roughly correspond to the current LSN when the log resizing had been started. It also tells the following byte offsets: LSN description offset=0x3000+LSN-0x6f47d2 0x6f4d9b checkpoint LSN=7294363 0x35c9 0x712661 final log_sys.write_lsn=7415393 according to thiru 0x20e8f 0x7127dc End of log at LSN=7415772 0x2100a We will come to the file offset 0x20e8f later. Based on both this output and my debugging of the recovery with rr replay , recovery was able to parse the log up to the byte offset 0x2100a. I decoded the above hexdump from 0x20e8f to 0x2100f, one line per record, and an empty line after the end of each mini-transaction (0x01 followed by a CRC-32C): 47 05 80 d9 8e 12 06 00 b2 00 80 c3 00 06 00 78 05 80 d9 00 b7 48 a9 4e 01 54 04 2e 8c   47 05 80 d9 8f 27 06 00 b2 00 80 c3 00 06 00 78 05 80 d9 00 9e 9a bd dc 01 6a f6 03 11   47 05 80 d9 90 3c 06 00 b2 00 80 c3 00 06 00 78 05 80 d9 00 12 3b b2 2d 01 7a 98 06 4e   47 05 80 d9 91 51 06 00 b2 00 80 c3 00 06 00 78 05 80 d9 00 a6 3c 5f 8f 01 fb 0e 65 a0   47 05 80 d9 92 66 06 00 b2 00 80 c3 00 06 00 78 05 80 d9 00 43 ad 18 78 01 25 c8 d4 7d   47 05 80 d9 93 7b 06 00 b2 00 80 c3 00 06 00 78 05 80 d9 00 90 38 70 b5 01 fe 96 87 67   47 05 80 d9 94 90 06 00 b2 00 80 c3 00 06 00 78 05 80 d9 00 a4 68 55 38 01 a7 9c 43 91   47 05 80 d9 95 a5 06 00 b2 00 80 c3 00 06 00 78 05 80 d9 00 4e 1d 1d 07 01 12 f6 f5 47   47 05 80 d9 96 ba 06 00 b2 00 80 c3 00 06 00 78 05 80 d9 00 83 7e 8b dc 01 81 7b 7d d1   47 05 80 d9 97 cf 06 00 b2 00 80 c3 00 06 00 78 05 80 d9 00 0e c1 2f a6 01 76 f2 b1 86   47 05 80 d9 98 e4 06 00 b2 00 80 c3 00 06 00 78 05 80 d9 00 62 10 94 02 01 27 2c c6 ab   47 05 80 d9 99 f9 06 00 b2 00 80 c3 00 06 00 78 05 80 d9 00 1a 8a e8 c2 01 6f 57 24 f8   be 05 00 2e 2f 74 65 73 74 2f 74 2e 69 62 64 fa 00 00 00 00 00 00 00 6f 4d 9b 01 c0 c5 fa c6   (error at 0x2100a) 00 b2 00 80 c3 The last successfully parsed mini-transaction consists of a FILE_MODIFY record for tablespace 5 ( test/t.ibd ) followed by FILE_CHECKPOINT pointing to the checkpoint LSN 0x6f4d9b, which indeed is "our" checkpoint LSN. Each mini-transaction quoted above the final one consists of a MEMSET record that is always writing to the same page (tablespace 5, encoded page number 0x80d9, increasing offset, 6 bytes of NUL. Each MEMSET is followed by a WRITE of the byte 0x80 and a MEMSET of 6 more NUL bytes. Therefore, these mini-transactions must have originated from row_purge_reset_trx_id() , and the final one from fil_names_clear() . The working hypothesis is that during the execution of SET GLOBAL innodb_log_file_size , the log_sys.resize_lsn had been adjusted forward (and the write position "jumped backwars"). That is, at the end of the log we would have some redundant log records, which should be present in a later written section of the ib_logfile101 that ultimately replaced the ib_logfile0 when the log resizing completed. For the bug to be triggered, at the start of the "garbage" there must be a valid mini-transaction, and the server must be killed in such a way that the actual log records up to that point are intact. There are 5 occurrences of the file name ./test/t.ibd in the ib_logfile0 , each time as part of a FILE_MODIFY record for tablespace 5. The last 3 occurrences are followed by a FILE_CHECKPOINT record, for an earlier checkpoint at 0x6f477d (at file offset 0x1da1c) and for "our" LSN 0x6f4d9b (at file offsets 0x20e7f and 0x20ffa). I don’t have access to the rr replay trace of the intentionally killed server. According to thiru , the final written LSN was 7415393 (0x712661), corresponding to the file offset 0x20e8f. That would be right after the above mentioned first FILE_CHECKPOINT record: 020e6f ac be 05 00 2e 2f 74 65 73 74 2f 74 2e 69 62 64 020e7f fa 00 00 00 00 00 00 00 6f 4d 9b 01 c0 c5 fa c6 020e8f 47 05 80 d9 8e 12 06 00 b2 00 80 c3 00 06 00 78 020e9f 05 80 d9 00 b7 48 a9 4e 01 54 04 2e 8c 47 05 80 That is, at 0x20e70 we have the FILE_MODIFY for ./t/test.ibd , the FILE_CHECKPOINT for our checkpoint, and the end-of-mini-transaction marker 0x01 followed by the CRC-32C of this mini-transaction. Immediately after that, we have the garbage that unfortunately looked like real log records. First I was suspecting that the contents had simply been shifted by 379 bytes, but it is not quite as simple as that if we look at the output of the following: od -An -w379 -t x1 -j 134420 -N 379 ib_logfile0| diff - <(od -An -w379 -t x1 -j 134799 -N 379 ib_logfile0) The simplest fix that I can think of would be to "destroy the garbage" in log_t::write_checkpoint() by writing a NUL byte right at the end of the resized log, after the CRC-32C checksum of the FILE_CHECKPOINT record. This is guaranteed to work for log resizing, because the sequence bit (which would toggle each time the ib_logfile0 wraps from its end to byte offset 0x3000) will always be written as 1 to the ib_logfile101 during resizing. The proposed fix does something else, modifying a much more frequently invoked function log_t::write_buf() . That might fix it too, but it could introduce a slight performance regression. We must ensure that the log is properly terminated also when the last actual payload byte is at the very end of a block. I see that we already do that in log_t::write_buf() when no log resizing is in progress; the condition length <= write_size_1 would only hold when there is one spare byte at the end of the buffer. Instead of writing a NUL byte, I think that it would be safer to write 1 or 0, always the opposite of the expected sequence byte.

            Sorry, I realized that log_t::write_buf() is fine after all. The log record parser does not allow any empty mini-transactions. Therefore, writing a NUL byte (0x00) right after an end-of-mini-transaction marker 0x00 or 0x01 and the 4-byte CRC-32C is guaranteed to "break the garbage".

            In other words, it should suffice to ensure that an extra 0x00 byte is written to the log_sys.resize_buf and ultimately to ib_logfile101 in log_t::write_checkpoint().

            marko Marko Mäkelä added a comment - Sorry, I realized that log_t::write_buf() is fine after all. The log record parser does not allow any empty mini-transactions. Therefore, writing a NUL byte (0x00) right after an end-of-mini-transaction marker 0x00 or 0x01 and the 4-byte CRC-32C is guaranteed to "break the garbage". In other words, it should suffice to ensure that an extra 0x00 byte is written to the log_sys.resize_buf and ultimately to ib_logfile101 in log_t::write_checkpoint() .

            I filed MDEV-35608 for the regression that /dev/shm would no longer be treated as persistent memory.

            marko Marko Mäkelä added a comment - I filed MDEV-35608 for the regression that /dev/shm would no longer be treated as persistent memory.

            I am done with the review and added two minor suggestions. The changes look good to me otherwise.

            debarun Debarun Banerjee added a comment - I am done with the review and added two minor suggestions. The changes look good to me otherwise.

            Thanks, the latest changes looks good to me.

            debarun Debarun Banerjee added a comment - Thanks, the latest changes looks good to me.

            Unlike the bug that was fixed in MDEV-35411, this one should be harder to hit. The scenario is that the server is killed right after SET GLOBAL innodb_log_file_size completed setting the size to a rather small value, before there was any further write to the newly created ib_logfile0

            marko Marko Mäkelä added a comment - Unlike the bug that was fixed in MDEV-35411 , this one should be harder to hit. The scenario is that the server is killed right after SET GLOBAL innodb_log_file_size completed setting the size to a rather small value, before there was any further write to the newly created ib_logfile0

            People

              marko Marko Mäkelä
              thiru Thirunarayanan Balathandayuthapani
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.