Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-31642

Upgrade from 10.7 or earlier may crash if innodb_log_file_buffering=OFF

Details

    Description

      Upgrading from Debian 11 to Debian 12 yields an exception from which mariadb doesn't seem to recover from. This can reproduced with a clean installation and no data inserted whatsoever. It seems as if it has to do something with the InnoDB structure that was generated with version 10.5.19. I tested this with both Debian 11 (5.10.0.-23) and Debian 12 (6.1.0-9) kernels with both failing.

      The only fix for this seems to be to delete /var/lib/mysql in its entirety and regenerate all of the data structures by starting up mariadb with an empty dir. From this point on the exception is gone and everything works as expected.

      I attached the error log that comes up when /usr/sbin/mariadbd is started with version 10.11.3 with data generated from 10.5.19.

      Attachments

        Issue Links

          Activity

            Can you run mariadbd under strace or a debugger to find out which parameters had been passed to the failing pread64 system call? I think that the recovery should first try to read the 12888 bytes starting at offset 0, and then read some pages starting at the one that includes the byte offset that the latest checkpoint LSN evaluates to. I should have covered that on a device whose physical block size is 4096 bytes. The NVMe and SSD that I have at home are with 512 bytes physical block size, but one HDD is 4096 bytes.

            Would this same system have trouble accessing tables like this when you are using the MDEV-24854 default innodb_flush_method=O_DIRECT?

            CREATE TABLE t(…) ENGINE=InnoDB ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=4;
            

            This should create a table with 4096-byte physical page size in the storage.

            marko Marko Mäkelä added a comment - Can you run mariadbd under strace or a debugger to find out which parameters had been passed to the failing pread64 system call? I think that the recovery should first try to read the 12888 bytes starting at offset 0, and then read some pages starting at the one that includes the byte offset that the latest checkpoint LSN evaluates to. I should have covered that on a device whose physical block size is 4096 bytes. The NVMe and SSD that I have at home are with 512 bytes physical block size, but one HDD is 4096 bytes. Would this same system have trouble accessing tables like this when you are using the MDEV-24854 default innodb_flush_method=O_DIRECT ? CREATE TABLE t(…) ENGINE=InnoDB ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=4; This should create a table with 4096-byte physical page size in the storage.

            So I ran mariadbd with the broken data through strace. The following pread64 calls should be of importance:

            pread64(4, "\376\376\f\1\22\326\30&\34\32\21\356\267l\2\1!\0\19\36\0\26\0\0 \316\273D\271\fr"..., 52, 0) = 52
            pread64(6, "\2\0\0\1\0\0\0\300\r\3\0\0\0\0\0\0\377\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377"..., 8192, 16384) = 8192
            pread64(6, "\376\376\v\1MARIALOG\345\n\357\377\323\377\5\0\20'\0\0\247\210\1\0\0\0\0\0"..., 44, 0) = 44
            pread64(11, "\0\0\0\0\0\0\0\0\377\377\377\377\377\377\377\377\0\0\0\0\0\0\255!\0\10\0\0\0\0\0\0"..., 65536, 0) = 65536
            pread64(11, "\0\0\0\0\0\0\0\5\377\377\377\377\377\377\377\377\0\0\0\0\0\0r\306\0\7\0\0\0\0\0\0"..., 16384, 81920) = 16384
            pread64(11, "\0\0\0\0\0\0\0\0\377\377\377\377\377\377\377\377\0\0\0\0\0\0\255!\0\10\0\0\0\0\0\0"..., 1048576, 1048576) = 1048576
            pread64(11, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576, 2097152) = 1048576
            pread64(11, "\0\0\0\0\0\0\0\0\377\377\377\377\377\377\377\377\0\0\0\0\0\0\255!\0\10\0\0\0\0\0\0"..., 16384, 0) = 16384
            pread64(12, "PHYS\0\0\0\2\0\0\0\0\0\0\"\0MariaDB 10.5.19\0"..., 4096, 0) = 4096
            pread64(12, 0x2aa35e0c000, 512, 38400)  = -1 EINVAL (Invalid argument)
            

            For reference I attached the full strace to this issue as a file.

            Further, I created the table with the options you mentioned on a working dataset with --innodb-log-file-buffering=TRUE and the default option innodb_flush_method=O_DIRECT, inserted and queried data which worked flawlessly.

            leo_ Leonardo Martinho added a comment - So I ran mariadbd with the broken data through strace. The following pread64 calls should be of importance: pread64(4, "\376\376\f\1\22\326\30&\34\32\21\356\267l\2\1!\0\19\36\0\26\0\0 \316\273D\271\fr"..., 52, 0) = 52 pread64(6, "\2\0\0\1\0\0\0\300\r\3\0\0\0\0\0\0\377\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377"..., 8192, 16384) = 8192 pread64(6, "\376\376\v\1MARIALOG\345\n\357\377\323\377\5\0\20'\0\0\247\210\1\0\0\0\0\0"..., 44, 0) = 44 pread64(11, "\0\0\0\0\0\0\0\0\377\377\377\377\377\377\377\377\0\0\0\0\0\0\255!\0\10\0\0\0\0\0\0"..., 65536, 0) = 65536 pread64(11, "\0\0\0\0\0\0\0\5\377\377\377\377\377\377\377\377\0\0\0\0\0\0r\306\0\7\0\0\0\0\0\0"..., 16384, 81920) = 16384 pread64(11, "\0\0\0\0\0\0\0\0\377\377\377\377\377\377\377\377\0\0\0\0\0\0\255!\0\10\0\0\0\0\0\0"..., 1048576, 1048576) = 1048576 pread64(11, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576, 2097152) = 1048576 pread64(11, "\0\0\0\0\0\0\0\0\377\377\377\377\377\377\377\377\0\0\0\0\0\0\255!\0\10\0\0\0\0\0\0"..., 16384, 0) = 16384 pread64(12, "PHYS\0\0\0\2\0\0\0\0\0\0\"\0MariaDB 10.5.19\0"..., 4096, 0) = 4096 pread64(12, 0x2aa35e0c000, 512, 38400) = -1 EINVAL (Invalid argument) For reference I attached the full strace to this issue as a file. Further, I created the table with the options you mentioned on a working dataset with --innodb-log-file-buffering=TRUE and the default option innodb_flush_method=O_DIRECT, inserted and queried data which worked flawlessly.
            danblack Daniel Black added a comment - - edited

            512 bytes < 4096, +38400 byte offset / 4096 blocksize is 9.375 blocks, so it doesn't align so that's why the O_DIRECT isn't working.

            danblack Daniel Black added a comment - - edited 512 bytes < 4096, +38400 byte offset / 4096 blocksize is 9.375 blocks, so it doesn't align so that's why the O_DIRECT isn't working.

            Thank you, leo_. This is specific to the upgrade code path. Before MDEV-14425 in MariaDB Server 10.8, the only possible InnoDB log block size was 512 bytes. In the new format, the size can be basically anything (one mini-transaction is one block), but we try to read and write using the physical block size.

            Once the redo log has been upgraded (it happens before any connections are accepted), you should safely be able to use innodb_log_file_buffering=OFF. I realize that we never tested upgrade with a 4096-byte block size device.

            marko Marko Mäkelä added a comment - Thank you, leo_ . This is specific to the upgrade code path. Before MDEV-14425 in MariaDB Server 10.8, the only possible InnoDB log block size was 512 bytes. In the new format, the size can be basically anything (one mini-transaction is one block), but we try to read and write using the physical block size. Once the redo log has been upgraded (it happens before any connections are accepted), you should safely be able to use innodb_log_file_buffering=OFF . I realize that we never tested upgrade with a 4096-byte block size device.

            I was able to observe the misaligned read using the test case of this fix if I revert the code change:

            ./mtr --rr innodb.log_corruption
            rr replay var/log/mysqld.1.rr/mariadbd-25
            

            I get to see how this artificial redo log sample (with the empty log starting at byte offset 0x120c instead of the previous 0x80c) is processed. The relevant code is here:

            static dberr_t recv_log_recover_10_5(lsn_t lsn_offset)
            {
              byte *buf= const_cast<byte*>(field_ref_zero);
             
              if (lsn_offset < (log_sys.is_pmem() ? log_sys.file_size : 4096))
                memcpy_aligned<512>(buf, &log_sys.buf[lsn_offset & ~511], 512);
              else
                recv_sys.read(lsn_offset & ~lsn_t{511}, {buf, 512});
            

            With the original test, we would copy the data from log_sys.buf, because its file offset (0x80c) is less than 4096 (0x1000), and thus present in the buffer from a previous read.

            With the modified test, the record needs to be read from file offset 0x120c. The else branch is using only 512-byte alignment for that, so it will use the offset 4608 (0x1200) and size 512. My fix is to submit a 4096-byte aligned read and adjust buf+=lsn_offset & 0xe00 so that it will point to the correct logical 512-byte block.

            To my surprise, even if I reverted the code change, the INTEL SSDSC2KG019T8 on the system where I tested this would happily accept the misaligned read (reading 0x200 bytes at 0x1200). This probably is because smartctl -a /dev/sda reports the following:

            Sector Sizes:     512 bytes logical, 4096 bytes physical
            

            The drive /dev/sdb (a 7,200 rpm SATA 3.3 hard disk) is equally forgiving. Also an Intel PMEM device mounted in regular mode would report a 4096-byte physical block size, but it would happily accept the 512-byte aligned read. (I think its native block size is 256 bytes. In PMEM mode, the interface is 64 bytes, or one cache line.)

            marko Marko Mäkelä added a comment - I was able to observe the misaligned read using the test case of this fix if I revert the code change: ./mtr --rr innodb.log_corruption rr replay var/log/mysqld.1.rr/mariadbd-25 I get to see how this artificial redo log sample (with the empty log starting at byte offset 0x120c instead of the previous 0x80c) is processed. The relevant code is here: static dberr_t recv_log_recover_10_5(lsn_t lsn_offset) { byte *buf= const_cast <byte*>(field_ref_zero);   if (lsn_offset < (log_sys.is_pmem() ? log_sys.file_size : 4096)) memcpy_aligned<512>(buf, &log_sys.buf[lsn_offset & ~511], 512); else recv_sys.read(lsn_offset & ~lsn_t{511}, {buf, 512}); With the original test, we would copy the data from log_sys.buf , because its file offset (0x80c) is less than 4096 (0x1000), and thus present in the buffer from a previous read. With the modified test, the record needs to be read from file offset 0x120c. The else branch is using only 512-byte alignment for that, so it will use the offset 4608 (0x1200) and size 512. My fix is to submit a 4096-byte aligned read and adjust buf+=lsn_offset & 0xe00 so that it will point to the correct logical 512-byte block. To my surprise, even if I reverted the code change, the INTEL SSDSC2KG019T8 on the system where I tested this would happily accept the misaligned read (reading 0x200 bytes at 0x1200). This probably is because smartctl -a /dev/sda reports the following: Sector Sizes: 512 bytes logical, 4096 bytes physical The drive /dev/sdb (a 7,200 rpm SATA 3.3 hard disk) is equally forgiving. Also an Intel PMEM device mounted in regular mode would report a 4096-byte physical block size, but it would happily accept the 512-byte aligned read. (I think its native block size is 256 bytes. In PMEM mode, the interface is 64 bytes, or one cache line.)

            People

              marko Marko Mäkelä
              leo_ Leonardo Martinho
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.