[MDEV-31642] Upgrade from 10.7 or earlier may crash if innodb_log_file_buffering=OFF - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.11.3, 10.8(EOL), 10.9(EOL), 10.10(EOL), 10.11, 11.0(EOL), 11.1(EOL), 11.2(EOL)
Fix Version/s: 10.9.8, 10.10.6, 10.11.5, 11.0.3, 11.1.2, 11.2.1
Component/s: Storage Engine - InnoDB
Labels:
- crash
- upgrade
Environment:
s390x arch, debian 12, 5.10.0-23-s390x kernel, 6.1.0-9-s390x kernel

Description

Upgrading from Debian 11 to Debian 12 yields an exception from which mariadb doesn't seem to recover from. This can reproduced with a clean installation and no data inserted whatsoever. It seems as if it has to do something with the InnoDB structure that was generated with version 10.5.19. I tested this with both Debian 11 (5.10.0.-23) and Debian 12 (6.1.0-9) kernels with both failing.

The only fix for this seems to be to delete /var/lib/mysql in its entirety and regenerate all of the data structures by starting up mariadb with an empty dir. From this point on the exception is gone and everything works as expected.

I attached the error log that comes up when /usr/sbin/mariadbd is started with version 10.11.3 with data generated from 10.5.19.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

error_log
5 kB
2023-07-07 07:46
mariadbd_strace
149 kB
2023-07-07 11:26

Issue Links

is duplicated by

MDEV-29617 MariaDB server crashes on startup after 10.9.3 on ARM

Closed

relates to

MDEV-14425 Change the InnoDB redo log format to reduce write amplification

Closed

MDEV-28766 MDEV-28111 breaks innodb_flush_log_at_trx_commit=2

Closed

Activity

Ascending order - Click to sort in descending order

View 7 older comments

Marko Mäkelä added a comment - 2023-07-07 10:54

Can you run mariadbd under strace or a debugger to find out which parameters had been passed to the failing pread64 system call? I think that the recovery should first try to read the 12888 bytes starting at offset 0, and then read some pages starting at the one that includes the byte offset that the latest checkpoint LSN evaluates to. I should have covered that on a device whose physical block size is 4096 bytes. The NVMe and SSD that I have at home are with 512 bytes physical block size, but one HDD is 4096 bytes.

Would this same system have trouble accessing tables like this when you are using the ~~MDEV-24854~~ default innodb_flush_method=O_DIRECT?

CREATE TABLE t(…) ENGINE=InnoDB ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=4;

This should create a table with 4096-byte physical page size in the storage.

Marko Mäkelä added a comment - 2023-07-07 10:54 Can you run mariadbd under strace or a debugger to find out which parameters had been passed to the failing pread64 system call? I think that the recovery should first try to read the 12888 bytes starting at offset 0, and then read some pages starting at the one that includes the byte offset that the latest checkpoint LSN evaluates to. I should have covered that on a device whose physical block size is 4096 bytes. The NVMe and SSD that I have at home are with 512 bytes physical block size, but one HDD is 4096 bytes. Would this same system have trouble accessing tables like this when you are using the MDEV-24854 default innodb_flush_method=O_DIRECT ? CREATE TABLE t(…) ENGINE=InnoDB ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=4; This should create a table with 4096-byte physical page size in the storage.

Leonardo Martinho added a comment - 2023-07-07 11:26

So I ran mariadbd with the broken data through strace. The following pread64 calls should be of importance:

pread64(4, "\376\376\f\1\22\326\30&\34\32\21\356\267l\2\1!\0\19\36\0\26\0\0 \316\273D\271\fr"..., 52, 0) = 52

pread64(6, "\2\0\0\1\0\0\0\300\r\3\0\0\0\0\0\0\377\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377"..., 8192, 16384) = 8192

pread64(6, "\376\376\v\1MARIALOG\345\n\357\377\323\377\5\0\20'\0\0\247\210\1\0\0\0\0\0"..., 44, 0) = 44

pread64(11, "\0\0\0\0\0\0\0\0\377\377\377\377\377\377\377\377\0\0\0\0\0\0\255!\0\10\0\0\0\0\0\0"..., 65536, 0) = 65536

pread64(11, "\0\0\0\0\0\0\0\5\377\377\377\377\377\377\377\377\0\0\0\0\0\0r\306\0\7\0\0\0\0\0\0"..., 16384, 81920) = 16384

pread64(11, "\0\0\0\0\0\0\0\0\377\377\377\377\377\377\377\377\0\0\0\0\0\0\255!\0\10\0\0\0\0\0\0"..., 1048576, 1048576) = 1048576

pread64(11, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576, 2097152) = 1048576

pread64(11, "\0\0\0\0\0\0\0\0\377\377\377\377\377\377\377\377\0\0\0\0\0\0\255!\0\10\0\0\0\0\0\0"..., 16384, 0) = 16384

pread64(12, "PHYS\0\0\0\2\0\0\0\0\0\0\"\0MariaDB 10.5.19\0"..., 4096, 0) = 4096

pread64(12, 0x2aa35e0c000, 512, 38400)  = -1 EINVAL (Invalid argument)

For reference I attached the full strace to this issue as a file.

Further, I created the table with the options you mentioned on a working dataset with --innodb-log-file-buffering=TRUE and the default option innodb_flush_method=O_DIRECT, inserted and queried data which worked flawlessly.

Leonardo Martinho added a comment - 2023-07-07 11:26 So I ran mariadbd with the broken data through strace. The following pread64 calls should be of importance: pread64(4, "\376\376\f\1\22\326\30&\34\32\21\356\267l\2\1!\0\19\36\0\26\0\0 \316\273D\271\fr"..., 52, 0) = 52 pread64(6, "\2\0\0\1\0\0\0\300\r\3\0\0\0\0\0\0\377\0\377\377\377\377\377\377\377\377\377\377\377\377\377\377"..., 8192, 16384) = 8192 pread64(6, "\376\376\v\1MARIALOG\345\n\357\377\323\377\5\0\20'\0\0\247\210\1\0\0\0\0\0"..., 44, 0) = 44 pread64(11, "\0\0\0\0\0\0\0\0\377\377\377\377\377\377\377\377\0\0\0\0\0\0\255!\0\10\0\0\0\0\0\0"..., 65536, 0) = 65536 pread64(11, "\0\0\0\0\0\0\0\5\377\377\377\377\377\377\377\377\0\0\0\0\0\0r\306\0\7\0\0\0\0\0\0"..., 16384, 81920) = 16384 pread64(11, "\0\0\0\0\0\0\0\0\377\377\377\377\377\377\377\377\0\0\0\0\0\0\255!\0\10\0\0\0\0\0\0"..., 1048576, 1048576) = 1048576 pread64(11, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1048576, 2097152) = 1048576 pread64(11, "\0\0\0\0\0\0\0\0\377\377\377\377\377\377\377\377\0\0\0\0\0\0\255!\0\10\0\0\0\0\0\0"..., 16384, 0) = 16384 pread64(12, "PHYS\0\0\0\2\0\0\0\0\0\0\"\0MariaDB 10.5.19\0"..., 4096, 0) = 4096 pread64(12, 0x2aa35e0c000, 512, 38400) = -1 EINVAL (Invalid argument) For reference I attached the full strace to this issue as a file. Further, I created the table with the options you mentioned on a working dataset with --innodb-log-file-buffering=TRUE and the default option innodb_flush_method=O_DIRECT, inserted and queried data which worked flawlessly.

Daniel Black added a comment - 2023-07-07 12:55 - edited

512 bytes < 4096, +38400 byte offset / 4096 blocksize is 9.375 blocks, so it doesn't align so that's why the O_DIRECT isn't working.

Daniel Black added a comment - 2023-07-07 12:55 - edited 512 bytes < 4096, +38400 byte offset / 4096 blocksize is 9.375 blocks, so it doesn't align so that's why the O_DIRECT isn't working.

Marko Mäkelä added a comment - 2023-07-07 15:14

Thank you, leo_. This is specific to the upgrade code path. Before ~~MDEV-14425~~ in MariaDB Server 10.8, the only possible InnoDB log block size was 512 bytes. In the new format, the size can be basically anything (one mini-transaction is one block), but we try to read and write using the physical block size.

Once the redo log has been upgraded (it happens before any connections are accepted), you should safely be able to use innodb_log_file_buffering=OFF. I realize that we never tested upgrade with a 4096-byte block size device.

Marko Mäkelä added a comment - 2023-07-07 15:14 Thank you, leo_ . This is specific to the upgrade code path. Before MDEV-14425 in MariaDB Server 10.8, the only possible InnoDB log block size was 512 bytes. In the new format, the size can be basically anything (one mini-transaction is one block), but we try to read and write using the physical block size. Once the redo log has been upgraded (it happens before any connections are accepted), you should safely be able to use innodb_log_file_buffering=OFF . I realize that we never tested upgrade with a 4096-byte block size device.

Marko Mäkelä added a comment - 2023-07-10 08:58

I was able to observe the misaligned read using the test case of this fix if I revert the code change:

./mtr --rr innodb.log_corruption

rr replay var/log/mysqld.1.rr/mariadbd-25

I get to see how this artificial redo log sample (with the empty log starting at byte offset 0x120c instead of the previous 0x80c) is processed. The relevant code is here:

static dberr_t recv_log_recover_10_5(lsn_t lsn_offset)

  byte *buf= const_cast<byte*>(field_ref_zero);

  if (lsn_offset < (log_sys.is_pmem() ? log_sys.file_size : 4096))

    memcpy_aligned<512>(buf, &log_sys.buf[lsn_offset & ~511], 512);

  else

    recv_sys.read(lsn_offset & ~lsn_t{511}, {buf, 512});

With the original test, we would copy the data from log_sys.buf, because its file offset (0x80c) is less than 4096 (0x1000), and thus present in the buffer from a previous read.

With the modified test, the record needs to be read from file offset 0x120c. The else branch is using only 512-byte alignment for that, so it will use the offset 4608 (0x1200) and size 512. My fix is to submit a 4096-byte aligned read and adjust buf+=lsn_offset & 0xe00 so that it will point to the correct logical 512-byte block.

To my surprise, even if I reverted the code change, the INTEL SSDSC2KG019T8 on the system where I tested this would happily accept the misaligned read (reading 0x200 bytes at 0x1200). This probably is because smartctl -a /dev/sda reports the following:

Sector Sizes:     512 bytes logical, 4096 bytes physical

The drive /dev/sdb (a 7,200 rpm SATA 3.3 hard disk) is equally forgiving. Also an Intel PMEM device mounted in regular mode would report a 4096-byte physical block size, but it would happily accept the 512-byte aligned read. (I think its native block size is 256 bytes. In PMEM mode, the interface is 64 bytes, or one cache line.)

Marko Mäkelä added a comment - 2023-07-10 08:58 I was able to observe the misaligned read using the test case of this fix if I revert the code change: ./mtr --rr innodb.log_corruption rr replay var/log/mysqld.1.rr/mariadbd-25 I get to see how this artificial redo log sample (with the empty log starting at byte offset 0x120c instead of the previous 0x80c) is processed. The relevant code is here: static dberr_t recv_log_recover_10_5(lsn_t lsn_offset) { byte *buf= const_cast <byte*>(field_ref_zero); if (lsn_offset < (log_sys.is_pmem() ? log_sys.file_size : 4096)) memcpy_aligned<512>(buf, &log_sys.buf[lsn_offset & ~511], 512); else recv_sys.read(lsn_offset & ~lsn_t{511}, {buf, 512}); With the original test, we would copy the data from log_sys.buf , because its file offset (0x80c) is less than 4096 (0x1000), and thus present in the buffer from a previous read. With the modified test, the record needs to be read from file offset 0x120c. The else branch is using only 512-byte alignment for that, so it will use the offset 4608 (0x1200) and size 512. My fix is to submit a 4096-byte aligned read and adjust buf+=lsn_offset & 0xe00 so that it will point to the correct logical 512-byte block. To my surprise, even if I reverted the code change, the INTEL SSDSC2KG019T8 on the system where I tested this would happily accept the misaligned read (reading 0x200 bytes at 0x1200). This probably is because smartctl -a /dev/sda reports the following: Sector Sizes: 512 bytes logical, 4096 bytes physical The drive /dev/sdb (a 7,200 rpm SATA 3.3 hard disk) is equally forgiving. Also an Intel PMEM device mounted in regular mode would report a 4096-byte physical block size, but it would happily accept the 512-byte aligned read. (I think its native block size is 256 bytes. In PMEM mode, the interface is 64 bytes, or one cache line.)

MariaDB Server

Upgrade from 10.7 or earlier may crash if innodb_log_file_buffering=OFF

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration