[MDEV-31642] Upgrade from 10.7 or earlier may crash if innodb_log_file_buffering=OFF Created: 2023-07-07 Updated: 2023-09-19 Resolved: 2023-07-10 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.11.3, 10.8, 10.9, 10.10, 10.11, 11.0, 11.1, 11.2 |
| Fix Version/s: | 10.9.8, 10.10.6, 10.11.5, 11.0.3, 11.1.2, 11.2.1 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Leonardo Martinho | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | crash, upgrade | ||
| Environment: |
s390x arch, debian 12, 5.10.0-23-s390x kernel, 6.1.0-9-s390x kernel |
||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Description |
|
Upgrading from Debian 11 to Debian 12 yields an exception from which mariadb doesn't seem to recover from. This can reproduced with a clean installation and no data inserted whatsoever. It seems as if it has to do something with the InnoDB structure that was generated with version 10.5.19. I tested this with both Debian 11 (5.10.0.-23) and Debian 12 (6.1.0-9) kernels with both failing. The only fix for this seems to be to delete /var/lib/mysql in its entirety and regenerate all of the data structures by starting up mariadb with an empty dir. From this point on the exception is gone and everything works as expected. I attached the error log that comes up when /usr/sbin/mariadbd is started with version 10.11.3 with data generated from 10.5.19. |
| Comments |
| Comment by Daniel Black [ 2023-07-07 ] | ||||||||||||
|
Thanks for the bug report leo_. In As 10.5 didn't have liburing support, so its just the 10.6+ instances. Can you attempt an upgrade with innodb_use_native_aio=0 is set, and see if this reoccurs? This crash I think occurred before change in the datadir occurred so data may be still ok. | ||||||||||||
| Comment by Marko Mäkelä [ 2023-07-07 ] | ||||||||||||
|
From error_log
Relevant code:
What is the type of the file system? Would the server start up if you specify innodb_log_file_buffering=OFF? See | ||||||||||||
| Comment by Marko Mäkelä [ 2023-07-07 ] | ||||||||||||
|
danblack, thank you for your hints, but the redo log file always uses synchronous reads, so I do not think that io_uring ( | ||||||||||||
| Comment by Leonardo Martinho [ 2023-07-07 ] | ||||||||||||
|
The filesystem where /var/lib/mysql is on is ext4. I tried both options innodb_log_file_buffering=OFF and innodb_use_native_aio=0 but the error still seems to occur. | ||||||||||||
| Comment by Leonardo Martinho [ 2023-07-07 ] | ||||||||||||
|
I tried to explicitly launch /usr/sbin/mariadbd --innodb-log-file-buffering=TRUE (as in | ||||||||||||
| Comment by Marko Mäkelä [ 2023-07-07 ] | ||||||||||||
|
leo_, sorry, I got it the wrong way around. In any case, disabling the use of O_DIRECT (enabling reads and writes via the Linux kernel’s file system cache) works. Can you post details on the block device where the file system is mounted? Possibly, the detection of the physical block size is incorrect, and it is actually larger than 4096 bytes. | ||||||||||||
| Comment by Leonardo Martinho [ 2023-07-07 ] | ||||||||||||
|
I checked the block size on the device the fs is mounted on - it is actually 4096 bytes. cat /sys/block/dasdb/queue/physical_block_size | ||||||||||||
| Comment by Marko Mäkelä [ 2023-07-07 ] | ||||||||||||
|
Can you run mariadbd under strace or a debugger to find out which parameters had been passed to the failing pread64 system call? I think that the recovery should first try to read the 12888 bytes starting at offset 0, and then read some pages starting at the one that includes the byte offset that the latest checkpoint LSN evaluates to. I should have covered that on a device whose physical block size is 4096 bytes. The NVMe and SSD that I have at home are with 512 bytes physical block size, but one HDD is 4096 bytes. Would this same system have trouble accessing tables like this when you are using the
This should create a table with 4096-byte physical page size in the storage. | ||||||||||||
| Comment by Leonardo Martinho [ 2023-07-07 ] | ||||||||||||
|
So I ran mariadbd with the broken data through strace. The following pread64 calls should be of importance:
For reference I attached the full strace to this issue as a file. Further, I created the table with the options you mentioned on a working dataset with --innodb-log-file-buffering=TRUE and the default option innodb_flush_method=O_DIRECT, inserted and queried data which worked flawlessly. | ||||||||||||
| Comment by Daniel Black [ 2023-07-07 ] | ||||||||||||
|
512 bytes < 4096, +38400 byte offset / 4096 blocksize is 9.375 blocks, so it doesn't align so that's why the O_DIRECT isn't working. | ||||||||||||
| Comment by Marko Mäkelä [ 2023-07-07 ] | ||||||||||||
|
Thank you, leo_. This is specific to the upgrade code path. Before Once the redo log has been upgraded (it happens before any connections are accepted), you should safely be able to use innodb_log_file_buffering=OFF. I realize that we never tested upgrade with a 4096-byte block size device. | ||||||||||||
| Comment by Marko Mäkelä [ 2023-07-10 ] | ||||||||||||
|
I was able to observe the misaligned read using the test case of this fix if I revert the code change:
I get to see how this artificial redo log sample (with the empty log starting at byte offset 0x120c instead of the previous 0x80c) is processed. The relevant code is here:
With the original test, we would copy the data from log_sys.buf, because its file offset (0x80c) is less than 4096 (0x1000), and thus present in the buffer from a previous read. With the modified test, the record needs to be read from file offset 0x120c. The else branch is using only 512-byte alignment for that, so it will use the offset 4608 (0x1200) and size 512. My fix is to submit a 4096-byte aligned read and adjust buf+=lsn_offset & 0xe00 so that it will point to the correct logical 512-byte block. To my surprise, even if I reverted the code change, the INTEL SSDSC2KG019T8 on the system where I tested this would happily accept the misaligned read (reading 0x200 bytes at 0x1200). This probably is because smartctl -a /dev/sda reports the following:
The drive /dev/sdb (a 7,200 rpm SATA 3.3 hard disk) is equally forgiving. Also an Intel PMEM device mounted in regular mode would report a 4096-byte physical block size, but it would happily accept the 512-byte aligned read. (I think its native block size is 256 bytes. In PMEM mode, the interface is 64 bytes, or one cache line.) |