[MDEV-25121] innodb_flush_method=O_DIRECT fails on compressed tables Created: 2021-03-12 Updated: 2021-06-29 Resolved: 2021-03-18 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Server, Storage Engine - InnoDB |
| Affects Version/s: | 10.5.9 |
| Fix Version/s: | 10.2.38, 10.3.29, 10.4.19, 10.5.10, 10.6.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Xiaobo Luo | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
[root@localhost~]# cat /etc/redhat-release [root@localhost~]# free -g The SSD device is Intel P4610 3.2TB and the database version is MariaDB 10.5.9 Community Edition |
||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Description |
| Comments |
| Comment by Marko Mäkelä [ 2021-03-12 ] | |||||||||||||||||||||||||||
|
Thank you for the report. The crash occurs due to the following:
This code was refactored in I see that you are using innodb_flush_method=O_DIRECT, which should ensure that DMA is being used. Without it, the Linux kernel could use more CPU in the io_submit() call. In 10.6, we finally changed that to be the default ( I suspect that the this is somehow related to the use of page_compressed=1 tables. We have tested MariaDB on various hardware (including NVMe). I think that I ran the ./mtr regression test suite on my NVMe (Intel Optane 960, INTEL SSDPED1D960GAY) when I implemented I think that we need more information to fix this. Could you provide some strace output that could hint what could have gone wrong? Did I get it right that this does not occur if you are not page_compressed tables? What about ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=1 (using 1KiB block size)? I understood that we should avoid setting O_DIRECT on the files in either case. The strace output would help verify that. If I have understood it correctly, for the ScaleFlux hardware, we would probably want to change page_compressed code so that the various IORequest::PUNCH will never be used, but instead sequences of NUL bytes will be written. That is, we would want to let the file system treat the data files as regular files, and the smart storage would transparently compress the individual sectors. | |||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-03-12 ] | |||||||||||||||||||||||||||
|
It occurred to me that my NVMe supports a 512-byte block size, so I would be unable to repeat the problem on my hardware:
It could be that all our test environments have SSDs with 512-byte sector size. | |||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-03-12 ] | |||||||||||||||||||||||||||
|
My attempts to configure a SATA SSD as well as my Intel NVMe drive to use 4096-byte sector size instead of 512-byte size failed. Apparently, the proprietary Intel tool isdct has been replaced with intelmas. | |||||||||||||||||||||||||||
| Comment by Xiaobo Luo [ 2021-03-13 ] | |||||||||||||||||||||||||||
|
I have tried to provide more information on this issue, as follows 1. When setting ROW_FORMAT= compressed when creating InnoDB tables, no matter the logical block size of SSD device is set to 4KB or 512 bytes, it will not cause MariaDB Server process crash phenomenon 2. I tried to set the size of logical sector to 4K on Intel P4610 and ScaleFlux CSD 2000, and reappeared the crash of MariaDB Server process. At the same time, I collected the strace information and core dump information of the process. For details, please refer to the zip file "core_dump_and_strace_file.tar. It can be installed using the software package "isdct-3.0.26.400-1.x86_64.rpm" or "isdct_3.0.26.400-1_amd64.deb"
| |||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-03-13 ] | |||||||||||||||||||||||||||
|
Thank you. In the strace_intel_nvmessd_p4610-4k.txt, I see the following:
I believe that all these calls except the last one are part of is_linux_native_aio_supported(). This "probe" write is for innodb_page_size (default 16384) bytes, using an aligned buffer, and I do not think that it can be related to the crash. I would expect there to be further io_submit() and io_getevents() calls in the trace before the crash, and I am not seeing that. In fact, I am not seeing any other io_ system calls between the above and the SIGABRT. As far as I can tell, no .ibd files should be open at the time of the crash. A probe is opening and closing all files. The only file descriptors that could be open in O_DIRECT mode at the time of the crash are "./ibtmp1" (file descriptor 12) and "./ibdata1" (file descriptor 10). I found only synchronous pread64() for file descriptor 10, and no asynchronous callback. For the other trace, strace_scaleflux-csd2000-4k.txt, it looks similar (including the file descriptor numbers for the data files). I did not check the core dumps yet, because in order to interpret those, I would need a copy of the mariadbd executable that you were using, as well as a copy of all shared libraries (listed by ldd mariadbd). Maybe it would be easier if you posted the output of the following:
For this to be useful, debug symbols for mariadbd should be available. If you are using a MariaDB package instead of compiling it from the source code, then the debugging symbols are usually available in a separate package that needs to be additionally installed. We need the debugging symbols in order to see the local variables and the function parameter values. I am not familiar with the internals of strace. Maybe it can lose a trace of some system calls if the monitored process is killed? After all, there was no evidence of any write to a data file in the traces, and if I remember correctly, neither the system tablespace nor the temporary tablespace should ever be written in smaller than innodb_page_size blocks. Because the crash occurs very early after startup, this should be easy to fix, once I get remote access to a system. We are working on that. | |||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-03-16 ] | |||||||||||||||||||||||||||
|
I got remote access to a system, but possibly used the wrong device because I was unable to repeat the failure yet. This will take some time, because the sysadmin is in a different time zone. | |||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-03-17 ] | |||||||||||||||||||||||||||
|
I am able to repeat problems. In the test innodb.innodb-page_compression_zip, another surprise emerged. An error is being reported for a 512-byte write to the following table:
This looks like a possible regression due to After I worked around that failure by modifying the test case, a 512-byte write was attempted on another file, on which we had enabled O_DIRECT earlier:
Either we must ensure to never enable O_DIRECT on page_compressed files, or we must make the write size a multiple of the underlying block size. | |||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-03-17 ] | |||||||||||||||||||||||||||
|
There is a data field fil_node_t::block_size that is supposed to store the file system block size. But, it is only being assigned in fil_node_t::read_page0(), that is, when accessing pre-existing data files for the first time since the server started up. Here is a minimal test case to demonstrate this.
If the line to restart the server is present, the server will not crash, because after restart, we would invoke fil_node_t::read_page0() to correctly initialize fil_node_t::block_size. If it is missing, the server will crash during the flush table t1 for export statement, because we left fil_node_t::block_size at the value 0 when creating the file. xiaoboluo768, I think that you can apply the idea of my above test case to work around the bug: First create the tables normally (without a page_compression attribute), then restart the server and execute ALTER TABLE to enable compression, and finally insert the data. I should have a fix for this soon. | |||||||||||||||||||||||||||
| Comment by Xiaobo Luo [ 2021-03-17 ] | |||||||||||||||||||||||||||
|
Okay, thank you Marko | |||||||||||||||||||||||||||
| Comment by Xiaobo Luo [ 2021-03-17 ] | |||||||||||||||||||||||||||
|
I think I should wait for this problem to be fixed and then I will run the stress test script for testing, because the test is not very urgent at the moment | |||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-03-17 ] | |||||||||||||||||||||||||||
|
There was an earlier attempt to fix this bug: | |||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-03-17 ] | |||||||||||||||||||||||||||
|
I have a fix for 10.5 and 10.6 that makes the tests pass on the remote system. The earlier attempt of fixing this ( Before my fix, I got failures also for ROW_FORMAT=COMPRESSED tables using KEY_BLOCK_SIZE=1 (1024 bytes) or KEY_BLOCK_SIZE=2 (2048 bytes). It easiest to refuse O_DIRECT for them. I intend to deprecate and remove that format; wlad is now checking that after my fix, everything will work correctly on Microsoft Windows, and then I will have to port and test the fix on 10.2, 10.3, 10.4. I believe that all major versions will differ a little in this area. | |||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-03-17 ] | |||||||||||||||||||||||||||
|
I verified that already MariaDB 10.2 is broken. Furthermore, I tested 10.2, 10.4 and 10.6 based branches without and with my fix. I will keep this ticket open until the fix has been merged up to 10.6. On 10.4, I used the following invocation:
while mysql-test/var was a symlink to a directory in the SSD with 4KiB block size. 10.4 would normally use innodb_checksum_algorithm=crc32; the default was changed for 10.5 in |