[MDEV-27593] Crashing on I/O error is unhelpful Created: 2022-01-24 Updated: 2023-12-08 Resolved: 2023-08-01 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.5.13, 10.8.1 |
| Fix Version/s: | 10.6.15, 10.9.8, 10.10.6, 10.11.5, 11.0.3, 11.1.2, 11.2.1 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Daniel Black | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | None | ||
| Environment: |
fc35 + uring |
||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
In a case of filling up a disk with space an assertion was generated. Like
|
| Comments |
| Comment by Daniel Black [ 2022-01-24 ] | |
|
PR 1995. The file descriptor to name is a bit of work wlad mentioned in a former PR. So I'm hoping a number is sufficient for now. | |
| Comment by Marko Mäkelä [ 2022-01-24 ] | |
|
I think that we should distinguish page read and write errors. On page write error, we do not have much other choice than to eventually terminate the process. I think that the minimum necessary handling is that we simply will leave the page dirty in the buffer pool, which will correctly prevent a log checkpoint from occurring. Page read errors should be handled in the same way as checksum mismatch: evict the page frame from the buffer pool, and possibly propagate the error some way to the caller (as noted in I think that at least the write side can be improved. There is no need to immediately crash the server on a page write failure. | |
| Comment by Daniel Black [ 2022-02-18 ] | |
| |
| Comment by Marko Mäkelä [ 2022-02-18 ] | |
|
I can repeat crashes with io_uring, but not with libaio, when running the following:
On NVMe instead of /dev/shm, this took a bit longer time to crash. The cb->m_err was always 195 (not 28). I repeated it on 5.17-rc3 and Debian’s 5.16-0-1-amd64 (5.16.7). | |
| Comment by Daniel Black [ 2022-07-12 ] | |
|
NFS note from lwn article, EIO is accessing a file descriptor when its NFS lock has been lost. | |
| Comment by suresh ramagiri [ 2023-06-26 ] | |
|
One of our customers running on MariaDB 10.5.10, hit the same assertion reported here, disk usage wise, it was not full, but can see I/O errors in the dmesg error logs: dmesg: It looks me I/O error is related to the floppy drive, which is not there and still trying to refer and getting the I/O error. | |
| Comment by Marko Mäkelä [ 2023-06-26 ] | |
|
While 10.5 only uses libaio and not liburing, fixing MDEV-29610 could be helpful. | |
| Comment by Marko Mäkelä [ 2023-08-01 ] | |
|
This is conceptually related to When a page write fails, we should do everything like we do for a successful page write, except mark the page as clean (or remove it from buf_pool.flush_list or buf_pool.LRU). A subsequent iteration of buf_flush_page_cleaner() should retry the write later. If all subsequent writes fail, then buf_flush_page_cleaner() will fail to make any progress, and all writer threads will eventually be blocked in log_free_check() because the log checkpoint cannot be advanced to make room in the circular write-ahead log ib_logfile0. When an asynchronous page read fails, we can simply discard the corrupted page from the buffer pool. If the page is requested via buf_page_get_low(), then a synchronous read will be invoked via buf_page_read(). In |