[MDEV-16526] Overhaul the InnoDB page flushing - Jira

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Fix Version/s: 11.1.1, 10.11.3, 11.0.2, 10.6.13, 10.8.8, 10.9.6, 10.10.4
Component/s: Storage Engine - InnoDB
Labels:
- flush
- performance

Description

The writing of modified InnoDB data pages to data files should be overhauled. See some of the comments in MDEV-15058.

Attachments

Issue Links

includes

MDEV-12227 Defer writes to the InnoDB temporary tablespace

Closed

MDEV-23754 Replace buf_pool.flush_list with a priority queue

Closed

MDEV-23756 Implement event-driven innodb_adaptive_flushing=OFF that ignores innodb_io_capacity

Open

is blocked by

MDEV-15053 Reduce buf_pool_t::mutex contention

Closed

MDEV-18115 Remove dummy tablespace for the redo log

Closed

MDEV-19514 Defer change buffer merge until pages are requested

Closed

MDEV-23399 10.5 performance regression with IO-bound tpcc

Closed

MDEV-26827 Make page flushing even faster

Closed

relates to

MDEV-11378 AliSQL: [Perf] Issue#23 MERGE INNODB AIO REQUEST

Open

MDEV-11384 AliSQL: [Feature] Issue#19 BUFFER POOL LIST SCAN OPTIMIZATION

Closed

MDEV-15058 Remove multiple InnoDB buffer pool instances

Closed

MDEV-16339 Upgrading to 10.1.32 shows innodb_empty_free_list_algorithm=BACKOFF as default when it should be 'LEGACY'

Closed

MDEV-17481 mariadb service won't shutdown when it's running and the OS datetime updated backwards

Closed

MDEV-19356 Assertion 'space->free_limit == 0 || space->free_limit == free_limit'

Closed

MDEV-21132 Remove buf_page_t::newest_modification

Closed

MDEV-24854 Change innodb_flush_method=O_DIRECT by default

Closed

MDEV-11916 Page compression - use smaller writes, avoid trimming/zeroing rest of the page if possible

Open

MDEV-12226 Avoid writes of freed (garbage) pages to InnoDB data files

Closed

MDEV-13670 [Note] InnoDB: page_cleaner: 1000ms intended loop took XXXXms. The settings might not be optimal. (flushed=0 and evicted=0, during the time.)

Closed

MDEV-14425 Change the InnoDB redo log format to reduce write amplification

Closed

MDEV-14550 Error log flood : "InnoDB: page_cleaner: 1000ms intended loop took N ms. The settings might not be optimal."

Closed

MDEV-15528 Avoid writing freed InnoDB pages

Closed

links to

FOSDEM 2019: PostgreSQL vs. fsync

Historical - InnoDB IO Performance

MySQL Bug #94912 O_DIRECT_NO_FSYNC possible write hole

PostgreSQL's fsync() surprise

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

(3 is blocked by, 14 relates to, 5 links to)

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2019-02-04 16:23

Page flushing performance should be improved by removing the unsorted buf_pool->flush_list, and always sorting the writes in the same way as buf_pool->flush_rbt does. In this way, there could be more progress from the page writes. This should also allow crash recovery to proceed in the background (~~MDEV-14481~~) while the server is already accepting connections.

Marko Mäkelä added a comment - 2019-02-04 16:23 Page flushing performance should be improved by removing the unsorted buf_pool->flush_list , and always sorting the writes in the same way as buf_pool->flush_rbt does. In this way, there could be more progress from the page writes. This should also allow crash recovery to proceed in the background ( MDEV-14481 ) while the server is already accepting connections.

Marko Mäkelä added a comment - 2019-03-20 10:32

We should look at implementing direct I/O. Where applicable, perhaps we could also replace `fsync()` with asynchronous `IOCB_CMD_FDSYNC` operations.

Marko Mäkelä added a comment - 2019-03-20 10:32 We should look at implementing direct I/O. Where applicable, perhaps we could also replace `fsync()` with asynchronous `IOCB_CMD_FDSYNC` operations.

Marko Mäkelä added a comment - 2019-03-20 11:28

Craig Ringer on the PostgreSQL mailing list shared a link to his fsync() test program, which suggests fault injection by dmsetup.

I think that we should perform this kind of fault injection testing for InnoDB.

Marko Mäkelä added a comment - 2019-03-20 11:28 Craig Ringer on the PostgreSQL mailing list shared a link to his fsync() test program , which suggests fault injection by dmsetup . I think that we should perform this kind of fault injection testing for InnoDB.

Marko Mäkelä added a comment - 2019-03-20 11:30

The FOSDEM 2019 talk on PostgreSQL vs. fsync is worth watching.

Marko Mäkelä added a comment - 2019-03-20 11:30 The FOSDEM 2019 talk on PostgreSQL vs. fsync is worth watching.

Marko Mäkelä added a comment - 2019-04-12 18:29

In the function buf_flush_write_block_low(), there is a call log_write_up_to(bpage->newest_modification, true), which seems out of the place. Yes, we must ensure that we will not write out any pages before writing the corresponding redo log (write-ahead logging). But, maybe we could just move to the next block in the flush_list whose newest modification is old enough to be written out. Only if we have run out of all such pages, it would seem to make sense to wait for a flush of the redo log. In this way, perhaps the page flushing could be managed by fewer threads.

Marko Mäkelä added a comment - 2019-04-12 18:29 In the function buf_flush_write_block_low() , there is a call log_write_up_to(bpage->newest_modification, true) , which seems out of the place. Yes, we must ensure that we will not write out any pages before writing the corresponding redo log (write-ahead logging). But, maybe we could just move to the next block in the flush_list whose newest modification is old enough to be written out. Only if we have run out of all such pages, it would seem to make sense to wait for a flush of the redo log. In this way, perhaps the page flushing could be managed by fewer threads.

Marko Mäkelä added a comment - 2019-04-29 08:58

As noted in ~~MDEV-19356~~, it might be worthwhile for fil_system.LRU to prefer closing those files for which there are no pending changes in the buffer pool.

Marko Mäkelä added a comment - 2019-04-29 08:58 As noted in MDEV-19356 , it might be worthwhile for fil_system.LRU to prefer closing those files for which there are no pending changes in the buffer pool.

Marko Mäkelä added a comment - 2019-05-02 05:38

As pointed out in MySQL Bug #94912, writes to O_DIRECT files do not imply fdatasync(). We must keep this in mind.

Perhaps all persistent InnoDB files (data and redo log files) should be written to in the same way.

Marko Mäkelä added a comment - 2019-05-02 05:38 As pointed out in MySQL Bug #94912, writes to O_DIRECT files do not imply fdatasync() . We must keep this in mind. Perhaps all persistent InnoDB files (data and redo log files) should be written to in the same way.

Anjum Naveed added a comment - 2019-05-14 05:12

I will love to contribute with this. Please let me know how can I be of help.

Anjum Naveed added a comment - 2019-05-14 05:12 I will love to contribute with this. Please let me know how can I be of help.

Marko Mäkelä added a comment - 2019-05-17 13:48

anjumnaveed81, we are in the early planning stage, and the plan is to completely overhaul the InnoDB buffer pool I/O interface in 10.5.
If you have some design ideas, feel free to write them here.

Marko Mäkelä added a comment - 2019-05-17 13:48 anjumnaveed81 , we are in the early planning stage, and the plan is to completely overhaul the InnoDB buffer pool I/O interface in 10.5. If you have some design ideas, feel free to write them here.

zhai weixiang added a comment - 2019-10-30 03:01

It would be great if we can flush page without acquiring S lock on block. For example we can copy out the page before IO

zhai weixiang added a comment - 2019-10-30 03:01 It would be great if we can flush page without acquiring S lock on block. For example we can copy out the page before IO

Marko Mäkelä added a comment - 2019-11-15 16:03

zhaiwx1987, such ‘shadow page’ flushing is indeed being used by other databases. MariaDB already uses separate write buffers for page_compressed and encrypted tables, but probably unnecessarily keeps holding the page latch. It is definitely worth trying, and could make log checkpoint faster (even after we have addressed ~~MDEV-14462~~).

wlad, as noted in ~~MDEV-16264~~, I think that SRV_MAX_N_IO_THREADS and any related code and variables should be removed.

Marko Mäkelä added a comment - 2019-11-15 16:03 zhaiwx1987 , such ‘shadow page’ flushing is indeed being used by other databases. MariaDB already uses separate write buffers for page_compressed and encrypted tables, but probably unnecessarily keeps holding the page latch. It is definitely worth trying, and could make log checkpoint faster (even after we have addressed MDEV-14462 ). wlad , as noted in MDEV-16264 , I think that SRV_MAX_N_IO_THREADS and any related code and variables should be removed.

Marko Mäkelä added a comment - 2019-11-15 16:09 - edited

The goal of this task is to ensure that a single buffer pool instance will not perform worse in a write-heavy benchmark than multiple buffer pools. This is a potential concern with ~~MDEV-15058~~, and will be hopefully helped by ~~MDEV-15053~~ already.

When running the benchmark, you should disable the InnoDB doublewrite buffer, because that is an obvious bottleneck.

Marko Mäkelä added a comment - 2019-11-15 16:09 - edited The goal of this task is to ensure that a single buffer pool instance will not perform worse in a write-heavy benchmark than multiple buffer pools. This is a potential concern with MDEV-15058 , and will be hopefully helped by MDEV-15053 already. When running the benchmark, you should disable the InnoDB doublewrite buffer, because that is an obvious bottleneck.

Marko Mäkelä added a comment - 2019-11-15 21:28

I think that as part of this task, we should try to simplify buf_flush_write_block_low(). Can we avoid acquiring log_sys.mutex? Instead of calling log_write_up_to() in that function, perhaps the page flush can simply skip those pages whose buf_page_t::newest_modification (which by the way should actually be removed, and replaced with direct access to FIL_PAGE_LSN) is newer than what has been persisted in redo log (log_sys.write_lsn)? Then, between flush batches, trigger the log flush (if it cannot be done by the dedicated log writer task; see ~~MDEV-14462~~) to ensure progress. I think that we should change the type of log_sys.write_lsn and log_sys.lsn to std::atomic<lsn_t>.

Furthermore, I think that we should try to remove buf_pool_t::flush_rbt (whose purpose is to ensure during redo log apply that buf_pool_t::flush_list is kept in a more reasonable order) and try to always keep the flush_list sorted in a way that tries to ensure optimal progress. Maybe use std::priority_queue?

We should remove the BUF_FLUSH_SINGLE_PAGE, but I think that we may have to keep both LRU and flush_list flushing. I may be mistaken, but based on ~~MDEV-14550~~ I have the impression that in the worst case, the LRU and flush_list batches are taking too much time, blocking the progress of the other type of flushing batch. We should try to minimize mutex contention while keeping the I/O system saturated with useful page flushes (with large values of innodb_max_dirty_pages_pct, try to avoid repeated writes of a frequently changed page).

Marko Mäkelä added a comment - 2019-11-15 21:28 I think that as part of this task, we should try to simplify buf_flush_write_block_low() . Can we avoid acquiring log_sys.mutex ? Instead of calling log_write_up_to() in that function, perhaps the page flush can simply skip those pages whose buf_page_t::newest_modification (which by the way should actually be removed, and replaced with direct access to FIL_PAGE_LSN ) is newer than what has been persisted in redo log ( log_sys.write_lsn )? Then, between flush batches, trigger the log flush (if it cannot be done by the dedicated log writer task; see MDEV-14462 ) to ensure progress. I think that we should change the type of log_sys.write_lsn and log_sys.lsn to std::atomic<lsn_t> . Furthermore, I think that we should try to remove buf_pool_t::flush_rbt (whose purpose is to ensure during redo log apply that buf_pool_t::flush_list is kept in a more reasonable order) and try to always keep the flush_list sorted in a way that tries to ensure optimal progress. Maybe use std::priority_queue ? We should remove the BUF_FLUSH_SINGLE_PAGE , but I think that we may have to keep both LRU and flush_list flushing. I may be mistaken, but based on MDEV-14550 I have the impression that in the worst case, the LRU and flush_list batches are taking too much time, blocking the progress of the other type of flushing batch. We should try to minimize mutex contention while keeping the I/O system saturated with useful page flushes (with large values of innodb_max_dirty_pages_pct , try to avoid repeated writes of a frequently changed page).

Marko Mäkelä added a comment - 2019-11-28 11:40

As part of this task, I think that the following threads that ~~MDEV-16264~~ failed to convert to tasks should be removed or refactored:

fil_crypt_thread (its sole purpose is to make pages dirty, so that they will be re-encrypted on page flush)
recv_writer_thread and buf_flush_page_cleaner_coordinator have somewhat overlapping functionality. Not having a separate ‘crash recovery’ mode would seem to be a requirement of ~~MDEV-14481~~.

Marko Mäkelä added a comment - 2019-11-28 11:40 As part of this task, I think that the following threads that MDEV-16264 failed to convert to tasks should be removed or refactored: fil_crypt_thread (its sole purpose is to make pages dirty, so that they will be re-encrypted on page flush) recv_writer_thread and buf_flush_page_cleaner_coordinator have somewhat overlapping functionality. Not having a separate ‘crash recovery’ mode would seem to be a requirement of MDEV-14481 .

Marko Mäkelä added a comment - 2019-12-11 06:03

Mark Callaghan’s blog (re)post Historical - InnoDB IO Performance provides some historical overview that could be relevant for this task.

Marko Mäkelä added a comment - 2019-12-11 06:03 Mark Callaghan’s blog (re)post Historical - InnoDB IO Performance provides some historical overview that could be relevant for this task.

Marko Mäkelä added a comment - 2020-02-28 06:47

It appears that XtraDB used to disable the single-page flushing when using a larger buffer pool, via the Boolean-disguised-as-enum parameter innodb_empty_free_list_algorithm=BACKOFF. The logic would sometimes fail, as reported in ~~MDEV-16339~~.

I think that we should consider removing the single-page flushing nevertheless. Perhaps there could be active signaling between the page cleaner and the buf_LRU_get_free_block(), instead of passive sleeping?

Marko Mäkelä added a comment - 2020-02-28 06:47 It appears that XtraDB used to disable the single-page flushing when using a larger buffer pool, via the Boolean-disguised-as-enum parameter innodb_empty_free_list_algorithm=BACKOFF . The logic would sometimes fail, as reported in MDEV-16339 . I think that we should consider removing the single-page flushing nevertheless. Perhaps there could be active signaling between the page cleaner and the buf_LRU_get_free_block() , instead of passive sleeping?

Marko Mäkelä added a comment - 2020-10-28 16:13

Now that most of the work has been completed in ~~MDEV-23399~~ and ~~MDEV-23855~~, I think that the main work to be done will be in ~~MDEV-12227~~, MDEV-23756 and possibly other tickets that have been linked as related.

Because this is mostly an umbrella task, having an estimate is not too meaningful.

Marko Mäkelä added a comment - 2020-10-28 16:13 Now that most of the work has been completed in MDEV-23399 and MDEV-23855 , I think that the main work to be done will be in MDEV-12227 , MDEV-23756 and possibly other tickets that have been linked as related. Because this is mostly an umbrella task, having an estimate is not too meaningful.

Vladislav Vaintroub added a comment - 2021-01-26 11:04

I adjusted the priority to "Major", eventhough it is probably "minor" , for the rest of the things left there.

Vladislav Vaintroub added a comment - 2021-01-26 11:04 I adjusted the priority to "Major", eventhough it is probably "minor" , for the rest of the things left there.

Marko Mäkelä added a comment - 2021-10-26 12:12

I think that most of this has already been done. One remaining tweak is ~~MDEV-26827~~. We failed to observe any improvement with it; in fact, a small regression was observed. So, it will need some additional work.

Marko Mäkelä added a comment - 2021-10-26 12:12 I think that most of this has already been done. One remaining tweak is MDEV-26827 . We failed to observe any improvement with it; in fact, a small regression was observed. So, it will need some additional work.

Marko Mäkelä added a comment - 2023-08-15 13:13

After ~~MDEV-26055~~ and ~~MDEV-26827~~ have been completed, there is not much that can be improved in this area. I can only think of MDEV-11378, submitting scatter-gather write requests instead of multiple single-page write requests.

Marko Mäkelä added a comment - 2023-08-15 13:13 After MDEV-26055 and MDEV-26827 have been completed, there is not much that can be improved in this area. I can only think of MDEV-11378 , submitting scatter-gather write requests instead of multiple single-page write requests.

People

Assignee:: Marko Mäkelä

Reporter:: Marko Mäkelä

Votes:: 1 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 2018-06-19 16:42

Updated:: 2023-08-15 13:13

Resolved:: 2023-08-15 13:13

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server