[MDEV-11659] Move the InnoDB doublewrite buffer to flat files Created: 2016-12-23  Updated: 2023-04-11

Status: Open
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Fix Version/s: None

Type: Task Priority: Major
Reporter: Marko Mäkelä Assignee: Unassigned
Resolution: Unresolved Votes: 4
Labels: innodb, space

Issue Links:
Blocks
blocks MDEV-11633 Make the InnoDB system tablespace opt... Open
is blocked by MDEV-23855 InnoDB log checkpointing causes regre... Closed
Relates
relates to MDEV-11658 Simpler, faster IMPORT of InnoDB tables Open

 Description   

The purpose of the InnoDB doublewrite buffer is to prevent corruption when the database server process is killed in the middle of a write operation that is replacing an already initialized page in a data file. The write logic is like this:

  1. Append to the redo log a record that covers the change(s) to the page.
  2. Write a copy of the modified page to the doublewrite buffer.
  3. Write the modified page to the data file.

If the system is killed during the last write, the page would be corrupted in the data file, and no redo log can be applied. The copy in the doublewrite buffer would save the situation.

Currently the doublewrite buffer is physically located in the InnoDB system tablespace, in the pages 64 to 191. It covers all data files. The small number of pages could become a performance bottleneck. And these 128 pages are basically unnecessary garbage when the server has been shut down cleanly.

We should move the doublewrite buffer to flat files, maybe one file for each physical page size. These files would be created on startup unless innodb_read_only is set, and deleted on shutdown. The files would only be consulted on redo log apply. The size of the files (number of pages) could be derived from some flushing parameters.



 Comments   
Comment by Laurynas Biveinis [ 2017-02-09 ]

FWIW Percona Server 5.7 parallel doublewrite (https://www.percona.com/doc/percona-server/5.7/performance/xtradb_performance_improvements_for_io-bound_highly-concurrent_workloads.html) buffer implements this with the following:

  • the batch flush mode doublewrite is moved out to a separate file (xb_doublewrite);
  • we haven't bothered to remove the system doublewrite buffer though - rather, we make it available as 128 slots (instead of 8 current ones) for single-page flushing should that ever be needed. Percona Server has mostly removed single page flushing, but there is an option to get legacy behavior. Otherwise it sits unused.
  • I am not sure what's there to be gained by having per-page size doublewrite buffer. Our current implementation still inherits the upstream bug where compressed pages are padded to 16K needlessly, but fixing it with having mixed-length pages in the same buffer seems trivial.
  • Sizing is automated and tied to our parallel LRU flushing algorithm and number of buffer pool instances
  • Our parallel flushing algorithm means that each doublewrite buffer is private to one thread only, eliminating lots of locking.
Comment by Marko Mäkelä [ 2017-02-10 ]

laurynas, thank you for the comment.
It was also pointed out that there existed a parameter innodb_doublewrite_file in Percona Server 5.5. In Percona Server 5.7, the parameter has been renamed to innodb_parallel_doublewrite_path, defaulting to datadir.

I have been playing with the thought of writing metadata separately from the page data. Currently the doublewrite recovery expects to find the tablespace ID, page number, and log sequence number within each page header. If we wrote the tablespace ID and page number separately, it would be possible to repurpose these fields for something else. But always writing aligned pages seems to be a good idea.

Comment by Anjum Naveed [ 2017-07-27 ]

I have been working on this issue for some time. Like Percona Server implementation, I have created multiple flush threads, matching the number of buffer pool instances. Each thread writes into its own doublewrite buffer file (number of files matching number of instances). I want the files to be on different physical drives which I believe will give me better write performance.

Since I am very new to MariaDB development, I want to have a reasonable level of confidence in my code before I contribute it.

Comment by Anjum Naveed [ 2019-05-06 ]

I have now worked on this issue and have significant time available to work on this issue.
There are few things I notice. double write buffer creation is highly coupled with creation of new system database (for obvious reasons). I believe this should eventually be decoupled and doublewrite buffer creation should be dependent on normal and abnormal start.
I have been able to use a separate file for doublewrite buffer. I have used a new System Tablespace for the purpose. I am now converting it to parallel doublewrite buffer, linking the number of files to number of buffer pool instances as explained in previous post. Issue with this approach will be when the buffer pool instances are changed without creating new system database. For the time being I am working with the assumption that this will not happen. I will update the code once I decouple the creation from system database creation.
I will need someone to review my code as well.

Comment by Marko Mäkelä [ 2019-05-06 ]

anjumnaveed81, this is interesting.

I would prefer a minimal change that only refactors the doublewrite buffer. We have a separate task MDEV-16526 for improving the page flushing.

We might indeed want to have multiple flat files. Also, unless we remove write support for ROW_FORMAT=COMPRESSED tables (which I think we should do at some point, to simplify the buffer pool code), we should consider have separate doublewrite buffers (and files) for different compressed page sizes. But, I am keen to remove the write support for ROW_FORMAT=COMPRESSED tables.

Please try to base your work on the MariaDB 10.5 branch. I do not think that we can radically change persistent data structures within a GA release series, and 10.4 is already very close to GA release.

Comment by Anjum Naveed [ 2019-05-06 ]

Thanks for pointing out. Issue MDEV-16526 is of a lot of interest for me. The reason I started working on InnoDB and doublewrite buffer was performance enhancement. Moving the ibdata file to RAM disk (only for testing purposes) resulted in significant performance improvement, part of which can be attributed to faster doublewrite buffer writes.

For current issue, I will focus on "These files would be created on startup unless innodb_read_only is set, and deleted on shutdown. The files would only be consulted on redo log apply." To begin with, I will use single file in this way and then expand to multiple files. Noted your suggestion relating MariaDB 10.5 branch.

Do we really need a file per page? Is it because it is easy to know the number of files or there are other reasons? So many files will not give any performance advantage in terms of disk writes.

Let me look more into ROW_FORMAT=COMPRESSED write support and I will get back on this later.

Comment by Anjum Naveed [ 2019-05-14 ]

I can now move the doublewrite buffer out of trx_sys_space into separate file(s). File(s) deleted on clean shutdown. Upon start of server, I check again to ensure files do not exist if we reach the normal start point of doublewrite buffer creation. New file(s) created on every restart. Changes done in MariaDB branch 10.5.

We can use one file per page, we can use one file per buf_pool_instance and we can use single file.
With the existing page cleaner implementation and buf_dblwr_t memory structure, single file is the best option. This is because of the batch flush which writes multiple pages into doublewrite buffer file using one IO.
One file per page is not suitable option for batch writing unless we call fil_io repeatedly for each page.
one file per buf_pool_instance works with batch writing but needs more modifications to buf_dblwr_t data structure.

What can be a good test to know if everything is working fine? the InnoDB tests in the existing mysql-test pass.

Comment by Mark Callaghan [ 2019-05-14 ]

Sadly I can't figure out how to get user mentions working to tag Laurynas.

Limiting doublewrite buffer writes to sizeof(page) rather than rounding up to 16kb doesn't matter for compressed pages because not many people use InnoDB compression, but it is a bigger deal for uncompressed pages when sizeof(page) < 16kb. The doublewrite buffer is half of the write-amp from InnoDB.

Comment by Laurynas Biveinis [ 2019-05-15 ]

Mark, I am following this issue.

It seems trivial to support innodb_page_size != 16KB, because it is a constant for an instance. It would be a mistake to do 16KB I/Os for smaller page sizes.

In general doublewrite, especially the in-memory data structure part, design is tightly coupled with flushing design. We went one direction by embracing buffer pool instances to have shared-nothing per-LRU/flush list doublewrite with trivial locking, it seems that MariaDB wants to go the opposite direction by MDEV-15058 wishing to remove multiple buffer pool instances and cleaner threads.

Comment by Anjum Naveed [ 2019-05-15 ]

Laurynas, Thanks for your interest.

As far as I can understand, IO is not 16KB in size. It uses srv_page_size as parameter, so whatever is the server page size for that instance, will be used. I think what Marko pointed out was, "to support compressed pages, there should be some way to write variable page sizes within same instance". If I understood this correctly then I think this is non-trivial to implement and will significantly complicate the IO. I will go with padding the compressed pages to the page size (whatever it is), which gives much cleaner IO functionality.

With reference to MDEV-15058, I went through Percona implementation. It is nice piece of code, completely redesigned from scratch. I was hoping to use the existing page cleaner code and achieve same separation per instance as achieved in Percona code. This is non-trivial but doable. Only one issue remains. How to handle the change of buffer pool instances between different executions of mysql server on same database instance. I think this is allowed and if so, it creates problem.

Comment by Federico Razzoli [ 2019-05-15 ]

@Anjum that is a special case, and non-advanced users shouldn't play with certain variables. If you can avoid complexity just by setting some limitations (for example, changing the instance number requires a shutdown with innodb_fast_shutdown=0 and innodb_buffer_pool_dump_at_shutdown=0), as a user I would consider it completely acceptable.

Comment by Marko Mäkelä [ 2019-05-17 ]

anjumnaveed81, my suggestion to remove write support for ROW_FORMAT=COMPRESSED met some resistance, so we will have to keep that. I think that we can keep padding compressed pages to the page size, just like we currently do.

So, it should suffice to have one doublewrite buffer file per page cleaner thread (or if really needed, buffer pool instance). I would like to go back to one buffer pool instance (MDEV-15058) and to simplify I/O side (MDEV-16526), for example by mostly using synchronous I/O for reads, and by removing the I/O handler threads except when asynchronous I/O is unavailable. (That is, try to submit and collect the requests from the same thread, to have fewer context switches.)

I would really appreciate your contributions. Unlike MDEV-16526 and MDEV-15058, this doublewrite buffer rewrite is something that we will not be actively working on. Therefore, it could be useful for you to start from this feature.

Another improvement related to doublewrite would be to skip it when a page is being (re)initialized. Thanks to MDEV-12699, we can recover such pages from the redo log.

Comment by Anjum Naveed [ 2019-05-18 ]

marko, Got it. I have to keep the compressed pages padding the way it is right now.

Using code of release 10.3(.14) and 10.4 I created a separate Tablespace and filespace for doublewrite buffer single file on the lines of trx_sys_space. I even kept the page numbers to be same to be sure I do not break anything. That code works fine and all tests are passed. I believe I will need to write another test that specifically tests changes?

Using code base of 10.5, I have used only a filespace and changed the page numbering. This works fine as well. With reference to MDEV-15058, I believe the lack of improvement because of increased buffer pool instances was attributed to single doublewrite file coupled with system table space. Therefore, I want to test it with multiple doublewrite files, separate from system table space. If I will get improvements, I will report on MDEV-15058, otherwise I will restrict the changes to a single file and submit the changes for review.

I will also check doublewrite buffer for page re-initialization part and add the changes where needed.

Comment by Anjum Naveed [ 2019-05-22 ]

Hi marko,
I have completed the implementation and done basic tested. number of doublewrite buffer files linked to number of buffer pool instances and separated from trx_sys_tablespace and files. Please see detailed comments in MDEV-15058.

Please suggest if I should be doing modifications to page cleaner and do more testing or we should use single doublewrite buffer file or one file per instance or any other option.

Comment by Marko Mäkelä [ 2021-03-26 ]

anjumnaveed81, sorry, I had missed your replies until now.

The performance bottleneck may have been fixed in MDEV-23855, where I implemented wlad’s suggestion to use asynchronous writes for the doublewrite buffer batches. Instead of waiting for synchronous writes of 64 pages, we submit an asynchronous write of 128 pages while filling another buffer of 128 pages in memory. This could have mostly addressed the performance concerns.

I would welcome a pull request for an option to have the doublewrite buffer in a separate file. Technically, this could even be doable in GA releases, because I do not think that this would count as a file format change. It is only a minor change to crash recovery (not even to Mariabackup; it would not access the doublewrite buffer at all).

Comment by Anjum Naveed [ 2021-03-26 ]

@Marko,

I got heavily occupied by one of my projects and did not actively follow development on these threads.

I will have a look at my code and related changes in existing version. I will get back in 2-3 days.

Comment by VAROQUI Stephane [ 2021-10-27 ]

Hi any news, this looks promising with a dedicated FS in ZFS for doublewrite that set policy to cache data, this may solve the famous read on write issue. Best ZFS benchmark results on my side is using not so long working O_DIRECT and metadata cache for tablespace, but this is not friendly for the doublewrite io patern that would benefit for beeing inside redolog header or in separated path

Comment by Marko Mäkelä [ 2021-10-27 ]

stephane@skysql.com, this has not been actively worked on. If anjumnaveed81 or anyone else can submit a patch for review, I am happy to look at it.

But, isn’t ZFS a copy-on-write file system? Would you really need the doublewrite buffer on it? It should be theoretically possible for any copy-on-write or journal-based file system to guarantee that writes are atomic. In practice, we know that it is not the case on Linux. That blog post author claimed that Linux ext4fs with the data=journal mount option would make the doublewrite buffer redundant. But, at least back in 2015 it was the case that when a process is killed in the middle of a write() or pwrite() or similar system call, that write could be truncated to a multiple of 4096 bytes. Possibly we do not really need the doublewrite buffer when using innodb_page_size=4k, but I have not seen that documented anywhere for Linux. On Microsoft Windows and NTFS we might safely rely on that.

I think that the doublewrite buffer is a work-around for an operating system bug that I would like to see eventually fixed. But, I am not against moving the doublewrite buffer into a separate file. It would be one step towards eliminating the InnoDB system tablespace.

Comment by VAROQUI Stephane [ 2021-10-27 ]

Sorry forget my comment im' dummy under fever indeed my reference setup for zfs already disable doublewrite buffer. Stupid me

Comment by Marko Mäkelä [ 2021-10-28 ]

I am looking forward to the idea that was presented in the LWM article A way to do atomic writes to become reality. The article does not mention asynchronous I/O at all, so I am afraid that it may be a few years ahead. That would make the doublewrite buffer obsolete, with no special hardware required.

If that is going to happen in the not too distant future, then it might not be a good idea to change the InnoDB doublewrite buffer format. Because of this, I am myself somewhat reluctant to spend my time on changing the file format. But I might not reject a code contribution.

Comment by Vladislav Vaintroub [ 2021-11-09 ]

On Windows, doublewrite can be safely disabled, if the page_size is 4K , and the disk sector size is 4K. Most of the disks I found are still 512byte sector-sized, although 4K is not too exotic now.

Comment by Rick James [ 2023-04-11 ]

While at it, consider the issues with consumer-grade SSDs and "wear leveling". If the double-write buffer were written cyclically to, say, 8 different spots on disk, a cheap SSD would last almost 8 times as long. Assuming there is some kind of sequence number in the page, recovery would imply look for the latest of the 8. The 8 spots could be 8 blocks in the proposed flat file.

Generated at Thu Feb 08 07:51:41 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.