Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-14425

Change the InnoDB redo log format to reduce write amplification

Details

    Description

      The InnoDB redo log format is not optimal in many respects:

      • At the start of ib_logfile0, there are two log checkpoint blocks, only 1024 bytes apart, while there exist devices with 4096-byte block size. The rest of the log file is written in a circular fashion.
      • On log checkpoint, some file name information needs to be appended to the log.
      • File names that were first changed since the latest checkpoint must be appended to the log. The bookkeeping causes some contention on log_sys.mutex and fil_system.mutex. Edit: The contention on fil_system.mutex was practically removed in MDEV-23855, and the contention on log_sys.mutex due to this is minimal.
      • The log file was unnecessarily split into multiple files, logically treated as one big circular file. (MDEV-20907 in MariaDB Server 10.5.0 change the default to 1 file, and later the parameter was deprecated and ignored.)
      • Log records are divided into tiny blocks of 512 bytes, with 12+4 bytes of header and footer (12+8 bytes with MDEV-12041 innodb_encrypt_log (10.4.0)).
      • We are holding a mutex while zero-filling unused parts of log blocks, encrypting log blocks, or computing checksums.
      • We were holding an exclusive latch while copying log blocks; this was fixed in MDEV-27774.
      • Mariabackup cannot copy the log without having access to the encryption keys. (It can copy data file pages without encrypting them.)

      We had some ideas to move to an append-only file and to partition the log into multiple files, but it turned out that a single fixed-size circular log file would perform best in typical scenarios.

      To address the fil_system.mutex contention whose root cause was later fixed in MDEV-23855, we were considering to split the log as follows:

      • ib_logfile0 (after the 512-byte header) will be append-only, unencrypted, for records containing file names and checkpoint information. A checkpoint record will comprise an LSN and a byte offset in a separate, optionally encrypted, circular log file ib_logdata. The length of each record is explicitly tagged and the payload will be followed by CRC-32C.
      • The ib_logdata file can be append-only or circular. If it is circular, its fixed size must be an integer multiple of 512 bytes.

      One problem would have had to be solved: When would the ib_logfile0 be shrunk? No storage is unlimited.

      We will retain the ib_logfile0 and the basic format of its first 512 bytes for compatibility purposes, but other features could be improved.

      • We remove log block headers and footers. We really only need is to detect the logical end of the circular log. That can be achieved by making sure that mini-transactions are terminated by a sequence number (at least one bit) and a checksum. When the circular file wraps around, the sequence number will be incremented (or the sequence bit toggled).
      • For page-aligned I/O, we allow dummy records to be written, to indicate that the next bytes (until the end of the physical block, no matter what the I/O block size is) must be ignored. (The log parser will ignore these padding records, but we do not currently write them; we will keep overwriting the last physical block until it has been completely filled like we used to do until now.)
      • Encrypt and compute checksum on mtr_t::m_log before initiating a write to the circular log file. The log can be copied and checksum validated without access to encryption keys.
      • If the log is on a memory-mapped persistent memory device, then we will make log_sys.buf point directly to the persistent memory.

      Some old InnoDB redo log parameters were removed in MDEV-23397 (MariaDB 10.6.0). Some more parameters will removed or changed here:

      • innodb_log_write_ahead_size: Removed. On Linux and Microsoft Windows, we will detect and use the physical block size of the underlying storage. We will also remove the log_padded counter from INFORMATION_SCHEMA.INNODB_METRICS.
      • innodb_log_file_buffering: Added (MDEV-28766). This controls the use of O_DIRECT on the ib_logfile0 when the physical block size can be determined
      • innodb_log_buffer_size: The minimum value is raised to 2MiB and the granularity increased from 1024 to 4096 bytes. This buffer will also be used during recovery. Ignored when the log is memory-mapped (on PMEM or /dev/shm).
      • innodb_log_file_size: The allocation granularity is reduced from 1MiB to 4KiB.

      Some global variables will be adjusted as well:

      • Innodb_os_log_fsyncs: Removed. This will be included in Innodb_data_fsyncs.
      • Innodb_os_log_pending_fsyncs: Removed. This was limited to at most 1 by design.
      • Innodb_log_pending_writes: Removed. This was limited to at most 1 by design.

      The circular log file ib_logfile0

      The file name ib_logfile0 and the existing format of the first 512 bytes will be retained for the purpose of upgrading and preventing downgrading. In the first 512 bytes of the file, the following information will be present:

      • InnoDB redo log format version identifier (in the format introduced by MySQL 5.7.9/MariaDB 10.2.2)
      • CRC-32C checksum

      After the first 512 bytes, there will be two 64-byte checkpoint blocks at the byte offsets 4096 and 8192, containing:

      • The checkpoint LSN
      • The LSN at the time the checkpoint was created, pointing to an optional sequence of FILE_MODIFY records and a FILE_CHECKPOINT record

      The circular redo log record area starts at offset 12288 and extends to the end of the file. Unless the file was created by mariadb-backup, the file size will be a multiple of 4096 bytes.

      All writes to ib_logfile0 will be synchronous and durable (O_DSYNC, fdatasync() or O_SYNC, fsync() or pmem_persist()).

      Payload encoding

      The payload area will contain records in the MDEV-12353 format. Each mini-transaction will be followed by a sequence byte 0x00 or 0x01 (the value of the sequence bit), optionally (if the log is encrypted) a 8-byte nonce, and a CRC-32C of all the bytes (except the sequence byte), so that backup can avoid recomputing the checksum while copying the log to a new file.

      We want to be able to avoid overwriting the last log block, so we cannot have an explicit 'end of log' marker. We must associate each mini-transaction (atomic sequence of log records) with a sequence number (at the minimum, a sequence bit) and a checksum. The 4-byte CRC-32C is a good candidate, because it is already being used in data page checksums.

      Padding

      We might want to introduce a special mini-transaction 'Skip the next N bytes', encoded in sizeof(CRC)+2+log(N) bytes: CRC, record type and length, subtype and the value of the sequence bit, and variable-length encoded N. However, for a compressed storage device, it would be helpful to not have any garbage bytes in the log file. It would be better to initialize all those N bytes.

      If we need to pad a block with fewer bytes than the minimum size, we would write a record to skip the minimum size.

      This has been implemented with arbitrary-length FILE_CHECKPOINT mini-transactions whose payload consists of NUL bytes. The parser will ignore such records. We are not currently writing such records, but instead overwriting the last incomplete log block when more log is being appended, just like InnoDB always did.

      Mini-transaction encoding: Prepending or appending a CRC to each MDEV-12353 mini-transaction

      In the MDEV-12353 encoding, a record cannot start with the bytes 0x00 or 0x01. Mini-transactions are currently being terminated by the byte 0x00. We could store the sequence bit in the terminating byte of the mini-transaction. The checksum would exclude the terminating byte.

      Only the payload bytes would be encrypted (not record types or lengths, and not page identifiers either). In that way, records can be parsed and validated efficiently. Decryption would only have to be invoked when the log really needs to be applied on the page. The initialization vector for encryption and decryption can include the unencrypted record header bytes.

      It could be best to store the CRC before the mini-transaction payload, because the CRC of non-zero bytes cannot be 0. Hence, we can detect the end of the log without even parsing the mini-transaction bytes.

      Pros: Minimal overhead: sizeof(CRC) bytes per mini-transaction.
      Cons: Recovery may have to parse a lot of log before determining that the end of the log was reached.

      In the end, the CRC was written after the mini-transaction. The log parser can flag an inconsistency if the maximum mini-transaction size would be exceeded.

      Alternative encoding (scrapped idea): Prepending a mini-transaction header with length and CRC

      We could encapsulate MDEV-12353 records (without the mini-transaction terminating NUL byte) in the following structure:

      • variable-length encoded integer of total_length << 2 | sequence_bit
      • CRC of the data payload and the variable-length encoded integer
      • the data payload (MDEV-12353 records); could be encrypted in their entirety

      Skipped bytes (at least 5) would be indicated by the following:

      • variable-length encoded integer of skipped_length << 2 | 1 << 1 | sequence_bit
      • CRC of the variable-length encoded integer (not including the skipped bytes)

      Pros: Recovery can determine more quickly that the end of the circular log was reached, thanks to the length, sequence bit and (nonzero) CRC being stored at the start.
      Pros: More of the log could be encrypted (at the cost of recovery and backup restoration speed)
      Cons: Increased storage overhead: sizeof(CRC)+log(length * 4) bytes. For length<32 bytes, no change of overhead.
      Cons: If the encryption is based on the current LSN, then both encryption and the checksum would have to be computed while holding log_sys.mutex.

      Log writing and synchronous flushing

      For the bulk of the changes done by mini-transactions, we do not care about flushing. The file system can write log file blocks as it pleases.

      Some state changes of the database must be made durable at a specific time. Examples include user transaction COMMIT, XA PREPARE, XA ROLLBACK, and (in case the binlog is not enabled) XA COMMIT.

      Whenever we want to make a certain change durable, we must flush all log files up to the LSN of the mini-transaction commit that made the change.

      If redo log is physically replicated to the buffer pools of physical replicas (like in Amazon Aurora or Alibaba PolarDB), then we should first write to the local log and only then to the replicas, and we should assume that the writes to the files will always eventually be durable. If that assumption is broken, then all servers would have to be restarted and perform crash recovery.

      Crash recovery and backup

      The previous two-stage parsing (log block validation and log record parsing) was replaced with a single stage. The separate 2-megabyte buffer recv_sys.buf is no longer needed, because the bytes of the log records will be stored contiguously, except when the log file wraps around from its end to the offset 12,288.

      When the log file is memory-mapped, we will parse records directly from log_sys.buf that contains a view of the entire log file. For parsing the mini-transaction that wraps from the end of the file to the start, the record parser will use a special pointer wrapper. When not using memory-mapping, we will read from the log file to log_sys.buf in such a way that the records of each mini-transaction will be contiguous.

      Crash-upgrade from earlier versions will not be supported. Before upgrading, the old server must have been shut down, or mariadb-backup --prepare must have been executed using an appropriate older version of the backup tool.

      Starting up without ib_logfile0 will no longer be supported; see also MDEV-27199.

      Attachments

        1. 81cf92e9471.pdf
          29 kB
        2. append.c
          0.6 kB
        3. MDEV-14425.pdf
          29 kB
        4. NUMA_1.pdf
          37 kB
        5. NUMA_1vs2.pdf
          29 kB
        6. NUMA_2.pdf
          38 kB
        7. preallocate.c
          0.6 kB

        Issue Links

          Activity

            As part of this work, the function log_buffer_extend() will be removed.

            marko Marko Mäkelä added a comment - As part of this work, the function log_buffer_extend() will be removed.

            As part of MDEV-14425 the recovery logic should be improved so that when a redo log block is corrupted, only the mini-transaction(s) that are (partly or fully) contained in the block will be skipped. This would augment MDEV-12699, which is about improving the recovery of corrupted data pages.

            marko Marko Mäkelä added a comment - As part of MDEV-14425 the recovery logic should be improved so that when a redo log block is corrupted, only the mini-transaction(s) that are (partly or fully) contained in the block will be skipped. This would augment MDEV-12699 , which is about improving the recovery of corrupted data pages.

            The function log_free_check() will have to be replaced.
            With the original circularly-written InnoDB redo log, the function seeked to prevent a situation where the tail of the log is overwriting the head before the head is logically truncated by a redo log checkpoint. If such overwriting happens, InnoDB will be unable to recover from a crash. The situation would be normalized by a redo log checkpoint.

            With these append-only log files, the overwriting issue is replaced with another one: running out of space in the file system. So, we will continue to need a function similar to log_free_check(). If the file system of the current thread’s log file is about to run out of space, the replacement of log_free_check() would return an ‘out of space’ error, which would be returned all the way up the call stack. If the ultimate caller is a client connection, the error would be reported as ER_DISK_FULL.

            Running out of space should only be an issue when log archiving is enabled (innodb_log_max_size is overridden from its default value).

            marko Marko Mäkelä added a comment - The function log_free_check() will have to be replaced. With the original circularly-written InnoDB redo log, the function seeked to prevent a situation where the tail of the log is overwriting the head before the head is logically truncated by a redo log checkpoint. If such overwriting happens, InnoDB will be unable to recover from a crash. The situation would be normalized by a redo log checkpoint. With these append-only log files, the overwriting issue is replaced with another one: running out of space in the file system. So, we will continue to need a function similar to log_free_check(). If the file system of the current thread’s log file is about to run out of space, the replacement of log_free_check() would return an ‘out of space’ error, which would be returned all the way up the call stack. If the ultimate caller is a client connection, the error would be reported as ER_DISK_FULL. Running out of space should only be an issue when log archiving is enabled (innodb_log_max_size is overridden from its default value).

            For the record: monty asked me to compare the relative performance of writing to a preallocated file and appending to a file.
            On my HP ZBook 15u G3 laptop, I ran the test programs append.c and preallocate.c to write a 2GiB file on ext4fs, and the results are clear.
            I used 2 SSD devices: 220GiB NVMe that is encrypted by dm-crypt, and a 450GiB SATA SSD that is not encrypted. With 4 programs running in parallel like this:

            time ./append foo1&time ./append foo2&time ./append foo3&time ./append foo4
            

            the reported real time was like this:

            append, NVMe preallocate, NVMe append, SATA preallocate, SATA
            18.7s 24.2s 21.0s 66.7s

            It would be interesting to test other relevant file systems as well as HDD.

            marko Marko Mäkelä added a comment - For the record: monty asked me to compare the relative performance of writing to a preallocated file and appending to a file. On my HP ZBook 15u G3 laptop, I ran the test programs append.c and preallocate.c to write a 2GiB file on ext4fs, and the results are clear. I used 2 SSD devices: 220GiB NVMe that is encrypted by dm-crypt, and a 450GiB SATA SSD that is not encrypted. With 4 programs running in parallel like this: time ./append foo1&time ./append foo2&time ./append foo3&time ./append foo4 the reported real time was like this: append, NVMe preallocate, NVMe append, SATA preallocate, SATA 18.7s 24.2s 21.0s 66.7s It would be interesting to test other relevant file systems as well as HDD.
            danblack Daniel Black added a comment -

            Some server hardware stats:

            Power Systems S822LC "Firestone", Power8 - 10 cores
            Ubuntu-16.04 userspace, Kernel 4.14.0-27552-gadb89ad
            nvme - IBM PCIe3 1.6TB NVMe Flash Adapter
            disk - ST1000NX0313, 7200rpm - SATA 3.1, 2 disks in software raid1 (changed to 4.15.0-10668-g3527799 kernel - managed to hit kernel thread that hung)

            medium append preallocate
            nvme, xfs bsize=4096 (default) 6.827s 6.706
            nvme, xfs bsize=4096 (using ix10) 57.257 57.650s
            nvme, xfs, bsize=16384 (using ix10) 56.621s 55.880s
            nvme, ext3 9.191s 9.220s
            nvme, ext3 (using ix10) 54.680s 1m45.147s
            nvme, ext4 (using ix10) 55.242s 53.415s
            disk, ext4 on llvm 1m39.221s 1m42.376s

            ix10 is the i variable in the source * 10. Felt the size was too small at ~6 seconds.

            error margin seems about 5% based on repetition (most results in the batch of 4 where identical however between batches varied). Assumed to be effects of journalling.

            So ext3 is worse at prealloc. Multiple runs did occur. otherwise comparable.

            danblack Daniel Black added a comment - Some server hardware stats: Power Systems S822LC "Firestone", Power8 - 10 cores Ubuntu-16.04 userspace, Kernel 4.14.0-27552-gadb89ad nvme - IBM PCIe3 1.6TB NVMe Flash Adapter disk - ST1000NX0313, 7200rpm - SATA 3.1, 2 disks in software raid1 (changed to 4.15.0-10668-g3527799 kernel - managed to hit kernel thread that hung) medium append preallocate nvme, xfs bsize=4096 (default) 6.827s 6.706 nvme, xfs bsize=4096 (using ix10) 57.257 57.650s nvme, xfs, bsize=16384 (using ix10) 56.621s 55.880s nvme, ext3 9.191s 9.220s nvme, ext3 (using ix10) 54.680s 1m45.147s nvme, ext4 (using ix10) 55.242s 53.415s disk, ext4 on llvm 1m39.221s 1m42.376s ix10 is the i variable in the source * 10. Felt the size was too small at ~6 seconds. error margin seems about 5% based on repetition (most results in the batch of 4 where identical however between batches varied). Assumed to be effects of journalling. So ext3 is worse at prealloc. Multiple runs did occur. otherwise comparable.
            junsu Jun Su added a comment -

            Your preallocate code doesn't reflect the reality well. Since one log file can be written many times, they contains therandom data instead of just falloc. Your code simulate the situation when first time the log file get writes (brand new server). Please consider adding a command line parameter to open the existing file from previous run and skip falloc to simulate when the log file gets overwrite.

            junsu Jun Su added a comment - Your preallocate code doesn't reflect the reality well. Since one log file can be written many times, they contains therandom data instead of just falloc. Your code simulate the situation when first time the log file get writes (brand new server). Please consider adding a command line parameter to open the existing file from previous run and skip falloc to simulate when the log file gets overwrite.

            danblack, thanks for your benchmarks!

            junsu, thank you for your comment. Another problem with the test program that there is only a fdatasync() call at the end, while in the reality there would be some fdatasync() whenever we need to make something durable.

            I am not actively working on this task yet. I would welcome improved versions of the test programs, as well as more benchmarks, especially ones that would seem to indicate that preallocating is faster than appending.

            marko Marko Mäkelä added a comment - danblack , thanks for your benchmarks! junsu , thank you for your comment. Another problem with the test program that there is only a fdatasync() call at the end, while in the reality there would be some fdatasync() whenever we need to make something durable. I am not actively working on this task yet. I would welcome improved versions of the test programs, as well as more benchmarks, especially ones that would seem to indicate that preallocating is faster than appending.
            marko Marko Mäkelä added a comment - - edited

            In addition to the proposed parameter innodb_log_max_size there perhaps also should be some time limit after which the excessive old logs will be discarded. Perhaps innodb_log_min_age?
            To ease maintenance and backups, maybe we should have an option to have strictly append-only log files, and should revise the design so that log files are never renamed. Most notably, there should be no header in the partitioned log files; the checkpoint block offsets would be written into the single-block control file ib_logfile0 only.

            On a related note, last Tuesday I gave a high-level view of the InnoDB internals in a M18 talk Deep Dive: InnoDB Transactions and Write Paths (video). The slides with the diagram on mini-transactions and describing the log checkpoints could be useful background information.

            marko Marko Mäkelä added a comment - - edited In addition to the proposed parameter innodb_log_max_size there perhaps also should be some time limit after which the excessive old logs will be discarded. Perhaps innodb_log_min_age ? To ease maintenance and backups, maybe we should have an option to have strictly append-only log files, and should revise the design so that log files are never renamed. Most notably, there should be no header in the partitioned log files; the checkpoint block offsets would be written into the single-block control file ib_logfile0 only. On a related note, last Tuesday I gave a high-level view of the InnoDB internals in a M18 talk Deep Dive: InnoDB Transactions and Write Paths (video). The slides with the diagram on mini-transactions and describing the log checkpoints could be useful background information.
            inaamrana Inaam Rana added a comment -

            Marko,

            This is a very interesting idea. I believe we should think of redo log as nothing but a bunch of ordered changes to the pages. This implies that if for a given page we have all the changes available we should be able to deal with the gaps in the log (obviously, as long as we do flush all log files at checkpoint).

            How about we map to log partition based on space_id:page_no i.e.: all changes to a page must always go to same logfile. If this invariant is maintained then we don't need to flush anything on mtr_commit(). As a trx touches different pages it will keep track of which logfiles it needs to flush. For small trxs like a single row DML it will be a single file. This calculation can be done at mtr_commit(). At trx_commit() we only flush the relevant files.

            I think with above scheme we can allow gaps in LSN. Conceptually this design feels intuitive. Redo log records changes to pages. Redo logs are partitioned based on page number. Log buffer can also be partitioned extending the same logic.

            As an aside, if we redesign the format may be we should add information during mtr_commit() about back pointer to last redo record that changed the page. If we are able to follow the chain of changes to a page that might actually be quite helpful. There might not be an immediate use case for MariaDB but then log format changes are cumbersome to orchestrate.

            inaamrana Inaam Rana added a comment - Marko, This is a very interesting idea. I believe we should think of redo log as nothing but a bunch of ordered changes to the pages. This implies that if for a given page we have all the changes available we should be able to deal with the gaps in the log (obviously, as long as we do flush all log files at checkpoint). How about we map to log partition based on space_id:page_no i.e.: all changes to a page must always go to same logfile. If this invariant is maintained then we don't need to flush anything on mtr_commit(). As a trx touches different pages it will keep track of which logfiles it needs to flush. For small trxs like a single row DML it will be a single file. This calculation can be done at mtr_commit(). At trx_commit() we only flush the relevant files. I think with above scheme we can allow gaps in LSN. Conceptually this design feels intuitive. Redo log records changes to pages. Redo logs are partitioned based on page number. Log buffer can also be partitioned extending the same logic. As an aside, if we redesign the format may be we should add information during mtr_commit() about back pointer to last redo record that changed the page. If we are able to follow the chain of changes to a page that might actually be quite helpful. There might not be an immediate use case for MariaDB but then log format changes are cumbersome to orchestrate.

            inaamrana, thank you for the valuable feedback.

            Your suggestion to flush a subset of the log files at transaction commit and to ignoring gaps in the log recovery seems to imply a partial ordering of events. With the added invariant that modifications of a certain page must always go to a certain log file, I cannot see any obvious correctness problem. We would only need an additional field "total number of log files" for the mini-transaction, so that recovery can reject a mini-transaction that was not fully written to all log files. What if this is combined with physical replication? We should only replicate each mini-transaction log up to the latest log flush. Here it could be tricky to guarantee the atomicity of the replicated mini-transactions without flushing all logs up to our mini-transaction commit LSN.

            This architecture would seem to require some scatter-gather operation or partitioning of the local mini-transaction buffer, so that the log for pages are written to the appropriate log files. Maybe the easiest way to arrange that would be by copying log snippets from local mini-transaction buffers to a per-log global buffer. This would also imply that the mini-transaction log for a given mini-transaction (identified by logical LSN) can exist in multiple log files.

            For maximum commit concurrency, I think that in this scheme, there should be 1 redo log for each rollback segment (now that with MDEV-15132 and MDEV-15158 commit only writes to the rollback segment header and undo log header pages, not the TRX_SYS page). For maximal concurrency, we could experiment with dedicated redo logs for transaction metadata, and separate redo logs for data file changes.

            This is definitely worth trying. I think that we should prototype and benchmark both approaches before committing to a solution.

            The back-pointer to the last record that changed the page sounds like a good idea to me, and certainly useful for troubleshooting. When the full log is archived, this could allow faster point-in-time recovery. In your suggested scheme, this could be a byte offset from the start of the file. In my original scheme, it should probably be LSN, and some searching would be needed to find the record among the log files.

            marko Marko Mäkelä added a comment - inaamrana , thank you for the valuable feedback. Your suggestion to flush a subset of the log files at transaction commit and to ignoring gaps in the log recovery seems to imply a partial ordering of events. With the added invariant that modifications of a certain page must always go to a certain log file, I cannot see any obvious correctness problem. We would only need an additional field "total number of log files" for the mini-transaction, so that recovery can reject a mini-transaction that was not fully written to all log files. What if this is combined with physical replication? We should only replicate each mini-transaction log up to the latest log flush. Here it could be tricky to guarantee the atomicity of the replicated mini-transactions without flushing all logs up to our mini-transaction commit LSN. This architecture would seem to require some scatter-gather operation or partitioning of the local mini-transaction buffer, so that the log for pages are written to the appropriate log files. Maybe the easiest way to arrange that would be by copying log snippets from local mini-transaction buffers to a per-log global buffer. This would also imply that the mini-transaction log for a given mini-transaction (identified by logical LSN) can exist in multiple log files. For maximum commit concurrency, I think that in this scheme, there should be 1 redo log for each rollback segment (now that with MDEV-15132 and MDEV-15158 commit only writes to the rollback segment header and undo log header pages, not the TRX_SYS page). For maximal concurrency, we could experiment with dedicated redo logs for transaction metadata, and separate redo logs for data file changes. This is definitely worth trying. I think that we should prototype and benchmark both approaches before committing to a solution. The back-pointer to the last record that changed the page sounds like a good idea to me, and certainly useful for troubleshooting. When the full log is archived, this could allow faster point-in-time recovery. In your suggested scheme, this could be a byte offset from the start of the file. In my original scheme, it should probably be LSN, and some searching would be needed to find the record among the log files.
            micai Minshen Cai added a comment -

            In my opinion, there are issues on Inaam's idea. The below is a use case.
            1. Mini-transaction m1 updates page A via log file 1, and updates page B via log file 2.
            2. Mini-transaction m2 updates page A via log file 1.
            3. All write to log file 1 completes.
            4. m2 is in the user transaction tr2. tr2 commits successfully.
            5. But at this time, the write to log file 2 for m1 hasn't been finished
            6. The system is killed.
            7. Recover sees the commit of m2. The change of page A made by m2 is replayed.
            8. Recover doesn't see the completed log events of m1 in log file 2. m1 isn't completed. m1 is discarded. The change of page A which is made by m1 is ignored. By this, page A may be inconsistent.

            In short, there might be uncommitted transactions which modify the same page of our transaction. The mini transactions of such uncommitted ones may write additional log file which is other than the ones written by our transaction. So to avoid the above issue, when commit our transaction, beside of the redo log file the current transaction writes, we also need to flush all redo log files written by such uncommitted transactions.

            Because of this, Inaam's idea might not perform well in practice.

            micai Minshen Cai added a comment - In my opinion, there are issues on Inaam's idea. The below is a use case. 1. Mini-transaction m1 updates page A via log file 1, and updates page B via log file 2. 2. Mini-transaction m2 updates page A via log file 1. 3. All write to log file 1 completes. 4. m2 is in the user transaction tr2. tr2 commits successfully. 5. But at this time, the write to log file 2 for m1 hasn't been finished 6. The system is killed. 7. Recover sees the commit of m2. The change of page A made by m2 is replayed. 8. Recover doesn't see the completed log events of m1 in log file 2. m1 isn't completed. m1 is discarded. The change of page A which is made by m1 is ignored. By this, page A may be inconsistent. In short, there might be uncommitted transactions which modify the same page of our transaction. The mini transactions of such uncommitted ones may write additional log file which is other than the ones written by our transaction. So to avoid the above issue, when commit our transaction, beside of the redo log file the current transaction writes, we also need to flush all redo log files written by such uncommitted transactions. Because of this, Inaam's idea might not perform well in practice.

            I agree with micai that likely the only practical solution for preventing the scenario is to flush all redo log files whenever state change needs to be made durable in the database. This would seem to remove any performance benefits of partitioning the log file.

            inaamrana’s idea sould still work in special cases, such as when each mini-transaction modifies its private set of pages, or if a (short) user transaction keeps page locks for the whole duration of the transaction. Maybe we could consider having separate groups of undo-redo-log or rseg-redo-log files, and allowing some level of partial ordering among related mini-transaction commits.

            I was also thinking about extending the checkpoint information. Perhaps we should store all checkpoints in a separate sequential log file, pointing to the individual log files that contain changes since the start of the checkpoint. Perhaps all MLOG_FILE_ entries should be written to the checkpoint log file, while the log record files would only contain page-level log.

            marko Marko Mäkelä added a comment - I agree with micai that likely the only practical solution for preventing the scenario is to flush all redo log files whenever state change needs to be made durable in the database. This would seem to remove any performance benefits of partitioning the log file. inaamrana ’s idea sould still work in special cases, such as when each mini-transaction modifies its private set of pages, or if a (short) user transaction keeps page locks for the whole duration of the transaction. Maybe we could consider having separate groups of undo-redo-log or rseg-redo-log files, and allowing some level of partial ordering among related mini-transaction commits. I was also thinking about extending the checkpoint information. Perhaps we should store all checkpoints in a separate sequential log file, pointing to the individual log files that contain changes since the start of the checkpoint. Perhaps all MLOG_FILE_ entries should be written to the checkpoint log file, while the log record files would only contain page-level log.
            inaamrana Inaam Rana added a comment -

            micai you are right. I haven't thought it through enough. Back to the drawing board

            inaamrana Inaam Rana added a comment - micai you are right. I haven't thought it through enough. Back to the drawing board
            marko Marko Mäkelä added a comment - - edited

            Regarding the preallocate.c vs append.c, today I found out that there still exist file systems where writing to a preallocated file is faster than appending to a file. This means that we should continue to offer an option where the log file is preallocated and written in circular fashion, instead of being written in append-only mode.

            I also ran a modified version of preallocate.c that would omit the O_CREAT flag and the posix_fallocate() call. On ext4, fallocate -l 2g file completed in virtually no time, and using the modified test program to write to the preallocated file took around 25 seconds on the same hardware where I previously reported 24.2 seconds.

            marko Marko Mäkelä added a comment - - edited Regarding the preallocate.c vs append.c , today I found out that there still exist file systems where writing to a preallocated file is faster than appending to a file. This means that we should continue to offer an option where the log file is preallocated and written in circular fashion, instead of being written in append-only mode. I also ran a modified version of preallocate.c that would omit the O_CREAT flag and the posix_fallocate() call. On ext4, fallocate -l 2g file completed in virtually no time, and using the modified test program to write to the preallocated file took around 25 seconds on the same hardware where I previously reported 24.2 seconds.

            MDEV-15914 (and its main fix) showed that a small change to the redo log volume can have a huge impact on performance.
            I think that the redo log record format must allow multiple byte strings to be written to the same page without repeating the tablespace identifier or page number.

            marko Marko Mäkelä added a comment - MDEV-15914 (and its main fix ) showed that a small change to the redo log volume can have a huge impact on performance. I think that the redo log record format must allow multiple byte strings to be written to the same page without repeating the tablespace identifier or page number.

            The VCDIFF format implemented in Xdelta could be a good starting point for a new redo log format.

            marko Marko Mäkelä added a comment - The VCDIFF format implemented in Xdelta could be a good starting point for a new redo log format.

            The bsdiff format has a different design goal: using a lot of RAM and CPU, create a minimal "binary patch" that can be distributed to a large number of clients. In InnoDB, there usually is at most 1 "consumer" of the redo log: the InnoDB crash recovery.

            marko Marko Mäkelä added a comment - The bsdiff format has a different design goal: using a lot of RAM and CPU, create a minimal "binary patch" that can be distributed to a large number of clients. In InnoDB, there usually is at most 1 "consumer" of the redo log: the InnoDB crash recovery.
            marko Marko Mäkelä added a comment - - edited

            Related to this work, MDEV-18115 will stop creating a fil_space_t object and using the fil_io() and fil_flush(SRV_LOG_SPACE_FIRST_ID) interfaces for writing to the redo log files. This should reduce contention on fil_system.mutex.

            marko Marko Mäkelä added a comment - - edited Related to this work, MDEV-18115 will stop creating a fil_space_t object and using the fil_io() and fil_flush(SRV_LOG_SPACE_FIRST_ID) interfaces for writing to the redo log files. This should reduce contention on fil_system.mutex .

            While implementing this, we should ensure that the latest redo log block is never being overwritten. InnoDB is currently doing that, and mariabackup is compensating it by re-reading the latest redo log block if it was not completely filled. Rewriting the latest redo log block feels crash-unsafe: if the server is killed during the write, you could end up with a corrupted log block, and lose a few redo log records. Worst case, you would lose a already durable transaction commit, or some pages would have been already flushed with the LSN of the mini-transaction that was lost because of the log block corruption.

            marko Marko Mäkelä added a comment - While implementing this, we should ensure that the latest redo log block is never being overwritten. InnoDB is currently doing that, and mariabackup is compensating it by re-reading the latest redo log block if it was not completely filled. Rewriting the latest redo log block feels crash-unsafe: if the server is killed during the write, you could end up with a corrupted log block, and lose a few redo log records. Worst case, you would lose a already durable transaction commit, or some pages would have been already flushed with the LSN of the mini-transaction that was lost because of the log block corruption.

            I think that it should be simplest to exclusively use synchronous I/O for the redo log. Currently, the log checkpoint write uses asynchronous I/O.

            marko Marko Mäkelä added a comment - I think that it should be simplest to exclusively use synchronous I/O for the redo log. Currently, the log checkpoint write uses asynchronous I/O.
            marko Marko Mäkelä added a comment - - edited

            My reading of man 2 write suggests that whether or not the individual redo log files are append-only or written in a circular fashion, we should not need any mutex to guard concurrent writes from multiple threads to a file through a shared file descriptor:

            For a seekable file (i.e., one to which lseek(2) may be applied, for example, a regular file) writing takes place at the file offset, and the file offset is incremented by the number of bytes actually written. If the file was open(2)ed with O_APPEND, the file offset is first set to the end of the file before writing. The adjustment of the file offset and the write operation are performed as an atomic step.

            BUGS

            According to POSIX.1-2008/SUSv4 Section XSI 2.9.7 ("Thread Interactions with Regular File Operations"):

            "All of the following functions shall be atomic with respect to each other in the effects specified in POSIX.1-2008 when they operate on regular files or symbolic links: ..."

            Among the APIs subsequently listed are write() and writev(2). And among the effects that should be atomic across threads (and processes) are updates of the file offset. However, on Linux before version 3.14, this was not the case: if two processes that share an open file description (see open(2)) perform a write() (or writev(2)) at the same time, then the I/O op‐ erations were not atomic with respect updating the file offset, with the result that the blocks of data output by the two processes might (incorrectly) overlap. This problem was fixed in Linux 3.14.

            The mentioned Linux kernel bug should not affect InnoDB, because InnoDB would be writing to the log files from multiple threads of the same process.

            The key seems to be to invoke write() or similar function that uses and updates the current position of the file descriptor. pwrite() would require the caller to keep track of a position, and we would do not want that. All the log of the mini-transaction should be written with a single system call. We will probably want some framing with explicit length and checksum around each mini-transaction log snippet, instead of forcing the log to be structured as blocks.

            Edit: An open problem with multiple concurrent threads writing to a file is that each write can be truncated into a partial write if the write is interrupted by a signal. If such an interrupting signal can be sent only to a subset of the file-writing threads, the log could easily be corrupted. Having the very last write truncated due to the server being killed is tolerable, but continuing writes to the log after a truncated write is not.

            Hence, I believe that it could be cleaner to have O_DIRECT synchronous write requests from a single thread, writing full, aligned blocks. Partly filled blocks would only be written on log_write_up_to(), just like it is now. The main difference would be that log could be written into multiple files in parallel.

            We might also employ some form of Lempel-Ziv compression on the log data that is going to be written. This would require identifying some ‘restart points’ for parsing the log.

            marko Marko Mäkelä added a comment - - edited My reading of man 2 write suggests that whether or not the individual redo log files are append-only or written in a circular fashion, we should not need any mutex to guard concurrent writes from multiple threads to a file through a shared file descriptor: For a seekable file (i.e., one to which lseek(2) may be applied, for example, a regular file) writing takes place at the file offset, and the file offset is incremented by the number of bytes actually written. If the file was open(2)ed with O_APPEND, the file offset is first set to the end of the file before writing. The adjustment of the file offset and the write operation are performed as an atomic step. … BUGS According to POSIX.1-2008/SUSv4 Section XSI 2.9.7 ("Thread Interactions with Regular File Operations"): "All of the following functions shall be atomic with respect to each other in the effects specified in POSIX.1-2008 when they operate on regular files or symbolic links: ..." Among the APIs subsequently listed are write() and writev(2). And among the effects that should be atomic across threads (and processes) are updates of the file offset. However, on Linux before version 3.14, this was not the case: if two processes that share an open file description (see open(2)) perform a write() (or writev(2)) at the same time, then the I/O op‐ erations were not atomic with respect updating the file offset, with the result that the blocks of data output by the two processes might (incorrectly) overlap. This problem was fixed in Linux 3.14. The mentioned Linux kernel bug should not affect InnoDB, because InnoDB would be writing to the log files from multiple threads of the same process. The key seems to be to invoke write() or similar function that uses and updates the current position of the file descriptor. pwrite() would require the caller to keep track of a position, and we would do not want that. All the log of the mini-transaction should be written with a single system call. We will probably want some framing with explicit length and checksum around each mini-transaction log snippet, instead of forcing the log to be structured as blocks. Edit: An open problem with multiple concurrent threads writing to a file is that each write can be truncated into a partial write if the write is interrupted by a signal. If such an interrupting signal can be sent only to a subset of the file-writing threads, the log could easily be corrupted. Having the very last write truncated due to the server being killed is tolerable, but continuing writes to the log after a truncated write is not. Hence, I believe that it could be cleaner to have O_DIRECT synchronous write requests from a single thread, writing full, aligned blocks. Partly filled blocks would only be written on log_write_up_to() , just like it is now. The main difference would be that log could be written into multiple files in parallel. We might also employ some form of Lempel-Ziv compression on the log data that is going to be written. This would require identifying some ‘restart points’ for parsing the log.

            MDEV-12353 will improve the format of individual redo log records. This task will implement some framing around them, such as compression, division into blocks, LSN assignment. I think that for now, we will write the entire log of a single mini-transaction into a single log file.

            marko Marko Mäkelä added a comment - MDEV-12353 will improve the format of individual redo log records. This task will implement some framing around them, such as compression, division into blocks, LSN assignment. I think that for now, we will write the entire log of a single mini-transaction into a single log file.

            Some ideas for improving the redo log block format:

            • Use a variable block size, so that a mini-transaction will never be split between blocks. Write the size at the start of the block and the checksum at the end. In this way, we can make the collection of log records point directly to the parse buffer, and also remove recv_data_copy_to_buf(). (If log blocks are compressed, decompression would output to the parse buffer.)
            • Each time a new block is seen, it will mark the start of a new mini-transaction. In this way, the log blocks can avoid encoding the LSN (which would grow by one per mini-transaction commit).
            • Allow NUL-padding of short blocks to the physical block size, so that if a log flush is needed, read-modify-write on the file system can be avoided.

            When it comes to file operations and log checkpoints, I think that it could be worthwhile to have a separate sequential log file that keeps a number of latest checkpoints as well as all the file operations. The checkpoint information in the checkpoint log file would point to the data log files, which would only contain page-level redo log records. This would remove the need to call the equivalent of fil_names_clear() at log checkpoint. Only when emptying or rotating the checkpoint log file we would write the equivalent of MLOG_FILE_NAME records to the new checkpoint log file.

            marko Marko Mäkelä added a comment - Some ideas for improving the redo log block format: Use a variable block size, so that a mini-transaction will never be split between blocks. Write the size at the start of the block and the checksum at the end. In this way, we can make the collection of log records point directly to the parse buffer, and also remove recv_data_copy_to_buf() . (If log blocks are compressed, decompression would output to the parse buffer.) Each time a new block is seen, it will mark the start of a new mini-transaction. In this way, the log blocks can avoid encoding the LSN (which would grow by one per mini-transaction commit). Allow NUL-padding of short blocks to the physical block size, so that if a log flush is needed, read-modify-write on the file system can be avoided. When it comes to file operations and log checkpoints, I think that it could be worthwhile to have a separate sequential log file that keeps a number of latest checkpoints as well as all the file operations. The checkpoint information in the checkpoint log file would point to the data log files, which would only contain page-level redo log records. This would remove the need to call the equivalent of fil_names_clear() at log checkpoint. Only when emptying or rotating the checkpoint log file we would write the equivalent of MLOG_FILE_NAME records to the new checkpoint log file.

            I think that we must abandon the idea of partitioning the redo log. I would go with 3 files:

            • mostly dummy ib_logfile0 to identify the file format
            • an append-only, binlog-style-rotated file for checkpoints and file-level operations (create, delete, rename, modify)
            • a page-level redo log file, 2 variants: block-oriented circular, or append-only
            • LSN will be logical, incremented by 1 on mtr_t::commit() when redo log records were generated.

            Even in the circular log file format, page-level redo log records will never be interrupted by log block trailers or headers. That is, we will write variable-size blocks, with the LSN at the start of the block (to detect the end of "new" log).

            In the byte-oriented append-only log file format, if a persistent write is requested (on user transaction commit), we will write an extra record that contains a checksum of the bytes that were written since the previous persistent write. The payload of the stream could even be encrypted.

            To alleviate the log_sys.mutex and log_sys.write_mutex bottleneck, we will introduce a dedicated log writing task, which will:

            • Get mtr_t::log records from and encode them to a private buffer of this task
            • In case of a circular log file, issue a log checkpoint if the log tail would overwrite the head. (Other calls to log_free_check() can be removed!)
            • Before a log checkpoint is issued, any background crash recovery (MDEV-14481) must be finished.
            • If persistence is requested, issue a synchronous write of the data to the log file.

            Because page-level log records of any single mini-transaction will be continuous streams of bytes in the log file, on recovery we can avoid copying log record snippets to recv_sys.pages. Instead, we can simply attach pointers to log_sys.buf that was read from the page-level redo log file. This should automatically fix MDEV-19176 when using the MDEV-12353 format. Note: compressing the log record stream could prevent such optimization, so we will not introduce any compression for now.

            marko Marko Mäkelä added a comment - I think that we must abandon the idea of partitioning the redo log. I would go with 3 files: mostly dummy ib_logfile0 to identify the file format an append-only, binlog-style-rotated file for checkpoints and file-level operations (create, delete, rename, modify) a page-level redo log file, 2 variants: block-oriented circular, or append-only LSN will be logical, incremented by 1 on mtr_t::commit() when redo log records were generated. Even in the circular log file format, page-level redo log records will never be interrupted by log block trailers or headers. That is, we will write variable-size blocks, with the LSN at the start of the block (to detect the end of "new" log). In the byte-oriented append-only log file format, if a persistent write is requested (on user transaction commit), we will write an extra record that contains a checksum of the bytes that were written since the previous persistent write. The payload of the stream could even be encrypted. To alleviate the log_sys.mutex and log_sys.write_mutex bottleneck, we will introduce a dedicated log writing task, which will: Get mtr_t::log records from and encode them to a private buffer of this task In case of a circular log file, issue a log checkpoint if the log tail would overwrite the head. (Other calls to log_free_check() can be removed!) Before a log checkpoint is issued, any background crash recovery ( MDEV-14481 ) must be finished. If persistence is requested, issue a synchronous write of the data to the log file. Because page-level log records of any single mini-transaction will be continuous streams of bytes in the log file, on recovery we can avoid copying log record snippets to recv_sys.pages . Instead, we can simply attach pointers to log_sys.buf that was read from the page-level redo log file. This should automatically fix MDEV-19176 when using the MDEV-12353 format. Note: compressing the log record stream could prevent such optimization, so we will not introduce any compression for now.
            marko Marko Mäkelä added a comment - - edited

            Currently, on recovery, redo log records are being copied twice from memory to memory:

            1. from redo log file blocks to contiguous strings of bytes
            2. from contiguous strings of bytes to recv_t (in limited-size chunks that are allocated from the buffer pool)

            We have a 2MiB recv_sys.buf for the initial buffering. The minimum size of log_sys.buf would be 16MiB, and that buffer should be practically unused during recovery. If the buffer pool size is measured in gigabytes, it would indeed make sense to use the buffer pool for the recovered log records.

            I updated MDEV-19176 with a suggested design how to improve the buffer pool utilization during crash recovery. It is independent of the redo log record or file format.

            marko Marko Mäkelä added a comment - - edited Currently, on recovery, redo log records are being copied twice from memory to memory: from redo log file blocks to contiguous strings of bytes from contiguous strings of bytes to recv_t (in limited-size chunks that are allocated from the buffer pool) We have a 2MiB recv_sys.buf for the initial buffering. The minimum size of log_sys.buf would be 16MiB, and that buffer should be practically unused during recovery. If the buffer pool size is measured in gigabytes, it would indeed make sense to use the buffer pool for the recovered log records. I updated MDEV-19176 with a suggested design how to improve the buffer pool utilization during crash recovery. It is independent of the redo log record or file format.

            When innodb_scrub_log=ON, log checkpoint should clear the unused part of the redo log, simply by invoking fallocate(fd, FALLOC_FL_PUNCH_HOLE, offset, len) or its Windows equivalent. There is no need for a separate log_scrub_thread. The hole-punching should work with both circular and append-only log files.

            Removing the log_scrub_thread should fix MDEV-18370, MDEV-20474, MDEV-20475.

            marko Marko Mäkelä added a comment - When innodb_scrub_log=ON , log checkpoint should clear the unused part of the redo log, simply by invoking fallocate(fd, FALLOC_FL_PUNCH_HOLE, offset, len) or its Windows equivalent. There is no need for a separate log_scrub_thread . The hole-punching should work with both circular and append-only log files. Removing the log_scrub_thread should fix MDEV-18370 , MDEV-20474 , MDEV-20475 .

            This is my test program which was created to test performance of writing to circular file and appending to file. I've run test on my laptop with HDD and ext4.

            Results without fsync():

            File size 524288 Kb
            Writing 4194304 Kb to it
             
            Simple cyclic file
            Took 6 seconds 714 milliseconds
             
            Mmapped cyclic file
            Took 0 seconds 532 milliseconds
             
            O_APPEND append file
            Took 25 seconds 260 milliseconds
             
            Simple append file
            Took 24 seconds 623 milliseconds
            

            Results with fsync():

            File size 1024 Kb
            Writing 4096 Kb to it
             
            Simple cyclic file
            Took 59 seconds 111 milliseconds
             
            O_DSYNC cyclic file
            Took 59 seconds 267 milliseconds
             
            Mmapped cyclic file
            Took 58 seconds 516 milliseconds
             
            O_DIRECT|O_DSYNC cyclic file
            Took 107 seconds 6 milliseconds
             
            O_APPEND append file
            Took 202 seconds 492 milliseconds
             
            Simple append file
            Took 249 seconds 301 milliseconds
            

            And the program itself:

            #include <sys/types.h>
            #include <sys/mman.h>
            #include <sys/stat.h>
            #include <fcntl.h>
            #include <unistd.h>
             
            #include <cerrno>
            #include <cassert>
            #include <cstdlib>
            #include <cstring>
             
            #include <array>
            #include <chrono>
            #include <string>
            #include <iostream>
             
            static const size_t kWriteTotal = 4 * 1024 * 1024;
            static const size_t kFileSize = 1 * 1024 * 1024;
            static const std::array<unsigned char, 1024> kBuf = { 33 };
            static const std::array<unsigned char, 1024> kAlignedBuf alignas(512) = { 33 };
            static_assert(kFileSize % kBuf.size() == 0, "");
            static const std::string kPath = "test_file";
             
            void ShowErrnoAndExit(std::string function_name) {
              std::cerr << function_name << " returned errno " << strerror(errno) << "\n";
              std::exit(EXIT_FAILURE);
            }
             
            struct File {
              File(std::string path, int additional_flags = 0) : path_(std::move(path)) {
                fd_ = open(path_.c_str(),
            	       O_CREAT | O_TRUNC | O_RDWR | additional_flags,
            	       S_IRUSR | S_IWUSR);
                if (fd_ == -1)
                  ShowErrnoAndExit("open()");
              }
             
              void Resize(off_t length) {
                assert(fd_ != -1);
             
                if (int ret = posix_fallocate(fd_, 0, length)) {
                  errno = ret;
                  ShowErrnoAndExit("posix_fallocate()");
                }
              }
             
              void Write(const void *buf, size_t count) {
                assert(fd_ != -1);
             
                ssize_t ret = write(fd_, buf, count);
                if (ret == -1)
                  ShowErrnoAndExit("write()");
             
                if (static_cast<size_t>(ret) != count) {
                  std::cerr << "write() partial write\n";
                  std::exit(EXIT_FAILURE);
                }
              }
             
              void PWrite(const void *buf, size_t count, off_t offset) {
                assert(fd_ != -1);
             
                ssize_t ret = pwrite(fd_, buf, count, offset);
                if (ret == -1)
                  ShowErrnoAndExit("pwrite()");
             
                if (static_cast<size_t>(ret) != count) {
                  std::cerr << "pwrite() partial write\n";
                  std::exit(EXIT_FAILURE);
                }
              }
             
              void Fsync() {
                assert(fd_ != -1);
             
                if (fsync(fd_) == -1)
                  ShowErrnoAndExit("fsync()");
              }
             
              void Fdatasync() {
                assert(fd_ != -1);
             
                if (fdatasync(fd_) == -1)
                  ShowErrnoAndExit("fdatasync()");
              }
             
              size_t Size() {
                assert(fd_ != 1);
             
                struct stat s;
                if (fstat(fd_, &s) == -1)
                  ShowErrnoAndExit("fstat()");
             
                return s.st_size;
              }
             
              void Mmap() {
                assert(fd_ != -1);
             
                size_t size = Size();
                mapped_ = mmap(nullptr, size,
            		   PROT_READ | PROT_WRITE,
            		   MAP_SHARED | MAP_POPULATE,
            		   fd_, 0);
                if (mapped_ == MAP_FAILED)
                  ShowErrnoAndExit("mmap()");
              }
             
              unsigned char *GetMmappedRegion() {
                assert(fd_ != -1);
                assert(mapped_);
             
                return static_cast<unsigned char *>(mapped_);
              }
             
              void Msync(void *addr, size_t len) {
                if (msync(addr, len, MS_SYNC) == -1) {
                  ShowErrnoAndExit("msync()");
                }
              }
             
              void Munmap() {
                assert(fd_ != -1);
                assert(mapped_ != nullptr);
             
                if (munmap(mapped_, Size()) == -1)
                  ShowErrnoAndExit("munmap()");
             
                mapped_ = nullptr;
              }
             
              ~File() {
                if (mapped_)
                  Munmap();
             
                if (fd_ != -1 && close(fd_) == -1)
                  ShowErrnoAndExit("close()");
             
             
                if (unlink(path_.c_str()) == -1)
                  ShowErrnoAndExit("unlink()");
              }
             
              int fd_{-1};
              void *mapped_{nullptr};
              std::string path_;
            };
             
            struct SimpleCyclicWriter {
              SimpleCyclicWriter(std::string path) : file_(path) {
                file_.Resize(kFileSize);
                file_.Fsync();
              }
             
              static std::string Name()  { return "Simple cyclic file"; }
             
              void Write() {
                const off_t offset = offset_ % kFileSize;
                file_.PWrite(kBuf.data(), kBuf.size(), offset);
                offset_ += kBuf.size();
              }
             
              void Flush() { file_.Fdatasync(); }
             
              File file_;
              off_t offset_{0};
            };
             
            struct DSyncCyclicWriter {
              DSyncCyclicWriter(std::string path) : file_(path, O_DSYNC) {
                file_.Resize(kFileSize);
                file_.Fsync();
              }
             
              static std::string Name()  { return "O_DSYNC cyclic file"; }
             
              void Write() {
                const off_t offset = offset_ % kFileSize;
                file_.PWrite(kBuf.data(), kBuf.size(), offset);
                offset_ += kBuf.size();
              }
             
              void Flush() { }
             
              File file_;
              off_t offset_{0};
            };
             
            struct ODirectODsyncCyclicWriter {
              ODirectODsyncCyclicWriter(std::string path) :
                  file_(path, O_DIRECT | O_DSYNC) {
                file_.Resize(kFileSize);
                file_.Fsync();
              }
             
              static std::string Name()  { return "O_DIRECT|O_DSYNC cyclic file"; }
             
              void Write() {
                const off_t offset = offset_ % kFileSize;
                file_.PWrite(kAlignedBuf.data(), kAlignedBuf.size(), offset);
                offset_ += kAlignedBuf.size();
              }
             
              void Flush() { }
             
              File file_;
              off_t offset_{0};
            };
             
            struct MmappedCyclicWriter {
              MmappedCyclicWriter(std::string path) : file_(path) {
                file_.Resize(kFileSize);
                file_.Fsync();
                file_.Mmap();
              }
             
              static std::string Name()  { return "Mmapped cyclic file"; }
             
              void Write() {
                const off_t offset = offset_ % kFileSize;
                memcpy(file_.GetMmappedRegion() + offset, kBuf.data(), kBuf.size());
                prev_offset_ = offset;
                offset_ += kBuf.size();
              }
             
              void Flush() {
                auto *start = file_.GetMmappedRegion() + prev_offset_;
                auto *end = start + kBuf.size();
                start = start - reinterpret_cast<std::uintptr_t>(start) % page_size_;
             
                assert(start >= file_.GetMmappedRegion());
                assert(end <= file_.GetMmappedRegion() + file_.Size());
                file_.Msync(start, end - start);
              }
             
              File file_;
              uintptr_t page_size_{static_cast<uintptr_t>(sysconf(_SC_PAGE_SIZE))};
              off_t prev_offset_{0};
              off_t offset_{0};
            };
             
            struct OAppendAppendWriter {
              OAppendAppendWriter(std::string path) : file_(path, O_APPEND) {}
             
              static std::string Name()  { return "O_APPEND append file"; }
             
              void Write() { file_.Write(kBuf.data(), kBuf.size()); }
              void Flush() { file_.Fsync(); }
             
              File file_;
            };
             
            struct SimpleAppendWriter {
              SimpleAppendWriter(std::string path) : file_(path) {}
             
              static std::string Name()  { return "Simple append file"; }
             
              void Write() { file_.Write(kBuf.data(), kBuf.size()); }
              void Flush() { file_.Fsync(); }
             
              File file_;
            };
             
            struct Timer {
              using Clock = std::chrono::steady_clock;
             
              ~Timer() {
                auto duration = Clock::now() - now_;
                auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(duration);
                auto s = std::chrono::duration_cast<std::chrono::seconds>(ms);
                ms = ms - s;
                std::cout << "Took " << s.count() << " seconds " << ms.count() <<
                    " milliseconds\n";
              }
             
              Clock::time_point now_ = Clock::now();
            };
             
            template <class Writer>
            void Test() {
              std::cout << Writer::Name() << "\n";
              Writer file_(kPath);
             
              {
                Timer timer;
                size_t write_total = kWriteTotal;
                while (write_total) {
                  file_.Write();
                  file_.Flush();
                  write_total -= kBuf.size();
                }
              }
             
              std::cout << "\n";
            }
             
            int main() {
              std::cout << "File size " << kFileSize / 1024 << " Kb\n";
              std::cout << "Writing " << kWriteTotal / 1024 << " Kb to it\n";
              std::cout << "\n";
             
              Test<SimpleCyclicWriter>();
              Test<DSyncCyclicWriter>();
              Test<MmappedCyclicWriter>();
              Test<ODirectODsyncCyclicWriter>();
              Test<OAppendAppendWriter>();
              Test<SimpleAppendWriter>();
             
              return EXIT_SUCCESS;
            }
            

            So, writing to circular file is clearly much faster than appending to file.

            kevg Eugene Kosov (Inactive) added a comment - This is my test program which was created to test performance of writing to circular file and appending to file. I've run test on my laptop with HDD and ext4. Results without fsync() : File size 524288 Kb Writing 4194304 Kb to it   Simple cyclic file Took 6 seconds 714 milliseconds   Mmapped cyclic file Took 0 seconds 532 milliseconds   O_APPEND append file Took 25 seconds 260 milliseconds   Simple append file Took 24 seconds 623 milliseconds Results with fsync() : File size 1024 Kb Writing 4096 Kb to it   Simple cyclic file Took 59 seconds 111 milliseconds   O_DSYNC cyclic file Took 59 seconds 267 milliseconds   Mmapped cyclic file Took 58 seconds 516 milliseconds   O_DIRECT|O_DSYNC cyclic file Took 107 seconds 6 milliseconds   O_APPEND append file Took 202 seconds 492 milliseconds   Simple append file Took 249 seconds 301 milliseconds And the program itself: #include <sys/types.h> #include <sys/mman.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h>   #include <cerrno> #include <cassert> #include <cstdlib> #include <cstring>   #include <array> #include <chrono> #include <string> #include <iostream>   static const size_t kWriteTotal = 4 * 1024 * 1024; static const size_t kFileSize = 1 * 1024 * 1024; static const std::array<unsigned char , 1024> kBuf = { 33 }; static const std::array<unsigned char , 1024> kAlignedBuf alignas(512) = { 33 }; static_assert(kFileSize % kBuf.size() == 0, "" ); static const std::string kPath = "test_file" ;   void ShowErrnoAndExit(std::string function_name) { std::cerr << function_name << " returned errno " << strerror ( errno ) << "\n" ; std:: exit (EXIT_FAILURE); }   struct File { File(std::string path, int additional_flags = 0) : path_(std::move(path)) { fd_ = open(path_.c_str(), O_CREAT | O_TRUNC | O_RDWR | additional_flags, S_IRUSR | S_IWUSR); if (fd_ == -1) ShowErrnoAndExit( "open()" ); }   void Resize(off_t length) { assert (fd_ != -1);   if ( int ret = posix_fallocate(fd_, 0, length)) { errno = ret; ShowErrnoAndExit( "posix_fallocate()" ); } }   void Write( const void *buf, size_t count) { assert (fd_ != -1);   ssize_t ret = write(fd_, buf, count); if (ret == -1) ShowErrnoAndExit( "write()" );   if ( static_cast < size_t >(ret) != count) { std::cerr << "write() partial write\n" ; std:: exit (EXIT_FAILURE); } }   void PWrite( const void *buf, size_t count, off_t offset) { assert (fd_ != -1);   ssize_t ret = pwrite(fd_, buf, count, offset); if (ret == -1) ShowErrnoAndExit( "pwrite()" );   if ( static_cast < size_t >(ret) != count) { std::cerr << "pwrite() partial write\n" ; std:: exit (EXIT_FAILURE); } }   void Fsync() { assert (fd_ != -1);   if (fsync(fd_) == -1) ShowErrnoAndExit( "fsync()" ); }   void Fdatasync() { assert (fd_ != -1);   if (fdatasync(fd_) == -1) ShowErrnoAndExit( "fdatasync()" ); }   size_t Size() { assert (fd_ != 1);   struct stat s; if (fstat(fd_, &s) == -1) ShowErrnoAndExit( "fstat()" );   return s.st_size; }   void Mmap() { assert (fd_ != -1);   size_t size = Size(); mapped_ = mmap(nullptr, size, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, fd_, 0); if (mapped_ == MAP_FAILED) ShowErrnoAndExit( "mmap()" ); }   unsigned char *GetMmappedRegion() { assert (fd_ != -1); assert (mapped_);   return static_cast <unsigned char *>(mapped_); }   void Msync( void *addr, size_t len) { if (msync(addr, len, MS_SYNC) == -1) { ShowErrnoAndExit( "msync()" ); } }   void Munmap() { assert (fd_ != -1); assert (mapped_ != nullptr);   if (munmap(mapped_, Size()) == -1) ShowErrnoAndExit( "munmap()" );   mapped_ = nullptr; }   ~File() { if (mapped_) Munmap();   if (fd_ != -1 && close(fd_) == -1) ShowErrnoAndExit( "close()" );     if (unlink(path_.c_str()) == -1) ShowErrnoAndExit( "unlink()" ); }   int fd_{-1}; void *mapped_{nullptr}; std::string path_; };   struct SimpleCyclicWriter { SimpleCyclicWriter(std::string path) : file_(path) { file_.Resize(kFileSize); file_.Fsync(); }   static std::string Name() { return "Simple cyclic file" ; }   void Write() { const off_t offset = offset_ % kFileSize; file_.PWrite(kBuf.data(), kBuf.size(), offset); offset_ += kBuf.size(); }   void Flush() { file_.Fdatasync(); }   File file_; off_t offset_{0}; };   struct DSyncCyclicWriter { DSyncCyclicWriter(std::string path) : file_(path, O_DSYNC) { file_.Resize(kFileSize); file_.Fsync(); }   static std::string Name() { return "O_DSYNC cyclic file" ; }   void Write() { const off_t offset = offset_ % kFileSize; file_.PWrite(kBuf.data(), kBuf.size(), offset); offset_ += kBuf.size(); }   void Flush() { }   File file_; off_t offset_{0}; };   struct ODirectODsyncCyclicWriter { ODirectODsyncCyclicWriter(std::string path) : file_(path, O_DIRECT | O_DSYNC) { file_.Resize(kFileSize); file_.Fsync(); }   static std::string Name() { return "O_DIRECT|O_DSYNC cyclic file" ; }   void Write() { const off_t offset = offset_ % kFileSize; file_.PWrite(kAlignedBuf.data(), kAlignedBuf.size(), offset); offset_ += kAlignedBuf.size(); }   void Flush() { }   File file_; off_t offset_{0}; };   struct MmappedCyclicWriter { MmappedCyclicWriter(std::string path) : file_(path) { file_.Resize(kFileSize); file_.Fsync(); file_.Mmap(); }   static std::string Name() { return "Mmapped cyclic file" ; }   void Write() { const off_t offset = offset_ % kFileSize; memcpy (file_.GetMmappedRegion() + offset, kBuf.data(), kBuf.size()); prev_offset_ = offset; offset_ += kBuf.size(); }   void Flush() { auto *start = file_.GetMmappedRegion() + prev_offset_; auto *end = start + kBuf.size(); start = start - reinterpret_cast <std:: uintptr_t >(start) % page_size_;   assert (start >= file_.GetMmappedRegion()); assert (end <= file_.GetMmappedRegion() + file_.Size()); file_.Msync(start, end - start); }   File file_; uintptr_t page_size_{ static_cast < uintptr_t >(sysconf(_SC_PAGE_SIZE))}; off_t prev_offset_{0}; off_t offset_{0}; };   struct OAppendAppendWriter { OAppendAppendWriter(std::string path) : file_(path, O_APPEND) {}   static std::string Name() { return "O_APPEND append file" ; }   void Write() { file_.Write(kBuf.data(), kBuf.size()); } void Flush() { file_.Fsync(); }   File file_; };   struct SimpleAppendWriter { SimpleAppendWriter(std::string path) : file_(path) {}   static std::string Name() { return "Simple append file" ; }   void Write() { file_.Write(kBuf.data(), kBuf.size()); } void Flush() { file_.Fsync(); }   File file_; };   struct Timer { using Clock = std::chrono::steady_clock;   ~Timer() { auto duration = Clock::now() - now_; auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(duration); auto s = std::chrono::duration_cast<std::chrono::seconds>(ms); ms = ms - s; std::cout << "Took " << s.count() << " seconds " << ms.count() << " milliseconds\n" ; }   Clock::time_point now_ = Clock::now(); };   template < class Writer> void Test() { std::cout << Writer::Name() << "\n" ; Writer file_(kPath);   { Timer timer; size_t write_total = kWriteTotal; while (write_total) { file_.Write(); file_.Flush(); write_total -= kBuf.size(); } }   std::cout << "\n" ; }   int main() { std::cout << "File size " << kFileSize / 1024 << " Kb\n" ; std::cout << "Writing " << kWriteTotal / 1024 << " Kb to it\n" ; std::cout << "\n" ;   Test<SimpleCyclicWriter>(); Test<DSyncCyclicWriter>(); Test<MmappedCyclicWriter>(); Test<ODirectODsyncCyclicWriter>(); Test<OAppendAppendWriter>(); Test<SimpleAppendWriter>();   return EXIT_SUCCESS; } So, writing to circular file is clearly much faster than appending to file.

            junsu danblack could you please run my program on some server hardware? I have no access to anything but my not so up to date laptop.

            The biggest questing is to decide what's faster: appending to file or writing to a circular file. Sorry, no CLI interface for my program, but you tweak file size and amount of data to write by changing globals.

            kevg Eugene Kosov (Inactive) added a comment - junsu danblack could you please run my program on some server hardware? I have no access to anything but my not so up to date laptop. The biggest questing is to decide what's faster: appending to file or writing to a circular file. Sorry, no CLI interface for my program, but you tweak file size and amount of data to write by changing globals.
            baotiao zongzhi chen added a comment -

            Hey guys. I have done almost the same work. change the redo log from circular file to appending to new file..
            Of course writing to circular file is much faster than appending to file. Since when appending to file, it need to modify the inode and allocate an extent to the file, which need about 8 times than writing to circular file. I have show the result in this slide https://www.slideshare.net/baotiao/polardb-percona19 from page 18.
            However, writing to circular file need to solve "read-on-write" issue, but appending to file don't need it.

            so the way we use in our environment is that when there is no stale redo log file, we allocate a new redo log file and filling the file with zero. And in the background, if we don't need some stale redo log file, we don't delete directly, we rename the stale file as new redo log file. And in InnoDB, we also padding the write to 4k to avoid the "read-on-write" issue.

            baotiao zongzhi chen added a comment - Hey guys. I have done almost the same work. change the redo log from circular file to appending to new file.. Of course writing to circular file is much faster than appending to file. Since when appending to file, it need to modify the inode and allocate an extent to the file, which need about 8 times than writing to circular file. I have show the result in this slide https://www.slideshare.net/baotiao/polardb-percona19 from page 18. However, writing to circular file need to solve "read-on-write" issue, but appending to file don't need it. so the way we use in our environment is that when there is no stale redo log file, we allocate a new redo log file and filling the file with zero. And in the background, if we don't need some stale redo log file, we don't delete directly, we rename the stale file as new redo log file. And in InnoDB, we also padding the write to 4k to avoid the "read-on-write" issue.
            danblack Daniel Black added a comment -

            'expanded size'

            static const size_t kFileSize = 1 * 128 * 1024 * 1024;
             
            static const size_t kWriteTotal = 4 * kFileSize;
            

            'nvme, ext4 (rw,relatime), kernel 5.3.0 - POWER9

            $ ~/write_test
            File size 131072 Kb
            Writing 524288 Kb to it
             
            Simple cyclic file
            Took 132 seconds 538 milliseconds
             
            O_DSYNC cyclic file
            Took 61 seconds 86 milliseconds
             
            Mmapped cyclic file
            Took 248 seconds 543 milliseconds
             
            O_DIRECT|O_DSYNC cyclic file
            pwrite() returned errno Invalid argument
             
            O_APPEND append file
            Took 959 seconds 463 milliseconds
             
            Simple append file
            Took 770 seconds 324 milliseconds
            

            '2x 12G SAS disks, raid0 - ServeRAID M5210, lvm (1 linear continuous map), xfs (rw,noatime,attr2,inode64,logbufs=8,logbsize=32k,noquota), 5.3.0-24-generic (ubuntu), x86_64'

            ./write_test 
            File size 131072 Kb
            Writing 524288 Kb to it
             
            Simple cyclic file
            Took 98 seconds 359 milliseconds
             
            O_DSYNC cyclic file
            Took 99 seconds 659 milliseconds
             
            Mmapped cyclic file
            Took 100 seconds 107 milliseconds
             
            O_DIRECT|O_DSYNC cyclic file
            Took 42 seconds 208 milliseconds
             
            O_APPEND append file
            Took 300 seconds 383 milliseconds
             
            Simple append file
            Took 303 seconds 216 milliseconds
            

            Looking why NVMe was slow - seems to be 4k LBA:

            'smartctl -a /dev/nvme0n1'

            smartctl 6.6 2016-05-31 r4324 [ppc64le-linux-5.3.0] (local build)
            Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
             
            === START OF INFORMATION SECTION ===
            Model Number:                       PCIe3 1.6TB NVMe Flash Adapter
            Serial Number:                      CJH0010003EE
            Firmware Version:                   KMIPP107
            PCI Vendor ID:                      0x1c58
            PCI Vendor Subsystem ID:            0x1014
            IEEE OUI Identifier:                0x000cca
            Controller ID:                      1269
            Number of Namespaces:               1
            Namespace 1 Size/Capacity:          1,600,321,314,816 [1.60 TB]
            Namespace 1 Formatted LBA Size:     4096
            Local Time is:                      Wed Jan  8 13:22:03 2020 AEDT
            Firmware Updates (0x08):            4 Slots
            Optional Admin Commands (0x0006):   Format Frmw_DL
            Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
             
            Supported Power States
            St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
             0 +    25.00W       -        -    0  0  0  0    15000   15000
             1 +    20.00W       -        -    1  1  1  1    15000   15000
             2 +    15.00W       -        -    2  2  2  2    15000   15000
             3 +    10.00W       -        -    3  3  3  3    15000   15000
             4 -    10.00W       -        -    3  3  3  3    15000   15000
             
            Supported LBA Sizes (NSID 0x1)
            Id Fmt  Data  Metadt  Rel_Perf
             0 +    4096       0         0
             1 -    4096       8         1
             
            === START OF SMART DATA SECTION ===
            SMART overall-health self-assessment test result: PASSED
             
            SMART/Health Information (NVMe Log 0x02, NSID 0x1)
            Critical Warning:                   0x00
            Temperature:                        36 Celsius
            Available Spare:                    100%
            Available Spare Threshold:          10%
            Percentage Used:                    0%smartctl 6.6 2016-05-31 r4324 [ppc64le-linux-5.3.0] (local build)
            Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
             
            === START OF INFORMATION SECTION ===
            Model Number:                       PCIe3 1.6TB NVMe Flash Adapter
            Serial Number:                      CJH0010003EE
            Firmware Version:                   KMIPP107
            PCI Vendor ID:                      0x1c58
            PCI Vendor Subsystem ID:            0x1014
            IEEE OUI Identifier:                0x000cca
            Controller ID:                      1269
            Number of Namespaces:               1
            Namespace 1 Size/Capacity:          1,600,321,314,816 [1.60 TB]
            Namespace 1 Formatted LBA Size:     4096
            Local Time is:                      Wed Jan  8 13:22:03 2020 AEDT
            Firmware Updates (0x08):            4 Slots
            Optional Admin Commands (0x0006):   Format Frmw_DL
            Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
             
            Supported Power States
            St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
             0 +    25.00W       -        -    0  0  0  0    15000   15000
             1 +    20.00W       -        -    1  1  1  1    15000   15000
             2 +    15.00W       -        -    2  2  2  2    15000   15000
             3 +    10.00W       -        -    3  3  3  3    15000   15000
             4 -    10.00W       -        -    3  3  3  3    15000   15000
             
            Supported LBA Sizes (NSID 0x1)
            Id Fmt  Data  Metadt  Rel_Perf
             0 +    4096       0         0
             1 -    4096       8         1
             
            === START OF SMART DATA SECTION ===
            SMART overall-health self-assessment test result: PASSED
             
            SMART/Health Information (NVMe Log 0x02, NSID 0x1)
            Critical Warning:                   0x00
            Temperature:                        36 Celsius
            Available Spare:                    100%
            Available Spare Threshold:          10%
            Percentage Used:                    0%
            Data Units Read:                    3,643,068 [1.86 TB]
            Data Units Written:                 6,647,820 [3.40 TB]
            Host Read Commands:                 83,729,047
            Host Write Commands:                81,415,498
            Controller Busy Time:               841
            Power Cycles:                       1,569
            Power On Hours:                     22,345
            Unsafe Shutdowns:                   516
            Media and Data Integrity Errors:    0
            Error Information Log Entries:      0
             
            Error Information (NVMe Log 0x01, max 63 entries)
            No Errors Logged
             
            Data Units Read:                    3,643,068 [1.86 TB]
            Data Units Written:                 6,647,820 [3.40 TB]
            Host Read Commands:                 83,729,047
            Host Write Commands:                81,415,498
            Controller Busy Time:               841
            Power Cycles:                       1,569
            Power On Hours:                     22,345
            Unsafe Shutdowns:                   516
            Media and Data Integrity Errors:    0
            Error Information Log Entries:      0
             
            Error Information (NVMe Log 0x01, max 63 entries)
            No Errors Logged
            

            'dumpe2fs'

            sudo dumpe2fs -h /dev/nvme0n1p1 | more
            dumpe2fs 1.44.1 (24-Mar-2018)
            Filesystem volume name:   scratch
            Last mounted on:          /scratch
            Filesystem UUID:          356af640-5d8e-4256-9dce-a0983c9e0e43
            Filesystem magic number:  0xEF53
            Filesystem revision #:    1 (dynamic)
            Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum
            Filesystem flags:         unsigned_directory_hash 
            Default mount options:    user_xattr acl
            Filesystem state:         clean
            Errors behavior:          Continue
            Filesystem OS type:       Linux
            Inode count:              65052672
            Block count:              260208384
            Reserved block count:     13010419
            Free blocks:              132171164
            Free inodes:              62683469
            First block:              0
            Block size:               4096
            Fragment size:            4096
            Reserved GDT blocks:      961
            Blocks per group:         32768
            Fragments per group:      32768
            Inodes per group:         8192
            Inode blocks per group:   512
            Flex block group size:    16
            Filesystem created:       Wed Nov 22 15:34:07 2017
            Last mount time:          Tue Jan  7 20:38:12 2020
            Last write time:          Tue Jan  7 20:38:12 2020
            Mount count:              293
            Maximum mount count:      -1
            Last checked:             Wed Nov 22 15:34:07 2017
            Check interval:           0 (<none>)
            Lifetime writes:          20 TB
            Reserved blocks uid:      0 (user root)
            Reserved blocks gid:      0 (group root)
            First inode:              11
            Inode size:	          256
            Required extra isize:     32
            Desired extra isize:      32
            Journal inode:            8
            Default directory hash:   half_md4
            Directory Hash Seed:      b68c7dc9-1e4b-46d2-b6d2-6f2d0128b12a
            Journal backup:           inode blocks
            Checksum type:            crc32c
            Checksum:                 0xb0abf812
            Journal features:         journal_incompat_revoke journal_checksum_v3
            Journal size:             1024M
            Journal length:           262144
            Journal sequence:         0x0050f6ac
            Journal start:            153997
            Journal checksum type:    crc32c
            Journal checksum:         0xc52e4213
            

            '4k test'

            static const size_t kFileSize = 1 * 128 * 1024 * 1024; 
             
            static const size_t kWriteTotal = 4 * kFileSize;
             
            static const size_t kBuffSize = 4096;
             
            static const std::array<unsigned char, kBuffSize> kBuf = { 33 };
             
            static const std::array<unsigned char, kBuffSize> kAlignedBuf alignas(kBuffSize) = { 33 };
            

            'nvme, ext4 (rw,relatime), kernel 5.3.0 - POWER9 - 4k writes

            File size 131072 Kb
            Writing 524288 Kb to it
             
            Simple cyclic file
             
            Took 49 seconds 300 milliseconds
             
            O_DSYNC cyclic file
            Took 57 seconds 536 milliseconds
             
            Mmapped cyclic file
            Took 75 seconds 627 milliseconds
             
            O_DIRECT|O_DSYNC cyclic file
            Took 64 seconds 600 milliseconds
             
            O_APPEND append file
            Took 231 seconds 47 milliseconds
             
            Simple append file
            Took 229 seconds 604 milliseconds
            
            

            '2x 12G SAS disks, raid0 - ServeRAID M5210, lvm (1 linear continuous map), xfs (rw,noatime,attr2,inode64,logbufs=8,logbsize=32k,noquota), 5.3.0-24-generic (ubuntu), x86_64' - 4K writes

            File size 131072 Kb
            Writing 524288 Kb to it
             
            Simple cyclic file
            Took 16 seconds 5 milliseconds
             
            O_DSYNC cyclic file
            Took 15 seconds 664 milliseconds
             
            Mmapped cyclic file
            Took 17 seconds 846 milliseconds
             
            O_DIRECT|O_DSYNC cyclic file
            Took 14 seconds 823 milliseconds
             
            O_APPEND append file
            Took 48 seconds 328 milliseconds
             
            Simple append file
            Took 48 seconds 207 milliseconds
            
            

            'xfs_info'

            xfs_info /var
            meta-data=/dev/mapper/ka4_disks-ka4_var isize=512    agcount=4, agsize=73119744 blks
                     =                       sectsz=4096  attr=2, projid32bit=1
                     =                       crc=1        finobt=1 spinodes=0 rmapbt=0
                     =                       reflink=0
            data     =                       bsize=4096   blocks=292478976, imaxpct=5
                     =                       sunit=0      swidth=0 blks
            naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
            log      =internal               bsize=4096   blocks=142812, version=2
                     =                       sectsz=4096  sunit=1 blks, lazy-count=1
            realtime =none                   extsz=4096   blocks=0, rtextents=0
            

            Seems this was setup for 4k too.

            danblack Daniel Black added a comment - 'expanded size' static const size_t kFileSize = 1 * 128 * 1024 * 1024;   static const size_t kWriteTotal = 4 * kFileSize; 'nvme, ext4 (rw,relatime), kernel 5.3.0 - POWER9 $ ~/write_test File size 131072 Kb Writing 524288 Kb to it   Simple cyclic file Took 132 seconds 538 milliseconds   O_DSYNC cyclic file Took 61 seconds 86 milliseconds   Mmapped cyclic file Took 248 seconds 543 milliseconds   O_DIRECT|O_DSYNC cyclic file pwrite() returned errno Invalid argument   O_APPEND append file Took 959 seconds 463 milliseconds   Simple append file Took 770 seconds 324 milliseconds '2x 12G SAS disks, raid0 - ServeRAID M5210, lvm (1 linear continuous map), xfs (rw,noatime,attr2,inode64,logbufs=8,logbsize=32k,noquota), 5.3.0-24-generic (ubuntu), x86_64' ./write_test File size 131072 Kb Writing 524288 Kb to it   Simple cyclic file Took 98 seconds 359 milliseconds   O_DSYNC cyclic file Took 99 seconds 659 milliseconds   Mmapped cyclic file Took 100 seconds 107 milliseconds   O_DIRECT|O_DSYNC cyclic file Took 42 seconds 208 milliseconds   O_APPEND append file Took 300 seconds 383 milliseconds   Simple append file Took 303 seconds 216 milliseconds Looking why NVMe was slow - seems to be 4k LBA: 'smartctl -a /dev/nvme0n1' smartctl 6.6 2016-05-31 r4324 [ppc64le-linux-5.3.0] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org   === START OF INFORMATION SECTION === Model Number: PCIe3 1.6TB NVMe Flash Adapter Serial Number: CJH0010003EE Firmware Version: KMIPP107 PCI Vendor ID: 0x1c58 PCI Vendor Subsystem ID: 0x1014 IEEE OUI Identifier: 0x000cca Controller ID: 1269 Number of Namespaces: 1 Namespace 1 Size/Capacity: 1,600,321,314,816 [1.60 TB] Namespace 1 Formatted LBA Size: 4096 Local Time is: Wed Jan 8 13:22:03 2020 AEDT Firmware Updates (0x08): 4 Slots Optional Admin Commands (0x0006): Format Frmw_DL Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat   Supported Power States St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat 0 + 25.00W - - 0 0 0 0 15000 15000 1 + 20.00W - - 1 1 1 1 15000 15000 2 + 15.00W - - 2 2 2 2 15000 15000 3 + 10.00W - - 3 3 3 3 15000 15000 4 - 10.00W - - 3 3 3 3 15000 15000   Supported LBA Sizes (NSID 0x1) Id Fmt Data Metadt Rel_Perf 0 + 4096 0 0 1 - 4096 8 1   === START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED   SMART/Health Information (NVMe Log 0x02, NSID 0x1) Critical Warning: 0x00 Temperature: 36 Celsius Available Spare: 100% Available Spare Threshold: 10% Percentage Used: 0%smartctl 6.6 2016-05-31 r4324 [ppc64le-linux-5.3.0] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org   === START OF INFORMATION SECTION === Model Number: PCIe3 1.6TB NVMe Flash Adapter Serial Number: CJH0010003EE Firmware Version: KMIPP107 PCI Vendor ID: 0x1c58 PCI Vendor Subsystem ID: 0x1014 IEEE OUI Identifier: 0x000cca Controller ID: 1269 Number of Namespaces: 1 Namespace 1 Size/Capacity: 1,600,321,314,816 [1.60 TB] Namespace 1 Formatted LBA Size: 4096 Local Time is: Wed Jan 8 13:22:03 2020 AEDT Firmware Updates (0x08): 4 Slots Optional Admin Commands (0x0006): Format Frmw_DL Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat   Supported Power States St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat 0 + 25.00W - - 0 0 0 0 15000 15000 1 + 20.00W - - 1 1 1 1 15000 15000 2 + 15.00W - - 2 2 2 2 15000 15000 3 + 10.00W - - 3 3 3 3 15000 15000 4 - 10.00W - - 3 3 3 3 15000 15000   Supported LBA Sizes (NSID 0x1) Id Fmt Data Metadt Rel_Perf 0 + 4096 0 0 1 - 4096 8 1   === START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED   SMART/Health Information (NVMe Log 0x02, NSID 0x1) Critical Warning: 0x00 Temperature: 36 Celsius Available Spare: 100% Available Spare Threshold: 10% Percentage Used: 0% Data Units Read: 3,643,068 [1.86 TB] Data Units Written: 6,647,820 [3.40 TB] Host Read Commands: 83,729,047 Host Write Commands: 81,415,498 Controller Busy Time: 841 Power Cycles: 1,569 Power On Hours: 22,345 Unsafe Shutdowns: 516 Media and Data Integrity Errors: 0 Error Information Log Entries: 0   Error Information (NVMe Log 0x01, max 63 entries) No Errors Logged   Data Units Read: 3,643,068 [1.86 TB] Data Units Written: 6,647,820 [3.40 TB] Host Read Commands: 83,729,047 Host Write Commands: 81,415,498 Controller Busy Time: 841 Power Cycles: 1,569 Power On Hours: 22,345 Unsafe Shutdowns: 516 Media and Data Integrity Errors: 0 Error Information Log Entries: 0   Error Information (NVMe Log 0x01, max 63 entries) No Errors Logged 'dumpe2fs' sudo dumpe2fs -h /dev/nvme0n1p1 | more dumpe2fs 1.44.1 (24-Mar-2018) Filesystem volume name: scratch Last mounted on: /scratch Filesystem UUID: 356af640-5d8e-4256-9dce-a0983c9e0e43 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum Filesystem flags: unsigned_directory_hash Default mount options: user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 65052672 Block count: 260208384 Reserved block count: 13010419 Free blocks: 132171164 Free inodes: 62683469 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 961 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 8192 Inode blocks per group: 512 Flex block group size: 16 Filesystem created: Wed Nov 22 15:34:07 2017 Last mount time: Tue Jan 7 20:38:12 2020 Last write time: Tue Jan 7 20:38:12 2020 Mount count: 293 Maximum mount count: -1 Last checked: Wed Nov 22 15:34:07 2017 Check interval: 0 (<none>) Lifetime writes: 20 TB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 32 Desired extra isize: 32 Journal inode: 8 Default directory hash: half_md4 Directory Hash Seed: b68c7dc9-1e4b-46d2-b6d2-6f2d0128b12a Journal backup: inode blocks Checksum type: crc32c Checksum: 0xb0abf812 Journal features: journal_incompat_revoke journal_checksum_v3 Journal size: 1024M Journal length: 262144 Journal sequence: 0x0050f6ac Journal start: 153997 Journal checksum type: crc32c Journal checksum: 0xc52e4213 '4k test' static const size_t kFileSize = 1 * 128 * 1024 * 1024;   static const size_t kWriteTotal = 4 * kFileSize;   static const size_t kBuffSize = 4096;   static const std::array<unsigned char, kBuffSize> kBuf = { 33 };   static const std::array<unsigned char, kBuffSize> kAlignedBuf alignas(kBuffSize) = { 33 }; 'nvme, ext4 (rw,relatime), kernel 5.3.0 - POWER9 - 4k writes File size 131072 Kb Writing 524288 Kb to it   Simple cyclic file   Took 49 seconds 300 milliseconds   O_DSYNC cyclic file Took 57 seconds 536 milliseconds   Mmapped cyclic file Took 75 seconds 627 milliseconds   O_DIRECT|O_DSYNC cyclic file Took 64 seconds 600 milliseconds   O_APPEND append file Took 231 seconds 47 milliseconds   Simple append file Took 229 seconds 604 milliseconds '2x 12G SAS disks, raid0 - ServeRAID M5210, lvm (1 linear continuous map), xfs (rw,noatime,attr2,inode64,logbufs=8,logbsize=32k,noquota), 5.3.0-24-generic (ubuntu), x86_64' - 4K writes File size 131072 Kb Writing 524288 Kb to it   Simple cyclic file Took 16 seconds 5 milliseconds   O_DSYNC cyclic file Took 15 seconds 664 milliseconds   Mmapped cyclic file Took 17 seconds 846 milliseconds   O_DIRECT|O_DSYNC cyclic file Took 14 seconds 823 milliseconds   O_APPEND append file Took 48 seconds 328 milliseconds   Simple append file Took 48 seconds 207 milliseconds 'xfs_info' xfs_info /var meta-data=/dev/mapper/ka4_disks-ka4_var isize=512 agcount=4, agsize=73119744 blks = sectsz=4096 attr=2, projid32bit=1 = crc=1 finobt=1 spinodes=0 rmapbt=0 = reflink=0 data = bsize=4096 blocks=292478976, imaxpct=5 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal bsize=4096 blocks=142812, version=2 = sectsz=4096 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 Seems this was setup for 4k too.
            baotiao zongzhi chen added a comment -

            @Marko why did you abandon the work of partitioned the redo log. Since we know that AWS aurora must have done this work. Otherwise, they can't partitioned the data by space_id:page_id into multi storage node. The redo log which modify the page must stay in the same partition with the page. This is the basic design that then can apply the redo log and crash recovery parallel..

            I really think that partitioned redo log is a good idea, since in the compute-storage separation architecture, InnoDB need to support much more larger data size. such as 20T or 100T. POLARDB has meet this case, If we don't partitioned the redo log and data page, then we can't parallel well..

            baotiao zongzhi chen added a comment - @Marko why did you abandon the work of partitioned the redo log. Since we know that AWS aurora must have done this work. Otherwise, they can't partitioned the data by space_id:page_id into multi storage node. The redo log which modify the page must stay in the same partition with the page. This is the basic design that then can apply the redo log and crash recovery parallel.. I really think that partitioned redo log is a good idea, since in the compute-storage separation architecture, InnoDB need to support much more larger data size. such as 20T or 100T. POLARDB has meet this case, If we don't partitioned the redo log and data page, then we can't parallel well..

            baotiao thank you for your answer! I think to solve `read-on-write` we can use posix_fadvise() or posix_mavdise(). Did you try that? Now as I understand you fill file with zeroes to force OS to cache it. Can you instead preread it somehow? At first glance that looks less invasive than writing.

            kevg Eugene Kosov (Inactive) added a comment - baotiao thank you for your answer! I think to solve `read-on-write` we can use posix_fadvise() or posix_mavdise() . Did you try that? Now as I understand you fill file with zeroes to force OS to cache it. Can you instead preread it somehow? At first glance that looks less invasive than writing.
            baotiao zongzhi chen added a comment -

            NO, The root cause is that if the read size isn't aligned to 4k block, os need to read the whole 4k block, and modify the data you want. The write operation need another read operation

            The fill file with zero operation solve the allocation of extent when append to file. If the address in file hasn't beed written before then it need allocate extent from file system.

            baotiao zongzhi chen added a comment - NO, The root cause is that if the read size isn't aligned to 4k block, os need to read the whole 4k block, and modify the data you want. The write operation need another read operation The fill file with zero operation solve the allocation of extent when append to file. If the address in file hasn't beed written before then it need allocate extent from file system.

            danblack hi. We become interested in write optimization FUA (https://bobsql.com/sql-server-on-linux-forced-unit-access-fua-internals/) It's implemented at least on XFS. And you performed benchmarks on it where `O_DIRECT|O_DSYNC` was the fastest thing. Do you have that FUA enabled? You probably can check it like this:

            $ dmesg | grep -i fua
            [    1.549434] sd 3:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
            

            O_DIRECT|O_DSYNC cyclic file
            pwrite() returned errno Invalid argument
            

            Do you think it was a bug in a testing program?

            Also, maybe you know what `fdatasync()` does for file descriptors opened with `O_DSYNC`? In my understanding it whould be a no-op.

            kevg Eugene Kosov (Inactive) added a comment - danblack hi. We become interested in write optimization FUA ( https://bobsql.com/sql-server-on-linux-forced-unit-access-fua-internals/ ) It's implemented at least on XFS. And you performed benchmarks on it where `O_DIRECT|O_DSYNC` was the fastest thing. Do you have that FUA enabled? You probably can check it like this: $ dmesg | grep -i fua [ 1.549434] sd 3:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA O_DIRECT|O_DSYNC cyclic file pwrite() returned errno Invalid argument Do you think it was a bug in a testing program? Also, maybe you know what `fdatasync()` does for file descriptors opened with `O_DSYNC`? In my understanding it whould be a no-op.
            danblack Daniel Black added a comment - - edited

            pwrite() returned errno Invalid argument - I assume this was writing a 512 block when the underlying layer was 4k as changing to 4k got a result for this. fstat on the fs and use `st_blksize` probably avoids this.

            Seem FUA is a standard part of nvme (which I used) and the linux nvme driver has some concept of it based on its codebase.

            Appears to be nothing special with `O_DSYNC` (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/sync.c?h=v5.5#n196) nothing seemingly special at ext4/xfs implementation either.

            danblack Daniel Black added a comment - - edited pwrite() returned errno Invalid argument - I assume this was writing a 512 block when the underlying layer was 4k as changing to 4k got a result for this. fstat on the fs and use `st_blksize` probably avoids this. Seem FUA is a standard part of nvme (which I used) and the linux nvme driver has some concept of it based on its codebase. Appears to be nothing special with `O_DSYNC` ( https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/sync.c?h=v5.5#n196 ) nothing seemingly special at ext4/xfs implementation either.

            baotiao, sorry, I missed your question.

            I think that if we eliminate the undo log pages and the TRX_SYS page and write all user transaction data only to the redo log, we will remove an artificial synchronization point between independent transactions. I have been toying with the idea, ever since junsu challenged me to think how to make more efficient use of NVDIMM or PMEM (byte-addressable persistent storage). The idea would be to write undo log records into the redo log, like many databases do it. The MDEV-12353 redo log format does allow this easily. We could even do memory-mapped I/O and let the DB_ROLL_PTR be a direct pointer to the redo log, to speed up MVCC and ROLLBACK. I have not come up with any good solution for the redo log checkpointing, though. In this scheme, an old read view or active transaction can prevent a log checkpoint from being made. (Alternatively, we would have to append old undo log information to the redo log and patch all the DB_ROLL_PTR that are pointing to them.)

            If transactions are truly independent due to not sharing any undo log pages, then I think that the partitioned log should work.

            marko Marko Mäkelä added a comment - baotiao , sorry, I missed your question. I think that if we eliminate the undo log pages and the TRX_SYS page and write all user transaction data only to the redo log, we will remove an artificial synchronization point between independent transactions. I have been toying with the idea, ever since junsu challenged me to think how to make more efficient use of NVDIMM or PMEM (byte-addressable persistent storage). The idea would be to write undo log records into the redo log, like many databases do it. The MDEV-12353 redo log format does allow this easily. We could even do memory-mapped I/O and let the DB_ROLL_PTR be a direct pointer to the redo log, to speed up MVCC and ROLLBACK . I have not come up with any good solution for the redo log checkpointing, though. In this scheme, an old read view or active transaction can prevent a log checkpoint from being made. (Alternatively, we would have to append old undo log information to the redo log and patch all the DB_ROLL_PTR that are pointing to them.) If transactions are truly independent due to not sharing any undo log pages, then I think that the partitioned log should work.
            marko Marko Mäkelä added a comment - - edited

            The current design idea is as follows:

            Checkpoint information & file operations

            There will be a separate file that contains information about log checkpoints and data file names. This file can contain information about multiple checkpoints. (The old ib_logfile0 only has room for 2 checkpoints.)

            The checkpoint log file allows to construct the mapping between numeric tablespace identifiers and file names.
            The checkpoint log file is never encrypted. This allows mariabackup --backup to work without having access to encryption keys. Because file names are not encrypted in the file system either, and because LSNs appear unencrypted in the diagnostic output, encrypting this file does not offer any security benefits.

            A checkpoint log record comprises the checkpoint LSN and a byte offset in the circular log file, pointing to the log right after the LSN. It will also include the value of the sequence_bit that is described below.

            The circular log

            If any parameters of the log change, the redo log will be rebuilt:

            • innodb_log_file_size
            • innodb_encrypt_log, or the encryption key (key rotation will require the log to be rebuilt)
            • innodb_log_checksums (if we choose to revive this deprecated parameter and implement a variant that lacks checksums)

            There will be no fixed block structure in the circular log file. The LSN will count InnoDB mini-transactions, not bytes. This allows some flexibility: For example, a future version of mariabackup --backup --incremental could inject records to the backup of the main log file, instead of writing separate .delta files.

            The circular log file will consist of length-tagged sequences of bytes:

            byte *append_log(byte *log, const void *payload, size_t size, bool skip_bit, bool sequence_bit)
            {
              size_t length= size;
              if (!skip_bit && innodb_log_checksums)
                length+= 4; /* CRC-32C at the end of the payload */
              byte * const start= log;
              log= mlog_encode_varint(log, length << 2 | skip_bit << 1 | sequence_bit);
              if (!skip_bit)
                memcpy(log, payload, size);
              log+= size;
              if (!skip_bit && innodb_log_checksums)
              {
                /* Always compute the checksum without the sequence_bit. */
                log[-size - 1]&= 0xfe;
                mach_write_to_4(log, ut_crc32(start, log - start));
                log[-size - 1]|= sequence_bit;
                log+= 4;
              }
              return log;
            }
            

            Explanation:

            • If encryption is enabled, the payload will have been encrypted before the log is written. The length and the checksum will not be encrypted.
            • The sequence_bit will be toggled whenever the write position jumps from the end of the circular log file to the beginning.
            • The skip_bit allows us to write a partially filled log block of any size. If we need to persist the log (due to user transaction commit) and we are L bytes into a N-byte block (this depends on the underlying storage!), we can write a special record to say ‘skip the next N-L bytes’. There is no need to initialize (memset()) any skipped garbage bytes.
            • We assume that the log ends when we get a CRC-32C mismatch or the sequence_bit of the next record differs from what we expect. (The last log record could end exactly at a byte offset where a log record before the last wrap-around had been stored, and that record would have a valid checksum.)
            • Note: due to the skip_bit and the lack of memset(), it may be necessary to always store checksums, to reliably detect the end of the log.
            marko Marko Mäkelä added a comment - - edited The current design idea is as follows: Checkpoint information & file operations There will be a separate file that contains information about log checkpoints and data file names. This file can contain information about multiple checkpoints. (The old ib_logfile0 only has room for 2 checkpoints.) The checkpoint log file allows to construct the mapping between numeric tablespace identifiers and file names. The checkpoint log file is never encrypted. This allows mariabackup --backup to work without having access to encryption keys. Because file names are not encrypted in the file system either, and because LSNs appear unencrypted in the diagnostic output, encrypting this file does not offer any security benefits. A checkpoint log record comprises the checkpoint LSN and a byte offset in the circular log file, pointing to the log right after the LSN. It will also include the value of the sequence_bit that is described below. The circular log If any parameters of the log change, the redo log will be rebuilt: innodb_log_file_size innodb_encrypt_log , or the encryption key (key rotation will require the log to be rebuilt) innodb_log_checksums (if we choose to revive this deprecated parameter and implement a variant that lacks checksums) There will be no fixed block structure in the circular log file. The LSN will count InnoDB mini-transactions, not bytes. This allows some flexibility: For example, a future version of mariabackup --backup --incremental could inject records to the backup of the main log file, instead of writing separate .delta files. The circular log file will consist of length-tagged sequences of bytes: byte *append_log(byte * log , const void *payload, size_t size, bool skip_bit, bool sequence_bit) { size_t length= size; if (!skip_bit && innodb_log_checksums) length+= 4; /* CRC-32C at the end of the payload */ byte * const start= log ; log = mlog_encode_varint( log , length << 2 | skip_bit << 1 | sequence_bit); if (!skip_bit) memcpy ( log , payload, size); log += size; if (!skip_bit && innodb_log_checksums) { /* Always compute the checksum without the sequence_bit. */ log [-size - 1]&= 0xfe; mach_write_to_4( log , ut_crc32(start, log - start)); log [-size - 1]|= sequence_bit; log += 4; } return log ; } Explanation: If encryption is enabled, the payload will have been encrypted before the log is written. The length and the checksum will not be encrypted. The sequence_bit will be toggled whenever the write position jumps from the end of the circular log file to the beginning. The skip_bit allows us to write a partially filled log block of any size. If we need to persist the log (due to user transaction commit) and we are L bytes into a N-byte block (this depends on the underlying storage!), we can write a special record to say ‘skip the next N-L bytes’. There is no need to initialize ( memset() ) any skipped garbage bytes. We assume that the log ends when we get a CRC-32C mismatch or the sequence_bit of the next record differs from what we expect. (The last log record could end exactly at a byte offset where a log record before the last wrap-around had been stored, and that record would have a valid checksum.) Note: due to the skip_bit and the lack of memset() , it may be necessary to always store checksums, to reliably detect the end of the log.
            baotiao zongzhi chen added a comment -

            @markoMarko Mäkelä

            The design right now is that the partition is still based on space_id:page_no, and then if the mtr modify multi page, then we need flush all the redo log that id modified.. Is that right? Please correct me if I misunderstanding something..
            And even though this solution looks like too rought, I think it is a practical solution, since most of the mtr only modify single page, most of the mtr modify multi page is the smo operation and undo operation. And we can write the undo log in the redo log directly, then the undo log will stay the same file with the redo log. And the smo operation is useless..

            However, how can we keep the operation of flushing multi file atomic?

            baotiao zongzhi chen added a comment - @markoMarko Mäkelä The design right now is that the partition is still based on space_id:page_no, and then if the mtr modify multi page, then we need flush all the redo log that id modified.. Is that right? Please correct me if I misunderstanding something.. And even though this solution looks like too rought, I think it is a practical solution, since most of the mtr only modify single page, most of the mtr modify multi page is the smo operation and undo operation. And we can write the undo log in the redo log directly, then the undo log will stay the same file with the redo log. And the smo operation is useless.. However, how can we keep the operation of flushing multi file atomic?

            > However, how can we keep the operation of flushing multi file atomic?

            I don't think we have such a problem at all. Only one file with redo data will exists. Writing to it looks like this: thread which owns mtr_t prepares buffer to write to a file (prepends it's size, computes crc32 and appends it to the end), then takes log mutex to write() buffer, releases mutex and performs fsync(). And that's it. Then, after some LSN is flushed to redo we can write(O_DIRECT|O_APPEND) corresponding checkpoint to a another file. Writes to log file and checkpoint files are not required to be atomical, thus, it's safe to crash right after flushing redo log, and before writing a checkpoint.

            kevg Eugene Kosov (Inactive) added a comment - > However, how can we keep the operation of flushing multi file atomic? I don't think we have such a problem at all. Only one file with redo data will exists. Writing to it looks like this: thread which owns mtr_t prepares buffer to write to a file (prepends it's size, computes crc32 and appends it to the end), then takes log mutex to write() buffer, releases mutex and performs fsync() . And that's it. Then, after some LSN is flushed to redo we can write(O_DIRECT|O_APPEND) corresponding checkpoint to a another file. Writes to log file and checkpoint files are not required to be atomical, thus, it's safe to crash right after flushing redo log, and before writing a checkpoint.
            baotiao zongzhi chen added a comment -

            No, If the mtr is a smo operation, then this mtr will include multi page operation..
            Then when mtr commit, it need to change more than one data page, then these two page may exist two different redo files as your partitioned by space_id:page_no. When we commit the trx, then we need to promise that the write to two redo logs is atomic. Since in the origin version, we only need to write to one file. It is easy to guarantee that the write is atomic. We don't have this problem.

            baotiao zongzhi chen added a comment - No, If the mtr is a smo operation, then this mtr will include multi page operation.. Then when mtr commit, it need to change more than one data page, then these two page may exist two different redo files as your partitioned by space_id:page_no. When we commit the trx, then we need to promise that the write to two redo logs is atomic. Since in the origin version, we only need to write to one file. It is easy to guarantee that the write is atomic. We don't have this problem.

            Sorry, I think you have some misunderstanding. There are no several redo log files partitioned by space_it::page_no. Before 10.5 it was possible to have several redo log files, but they're used as one logical circular file. In current 10.5 it's already impossible to have several log files. And current design for new file format still assumes just one circular redo log file.

            kevg Eugene Kosov (Inactive) added a comment - Sorry, I think you have some misunderstanding. There are no several redo log files partitioned by space_it::page_no. Before 10.5 it was possible to have several redo log files, but they're used as one logical circular file. In current 10.5 it's already impossible to have several log files. And current design for new file format still assumes just one circular redo log file.
            baotiao zongzhi chen added a comment -

            Sorry, I saw the design document
            "The idea: Partition the log into append-only, truncate-the-start files"

            I suppose we are talking about partition the redo log into multi redo log file..

            baotiao zongzhi chen added a comment - Sorry, I saw the design document "The idea: Partition the log into append-only, truncate-the-start files" I suppose we are talking about partition the redo log into multi redo log file..

            baotiao this initial design was changed. You can find the recent version in comments for this issue. We decided to the the simplest possible thing: metadata file with creator version and such info is a separate file, checkpoints + file operations (create, delete, rename) is a separate append-only file, and actual redo log data is a third separate circular file.

            kevg Eugene Kosov (Inactive) added a comment - baotiao this initial design was changed. You can find the recent version in comments for this issue. We decided to the the simplest possible thing: metadata file with creator version and such info is a separate file, checkpoints + file operations (create, delete, rename) is a separate append-only file, and actual redo log data is a third separate circular file.
            baotiao zongzhi chen added a comment -

            @Eugene Kosov Ok, I got it..

            let me summary changes are:
            1. separate the chekpoint information from ib_logfile
            2. add the file operation in the checkpoint file
            3. the undo data in the redo log as comment by marko
            4. lsn count the number of mtr, not byte

            is it right?

            However, I really interested in seperate the data into multi redo log files, since in the Architeture like aurora and PolarDB, there will exist about 100T large btree, if we don't partition it, there will only one redo log file, we can't make use of the under layer storage of fs..

            baotiao zongzhi chen added a comment - @Eugene Kosov Ok, I got it.. let me summary changes are: 1. separate the chekpoint information from ib_logfile 2. add the file operation in the checkpoint file 3. the undo data in the redo log as comment by marko 4. lsn count the number of mtr, not byte is it right? However, I really interested in seperate the data into multi redo log files, since in the Architeture like aurora and PolarDB, there will exist about 100T large btree, if we don't partition it, there will only one redo log file, we can't make use of the under layer storage of fs..
            marko Marko Mäkelä added a comment - - edited

            baotiao, I have been thinking of the following format:

            • Maintain a 512-byte header in ib_logfile0. Add information needed by innodb_encrypt_log there.
            • After the first 512 bytes, write an append-only log consisting only of file name records (similar to the MDEV-12353 FILE_ records) and fixed-length checkpoint records. Each record will be followed by a CRC-32C checksum. The ib_logfile0 will never be encrypted. (File names and LSNs appear unencrypted in the file system and in the logs anyway.)
            • We write a separate circular file ib_logdata that may be encrypted. The checkpoint records point to a byte offset within this file. This file can support any underlying physical sector size.
            • LSN will be in bytes, just like before. But, encryption might no longer use LSN as part of the initialization vector, so that we can encrypt mtr_t::m_log before acquiring any mutex.
            • Rebuilding the redo log file will not affect the LSN.
            • In the future, mariabackup --backup --incremental could get rid of .delta files and instead write the information to the ib_logfile0 file. Any amount of data can be written to that file without affecting the LSN.

            The circular log file could technically be split to multiple files, but we did not see a need for that. I think 128 TiB should suffice for quite some time in the future. The log checkpoint record would be 1+8+6+4=19 bytes.

            At this point, we will not write undo log data to the redo log. I do not know if we will ever do that. I only mentioned the idea and challenges around it.

            marko Marko Mäkelä added a comment - - edited baotiao , I have been thinking of the following format: Maintain a 512-byte header in ib_logfile0 . Add information needed by innodb_encrypt_log there. After the first 512 bytes, write an append-only log consisting only of file name records (similar to the MDEV-12353 FILE_ records) and fixed-length checkpoint records. Each record will be followed by a CRC-32C checksum. The ib_logfile0 will never be encrypted. (File names and LSNs appear unencrypted in the file system and in the logs anyway.) We write a separate circular file ib_logdata that may be encrypted. The checkpoint records point to a byte offset within this file. This file can support any underlying physical sector size. LSN will be in bytes, just like before. But, encryption might no longer use LSN as part of the initialization vector, so that we can encrypt mtr_t::m_log before acquiring any mutex. Rebuilding the redo log file will not affect the LSN. In the future, mariabackup --backup --incremental could get rid of .delta files and instead write the information to the ib_logfile0 file. Any amount of data can be written to that file without affecting the LSN. The circular log file could technically be split to multiple files, but we did not see a need for that. I think 128 TiB should suffice for quite some time in the future. The log checkpoint record would be 1+8+6+4=19 bytes. At this point, we will not write undo log data to the redo log. I do not know if we will ever do that. I only mentioned the idea and challenges around it.

            I did some research on writing to a file from different threads. My testing code is in https://github.com/kevgs/redo/
            Here are the results for single thread for commit 0400fe0cc3829177c05341413e555c1f09f81b54 on my weak laptop with HDD:

            File size: 134217728, threads: 1, duration: 20s
             
            Circular file:
            RedoSyncTLSBuffer handled 841 commits
            RedoSync handled 368 commits
            RedoSyncBuffer handled 801 commits
            RedoODirectSparse handled 390 commits
            RedoODirectBuffer handled 513 commits
            RedoODirectTwoBuffers handled 747 commits
            RedoOverlappedFsync handled 670 commits
            RedoOverlappedMsync handled 3235 commits
            RedoGroupCommit handled 744 commits
             
            Append-only file:
            RedoSyncTLSBuffer handled 758 commits
            RedoSync handled 361 commits
            RedoSyncBuffer handled 823 commits
            RedoODirectSparse handled 365 commits
            RedoODirectBuffer handled 756 commits
            RedoODirectTwoBuffers handled 753 commits
            RedoOverlappedFsync handled 786 commits
            RedoGroupCommit handled 766 commits
            

            And for 64 threads:

            File size: 134217728, threads: 64, duration: 20s
             
            Circular file:
            RedoSyncTLSBuffer handled 825 commits
            RedoSync handled 434 commits
            RedoSyncBuffer handled 823 commits
            RedoODirectSparse handled 487 commits
            RedoODirectBuffer handled 851 commits
            RedoODirectTwoBuffers handled 863 commits
            RedoOverlappedFsync handled 11242 commits
            RedoOverlappedMsync handled 58714 commits
            RedoGroupCommit handled 1035 commits
             
            Append-only file:
            RedoSyncTLSBuffer handled 879 commits
            RedoSync handled 514 commits
            RedoSyncBuffer handled 855 commits
            RedoODirectSparse handled 477 commits
            RedoODirectBuffer handled 830 commits
            RedoODirectTwoBuffers handled 896 commits
            RedoOverlappedFsync handled 11318 commits
            RedoGroupCommit handled 1361 commits
            

            In multithreaded environment append-only file has the same performance as circular one.

            kevg Eugene Kosov (Inactive) added a comment - I did some research on writing to a file from different threads. My testing code is in https://github.com/kevgs/redo/ Here are the results for single thread for commit 0400fe0cc3829177c05341413e555c1f09f81b54 on my weak laptop with HDD: File size: 134217728, threads: 1, duration: 20s   Circular file: RedoSyncTLSBuffer handled 841 commits RedoSync handled 368 commits RedoSyncBuffer handled 801 commits RedoODirectSparse handled 390 commits RedoODirectBuffer handled 513 commits RedoODirectTwoBuffers handled 747 commits RedoOverlappedFsync handled 670 commits RedoOverlappedMsync handled 3235 commits RedoGroupCommit handled 744 commits   Append-only file: RedoSyncTLSBuffer handled 758 commits RedoSync handled 361 commits RedoSyncBuffer handled 823 commits RedoODirectSparse handled 365 commits RedoODirectBuffer handled 756 commits RedoODirectTwoBuffers handled 753 commits RedoOverlappedFsync handled 786 commits RedoGroupCommit handled 766 commits And for 64 threads: File size: 134217728, threads: 64, duration: 20s   Circular file: RedoSyncTLSBuffer handled 825 commits RedoSync handled 434 commits RedoSyncBuffer handled 823 commits RedoODirectSparse handled 487 commits RedoODirectBuffer handled 851 commits RedoODirectTwoBuffers handled 863 commits RedoOverlappedFsync handled 11242 commits RedoOverlappedMsync handled 58714 commits RedoGroupCommit handled 1035 commits   Append-only file: RedoSyncTLSBuffer handled 879 commits RedoSync handled 514 commits RedoSyncBuffer handled 855 commits RedoODirectSparse handled 477 commits RedoODirectBuffer handled 830 commits RedoODirectTwoBuffers handled 896 commits RedoOverlappedFsync handled 11318 commits RedoGroupCommit handled 1361 commits In multithreaded environment append-only file has the same performance as circular one.

            Actually with fdatasync() instead of fsync() the picture is different now:

            File size: 134217728, threads: 64, duration: 20s
             
            Circular file:
            RedoOverlappedFsync handled 50247 commits
             
            Append-only file:
            RedoOverlappedFsync handled 14399 commits
            
            

            kevg Eugene Kosov (Inactive) added a comment - Actually with fdatasync() instead of fsync() the picture is different now: File size: 134217728, threads: 64, duration: 20s   Circular file: RedoOverlappedFsync handled 50247 commits   Append-only file: RedoOverlappedFsync handled 14399 commits
            nunop Nuno added a comment -

            Hi,

            If " innodb_log_files_in_group " is being removed/ignored, you may want to update the documentation at:
            https://mariadb.com/kb/en/innodb-redo-log/

            Which advices to configure that variable as required.

            Thank you.

            nunop Nuno added a comment - Hi, If " innodb_log_files_in_group " is being removed/ignored, you may want to update the documentation at: https://mariadb.com/kb/en/innodb-redo-log/ Which advices to configure that variable as required. Thank you.
            greenman Ian Gilfillan added a comment -

            Thanks nunop, the documentation has been expanded to mention the 10.5 change in each place where it could be relevant, so hopefully its not misleading any longer.

            greenman Ian Gilfillan added a comment - Thanks nunop , the documentation has been expanded to mention the 10.5 change in each place where it could be relevant, so hopefully its not misleading any longer.

            I am sorry, but it does not look like this can be completed in the 10.6 release. At the time when I was finally ready to resume this, we were already close to a feature freeze, and I was reluctant to start changing the file format and rewriting the recovery code that late. So, instead I spent the time on addressing another bottleneck: lock_sys.mutex. With MDEV-23855, MDEV-20612 and MDEV-24738 completed, the major remaining scalability bottleneck in InnoDB is log_sys.mutex, which will be addressed by this task, hopefully very early during the 10.7 development cycle.

            marko Marko Mäkelä added a comment - I am sorry, but it does not look like this can be completed in the 10.6 release. At the time when I was finally ready to resume this, we were already close to a feature freeze, and I was reluctant to start changing the file format and rewriting the recovery code that late. So, instead I spent the time on addressing another bottleneck: lock_sys.mutex . With MDEV-23855 , MDEV-20612 and MDEV-24738 completed, the major remaining scalability bottleneck in InnoDB is log_sys.mutex , which will be addressed by this task, hopefully very early during the 10.7 development cycle.
            marko Marko Mäkelä added a comment - - edited

            I experimented whether if it makes sense to eliminate FILE_MODIFY and similar records from the normal redo log. The idea was to introduce a separate append-only file exclusively for checkpoint and file name information. I ran Sysbench oltp_update_index using 80×10,000 rows and innodb_log_file_size=2G before and after the change, on the server process pinned to a single Intel® Xeon® E5-2630 processor. During the benchmark, the LSN grew to about 12.6 GiB, that is, 6 times the log file size.
            I observed the following numbers of transactions per second with different numbers of concurrent connections:

            server 10/tps 20/tps 30/tps
            10.7 100909 157561 159615
            10.7-modified 100806 159263 160353

            We would seem to need a run with significantly more log checkpoints, because log checkpoints are where I would expect the FILE_MODIFY bookkeeping (the fil_system.named_spaces) to make the most difference. Here is another test with innodb_log_file_size=256M (1/8 of the original log file size):

            server 10/tps 20/tps 30/tps
            10.7 97272 151572 153554
            10.7-modified 99020 154903 157719

            The 2-minute benchmark runs probably are too short for us to draw any conclusions.

            If this change does not appear to consistently lead to a significant improvement, then I think that it would be best to keep the single circular log file, with a slightly changed structure that I think should be friendly for both persistent memory (PMEM) and computational storage drives (such as those by ScaleFlux):

            • 2 checkpoint (and file format) information blocks of 4096 bytes each
            • Circular log file, with arbitrary block size (64 to 4096 bytes); padded with NUL bytes that are not encrypted nor checksummed
            marko Marko Mäkelä added a comment - - edited I experimented whether if it makes sense to eliminate FILE_MODIFY and similar records from the normal redo log. The idea was to introduce a separate append-only file exclusively for checkpoint and file name information. I ran Sysbench oltp_update_index using 80×10,000 rows and innodb_log_file_size=2G before and after the change, on the server process pinned to a single Intel® Xeon® E5-2630 processor. During the benchmark, the LSN grew to about 12.6 GiB, that is, 6 times the log file size. I observed the following numbers of transactions per second with different numbers of concurrent connections: server 10/tps 20/tps 30/tps 10.7 100909 157561 159615 10.7-modified 100806 159263 160353 We would seem to need a run with significantly more log checkpoints, because log checkpoints are where I would expect the FILE_MODIFY bookkeeping (the fil_system.named_spaces ) to make the most difference. Here is another test with innodb_log_file_size=256M (1/8 of the original log file size): server 10/tps 20/tps 30/tps 10.7 97272 151572 153554 10.7-modified 99020 154903 157719 The 2-minute benchmark runs probably are too short for us to draw any conclusions. If this change does not appear to consistently lead to a significant improvement, then I think that it would be best to keep the single circular log file, with a slightly changed structure that I think should be friendly for both persistent memory (PMEM) and computational storage drives (such as those by ScaleFlux): 2 checkpoint (and file format) information blocks of 4096 bytes each Circular log file, with arbitrary block size (64 to 4096 bytes); padded with NUL bytes that are not encrypted nor checksummed

            Another benchmark for assessing the impact of eliminating the FILE_MODIFY records showed some improvement at 32 concurrent connections, and virtually no improvement at 16 concurrent connections.

            It might still turn out that changing the log block format and switching to asynchronous O_DIRECT|O_DSYNC writes of log blocks will reduce contention on log_sys.mutex so much that eliminating the FILE_MODIFY records would not bring significant additional benefit. For this reason, it may be wise to first develop the concurrency-friendlier log block format and then test the removal of the FILE_MODIFY records on top of that.

            marko Marko Mäkelä added a comment - Another benchmark for assessing the impact of eliminating the FILE_MODIFY records showed some improvement at 32 concurrent connections, and virtually no improvement at 16 concurrent connections. It might still turn out that changing the log block format and switching to asynchronous O_DIRECT|O_DSYNC writes of log blocks will reduce contention on log_sys.mutex so much that eliminating the FILE_MODIFY records would not bring significant additional benefit. For this reason, it may be wise to first develop the concurrency-friendlier log block format and then test the removal of the FILE_MODIFY records on top of that.
            marko Marko Mäkelä added a comment - - edited

            I am currently debugging a prototype that will retain the FILE_MODIFY records and a single ib_logfile0, to keep backups and log resizing simple.

            Changes to log block format

            1. innodb_encrypt_log metadata (encryption key information) will be moved to the 512-byte log file header block.
            2. The 2 checkpoint blocks will move to 64 bytes at the start of 4096-byte blocks at offsets 4096 and 8192. This should allow O_DIRECT writes in all file systems as well as allow efficient writing of checkpoints on PMEM.
            3. Redo log record data will start at byte offset 12288, right after the 2 checkpoint blocks.
            4. The 512-byte log block structure will be eliminated. Basically, every mini-transaction will be an arbitrary-sized log block.

            It will be easier to read and write the log file, because each mini-transaction will be a contiguous stream of bytes (except when the mini-transaction wraps around from the last byte of ib_logfile0 to the 12288 bytes from the start of the file.

            The checkpoint block format

            Bytes 4096 to 12287 will be filled with NUL bytes, except for the 64-byte checkpoint blocks, which will contain the following information:

            1. 64-bit checkpoint log sequence number (LSN)
            2. 64-bit LSN of log with optional FILE_MODIFY records and a FILE_CHECKPOINT record pointing to the checkpoint
            3. 64-bit offset in ib_logfile0, pointing to the log record at the checkpoint LSN
            4. 36 bytes of NUL (reserved for future extension)
            5. 32-bit CRC-32C checksum of the 64-byte block

            Changes to log record format

            The 1-byte mini-transaction trailer (a NUL byte) will be replaced with the following:

            1. A byte 0 or 1, corresponding to a "sequence bit" that replaces the 31-bit LOG_BLOCK_HDR_NO field of the log header.
              Each time the log wraps around from the end to offset 12288, this bit will be toggled. The value of the bit is computed based on the log header field LOG_HEADER_START_LSN, which is the LSN of the very first record that was written to the file, at offset 12288.
            2. 32 bits of CRC-32C checksums from the start of the mini-transaction to the end, excluding the sequence bit.
            3. Only for innodb_encrypt_log=ON: 64 bits of nonce that is used as part of the initialization vector.

            This format will allow us to simply memcpy() log records and the checksum to the log buffer, or directly to the redo log that resides in PMEM and thus reduce contention on log_sys.mutex. In the old format with 512-byte blocks, some memset(), my_crc32c() and encryption_crypt() will be executed while holding log_sys.mutex.

            Encryption

            To allow the data to be backed up without requiring any decryption, encryption will be limited to the payload bytes of page-level redo log records. That is, checkpoint information, file names, tablespace identifiers and page numbers will be in clear text. It can be argued that this information is already mostly available in clear text even in encrypted data files. Neither the file names in the file system nor the FIL_PAGE_LSN in data pages was ever encrypted. Also the tablespace ID is stored in clear text in the first page of each data file.

            This means that INIT_PAGE and FREE_PAGE records (which lack any payload) will be entirely unencrypted. For applying backed up log to backed up data files, the ability to decrypt the log will be needed.

            Padding

            When we want to write an incomplete log block, we can pad the log to the desired block size (be it 64 or 4096 or whatever amount of bytes) by writing special FILE_CHECKPOINT records whose payload is filled with NUL bytes. The minimum padding size would be 7 bytes: 0xf1, sequence bit, checksum. If innodb_encrypt_log=ON, each record will be 8 bytes longer, due to a "nonce" being added to each mini-transaction. Normal FILE_CHECKPOINT records cannot be confused with them, because the checkpoint payload will never be 0.

            Because the padding records have to be written while holding log_sys.mutex, we will use pre-computed checksums. To minimize the cache impact, we will use 15 distinct record sizes. For example, 22 bytes could be padded using 2 records when innodb_encrypt_log=OFF and the value of the sequence bit is 1:

            f1 00 01 a6 59 c1 db
            f9 00 00 00 00 00 00 00 00 00 01 ba 73 b2 a3
            

            When innodb_encrypt_log=ON, each record would be 8 bytes longer, and the pad record sizes will range from 15 to 29 bytes. Thus, 22 bytes would be padded using a single record:

            f8 00 00 00 00 00 00 00 00 01 eb 20 12 33 00 00 00 00 00 00 00 00
            

            The log parser will handle any pad record size up to 65536 bytes, but we do not want to compute checksums on pad records while holding log_sys.mutex, and it could be detrimental to performance to have a larger checksum lookup table.

            marko Marko Mäkelä added a comment - - edited I am currently debugging a prototype that will retain the FILE_MODIFY records and a single ib_logfile0 , to keep backups and log resizing simple. Changes to log block format innodb_encrypt_log metadata (encryption key information) will be moved to the 512-byte log file header block. The 2 checkpoint blocks will move to 64 bytes at the start of 4096-byte blocks at offsets 4096 and 8192. This should allow O_DIRECT writes in all file systems as well as allow efficient writing of checkpoints on PMEM. Redo log record data will start at byte offset 12288, right after the 2 checkpoint blocks. The 512-byte log block structure will be eliminated. Basically, every mini-transaction will be an arbitrary-sized log block. It will be easier to read and write the log file, because each mini-transaction will be a contiguous stream of bytes (except when the mini-transaction wraps around from the last byte of ib_logfile0 to the 12288 bytes from the start of the file. The checkpoint block format Bytes 4096 to 12287 will be filled with NUL bytes, except for the 64-byte checkpoint blocks, which will contain the following information: 64-bit checkpoint log sequence number (LSN) 64-bit LSN of log with optional FILE_MODIFY records and a FILE_CHECKPOINT record pointing to the checkpoint 64-bit offset in ib_logfile0 , pointing to the log record at the checkpoint LSN 36 bytes of NUL (reserved for future extension) 32-bit CRC-32C checksum of the 64-byte block Changes to log record format The 1-byte mini-transaction trailer (a NUL byte) will be replaced with the following: A byte 0 or 1, corresponding to a "sequence bit" that replaces the 31-bit LOG_BLOCK_HDR_NO field of the log header. Each time the log wraps around from the end to offset 12288, this bit will be toggled. The value of the bit is computed based on the log header field LOG_HEADER_START_LSN , which is the LSN of the very first record that was written to the file, at offset 12288. 32 bits of CRC-32C checksums from the start of the mini-transaction to the end, excluding the sequence bit. Only for innodb_encrypt_log=ON : 64 bits of nonce that is used as part of the initialization vector. This format will allow us to simply memcpy() log records and the checksum to the log buffer, or directly to the redo log that resides in PMEM and thus reduce contention on log_sys.mutex . In the old format with 512-byte blocks, some memset() , my_crc32c() and encryption_crypt() will be executed while holding log_sys.mutex . Encryption To allow the data to be backed up without requiring any decryption, encryption will be limited to the payload bytes of page-level redo log records. That is, checkpoint information, file names, tablespace identifiers and page numbers will be in clear text. It can be argued that this information is already mostly available in clear text even in encrypted data files. Neither the file names in the file system nor the FIL_PAGE_LSN in data pages was ever encrypted. Also the tablespace ID is stored in clear text in the first page of each data file. This means that INIT_PAGE and FREE_PAGE records (which lack any payload) will be entirely unencrypted. For applying backed up log to backed up data files, the ability to decrypt the log will be needed. Padding When we want to write an incomplete log block, we can pad the log to the desired block size (be it 64 or 4096 or whatever amount of bytes) by writing special FILE_CHECKPOINT records whose payload is filled with NUL bytes. The minimum padding size would be 7 bytes: 0xf1 , sequence bit, checksum. If innodb_encrypt_log=ON , each record will be 8 bytes longer, due to a "nonce" being added to each mini-transaction. Normal FILE_CHECKPOINT records cannot be confused with them, because the checkpoint payload will never be 0. Because the padding records have to be written while holding log_sys.mutex , we will use pre-computed checksums. To minimize the cache impact, we will use 15 distinct record sizes. For example, 22 bytes could be padded using 2 records when innodb_encrypt_log=OFF and the value of the sequence bit is 1: f1 00 01 a6 59 c1 db f9 00 00 00 00 00 00 00 00 00 01 ba 73 b2 a3 When innodb_encrypt_log=ON , each record would be 8 bytes longer, and the pad record sizes will range from 15 to 29 bytes. Thus, 22 bytes would be padded using a single record: f8 00 00 00 00 00 00 00 00 01 eb 20 12 33 00 00 00 00 00 00 00 00 The log parser will handle any pad record size up to 65536 bytes, but we do not want to compute checksums on pad records while holding log_sys.mutex , and it could be detrimental to performance to have a larger checksum lookup table.

            A few mariadb-backup tests are disabled for now (in particular all tests that use --incremental), and there are some crash recovery bugs, for which rr traces are needed. Also, it is possible (and intended) to improve performance later, based on what this format change allows.

            InnoDB will refuse to start up without ib_logfile0, unless innodb_force_recovery=6 is set. This allows MDEV-27199 to stop the inherently risky updates of the field FIL_PAGE_FILE_FLUSH_LSN in the first page of the system tablespace file.

            marko Marko Mäkelä added a comment - A few mariadb-backup tests are disabled for now (in particular all tests that use --incremental ), and there are some crash recovery bugs, for which rr traces are needed. Also, it is possible (and intended) to improve performance later, based on what this format change allows. InnoDB will refuse to start up without ib_logfile0 , unless innodb_force_recovery=6 is set. This allows MDEV-27199 to stop the inherently risky updates of the field FIL_PAGE_FILE_FLUSH_LSN in the first page of the system tablespace file.

            Most incremental backup tests work now. The final issue was that after the incremental log apply, the dummy log file was being created in the wrong directory (not --target-dir) and thus the backup was being restored with a too old log sequence number in the dummy log file.

            marko Marko Mäkelä added a comment - Most incremental backup tests work now. The final issue was that after the incremental log apply, the dummy log file was being created in the wrong directory (not --target-dir ) and thus the backup was being restored with a too old log sequence number in the dummy log file.
            axel Axel Schwenke added a comment - - edited

            preview-10.8-MDEV-14425 commit fe030b137f4 looks promising

            axel Axel Schwenke added a comment - - edited preview-10.8- MDEV-14425 commit fe030b137f4 looks promising

            axel, thank you. The branch preview-10.8-MDEV-14425-innodb was updated at least twice since you tested it, to fix some mariadb-backup tests as well as an issue with innodb_encrypt_log: I made a wrong assumption that a string may be encrypted piecewise in multiple calls to encryption_crypt().

            There still remains some room for performance improvements. In particular, we are writing unaligned data to the ib_logfile0 and never padding it. We have some test failures on Microsoft Windows, possibly related to that. I think that we should enable O_DIRECT writes wherever possible, and write data that is aligned to the physical sector size (or 4096 bytes if the physical sector size cannot be determined).

            Furthermore, I did not have time to simplify the PMEM interface yet. On PMEM, the physical sector size would be CPU_LEVEL1_DCACHE_LINESIZE (64 bytes on AMD64).

            marko Marko Mäkelä added a comment - axel , thank you. The branch preview-10.8-MDEV-14425-innodb was updated at least twice since you tested it, to fix some mariadb-backup tests as well as an issue with innodb_encrypt_log : I made a wrong assumption that a string may be encrypted piecewise in multiple calls to encryption_crypt() . There still remains some room for performance improvements. In particular, we are writing unaligned data to the ib_logfile0 and never padding it. We have some test failures on Microsoft Windows, possibly related to that. I think that we should enable O_DIRECT writes wherever possible, and write data that is aligned to the physical sector size (or 4096 bytes if the physical sector size cannot be determined). Furthermore, I did not have time to simplify the PMEM interface yet. On PMEM, the physical sector size would be CPU_LEVEL1_DCACHE_LINESIZE (64 bytes on AMD64).

            We never claimed to support (or test) downgrades between major versions. Users who are desperate to downgrade could try the following:

            1. Perform a clean shutdown. Note the log sequence number.
            2. Ensure that the last LSN in the new-format ib_logfile0 matches the one in the shutdown message.
            3. Back up the data directory.
            4. Write that LSN to the system tablespace (see the test mariabackup.huge_lsn for how to do that) and delete the ib_logfile0 file.
            5. Start the older version of MariaDB.

            A mandatory step in the above is that the LSN in the first page of the system tablespace needs to be updated. If that is neglected, the old server will start with a too old LSN (see MDEV-27199), and all InnoDB files that will be modified may become corrupted. An indicator of that are messages like "Page … log sequence number … is in the future".

            marko Marko Mäkelä added a comment - We never claimed to support (or test) downgrades between major versions. Users who are desperate to downgrade could try the following: Perform a clean shutdown. Note the log sequence number. Ensure that the last LSN in the new-format ib_logfile0 matches the one in the shutdown message. Back up the data directory. Write that LSN to the system tablespace (see the test mariabackup.huge_lsn for how to do that) and delete the ib_logfile0 file. Start the older version of MariaDB. A mandatory step in the above is that the LSN in the first page of the system tablespace needs to be updated. If that is neglected, the old server will start with a too old LSN (see MDEV-27199 ), and all InnoDB files that will be modified may become corrupted. An indicator of that are messages like "Page … log sequence number … is in the future".
            mleich Matthias Leich added a comment - - edited

            Preliminary results of RQG testing on origin/preview-10.8-MDEV-14425-innodb 23849209738153bed4ea60f39830305840ee4025 2021-12-19T17:28:12+02:00
            1. The ASAN failures around crc32 (seen on previous tree) have disappeared.
            2. Failure pattern TBR-1310
                kill DB server when being under load, the restart attempt fails with
                mysqld: storage/innobase/rem/rem0rec.cc:304: void rec_init_offsets_comp_ordinary(const rec_t*, ...): Assertion `n_fields <= ulint(index->n_fields) + 1' failed.
                sdp:/data/results/1639941539/TBR-1310/dev/shm/rqg/1639941539/181/1/rr
                The rr trace (mysqld-1) till injected SIGSEGV is at its end has trouble around end.
                /data/results/1639941539/TBR-1310/dev/shm/rqg/1639941539/181/1/data_copy
                          Copy of the data dir before restart attempt.
            2. Failure pattern TBR-1311
                kill DB server when being under load, restart with success, SELECT ... FROM ... FORCE INDEX .... harvests 1030,
                [ERROR] InnoDB indexes are inconsistent with what defined in .frm for table ./test/t4
                sdp:/data/results/1639941539/TBR-1311/dev/shm/rqg/1639941539/48/1/rr
                Both rr traces work well.
                /data/results/1639941539/TBR-1311/dev/shm/rqg/1639941539/48/1/data_copy/
                          Copy of the data dir before restart attempt.
            3. Failure pattern TBR-1312
                kill DB server when being under load, the restart attempt fails with
                [ERROR] [FATAL] InnoDB: Page 3242543642:134 name ./test/t6.ibd page_type 32770 key_version 1 lsn 78387097 compressed_len 55514
                sdp:/data/results/1639941539/TBR-1312
                gdb -c dev/shm/rqg/1639941539/158/1/data/core /data/Server_bin/preview-10.8-MDEV-14425-innodbA_asan/bin/mysqld
                           Core at end of restart attempt.
                /data/results/1639941539/TBR-1312/dev/shm/rqg/1639941539/158/1/data_copy/
                           Copy of the data dir before restart attempt.
            4. Most if not all other failures observed occur on actual main trees 10.6 - 10.8 too
             
            Upgrade (stop is initiated by SIGTERM) from
            10.5.14 origin/10.5 2776635cb98d35867447d375fdc04a44ef11a697 2021-12-16
            to
            10.8.0 origin/preview-10.8-MDEV-14425-innodb 23849209738153bed4ea60f39830305840ee4025 2021-12-19
            Failure patterns (TBR-1313 - TBR-1315)
            1. The restart with preview-10.8... fails with trouble like
                 - InnoDB: Background Page read failed to read, uncompress, or decrypt
                 - InnoDB: Failed to read page ... from file ....: Table is compressed or encrypted but uncompress or decrypt failed
             2. The mysql_upgrade script fails like
                  - MariaDB tried to use the .{1,10} compression, but its provider plugin is not loaded
                  or
                  - Table ... is compressed with ..., which is not currently loaded. Please ... the bzip2 provider plugin to open the table'
                  or
                  -  # ERROR 2013 (HY000) at line 795: Lost connection to server during query
                     # ERROR 2006 (HY000) at line 796: Server has gone away
                     # ERROR: AddressSanitizer: heap-buffer-overflow on address ...
                     # READ of size 19 at 0x602000010bb7 thread T16
                    #0 0x7f9cd1091cff  (/usr/lib/x86_64-linux-gnu/libasan.so.5+0xdacff)
                    #1 0x557eb5408a05 in cmp_data(unsigned long, unsigned long, unsigned char const*, unsigned long, unsigned char const*, unsigned long) /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/rem/rem0cmp.cc:322
                #2 0x557eb5404137 in cmp_data_data(unsigned long, unsigned long, unsigned char const*, unsigned long, unsigned char const*, unsigned long) /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/rem/rem0cmp.cc:378
                #3 0x557eb57fa0e8 in cmp_dfield_dfield /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/include/rem0cmp.ic:49
                #4 0x557eb57fb12a in eval_cmp(func_node_t*) /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/eval/eval0eval.cc:183
                #5 0x557eb57fc3c4 in eval_func(func_node_t*) /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/eval/eval0eval.cc:595
                #6 0x557eb57fd04e in eval_exp /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/include/eval0eval.ic:117
                #7 0x557eb57fd522 in if_step(que_thr_t*) /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/eval/eval0proc.cc:48
                #8 0x557eb53f497a in que_thr_step /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/que/que0que.cc:611
                #9 0x557eb53f50e1 in que_run_threads_low /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/que/que0que.cc:709
                #10 0x557eb53f5283 in que_run_threads(que_thr_t*) /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/que/que0que.cc:729
                #11 0x557eb53f55a9 in que_eval_sql(pars_info_t*, char const*, trx_t*) /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/que/que0que.cc:768
                #12 0x557eb511c76c in innodb_drop_database /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/handler/ha_innodb.cc:1506
                # Query (0x62900004b2d0): DROP DATABASE IF EXISTS performance_schema
                ==> MDEV-27336
            3. There are other failures too like schema or data content mismatches between state
                 before and after upgrade. But these were observed main trees too.
             
            Pseudoupgrade preview-10.8-MDEV-14425-innodb -> preview-10.8-MDEV-14425-innodb
            (origin/preview-10.8-MDEV-14425-innodb 23849209738153bed4ea60f39830305840ee4025)
            1. The failures seen when running this pseudoupgrade on a previous version of a MDEV-14425 development tree are gone.
            2. Other failures observed are known for the main trees too.
             
            
            

            mleich Matthias Leich added a comment - - edited Preliminary results of RQG testing on origin/preview-10.8-MDEV-14425-innodb 23849209738153bed4ea60f39830305840ee4025 2021-12-19T17:28:12+02:00 1. The ASAN failures around crc32 (seen on previous tree) have disappeared. 2. Failure pattern TBR-1310 kill DB server when being under load, the restart attempt fails with mysqld: storage/innobase/rem/rem0rec.cc:304: void rec_init_offsets_comp_ordinary(const rec_t*, ...): Assertion `n_fields <= ulint(index->n_fields) + 1' failed. sdp:/data/results/1639941539/TBR-1310/dev/shm/rqg/1639941539/181/1/rr The rr trace (mysqld-1) till injected SIGSEGV is at its end has trouble around end. /data/results/1639941539/TBR-1310/dev/shm/rqg/1639941539/181/1/data_copy Copy of the data dir before restart attempt. 2. Failure pattern TBR-1311 kill DB server when being under load, restart with success, SELECT ... FROM ... FORCE INDEX .... harvests 1030, [ERROR] InnoDB indexes are inconsistent with what defined in .frm for table ./test/t4 sdp:/data/results/1639941539/TBR-1311/dev/shm/rqg/1639941539/48/1/rr Both rr traces work well. /data/results/1639941539/TBR-1311/dev/shm/rqg/1639941539/48/1/data_copy/ Copy of the data dir before restart attempt. 3. Failure pattern TBR-1312 kill DB server when being under load, the restart attempt fails with [ERROR] [FATAL] InnoDB: Page 3242543642:134 name ./test/t6.ibd page_type 32770 key_version 1 lsn 78387097 compressed_len 55514 sdp:/data/results/1639941539/TBR-1312 gdb -c dev/shm/rqg/1639941539/158/1/data/core /data/Server_bin/preview-10.8-MDEV-14425-innodbA_asan/bin/mysqld Core at end of restart attempt. /data/results/1639941539/TBR-1312/dev/shm/rqg/1639941539/158/1/data_copy/ Copy of the data dir before restart attempt. 4. Most if not all other failures observed occur on actual main trees 10.6 - 10.8 too   Upgrade (stop is initiated by SIGTERM) from 10.5.14 origin/10.5 2776635cb98d35867447d375fdc04a44ef11a697 2021-12-16 to 10.8.0 origin/preview-10.8-MDEV-14425-innodb 23849209738153bed4ea60f39830305840ee4025 2021-12-19 Failure patterns (TBR-1313 - TBR-1315) 1. The restart with preview-10.8... fails with trouble like - InnoDB: Background Page read failed to read, uncompress, or decrypt - InnoDB: Failed to read page ... from file ....: Table is compressed or encrypted but uncompress or decrypt failed 2. The mysql_upgrade script fails like - MariaDB tried to use the .{1,10} compression, but its provider plugin is not loaded or - Table ... is compressed with ..., which is not currently loaded. Please ... the bzip2 provider plugin to open the table' or - # ERROR 2013 (HY000) at line 795: Lost connection to server during query # ERROR 2006 (HY000) at line 796: Server has gone away # ERROR: AddressSanitizer: heap-buffer-overflow on address ... # READ of size 19 at 0x602000010bb7 thread T16 #0 0x7f9cd1091cff (/usr/lib/x86_64-linux-gnu/libasan.so.5+0xdacff) #1 0x557eb5408a05 in cmp_data(unsigned long, unsigned long, unsigned char const*, unsigned long, unsigned char const*, unsigned long) /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/rem/rem0cmp.cc:322 #2 0x557eb5404137 in cmp_data_data(unsigned long, unsigned long, unsigned char const*, unsigned long, unsigned char const*, unsigned long) /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/rem/rem0cmp.cc:378 #3 0x557eb57fa0e8 in cmp_dfield_dfield /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/include/rem0cmp.ic:49 #4 0x557eb57fb12a in eval_cmp(func_node_t*) /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/eval/eval0eval.cc:183 #5 0x557eb57fc3c4 in eval_func(func_node_t*) /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/eval/eval0eval.cc:595 #6 0x557eb57fd04e in eval_exp /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/include/eval0eval.ic:117 #7 0x557eb57fd522 in if_step(que_thr_t*) /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/eval/eval0proc.cc:48 #8 0x557eb53f497a in que_thr_step /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/que/que0que.cc:611 #9 0x557eb53f50e1 in que_run_threads_low /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/que/que0que.cc:709 #10 0x557eb53f5283 in que_run_threads(que_thr_t*) /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/que/que0que.cc:729 #11 0x557eb53f55a9 in que_eval_sql(pars_info_t*, char const*, trx_t*) /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/que/que0que.cc:768 #12 0x557eb511c76c in innodb_drop_database /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/handler/ha_innodb.cc:1506 # Query (0x62900004b2d0): DROP DATABASE IF EXISTS performance_schema ==> MDEV-27336 3. There are other failures too like schema or data content mismatches between state before and after upgrade. But these were observed main trees too.   Pseudoupgrade preview-10.8-MDEV-14425-innodb -> preview-10.8-MDEV-14425-innodb (origin/preview-10.8-MDEV-14425-innodb 23849209738153bed4ea60f39830305840ee4025) 1. The failures seen when running this pseudoupgrade on a previous version of a MDEV-14425 development tree are gone. 2. Other failures observed are known for the main trees too.  

            mleich, I would expect the upgrade failures for page_compressed tables to be related to MDEV-12933, and affect already an upgrade to 10.7. Can you check that? This branch only includes changes to the log file format, not to any data page encryption or compression. However, it is theoretically possible that if recovery fails to find and process some INIT_PAGE records due to wrongly detected EOF, we would attempt to read a corrupted page that was not supposed to be read during recovery (MDEV-19738).

            Today, I did some cleanup and enabled O_DIRECT access to the log file on Linux when the physical block size is 512 bytes. After my Sysbench based test, the Linux file system cache no longer grew to the size of the ib_logfile0, like it used to do. We really should replace the constant log_sys.BLOCK_SIZE with a variable that we will determine from the operating system.

            marko Marko Mäkelä added a comment - mleich , I would expect the upgrade failures for page_compressed tables to be related to MDEV-12933 , and affect already an upgrade to 10.7. Can you check that? This branch only includes changes to the log file format, not to any data page encryption or compression. However, it is theoretically possible that if recovery fails to find and process some INIT_PAGE records due to wrongly detected EOF, we would attempt to read a corrupted page that was not supposed to be read during recovery ( MDEV-19738 ). Today, I did some cleanup and enabled O_DIRECT access to the log file on Linux when the physical block size is 512 bytes . After my Sysbench based test, the Linux file system cache no longer grew to the size of the ib_logfile0 , like it used to do. We really should replace the constant log_sys.BLOCK_SIZE with a variable that we will determine from the operating system.

            Upgrade (stop is initiated by SIGTERM) from
            10.5.14 origin/10.5 2776635cb98d35867447d375fdc04a44ef11a697 2021-12-16
            to
            10.7.2 origin/10.7 92a4e76a2c1c15fb44dc0cb05e06d5aa408a8e35 2021-12-14
            The failure patterns TBR-1313 - TBR-1315 were observed.
            This lets assume that there are no preview-10.8-MDEV-14425-innodb specific upgrade failures.
            

            mleich Matthias Leich added a comment - Upgrade (stop is initiated by SIGTERM) from 10.5.14 origin/10.5 2776635cb98d35867447d375fdc04a44ef11a697 2021-12-16 to 10.7.2 origin/10.7 92a4e76a2c1c15fb44dc0cb05e06d5aa408a8e35 2021-12-14 The failure patterns TBR-1313 - TBR-1315 were observed. This lets assume that there are no preview-10.8-MDEV-14425-innodb specific upgrade failures.

            I made one more change today, which missed the preview releases. On Linux and Microsoft Windows, we will bypass the file system cache for the redo log if the physical block size is 64 to 4096 bytes. The environments where it was tested had 512-byte or 4096-byte sectors. When the buffer is bypassed, you would see a message like this in the server message log:

            2021-12-21 14:02:49 0 [Note] InnoDB: File system buffers for log disabled (block size=4096 bytes)
            

            A final change that I plan to implement is a more efficient PMEM interface, to make log_sys.buf point directly to the persistent memory.

            marko Marko Mäkelä added a comment - I made one more change today, which missed the preview releases. On Linux and Microsoft Windows, we will bypass the file system cache for the redo log if the physical block size is 64 to 4096 bytes. The environments where it was tested had 512-byte or 4096-byte sectors. When the buffer is bypassed, you would see a message like this in the server message log: 2021-12-21 14:02:49 0 [Note] InnoDB: File system buffers for log disabled (block size=4096 bytes) A final change that I plan to implement is a more efficient PMEM interface, to make log_sys.buf point directly to the persistent memory.

            There now is a new PMEM (MDEV-25090) interface that I have tested on Linux, On Linux, it is also used if innodb_log_group_home_dir (or datadir) points to /dev/shm. A start-up message will identify this interface as follows:

            2022-01-05  7:44:52 0 [Note] InnoDB: Memory-mapped log (10485760 bytes)
            

            It is still possible to avoid using mmap() on tmpfs if you use any other tmpfs mount point, such as --innodb-log-group-home-dir=/run/user/$UID.

            The Linux mmap() based interface for PMEM will only work if the file system has been mounted with -o dax. If the option is missing, conventional file I/O will be used. In this case, I saw a start-up message like this:

            2021-12-21 14:02:49 0 [Note] InnoDB: File system buffers for log disabled (block size=4096 bytes)
            

            marko Marko Mäkelä added a comment - There now is a new PMEM ( MDEV-25090 ) interface that I have tested on Linux, On Linux, it is also used if innodb_log_group_home_dir (or datadir ) points to /dev/shm . A start-up message will identify this interface as follows: 2022-01-05 7:44:52 0 [Note] InnoDB: Memory-mapped log (10485760 bytes) It is still possible to avoid using mmap() on tmpfs if you use any other tmpfs mount point, such as --innodb-log-group-home-dir=/run/user/$UID . The Linux mmap() based interface for PMEM will only work if the file system has been mounted with -o dax . If the option is missing, conventional file I/O will be used. In this case, I saw a start-up message like this: 2021-12-21 14:02:49 0 [Note] InnoDB: File system buffers for log disabled (block size=4096 bytes)

            A user-visible change is that this is bundled with MDEV-27199. We will require the ib_logfile0 to always exist. Previously, if the file was empty or missing, InnoDB would create a new log file, assuming that all data files are clean and that the field FIL_PAGE_FILE_FLUSH_LSN in the first page of the system tablespace (ibdata1) contains the most recent log sequence number. mariadb-backup --prepare will create a minimal ib_logfile0 file.

            See also my previous note about downgrades to earlier versions (which we do not support). Because with MDEV-27199, we would no longer update the FIL_PAGE_FILE_FLUSH_LSN in the InnoDB system tablespace on shutdown, a simple approach of removing ib_logfile0 and starting up an older version would likely result in a disaster, caused by a rewind of the log sequence number.

            marko Marko Mäkelä added a comment - A user-visible change is that this is bundled with MDEV-27199 . We will require the ib_logfile0 to always exist. Previously, if the file was empty or missing, InnoDB would create a new log file, assuming that all data files are clean and that the field FIL_PAGE_FILE_FLUSH_LSN in the first page of the system tablespace ( ibdata1 ) contains the most recent log sequence number. mariadb-backup --prepare will create a minimal ib_logfile0 file. See also my previous note about downgrades to earlier versions (which we do not support). Because with MDEV-27199 , we would no longer update the FIL_PAGE_FILE_FLUSH_LSN in the InnoDB system tablespace on shutdown, a simple approach of removing ib_logfile0 and starting up an older version would likely result in a disaster, caused by a rewind of the log sequence number.

            Related to MDEV-27437, I realized that a restored backup from an older version would consist of an ib_logfile0 whose size is 0 bytes. We must allow upgrade straight from a backup. In that case, we will recover the log sequence number from the FIL_PAGE_FILE_FLUSH_LSN field. If that field contains 0 (like it will after MDEV-27199), we will refuse to start up.

            marko Marko Mäkelä added a comment - Related to MDEV-27437 , I realized that a restored backup from an older version would consist of an ib_logfile0 whose size is 0 bytes. We must allow upgrade straight from a backup. In that case, we will recover the log sequence number from the FIL_PAGE_FILE_FLUSH_LSN field. If that field contains 0 (like it will after MDEV-27199 ), we will refuse to start up.

            After the preview release, in preparation for the PMEM interface, I had removed flush_lock and had unconditionally enabled O_DSYNC on the redo log. This can result in a performance regression on some drives. So, we will only attempt to use O_DIRECT on the log file (if a compatible physical block size is detected on Linux or Windows).

            However, I’d change innodb_flush_method=O_DSYNC to enable O_DIRECT no data files. I see no reason disable O_DIRECT.

            Some counters related to os_file_flush() (fsync(), fdatasync() or similar) will be cleaned up. I do not think that it makes sense to have a counter of pending log fsync operations, or a separate counter of log flush operations.

            marko Marko Mäkelä added a comment - After the preview release, in preparation for the PMEM interface, I had removed flush_lock and had unconditionally enabled O_DSYNC on the redo log. This can result in a performance regression on some drives. So, we will only attempt to use O_DIRECT on the log file (if a compatible physical block size is detected on Linux or Windows). However, I’d change innodb_flush_method=O_DSYNC to enable O_DIRECT no data files. I see no reason disable O_DIRECT . Some counters related to os_file_flush() ( fsync() , fdatasync() or similar) will be cleaned up. I do not think that it makes sense to have a counter of pending log fsync operations, or a separate counter of log flush operations.
            marko Marko Mäkelä added a comment - - edited

            There was a performance problem with the mmap() based interface when the redo log is located in /dev/shm or a mount -o dax PMEM device:

                if (log_sys.buf_free >= log_sys.max_buf_free)
                  log_sys.set_check_flush_or_checkpoint();
            

            The field log_sys.max_buf_free is only applicable to the pwrite() based interface. That code must not be executed for the mmap() based log, because it will cause other threads to acquire log_sys.mutex very frequently, to ensure that a pwrite() will be issued.

            On a quick local test on /dev/shm with mmap() and the possible overhead of pmem_persist(), I am seeing about 5% better throughput than with the pwrite() and fdatasync() based log.

            On the PMEM device that I tested, the pmem_deep_persist() introduced a slowdown of several orders of magnitude, compared to pmem_persist(), which did not incur any significant overhead. The old code used pmem_memcpy_persist(), which would seem to pair with pmem_persist(). Both should be fine if the PMEM device guarantees durable writes in the event of a sudden power loss.

            marko Marko Mäkelä added a comment - - edited There was a performance problem with the mmap() based interface when the redo log is located in /dev/shm or a mount -o dax PMEM device: if (log_sys.buf_free >= log_sys.max_buf_free) log_sys.set_check_flush_or_checkpoint(); The field log_sys.max_buf_free is only applicable to the pwrite() based interface. That code must not be executed for the mmap() based log, because it will cause other threads to acquire log_sys.mutex very frequently, to ensure that a pwrite() will be issued. On a quick local test on /dev/shm with mmap() and the possible overhead of pmem_persist() , I am seeing about 5% better throughput than with the pwrite() and fdatasync() based log. On the PMEM device that I tested, the pmem_deep_persist() introduced a slowdown of several orders of magnitude, compared to pmem_persist() , which did not incur any significant overhead. The old code used pmem_memcpy_persist() , which would seem to pair with pmem_persist() . Both should be fine if the PMEM device guarantees durable writes in the event of a sudden power loss.
            axel Axel Schwenke added a comment -

            Benchmark results for commit 81cf92e9471 : 81cf92e9471.pdf show good results for UPDATE workload. The 90:10 numbers show an anomaly. Cannot explain yet. But the baseline (vanilla 10.8, commit a81c75f5a96) behaves better at 128 and 256 threads. It was not so for previous 10.8 commits.

            axel Axel Schwenke added a comment - Benchmark results for commit 81cf92e9471 : 81cf92e9471.pdf show good results for UPDATE workload. The 90:10 numbers show an anomaly. Cannot explain yet. But the baseline (vanilla 10.8, commit a81c75f5a96) behaves better at 128 and 256 threads. It was not so for previous 10.8 commits.

            axel, thank you. The anomaly that you observed could be be because O_DIRECT was enabled on the log file. The preview release, which you had tested earlier, did not enable O_DIRECT, and it was issuing unaligned writes to the log file.

            On my system, with Linux kernel 5.15.5 and ext4fs on an NVMe drive with 512-byte block size, I observed yesterday that O_DIRECT|O_DSYNC is slightly faster than O_DIRECT, and the fastest was to open the log file without O_DIRECT and to issue explicit fdatasync() for durability. On Microsoft Windows, already since MDEV-16264 (10.5) we used FILE_FLAG_NO_BUFFERING (equivalent to O_DIRECT) if the physical block size is 512 bytes.

            Later yesterday, I updated the branch to only use O_DIRECT on the log file together with O_DSYNC, that is, when innodb_flush_method=O_DSYNC is specified. In this branch, that setting will also enable O_DIRECT (along with O_DSYNC) for data files. I hope that this will fix the anomaly for you as well.

            marko Marko Mäkelä added a comment - axel , thank you. The anomaly that you observed could be be because O_DIRECT was enabled on the log file. The preview release, which you had tested earlier, did not enable O_DIRECT , and it was issuing unaligned writes to the log file. On my system, with Linux kernel 5.15.5 and ext4fs on an NVMe drive with 512-byte block size, I observed yesterday that O_DIRECT|O_DSYNC is slightly faster than O_DIRECT , and the fastest was to open the log file without O_DIRECT and to issue explicit fdatasync() for durability. On Microsoft Windows, already since MDEV-16264 (10.5) we used FILE_FLAG_NO_BUFFERING (equivalent to O_DIRECT ) if the physical block size is 512 bytes. Later yesterday, I updated the branch to only use O_DIRECT on the log file together with O_DSYNC , that is, when innodb_flush_method=O_DSYNC is specified. In this branch, that setting will also enable O_DIRECT (along with O_DSYNC ) for data files. I hope that this will fix the anomaly for you as well.
            wlad Vladislav Vaintroub added a comment - - edited

            A correction - we always used FILE_FLAG_NO_BUFFERING on Windows, on redo log only, for innodb_flush_log_at_trx_commit=1. MySQL did that, too, until 10.8 . MDEV-16264 has not changed any logic in that regard.

            Now, O_DIRECT|O_DSYNC might appear faster, because only in rare cases it actually flushes hardware disk buffers. The rare cases are known as "FUA"-capable hardware, which is probably not given on marko's case, or when disk buffering is disabled, on hardware level. So, always flushing should be the way to go, unless user indeed sets innodb_flush_method=O_DSYNC, in which case we assume user knows what he does. Otherwise, the ACID can well be compromised.

            I'd also like to ask axel to benchmark threadpool, in heavy write benchmarks, mostly because since 10.6 there was a big improvement for this case, and I'd like to see if it was not nullified by the patch, on a premium hardware.

            wlad Vladislav Vaintroub added a comment - - edited A correction - we always used FILE_FLAG_NO_BUFFERING on Windows, on redo log only, for innodb_flush_log_at_trx_commit=1. MySQL did that, too, until 10.8 . MDEV-16264 has not changed any logic in that regard. Now, O_DIRECT|O_DSYNC might appear faster, because only in rare cases it actually flushes hardware disk buffers. The rare cases are known as "FUA"-capable hardware, which is probably not given on marko 's case, or when disk buffering is disabled, on hardware level. So, always flushing should be the way to go, unless user indeed sets innodb_flush_method=O_DSYNC, in which case we assume user knows what he does. Otherwise, the ACID can well be compromised. I'd also like to ask axel to benchmark threadpool, in heavy write benchmarks, mostly because since 10.6 there was a big improvement for this case, and I'd like to see if it was not nullified by the patch, on a premium hardware.

            Is there a possibility to load data into some empty table and get page compressed in the tablespace to save space right away on disk , from experimentation on 10.6 this only works if i set max dirty page pct 0 so that all dirty pages get flushed write away . I remember about some minimal flushing activity on disk when having io bandwidth , any MDEV on that feature ?

            stephane@skysql.com VAROQUI Stephane added a comment - Is there a possibility to load data into some empty table and get page compressed in the tablespace to save space right away on disk , from experimentation on 10.6 this only works if i set max dirty page pct 0 so that all dirty pages get flushed write away . I remember about some minimal flushing activity on disk when having io bandwidth , any MDEV on that feature ?

            stephane@skysql.com, data page flushing is not directly related to these changes, other than the fact that I would change innodb_flush_method=O_DSYNC to behave like innodb_flush_method=O_DIRECT for data files, only adding the O_DSYNC attribute. axel tested page_compressed some time ago, I thought it was related to MDEV-11068 but I did not find the graphs. In the end, we found that writing uncompressed tables to a thinly provisioned SSD (ScaleFlux computational storage device) was not only fastest, but also resulted in best compression. Related to page flushing when the server is idle, you might want to check MDEV-24949.

            marko Marko Mäkelä added a comment - stephane@skysql.com , data page flushing is not directly related to these changes, other than the fact that I would change innodb_flush_method=O_DSYNC to behave like innodb_flush_method=O_DIRECT for data files, only adding the O_DSYNC attribute. axel tested page_compressed some time ago, I thought it was related to MDEV-11068 but I did not find the graphs. In the end, we found that writing uncompressed tables to a thinly provisioned SSD (ScaleFlux computational storage device) was not only fastest, but also resulted in best compression. Related to page flushing when the server is idle, you might want to check MDEV-24949 .

            If the performance regression occurs due to both buf_pool.mutex and log_sys.mutex being very contended at large number of concurrent connections, it could help to disable the adaptive spinning (MY_MUTEX_INIT_FAST) for buf_pool.mutex. We currently enable it on log_sys.mutex only on ARMv8 (see MDEV-26855). I tried enabling the spinning for log_sys.mutex on my AMD64 system a couple of days ago, and the throughput nearly halved. So, I would suggest to disable the spinning:

            diff --git a/storage/innobase/buf/buf0buf.cc b/storage/innobase/buf/buf0buf.cc
            index e7fc3264d60..1127a191f7e 100644
            --- a/storage/innobase/buf/buf0buf.cc
            +++ b/storage/innobase/buf/buf0buf.cc
            @@ -1175,7 +1175,7 @@ bool buf_pool_t::create()
               while (++chunk < chunks + n_chunks);
             
               ut_ad(is_initialised());
            -  mysql_mutex_init(buf_pool_mutex_key, &mutex, MY_MUTEX_INIT_FAST);
            +  mysql_mutex_init(buf_pool_mutex_key, &mutex, nullptr);
             
               UT_LIST_INIT(LRU, &buf_page_t::LRU);
               UT_LIST_INIT(withdraw, &buf_page_t::list);
            

            I believe that this format change could enable scalability improvements, such as asynchronous log writes, or interleaving of flush_lock and write_lock. Such attempts did not help in the past, possibly because the old format that requires log_sys.mutex to be held during memset() or my_crc32c() of the 512-byte log block.

            marko Marko Mäkelä added a comment - If the performance regression occurs due to both buf_pool.mutex and log_sys.mutex being very contended at large number of concurrent connections, it could help to disable the adaptive spinning ( MY_MUTEX_INIT_FAST ) for buf_pool.mutex . We currently enable it on log_sys.mutex only on ARMv8 (see MDEV-26855 ). I tried enabling the spinning for log_sys.mutex on my AMD64 system a couple of days ago, and the throughput nearly halved. So, I would suggest to disable the spinning: diff --git a/storage/innobase/buf/buf0buf.cc b/storage/innobase/buf/buf0buf.cc index e7fc3264d60..1127a191f7e 100644 --- a/storage/innobase/buf/buf0buf.cc +++ b/storage/innobase/buf/buf0buf.cc @@ -1175,7 +1175,7 @@ bool buf_pool_t::create() while (++chunk < chunks + n_chunks); ut_ad(is_initialised()); - mysql_mutex_init(buf_pool_mutex_key, &mutex, MY_MUTEX_INIT_FAST); + mysql_mutex_init(buf_pool_mutex_key, &mutex, nullptr); UT_LIST_INIT(LRU, &buf_page_t::LRU); UT_LIST_INIT(withdraw, &buf_page_t::list); I believe that this format change could enable scalability improvements, such as asynchronous log writes, or interleaving of flush_lock and write_lock . Such attempts did not help in the past, possibly because the old format that requires log_sys.mutex to be held during memset() or my_crc32c() of the 512-byte log block.

            In addition to the buf_pool.mutex spinloop removal, I wanted to see whether applying MDEV-26827 on top would help. MDEV-26827 is expected to reduce contention on buf_pool.mutex, but in the past it caused performance regression. According to axel’s tests, it still seems to be the case.

            marko Marko Mäkelä added a comment - In addition to the buf_pool.mutex spinloop removal, I wanted to see whether applying MDEV-26827 on top would help. MDEV-26827 is expected to reduce contention on buf_pool.mutex , but in the past it caused performance regression. According to axel ’s tests, it still seems to be the case.

            The tree
            origin/bb-10.8-MDEV-14425 614e46b89ffe7357e5b72ea0d0fd3f490567a384 2022-01-13T20:32:56+02:00
            behaved well in RQG testing. Bad effects observed exist in the main trees too and are known.
            - InnoDB Standardtestbattery for covering a broad range of functionality
            - Upgrade Testbattery (10.5 -> bb-10.8-MDEV-14425)
            - Testbattery for Crashrecovery
            

            mleich Matthias Leich added a comment - The tree origin/bb-10.8-MDEV-14425 614e46b89ffe7357e5b72ea0d0fd3f490567a384 2022-01-13T20:32:56+02:00 behaved well in RQG testing. Bad effects observed exist in the main trees too and are known. - InnoDB Standardtestbattery for covering a broad range of functionality - Upgrade Testbattery (10.5 -> bb-10.8-MDEV-14425) - Testbattery for Crashrecovery

            mleich, thank you. I have since then rebased the tree for final testing.

            • The two commits of MDEV-26827 are omitted. This was an attempt to see if performance would improve.
            • No change to the buf_pool.mutex initialization is done (the adaptive spinloop will be allowed).
            • The redo log will be opened in O_DIRECT mode when the physical block size can be determined.
            • Some fixes of 10.5 or 10.6 bugs that were found during testing are included.
            marko Marko Mäkelä added a comment - mleich , thank you. I have since then rebased the tree for final testing. The two commits of MDEV-26827 are omitted. This was an attempt to see if performance would improve. No change to the buf_pool.mutex initialization is done (the adaptive spinloop will be allowed). The redo log will be opened in O_DIRECT mode when the physical block size can be determined. Some fixes of 10.5 or 10.6 bugs that were found during testing are included.
            axel Axel Schwenke added a comment -

            Added latest benchmark results:

            All runs with PFS disabled and using SSD storage.

            The graphs show a significant difference in variance. 2 NUMA nodes results are in general nearer to each other. This could mean that it was maxing out the SSD (two SATA SSD in RAID 0) storage. With 1 NUMA node the differences between commits were bigger. Recommended final configuration: commit 81cf92e9471 with spinning on buf_pool_mutex disabled (the light blue line).

            When comparing 2 NUMA node performance to 1 NUMA node, scalability looks good. The second NUMA domain adds ~60% (write-only) to ~80% (all other workloads) to performance. Only when operating in AUTO_COMMIT mode, it doesn't scale.

            axel Axel Schwenke added a comment - Added latest benchmark results: using both NUMA nodes NUMA_2.pdf using only one NUMA node NUMA_1.pdf comparison between the two NUMA_1vs2.pdf All runs with PFS disabled and using SSD storage. The graphs show a significant difference in variance. 2 NUMA nodes results are in general nearer to each other. This could mean that it was maxing out the SSD (two SATA SSD in RAID 0) storage. With 1 NUMA node the differences between commits were bigger. Recommended final configuration: commit 81cf92e9471 with spinning on buf_pool_mutex disabled (the light blue line). When comparing 2 NUMA node performance to 1 NUMA node, scalability looks good. The second NUMA domain adds ~60% (write-only) to ~80% (all other workloads) to performance. Only when operating in AUTO_COMMIT mode, it doesn't scale.

            The tree
            origin/bb-10.8-MDEV-14425 7f75466f539b61d3dc8696e72a2d715c59aa04d6 2022-01-14T19:52:40+02:00
            behaved well in RQG testing. Bad effects observed happen on other trees too.
            

            mleich Matthias Leich added a comment - The tree origin/bb-10.8-MDEV-14425 7f75466f539b61d3dc8696e72a2d715c59aa04d6 2022-01-14T19:52:40+02:00 behaved well in RQG testing. Bad effects observed happen on other trees too.

            I rebased the bb-10.8-MDEV-14425 branch once more, so that krunalbauskar can test it. Previously, his tests were contaminated by MDEV-27499.

            marko Marko Mäkelä added a comment - I rebased the bb-10.8- MDEV-14425 branch once more, so that krunalbauskar can test it. Previously, his tests were contaminated by MDEV-27499 .

            An observation was made during testing: rr record mariadb-backup --backup cannot work reliably if the redo log file was opened via mmap(). The reason is that rr assumes that the mmap file contents may only be changed by the traced process(es). To avoid bogus failures while running backup under rr, there are a few solutions:

            • Build without libpmem: rm CMakeCache.txt; cmake -DCMAKE_DISABLE_FIND_PACKAGE_PMEM=1 /path/to/source
            • Place the server’s redo log somewhere else than /dev/shm or a PMEM device mounted with -o dax.
            • Patch log_t::attach() so that mmap() will not be attempted if srv_operation == SRV_OPERATION_BACKUP.
            • Implement server-side backup (MDEV-14992).
            marko Marko Mäkelä added a comment - An observation was made during testing: rr record mariadb-backup --backup cannot work reliably if the redo log file was opened via mmap() . The reason is that rr assumes that the mmap file contents may only be changed by the traced process(es) . To avoid bogus failures while running backup under rr , there are a few solutions: Build without libpmem : rm CMakeCache.txt; cmake -DCMAKE_DISABLE_FIND_PACKAGE_PMEM=1 /path/to/source Place the server’s redo log somewhere else than /dev/shm or a PMEM device mounted with -o dax . Patch log_t::attach() so that mmap() will not be attempted if srv_operation == SRV_OPERATION_BACKUP . Implement server-side backup ( MDEV-14992 ).

            The tree
                origin/bb-10.8-MDEV-14425 14eaeb68e60626a9b1e4f4b611f0bc23a79f7202 2022-01-20T13:35:18+02:00
            performed sufficient well in RQG testing focused on Mariabackup.
            There was some surprising amount of  bad effects (unknown to me but maybe already in JIRA).
            But the same test battery applied to
                origin/10.8 baef53a70c675da6d19ac3c7f23c7b8b4ed8458c 2022-01-20T16:01:10+01:00
            showed nearly the same and in sum not less bad effects.
             
            Hence I stop testing now and vote for integrating MDEV-14425 into 10.8 if
            the corresponding MTR test pass.
            

            mleich Matthias Leich added a comment - The tree origin/bb-10.8-MDEV-14425 14eaeb68e60626a9b1e4f4b611f0bc23a79f7202 2022-01-20T13:35:18+02:00 performed sufficient well in RQG testing focused on Mariabackup. There was some surprising amount of bad effects (unknown to me but maybe already in JIRA). But the same test battery applied to origin/10.8 baef53a70c675da6d19ac3c7f23c7b8b4ed8458c 2022-01-20T16:01:10+01:00 showed nearly the same and in sum not less bad effects.   Hence I stop testing now and vote for integrating MDEV-14425 into 10.8 if the corresponding MTR test pass.

            Thank you to everyone who tested this and provided feedback.

            As recommended by axel, spinning on buf_pool.mutex was disabled, except on ARMv8.

            Based on performance tests on 512-byte block devices by myself and krunalbauskar, we will not enable O_DIRECT on the redo log on Linux by default. With the setting innodb_flush_method=O_SYNC we will enable O_DIRECT on the log as well as data files. On Microsoft Windows, buffering had already been disabled for the redo log.

            On other operating systems than Windows or Linux, writes to the log will keep using a block size of 512 bytes and not bypass any file system cache. Changing that would require implementing a way to detect the physical block size.

            marko Marko Mäkelä added a comment - Thank you to everyone who tested this and provided feedback. As recommended by axel , spinning on buf_pool.mutex was disabled , except on ARMv8. Based on performance tests on 512-byte block devices by myself and krunalbauskar , we will not enable O_DIRECT on the redo log on Linux by default. With the setting innodb_flush_method=O_SYNC we will enable O_DIRECT on the log as well as data files. On Microsoft Windows, buffering had already been disabled for the redo log. On other operating systems than Windows or Linux, writes to the log will keep using a block size of 512 bytes and not bypass any file system cache. Changing that would require implementing a way to detect the physical block size.

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              8 Vote for this issue
              Watchers:
              33 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.