Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-33585

The maximum innodb_log_buffer_size is too large

Details

    Description

      MariaDB 10.11.7 set with innodb log file size of 70GB and 516GB of buffer pool memory. When stopped and attempted to restart, it fails to restart as it was unable to apply redo log file records to disk and it throws error as below

      [Note] Starting MariaDB 10.11.7-MariaDB-log source revision  as process 8497
      [Note] InnoDB: Compressed tables use zlib 1.3.1
      [Note] InnoDB: Number of transaction pools: 1
      [Note] InnoDB: Using crc32 + pclmulqdq instructions
      [Note] InnoDB: Using Linux native AIO
      [Note] InnoDB: Initializing buffer pool, total size = 0.516TiB, chunk size = 8.000GiB
      [Note] InnoDB: Setting NUMA memory policy to MPOL_INTERLEAVE
      [Note] InnoDB: Setting NUMA memory policy to MPOL_DEFAULT
      [Note] InnoDB: Completed initialization of buffer pool
      [Note] InnoDB: Buffered log writes (block size=512 bytes)
      [Warning] InnoDB: 22531489280 bytes should have been read at 47565664768 from (unknown file), but got only 2147479552. Retrying.
      [Warning] InnoDB: 22531489280 bytes should have been read at 47565664768 from (unknown file), but got only 4294959104. Retrying.
      [Warning] InnoDB: 22531489280 bytes should have been read at 47565664768 from (unknown file), but got only 6442438656. Retrying.
      [Warning] InnoDB: 22531489280 bytes should have been read at 47565664768 from (unknown file), but got only 8589918208. Retrying.
      [Warning] InnoDB: 22531489280 bytes should have been read at 47565664768 from (unknown file), but got only 10737397760. Retrying.
      [Warning] InnoDB: 22531489280 bytes should have been read at 47565664768 from (unknown file), but got only 12884877312. Retrying.
      [Warning] InnoDB: 22531489280 bytes should have been read at 47565664768 from (unknown file), but got only 15032356864. Retrying.
      [Warning] InnoDB: 22531489280 bytes should have been read at 47565664768 from (unknown file), but got only 17179836416. Retrying.
      [Warning] InnoDB: 22531489280 bytes should have been read at 47565664768 from (unknown file), but got only 19327315968. Retrying.
      [Warning] InnoDB: 22531489280 bytes should have been read at 47565664768 from (unknown file), but got only 21474795520. Retrying.
      [Warning] InnoDB: Retry attempts for reading partial data failed.
      [ERROR] InnoDB: Failed to read log at 47565664768: I/O error
      [Note] InnoDB: Read redo log up to LSN=15096222607497
      [ERROR] InnoDB: Log scan aborted at LSN 15096222607497
      [ERROR] InnoDB: Plugin initialization aborted with error Generic error
      [Note] InnoDB: Starting shutdown...
      [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed.
      [Note] Plugin 'FEEDBACK' is disabled.
      [Note] Semi-sync replication enabled on the master.
      [ERROR] Unknown/unsupported storage engine: InnoDB
      [ERROR] Aborting
      

      Any guidance on how it can be resolved.

      Context:
      Initially we observed performance impact with a smaller redo log file of 4G. There was more load to the background task as Adaptive flushing was active all the time. We increased the log file size after reading through some recommendation provided on blog and MDEV-30501 and we saw better performance results as flushing reduced.

      At the same time we wanted to test the trade off of restart time, where redo logs will have to reapply all changes to disk. `innodb_fast_shutdown` is set to `1` so it will not apply the changes during shutdown/crash. It resulted in above error and not allowing mariadb to start.

      Attachments

        Issue Links

          Activity

            I think Linux documentation explains it good enough . If it says one pread more than 2GB at once, and Linus personally thinks it is a right thing to do, Innodb needs to workaround this "right thing to do"

            wlad Vladislav Vaintroub added a comment - I think Linux documentation explains it good enough . If it says one pread more than 2GB at once, and Linus personally thinks it is a right thing to do, Innodb needs to workaround this "right thing to do"

            Thank you both. I think that the message needs to be suppressed and the 0x7ffff000 bytes read handled as a special case, without affecting the retry count. I compared the 10.6 and 10.11 releases to see if MDEV-14425 could have introduced this problem after the 10.6 release. In 10.6, the call stack should be something like the following:

            10.6 ccb7a1e9a15e6a47aba97f9bdbfab2e4bf64c447

            #0  __libc_pread64 (fd=8, buf=0x7fd997f00000, count=65536, offset=51200) at ../sysdeps/unix/sysv/linux/pread64.c:25
            #1  0x000055b1b5f7dd2a in SyncFileIO::execute (this=this@entry=0x7ffe07eba410, request=@0x7ffe07eba3f0: {bpage = 0x0, slot = 0x0, node = 0x0, type = IORequest::READ_SYNC})
                at /mariadb/10.6/storage/innobase/os/os0file.cc:685
            #2  0x000055b1b5f7f181 in os_file_io (in_type=@0x55b1b5421bc0: {bpage = 0x0, slot = 0x0, node = 0x0, type = IORequest::READ_SYNC}, file=file@entry=8, buf=buf@entry=0x7fd997f00000, n=n@entry=65536, 
                offset=offset@entry=51200, err=err@entry=0x7ffe07eba64c) at /mariadb/10.6/storage/innobase/os/os0file.cc:2755
            #3  0x000055b1b5f7f438 in os_file_pread (type=@0x55b1b5421bc0: {bpage = 0x0, slot = 0x0, node = 0x0, type = IORequest::READ_SYNC}, file=file@entry=8, buf=buf@entry=0x7fd997f00000, n=n@entry=65536, 
                offset=offset@entry=51200, err=err@entry=0x7ffe07eba64c) at /mariadb/10.6/storage/innobase/os/os0file.cc:2913
            #4  0x000055b1b5f818d5 in os_file_read_func (type=@0x55b1b5421bc0: {bpage = 0x0, slot = 0x0, node = 0x0, type = IORequest::READ_SYNC}, file=8, buf=0x7fd997f00000, offset=51200, n=65536, o=o@entry=0x0)
                at /mariadb/10.6/storage/innobase/os/os0file.cc:2943
            #5  0x000055b1b5f5b7e8 in file_os_io::read (this=<optimized out>, offset=<optimized out>, buf=<optimized out>) at /mariadb/10.6/storage/innobase/log/log0log.cc:282
            #6  0x000055b1b5f5bfa8 in log_file_t::read (this=0x55b1b87e3fb0, offset=51200, buf={data_ = 0x7fd997f00000 "", size_ = 65536}) at /mariadb/10.6/storage/innobase/log/log0log.cc:455
            #7  0x000055b1b5f67af2 in recv_sys_t::read (this=this@entry=0x55b1b62f74c0 <recv_sys>, total_offset=total_offset@entry=51200, buf={data_ = 0x7fd997f00000 "", size_ = 65536})
                at /mariadb/10.6/storage/innobase/log/log0recv.cc:1256
            #8  0x000055b1b5f67fef in log_t::file::read_log_seg (this=0x55b1b6bd2780 <log_sys+512>, start_lsn=start_lsn@entry=0x7ffe07eba918, end_lsn=123392) at /mariadb/10.6/storage/innobase/log/log0recv.cc:1599
            #9  0x000055b1b5f6cf5b in recv_group_scan_log_recs (checkpoint_lsn=58231, contiguous_lsn=contiguous_lsn@entry=0x7ffe07eba970, last_phase=last_phase@entry=false)
                at /mariadb/10.6/storage/innobase/log/log0recv.cc:4129
            #10 0x000055b1b5f6d460 in recv_recovery_from_checkpoint_start (flush_lsn=<optimized out>) at /mariadb/10.6/storage/innobase/log/log0recv.cc:4518
            

            In 10.6, the above read covers RECV_SCAN_SIZE or 4*innodb_page_size bytes, or 65536 bytes by default. I also observed a read from recv_synchronize_groups() after that, but it is always reading 512 bytes, which is the size of a log block before MDEV-14425.

            In 10.11 (the oldest maintained major version that includes MDEV-14425), the corresponding call stack would be as follows:

            10.11 a79fb66a98ee44c6e5570ff31db581228def7032

            #0  __libc_pread64 (fd=fd@entry=10, buf=buf@entry=0x7fb866600000, count=count@entry=2097152, offset=offset@entry=62464) at ../sysdeps/unix/sysv/linux/pread64.c:25
            #1  0x0000560c1d3f5b52 in SyncFileIO::execute (this=<optimized out>, request=<optimized out>) at /mariadb/10.11/storage/innobase/os/os0file.cc:686
            #2  os_file_io (in_type=@0x560c1c693f08: {bpage = 0x0, slot = 0x0, node = 0x0, type = IORequest::READ_SYNC}, file=10, buf=<optimized out>, n=n@entry=2097152, offset=offset@entry=62464, 
                err=err@entry=0x7ffcf9668f04) at /mariadb/10.11/storage/innobase/os/os0file.cc:2560
            #3  0x0000560c1d3f2b41 in os_file_pread (type=@0x560c1c693f08: {bpage = 0x0, slot = 0x0, node = 0x0, type = IORequest::READ_SYNC}, n=2097152, offset=62464, err=0x7ffcf9668f04, file=<optimized out>, 
                buf=<optimized out>) at /mariadb/10.11/storage/innobase/os/os0file.cc:2718
            #4  os_file_read_func (type=@0x560c1c693f08: {bpage = 0x0, slot = 0x0, node = 0x0, type = IORequest::READ_SYNC}, file=<optimized out>, buf=<optimized out>, offset=offset@entry=62464, n=2097152, o=o@entry=0x0)
                at /mariadb/10.11/storage/innobase/os/os0file.cc:2748
            #5  0x0000560c1d3bef9b in log_file_t::read (this=<optimized out>, offset=62464, buf={data_ = 0x7fb866600000 "", size_ = 2097152}) at /mariadb/10.11/storage/innobase/log/log0log.cc:166
            #6  0x0000560c1d3ca67b in recv_scan_log (last_phase=false) at /mariadb/10.11/storage/innobase/log/log0recv.cc:4068
            #7  0x0000560c1d3c95e8 in recv_recovery_from_checkpoint_start () at /mariadb/10.11/storage/innobase/log/log0recv.cc:4585
            #8  0x0000560c1d4f9ac9 in srv_start (create_new_db=false) at /mariadb/10.11/storage/innobase/srv/srv0start.cc:1444
            #9  0x0000560c1d1f7a9c in innodb_init (p=<optimized out>) at /mariadb/10.11/storage/innobase/handler/ha_innodb.cc:4213
            

            The length is simply based on log_sys.buf_size, or innodb_log_buffer_size. Its default value is 16 MiB, and the minimum is 2 MiB. The maximum is SIZE_T_MAX, which looks like an overkill in this case.

            keshshan, would it be acceptable to you if we just reduced the maximum of innodb_log_buffer_size to slightly below 2 GiB, to keep Linux happy? Which value are you currently using?

            marko Marko Mäkelä added a comment - Thank you both. I think that the message needs to be suppressed and the 0x7ffff000 bytes read handled as a special case, without affecting the retry count. I compared the 10.6 and 10.11 releases to see if MDEV-14425 could have introduced this problem after the 10.6 release. In 10.6, the call stack should be something like the following: 10.6 ccb7a1e9a15e6a47aba97f9bdbfab2e4bf64c447 #0 __libc_pread64 (fd=8, buf=0x7fd997f00000, count=65536, offset=51200) at ../sysdeps/unix/sysv/linux/pread64.c:25 #1 0x000055b1b5f7dd2a in SyncFileIO::execute (this=this@entry=0x7ffe07eba410, request=@0x7ffe07eba3f0: {bpage = 0x0, slot = 0x0, node = 0x0, type = IORequest::READ_SYNC}) at /mariadb/10.6/storage/innobase/os/os0file.cc:685 #2 0x000055b1b5f7f181 in os_file_io (in_type=@0x55b1b5421bc0: {bpage = 0x0, slot = 0x0, node = 0x0, type = IORequest::READ_SYNC}, file=file@entry=8, buf=buf@entry=0x7fd997f00000, n=n@entry=65536, offset=offset@entry=51200, err=err@entry=0x7ffe07eba64c) at /mariadb/10.6/storage/innobase/os/os0file.cc:2755 #3 0x000055b1b5f7f438 in os_file_pread (type=@0x55b1b5421bc0: {bpage = 0x0, slot = 0x0, node = 0x0, type = IORequest::READ_SYNC}, file=file@entry=8, buf=buf@entry=0x7fd997f00000, n=n@entry=65536, offset=offset@entry=51200, err=err@entry=0x7ffe07eba64c) at /mariadb/10.6/storage/innobase/os/os0file.cc:2913 #4 0x000055b1b5f818d5 in os_file_read_func (type=@0x55b1b5421bc0: {bpage = 0x0, slot = 0x0, node = 0x0, type = IORequest::READ_SYNC}, file=8, buf=0x7fd997f00000, offset=51200, n=65536, o=o@entry=0x0) at /mariadb/10.6/storage/innobase/os/os0file.cc:2943 #5 0x000055b1b5f5b7e8 in file_os_io::read (this=<optimized out>, offset=<optimized out>, buf=<optimized out>) at /mariadb/10.6/storage/innobase/log/log0log.cc:282 #6 0x000055b1b5f5bfa8 in log_file_t::read (this=0x55b1b87e3fb0, offset=51200, buf={data_ = 0x7fd997f00000 "", size_ = 65536}) at /mariadb/10.6/storage/innobase/log/log0log.cc:455 #7 0x000055b1b5f67af2 in recv_sys_t::read (this=this@entry=0x55b1b62f74c0 <recv_sys>, total_offset=total_offset@entry=51200, buf={data_ = 0x7fd997f00000 "", size_ = 65536}) at /mariadb/10.6/storage/innobase/log/log0recv.cc:1256 #8 0x000055b1b5f67fef in log_t::file::read_log_seg (this=0x55b1b6bd2780 <log_sys+512>, start_lsn=start_lsn@entry=0x7ffe07eba918, end_lsn=123392) at /mariadb/10.6/storage/innobase/log/log0recv.cc:1599 #9 0x000055b1b5f6cf5b in recv_group_scan_log_recs (checkpoint_lsn=58231, contiguous_lsn=contiguous_lsn@entry=0x7ffe07eba970, last_phase=last_phase@entry=false) at /mariadb/10.6/storage/innobase/log/log0recv.cc:4129 #10 0x000055b1b5f6d460 in recv_recovery_from_checkpoint_start (flush_lsn=<optimized out>) at /mariadb/10.6/storage/innobase/log/log0recv.cc:4518 In 10.6, the above read covers RECV_SCAN_SIZE or 4* innodb_page_size bytes, or 65536 bytes by default. I also observed a read from recv_synchronize_groups() after that, but it is always reading 512 bytes, which is the size of a log block before MDEV-14425 . In 10.11 (the oldest maintained major version that includes MDEV-14425 ), the corresponding call stack would be as follows: 10.11 a79fb66a98ee44c6e5570ff31db581228def7032 #0 __libc_pread64 (fd=fd@entry=10, buf=buf@entry=0x7fb866600000, count=count@entry=2097152, offset=offset@entry=62464) at ../sysdeps/unix/sysv/linux/pread64.c:25 #1 0x0000560c1d3f5b52 in SyncFileIO::execute (this=<optimized out>, request=<optimized out>) at /mariadb/10.11/storage/innobase/os/os0file.cc:686 #2 os_file_io (in_type=@0x560c1c693f08: {bpage = 0x0, slot = 0x0, node = 0x0, type = IORequest::READ_SYNC}, file=10, buf=<optimized out>, n=n@entry=2097152, offset=offset@entry=62464, err=err@entry=0x7ffcf9668f04) at /mariadb/10.11/storage/innobase/os/os0file.cc:2560 #3 0x0000560c1d3f2b41 in os_file_pread (type=@0x560c1c693f08: {bpage = 0x0, slot = 0x0, node = 0x0, type = IORequest::READ_SYNC}, n=2097152, offset=62464, err=0x7ffcf9668f04, file=<optimized out>, buf=<optimized out>) at /mariadb/10.11/storage/innobase/os/os0file.cc:2718 #4 os_file_read_func (type=@0x560c1c693f08: {bpage = 0x0, slot = 0x0, node = 0x0, type = IORequest::READ_SYNC}, file=<optimized out>, buf=<optimized out>, offset=offset@entry=62464, n=2097152, o=o@entry=0x0) at /mariadb/10.11/storage/innobase/os/os0file.cc:2748 #5 0x0000560c1d3bef9b in log_file_t::read (this=<optimized out>, offset=62464, buf={data_ = 0x7fb866600000 "", size_ = 2097152}) at /mariadb/10.11/storage/innobase/log/log0log.cc:166 #6 0x0000560c1d3ca67b in recv_scan_log (last_phase=false) at /mariadb/10.11/storage/innobase/log/log0recv.cc:4068 #7 0x0000560c1d3c95e8 in recv_recovery_from_checkpoint_start () at /mariadb/10.11/storage/innobase/log/log0recv.cc:4585 #8 0x0000560c1d4f9ac9 in srv_start (create_new_db=false) at /mariadb/10.11/storage/innobase/srv/srv0start.cc:1444 #9 0x0000560c1d1f7a9c in innodb_init (p=<optimized out>) at /mariadb/10.11/storage/innobase/handler/ha_innodb.cc:4213 The length is simply based on log_sys.buf_size , or innodb_log_buffer_size . Its default value is 16 MiB, and the minimum is 2 MiB. The maximum is SIZE_T_MAX , which looks like an overkill in this case. keshshan , would it be acceptable to you if we just reduced the maximum of innodb_log_buffer_size to slightly below 2 GiB, to keep Linux happy? Which value are you currently using?

            Initially, I thought that the following should fix this with minimal impact:

            diff --git a/storage/innobase/log/log0recv.cc b/storage/innobase/log/log0recv.cc
            index 6b6a686823c..67e7e38f234 100644
            --- a/storage/innobase/log/log0recv.cc
            +++ b/storage/innobase/log/log0recv.cc
            @@ -4064,6 +4064,13 @@ static bool recv_scan_log(bool last_phase)
             
                   if (source_offset + size > log_sys.file_size)
                     size= static_cast<size_t>(log_sys.file_size - source_offset);
            +#ifdef __linux__
            +      /* man 2 read: On Linux, read() (and similar system calls) will
            +      transfer at most 0x7ffff000 (2,147,479,552) bytes, returning the
            +      number of bytes actually transferred.  (This is true on both
            +      32-bit and 64-bit systems.) */
            +      size= std::min<size_t>(size, 0x7ffff000);
            +#endif
             
                   if (dberr_t err= log_sys.log.read(source_offset,
                                                     {log_sys.buf + recv_sys.len, size}))
            

            To ensure that the log parsing will not be broken, I successfully ran our regression test suite with a maximum read request size limited to 0x1000 (4096) bytes instead of the above.

            According to man 2 write there is a similar limitation on writes as well. I will have to check whether we could get a similar error when the log is being written, I guess when using a huge innodb_buffer_pool_size and innodb_log_file_size as well as innodb_flush_log_at_trx_commit=0. The call stack would be as follows:

            10.11 a79fb66a98ee44c6e5570ff31db581228def7032

            #6  0x0000559046c92397 in log_write_buf (buf=buf@entry=0x7fcec4200000 "", len=len@entry=1536, offset=140526030422016) at /mariadb/10.11/storage/innobase/log/log0log.cc:596
            #7  0x0000559046c93aea in log_t::write_buf<true> (this=0x5590479b0d80 <log_sys>) at /mariadb/10.11/storage/innobase/log/log0log.cc:881
            #8  0x0000559046c90d73 in log_write_up_to (lsn=<optimized out>, lsn@entry=59269, durable=true, callback=<optimized out>, callback@entry=0x0) at /mariadb/10.11/storage/innobase/log/log0log.cc:967
            

            This is again limited by innodb_log_buffer_size. Having to limit the usable size of the buffer during crash recovery is one thing. For the write part, I think that it is easiest to just limit the maximum size of the buffer on Linux, because the rest of the buffer would be wasted in any case.

            For the record, I tested the write side on 10.6 as well:

            10.6 ccb7a1e9a15e6a47aba97f9bdbfab2e4bf64c447

            #8  0x0000555c2f423237 in log_write_buf (buf=buf@entry=0x7f68da500000 "\200", len=len@entry=512, pad_len=pad_len@entry=0, start_lsn=53760, new_data_offset=new_data_offset@entry=304)
                at /mariadb/10.6/storage/innobase/log/log0log.cc:635
            #9  0x0000555c2f42341a in log_write (rotate_key=rotate_key@entry=false) at /mariadb/10.6/storage/innobase/log/log0log.cc:776
            #10 0x0000555c2f42387e in log_write_up_to (lsn=54076, flush_to_disk=flush_to_disk@entry=true, rotate_key=rotate_key@entry=false, callback=<optimized out>, callback@entry=0x0)
                at /mariadb/10.6/storage/innobase/log/log0log.cc:841
            

            Before MDEV-14425, the counterpart of log_sys.buf_size was srv_log_buffer_size, with an identical maximum size on LP64 systems (but only 2GiB on LLP64 systems, such as Microsoft Windows). The logic for writing log in older versions is quite a bit more convoluted than it is after MDEV-14425, so I would rather not touch that.

            My suggested fix would be along the following lines. I will try to find out if similar limitations exist in other operating systems.

            diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc
            index 0b6eb2e0259..e0abe48ac67 100644
            --- a/storage/innobase/handler/ha_innodb.cc
            +++ b/storage/innobase/handler/ha_innodb.cc
            @@ -19316,10 +19316,18 @@ static MYSQL_SYSVAR_ULONG(page_size, srv_page_size,
               NULL, NULL, UNIV_PAGE_SIZE_DEF,
               UNIV_PAGE_SIZE_MIN, UNIV_PAGE_SIZE_MAX, 0);
             
            +#ifdef __linux__
            +/* According to "man 2 read" and "man 2 write" this is the maximum size of
            +a read or write request, both on 32-bit and 64-bit systems. */
            +static constexpr size_t innodb_log_buffer_size_max= 0x7ffff000;
            +#else
            +static constexpr size_t innodb_log_buffer_size_max= SIZE_T_MAX;
            +#endif
            +
             static MYSQL_SYSVAR_SIZE_T(log_buffer_size, log_sys.buf_size,
               PLUGIN_VAR_RQCMDARG | PLUGIN_VAR_READONLY,
               "Redo log buffer size in bytes.",
            -  NULL, NULL, 16U << 20, 2U << 20, SIZE_T_MAX, 4096);
            +  NULL, NULL, 16U << 20, 2U << 20, innodb_log_buffer_size_max, 4096);
             
             #if defined __linux__ || defined _WIN32
             static MYSQL_SYSVAR_BOOL(log_file_buffering, log_sys.log_buffered,
            

            marko Marko Mäkelä added a comment - Initially, I thought that the following should fix this with minimal impact: diff --git a/storage/innobase/log/log0recv.cc b/storage/innobase/log/log0recv.cc index 6b6a686823c..67e7e38f234 100644 --- a/storage/innobase/log/log0recv.cc +++ b/storage/innobase/log/log0recv.cc @@ -4064,6 +4064,13 @@ static bool recv_scan_log(bool last_phase) if (source_offset + size > log_sys.file_size) size= static_cast<size_t>(log_sys.file_size - source_offset); +#ifdef __linux__ + /* man 2 read: On Linux, read() (and similar system calls) will + transfer at most 0x7ffff000 (2,147,479,552) bytes, returning the + number of bytes actually transferred. (This is true on both + 32-bit and 64-bit systems.) */ + size= std::min<size_t>(size, 0x7ffff000); +#endif if (dberr_t err= log_sys.log.read(source_offset, {log_sys.buf + recv_sys.len, size})) To ensure that the log parsing will not be broken, I successfully ran our regression test suite with a maximum read request size limited to 0x1000 (4096) bytes instead of the above. According to man 2 write there is a similar limitation on writes as well. I will have to check whether we could get a similar error when the log is being written, I guess when using a huge innodb_buffer_pool_size and innodb_log_file_size as well as innodb_flush_log_at_trx_commit=0 . The call stack would be as follows: 10.11 a79fb66a98ee44c6e5570ff31db581228def7032 #6 0x0000559046c92397 in log_write_buf (buf=buf@entry=0x7fcec4200000 "", len=len@entry=1536, offset=140526030422016) at /mariadb/10.11/storage/innobase/log/log0log.cc:596 #7 0x0000559046c93aea in log_t::write_buf<true> (this=0x5590479b0d80 <log_sys>) at /mariadb/10.11/storage/innobase/log/log0log.cc:881 #8 0x0000559046c90d73 in log_write_up_to (lsn=<optimized out>, lsn@entry=59269, durable=true, callback=<optimized out>, callback@entry=0x0) at /mariadb/10.11/storage/innobase/log/log0log.cc:967 This is again limited by innodb_log_buffer_size . Having to limit the usable size of the buffer during crash recovery is one thing. For the write part, I think that it is easiest to just limit the maximum size of the buffer on Linux, because the rest of the buffer would be wasted in any case. For the record, I tested the write side on 10.6 as well: 10.6 ccb7a1e9a15e6a47aba97f9bdbfab2e4bf64c447 #8 0x0000555c2f423237 in log_write_buf (buf=buf@entry=0x7f68da500000 "\200", len=len@entry=512, pad_len=pad_len@entry=0, start_lsn=53760, new_data_offset=new_data_offset@entry=304) at /mariadb/10.6/storage/innobase/log/log0log.cc:635 #9 0x0000555c2f42341a in log_write (rotate_key=rotate_key@entry=false) at /mariadb/10.6/storage/innobase/log/log0log.cc:776 #10 0x0000555c2f42387e in log_write_up_to (lsn=54076, flush_to_disk=flush_to_disk@entry=true, rotate_key=rotate_key@entry=false, callback=<optimized out>, callback@entry=0x0) at /mariadb/10.6/storage/innobase/log/log0log.cc:841 Before MDEV-14425 , the counterpart of log_sys.buf_size was srv_log_buffer_size , with an identical maximum size on LP64 systems (but only 2GiB on LLP64 systems, such as Microsoft Windows). The logic for writing log in older versions is quite a bit more convoluted than it is after MDEV-14425 , so I would rather not touch that. My suggested fix would be along the following lines. I will try to find out if similar limitations exist in other operating systems. diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc index 0b6eb2e0259..e0abe48ac67 100644 --- a/storage/innobase/handler/ha_innodb.cc +++ b/storage/innobase/handler/ha_innodb.cc @@ -19316,10 +19316,18 @@ static MYSQL_SYSVAR_ULONG(page_size, srv_page_size, NULL, NULL, UNIV_PAGE_SIZE_DEF, UNIV_PAGE_SIZE_MIN, UNIV_PAGE_SIZE_MAX, 0); +#ifdef __linux__ +/* According to "man 2 read" and "man 2 write" this is the maximum size of +a read or write request, both on 32-bit and 64-bit systems. */ +static constexpr size_t innodb_log_buffer_size_max= 0x7ffff000; +#else +static constexpr size_t innodb_log_buffer_size_max= SIZE_T_MAX; +#endif + static MYSQL_SYSVAR_SIZE_T(log_buffer_size, log_sys.buf_size, PLUGIN_VAR_RQCMDARG | PLUGIN_VAR_READONLY, "Redo log buffer size in bytes.", - NULL, NULL, 16U << 20, 2U << 20, SIZE_T_MAX, 4096); + NULL, NULL, 16U << 20, 2U << 20, innodb_log_buffer_size_max, 4096); #if defined __linux__ || defined _WIN32 static MYSQL_SYSVAR_BOOL(log_file_buffering, log_sys.log_buffered,

            On Microsoft Windows, ReadFile() as well as WriteFile() limit the size of the request to DWORD, which is 32 bits (4 GiB) also on 64-bit systems.

            On FreeBSD, sysctl debug.iosize_max_clamp could limit the size of a write request to INT_MAX. The size of a read request is always limited to INT_MAX. This would allow the request size to be 4095 bytes more than on Linux.

            On OpenBSD, Solaris and possibly NetBSD, the read request size is limited to SSIZE_T_MAX, which would be half the current maximum innodb_log_buffer_size. This is not much of an issue anyway, because on contemporary 64-bit platforms, the virtual addresses are limited to 48 bits.

            IBM AIX documentation mentions a limit of OFF_MAX that would apply when a 64-bit application is running on a 32-bit kernel.

            I think that we’d better declare innodb_log_buffer_size as 32-bit unsigned and make the maximum 0x7ffff000. Limiting every platform to the least common denominator should hurt too much here.

            marko Marko Mäkelä added a comment - On Microsoft Windows, ReadFile() as well as WriteFile() limit the size of the request to DWORD , which is 32 bits (4 GiB) also on 64-bit systems. On FreeBSD, sysctl debug.iosize_max_clamp could limit the size of a write request to INT_MAX . The size of a read request is always limited to INT_MAX . This would allow the request size to be 4095 bytes more than on Linux. On OpenBSD, Solaris and possibly NetBSD, the read request size is limited to SSIZE_T_MAX , which would be half the current maximum innodb_log_buffer_size . This is not much of an issue anyway, because on contemporary 64-bit platforms, the virtual addresses are limited to 48 bits. IBM AIX documentation mentions a limit of OFF_MAX that would apply when a 64-bit application is running on a 32-bit kernel. I think that we’d better declare innodb_log_buffer_size as 32-bit unsigned and make the maximum 0x7ffff000 . Limiting every platform to the least common denominator should hurt too much here.

            While working on this, I noticed that innodb_sort_buffer_size is not an issue, because its maximum size is 64 MiB.

            In mariadb-backup, some code is copying files using os_file_read() or os_file_write(), but those sizes seem to be limited well enough.

            Last but not least, I found a related regression MDEV-33809, which is about the buffering of BLOB data during bulk insert.

            marko Marko Mäkelä added a comment - While working on this, I noticed that innodb_sort_buffer_size is not an issue, because its maximum size is 64 MiB. In mariadb-backup , some code is copying files using os_file_read() or os_file_write() , but those sizes seem to be limited well enough. Last but not least, I found a related regression MDEV-33809 , which is about the buffering of BLOB data during bulk insert.

            People

              marko Marko Mäkelä
              keshshan Keshan Nageswaran
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.