Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-34450

os_file_write_func() is an overkill for ib_logfile0

Details

    Description

      While working on MDEV-33894, I noticed a rather deep call stack:

      #0  __libc_pwrite64 (fd=7, buf=0x7f0d44b15000, count=count@entry=8192, offset=104448) at ../sysdeps/unix/sysv/linux/pwrite64.c:24
      #1  0x0000564264c10e07 in SyncFileIO::execute_low (this=this@entry=0x7f0d1729bc20, request=..., n=8192) at /data/Server/10.11-MDEV-33894/storage/innobase/os/os0file.cc:692
      #2  0x0000564264c10f17 in SyncFileIO::execute (this=this@entry=0x7f0d1729bc20, request=...) at /data/Server/10.11-MDEV-33894/storage/innobase/os/os0file.cc:715
      #3  0x0000564264c12402 in os_file_io (in_type=..., file=file@entry=7, buf=buf@entry=0x7f0d44b15000, n=n@entry=8192, offset=offset@entry=104448, err=err@entry=0x7f0d1729be7c)
          at /data/Server/10.11-MDEV-33894/storage/innobase/os/os0file.cc:2584
      #4  0x0000564264c125d9 in os_file_pwrite (type=..., file=file@entry=7, buf=buf@entry=0x7f0d44b15000 "w\a\f", n=8192, offset=104448, err=err@entry=0x7f0d1729be7c)
          at /data/Server/10.11-MDEV-33894/storage/innobase/os/os0file.cc:2659
      #5  0x0000564264c14299 in os_file_write_func (type=..., name=<optimized out>, name@entry=0x56426540609f "ib_logfile0", file=7, buf=0x7f0d44b15000, offset=<optimized out>, offset@entry=104448, n=<optimized out>)
          at /data/Server/10.11-MDEV-33894/storage/innobase/os/os0file.cc:2688
      #6  0x0000564264bbed28 in log_file_t::write (this=this@entry=0x5642665238c0 <log_sys+128>, offset=offset@entry=104448, buf=...) at /data/Server/10.11-MDEV-33894/storage/innobase/log/log0log.cc:173
      #7  0x0000564264bbf1a7 in log_write_buf (buf=buf@entry=0x7f0d44b15000 "w\a\f", length=length@entry=8192, offset=offset@entry=104448) at /data/Server/10.11-MDEV-33894/storage/innobase/log/log0log.cc:624
      #8  0x0000564264bc1936 in log_t::write_buf<true> (this=this@entry=0x564266523840 <log_sys>) at /data/Server/10.11-MDEV-33894/storage/innobase/log/log0log.cc:944
      

      The InnoDB write-ahead log is not instrumented by PERFORMANCE_SCHEMA; we only have innodb_data_file_key and innodb_temp_file_key for data files and temporary files. Therefore, log_file_t::write could simply invoke IF_WIN(tpool::,)pwrite().

      Attachments

        Issue Links

          Activity

            Also the reads of ib_logfile0 (during early startup and crash recovery) are unnecessarily complex, including some ‘unreachable’ bookkeeping in os_file_pread():

            	const bool monitor = MONITOR_IS_ON(MONITOR_OS_PENDING_READS);
            	MONITOR_ATOMIC_INC_LOW(MONITOR_OS_PENDING_READS, monitor);
            	ssize_t	n_bytes = os_file_io(type, file, buf, n, offset, err);
            	MONITOR_ATOMIC_DEC_LOW(MONITOR_OS_PENDING_READS, monitor);
            

            #0  __libc_pread64 (fd=10, buf=0x7f3e4a400000, count=count@entry=2097152, offset=49152) at ../sysdeps/unix/sysv/linux/pread64.c:25
            #1  0x0000558ef654a6bd in SyncFileIO::execute_low (this=this@entry=0x7ffd94df1ff0, request=@0x7ffd94df1fd0: {bpage = 0x0, slot = 0x0, node = 0x0, type = IORequest::READ_SYNC}, n=2097152)
                at /mariadb/10.11m/storage/innobase/os/os0file.cc:691
            #2  0x0000558ef654a811 in SyncFileIO::execute (this=this@entry=0x7ffd94df1ff0, request=@0x7ffd94df1fd0: {bpage = 0x0, slot = 0x0, node = 0x0, type = IORequest::READ_SYNC})
                at /mariadb/10.11m/storage/innobase/os/os0file.cc:715
            #3  0x0000558ef654b70a in os_file_io (in_type=@0x558ef5891080: {bpage = 0x0, slot = 0x0, node = 0x0, type = IORequest::READ_SYNC}, file=file@entry=10, buf=buf@entry=0x7f3e4a400000, n=n@entry=2097152, 
                offset=offset@entry=49152, err=err@entry=0x7ffd94df222c) at /mariadb/10.11m/storage/innobase/os/os0file.cc:2584
            #4  0x0000558ef654b9c1 in os_file_pread (type=@0x558ef5891080: {bpage = 0x0, slot = 0x0, node = 0x0, type = IORequest::READ_SYNC}, file=file@entry=10, buf=buf@entry=0x7f3e4a400000, n=n@entry=2097152, 
                offset=offset@entry=49152, err=err@entry=0x7ffd94df222c) at /mariadb/10.11m/storage/innobase/os/os0file.cc:2742
            #5  0x0000558ef654d58e in os_file_read_func (type=@0x558ef5891080: {bpage = 0x0, slot = 0x0, node = 0x0, type = IORequest::READ_SYNC}, file=10, buf=0x7f3e4a400000, offset=offset@entry=49152, n=2097152, 
                o=o@entry=0x0) at /mariadb/10.11m/storage/innobase/os/os0file.cc:2772
            #6  0x0000558ef651cc53 in log_file_t::read (this=0x558ef72044c0 <log_sys+128>, offset=49152, buf={data_ = 0x7f3e4a400000 "", size_ = 2097152}) at /mariadb/10.11m/storage/innobase/log/log0log.cc:166
            

            We have similar accounting in os_file_pwrite() that currently covers also log writes:

            	++os_n_file_writes;
             
            	const bool monitor = MONITOR_IS_ON(MONITOR_OS_PENDING_WRITES);
            	MONITOR_ATOMIC_INC_LOW(MONITOR_OS_PENDING_WRITES, monitor);
            	ssize_t	n_bytes = os_file_io(type, file, const_cast<byte*>(buf),
            				     n, offset, err);
            	MONITOR_ATOMIC_DEC_LOW(MONITOR_OS_PENDING_WRITES, monitor);
             
            	return(n_bytes);
            

            I think that we can do without that and only count pending writes to data files, not ib_logfile0. In fact, we may remove or replace the entire MONITOR_OS_PENDING_WRITES, because os_aio_pending_writes() will cover all writes to data files. The os_n_file_writes is an atomic counter, because data page writes require that. We do not need to increment that counter for log writes, because they are additionally covered by the regular variable log_sys.write_to_log, which is protected by log_sys.latch.

            Currently, the status variable Innodb_data_writes includes the count of Innodb_log_writes. With my intended change, Innodb_data_writes would no longer cover writes to ib_logfile0.

            marko Marko Mäkelä added a comment - Also the reads of ib_logfile0 (during early startup and crash recovery) are unnecessarily complex, including some ‘unreachable’ bookkeeping in os_file_pread() : const bool monitor = MONITOR_IS_ON(MONITOR_OS_PENDING_READS); MONITOR_ATOMIC_INC_LOW(MONITOR_OS_PENDING_READS, monitor); ssize_t n_bytes = os_file_io(type, file, buf, n, offset, err); MONITOR_ATOMIC_DEC_LOW(MONITOR_OS_PENDING_READS, monitor); #0 __libc_pread64 (fd=10, buf=0x7f3e4a400000, count=count@entry=2097152, offset=49152) at ../sysdeps/unix/sysv/linux/pread64.c:25 #1 0x0000558ef654a6bd in SyncFileIO::execute_low (this=this@entry=0x7ffd94df1ff0, request=@0x7ffd94df1fd0: {bpage = 0x0, slot = 0x0, node = 0x0, type = IORequest::READ_SYNC}, n=2097152) at /mariadb/10.11m/storage/innobase/os/os0file.cc:691 #2 0x0000558ef654a811 in SyncFileIO::execute (this=this@entry=0x7ffd94df1ff0, request=@0x7ffd94df1fd0: {bpage = 0x0, slot = 0x0, node = 0x0, type = IORequest::READ_SYNC}) at /mariadb/10.11m/storage/innobase/os/os0file.cc:715 #3 0x0000558ef654b70a in os_file_io (in_type=@0x558ef5891080: {bpage = 0x0, slot = 0x0, node = 0x0, type = IORequest::READ_SYNC}, file=file@entry=10, buf=buf@entry=0x7f3e4a400000, n=n@entry=2097152, offset=offset@entry=49152, err=err@entry=0x7ffd94df222c) at /mariadb/10.11m/storage/innobase/os/os0file.cc:2584 #4 0x0000558ef654b9c1 in os_file_pread (type=@0x558ef5891080: {bpage = 0x0, slot = 0x0, node = 0x0, type = IORequest::READ_SYNC}, file=file@entry=10, buf=buf@entry=0x7f3e4a400000, n=n@entry=2097152, offset=offset@entry=49152, err=err@entry=0x7ffd94df222c) at /mariadb/10.11m/storage/innobase/os/os0file.cc:2742 #5 0x0000558ef654d58e in os_file_read_func (type=@0x558ef5891080: {bpage = 0x0, slot = 0x0, node = 0x0, type = IORequest::READ_SYNC}, file=10, buf=0x7f3e4a400000, offset=offset@entry=49152, n=2097152, o=o@entry=0x0) at /mariadb/10.11m/storage/innobase/os/os0file.cc:2772 #6 0x0000558ef651cc53 in log_file_t::read (this=0x558ef72044c0 <log_sys+128>, offset=49152, buf={data_ = 0x7f3e4a400000 "", size_ = 2097152}) at /mariadb/10.11m/storage/innobase/log/log0log.cc:166 We have similar accounting in os_file_pwrite() that currently covers also log writes: ++os_n_file_writes;   const bool monitor = MONITOR_IS_ON(MONITOR_OS_PENDING_WRITES); MONITOR_ATOMIC_INC_LOW(MONITOR_OS_PENDING_WRITES, monitor); ssize_t n_bytes = os_file_io(type, file, const_cast <byte*>(buf), n, offset, err); MONITOR_ATOMIC_DEC_LOW(MONITOR_OS_PENDING_WRITES, monitor);   return (n_bytes); I think that we can do without that and only count pending writes to data files, not ib_logfile0 . In fact, we may remove or replace the entire MONITOR_OS_PENDING_WRITES , because os_aio_pending_writes() will cover all writes to data files. The os_n_file_writes is an atomic counter, because data page writes require that. We do not need to increment that counter for log writes, because they are additionally covered by the regular variable log_sys.write_to_log , which is protected by log_sys.latch . Currently, the status variable Innodb_data_writes includes the count of Innodb_log_writes . With my intended change, Innodb_data_writes would no longer cover writes to ib_logfile0 .

            I’d also change log_file_t::read() to invoke pread() or tpool::pread() directly. That will facilitate slightly more accurate diagnostics in case of a failure.

            marko Marko Mäkelä added a comment - I’d also change log_file_t::read() to invoke pread() or tpool::pread() directly. That will facilitate slightly more accurate diagnostics in case of a failure.

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.