Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-32067

InnoDB linear read ahead had better be logical

Details

    • Bug
    • Status: Confirmed (View Workflow)
    • Critical
    • Resolution: Unresolved
    • 10.1(EOL), (14)
      10.2(EOL), 10.3(EOL), 10.4(EOL), 10.5, 10.6, 10.7(EOL), 10.8(EOL), 10.9(EOL), 10.10(EOL), 10.11, 11.0(EOL), 11.1(EOL), 11.2(EOL), 11.3(EOL)
    • 11.4

    Description

      MDEV-30986 was originally filed due to a change that was made in MDEV-24854: Starting with MariaDB Server 10.6, InnoDB disables the file system cache by default (innodb_flush_method=O_DIRECT).

      While disabling the file system cache does improve write performance, it can hurt the performance of those read workloads that cannot be satisfied directly by the InnoDB buffer pool.

      In MDEV-30986 it was tested on several types of storage and operating system versions that when the data needs to be loaded into the InnoDB buffer pool, it is faster when the file system cache of the operating system is used. This is the case also when the cache is initially empty. This suggests that the InnoDB read-ahead mechanism could be better.

      In key range scans or table scans, it would seem to make sense to post read-ahead requests for the index leaf pages when the level right above the leaf is reached. At that point, we would know which pages will have to be accessed by the query. Possibly, it would help to issue a single larger read instead of several single-page requests (MDEV-11378).

      Attachments

        Issue Links

          Activity

            marko Marko Mäkelä added a comment - - edited

            I rebased the work-in-progress prototype. MDEV-32068 implements a subset of this: a smarter invocation of buf_read_ahead_linear().

            The main challenge for completing the logical read-ahead is to pass the end_key from handler::read_range_first() to btr_cur_t::search_leaf() in order to determine which pages will have to be read ahead. The current patch is always trying to read up to 16 siblings between the requested leaf page and the first or last child page of the current page, no matter if we are in a point select (not going to access any other pages than the current one) or in a table scan (going to access all child pages).

            marko Marko Mäkelä added a comment - - edited I rebased the work-in-progress prototype . MDEV-32068 implements a subset of this: a smarter invocation of buf_read_ahead_linear() . The main challenge for completing the logical read-ahead is to pass the end_key from handler::read_range_first() to btr_cur_t::search_leaf() in order to determine which pages will have to be read ahead. The current patch is always trying to read up to 16 siblings between the requested leaf page and the first or last child page of the current page, no matter if we are in a point select (not going to access any other pages than the current one) or in a table scan (going to access all child pages).
            valerii Valerii Kravchuk added a comment - - edited

            Consider a test case where the table with long rows is larger than 1/2 of RAM available. See attached tbig2.sql for the table 2G+ in size with some 400K+ rows that may be used to demonstrate the problem on a system with 4G of RAM or less and slow enough disks (like HDD). Find the best size of the innodb buffer pool vs filesystem cache vs innodb_flush_method setting to get the fastest execution of the full table scan query, like this, for the first and then repeated execution in one single thread with nothing happens concurrently:

            MariaDB [test]> select @@innodb_buffer_pool_size;
            +---------------------------+
            | @@innodb_buffer_pool_size |
            +---------------------------+
            |                  33554432 |
            +---------------------------+
            1 row in set (0,001 sec)
             
            MariaDB [test]> pager grep rows
            PAGER set to 'grep rows'
             
            MariaDB [test]> select * from tbig2;
            402144 rows in set (49,894 sec)
             
            MariaDB [test]> select * from tbig2;
            402144 rows in set (49,480 sec)
             
            MariaDB [test]> select * from tbig2;
            402144 rows in set (49,463 sec)
             
            MariaDB [test]> select * from tbig2;
            402144 rows in set (49,718 sec)
             
            MariaDB [test]> select * from tbig2;
            402144 rows in set (49,630 sec)
            

            I do NOT see any clear positive impact of growing the buffer pool size above that 32M, while keeping is small and setting innodb_flush_mrthod to fsync makes the query notably faster. Whatever improvement you make with read ahead, it should show positive impact for larger buffer pool size and innodb_flush_method of O_DIRECT, otherwise relying on the OS filesystem cache and using less thatn 1/2 of RAM for the buffer pool would still be a better alternative.

            valerii Valerii Kravchuk added a comment - - edited Consider a test case where the table with long rows is larger than 1/2 of RAM available. See attached tbig2.sql for the table 2G+ in size with some 400K+ rows that may be used to demonstrate the problem on a system with 4G of RAM or less and slow enough disks (like HDD). Find the best size of the innodb buffer pool vs filesystem cache vs innodb_flush_method setting to get the fastest execution of the full table scan query, like this, for the first and then repeated execution in one single thread with nothing happens concurrently: MariaDB [test]> select @@innodb_buffer_pool_size; +---------------------------+ | @@innodb_buffer_pool_size | +---------------------------+ | 33554432 | +---------------------------+ 1 row in set (0,001 sec)   MariaDB [test]> pager grep rows PAGER set to 'grep rows'   MariaDB [test]> select * from tbig2; 402144 rows in set (49,894 sec)   MariaDB [test]> select * from tbig2; 402144 rows in set (49,480 sec)   MariaDB [test]> select * from tbig2; 402144 rows in set (49,463 sec)   MariaDB [test]> select * from tbig2; 402144 rows in set (49,718 sec)   MariaDB [test]> select * from tbig2; 402144 rows in set (49,630 sec) I do NOT see any clear positive impact of growing the buffer pool size above that 32M, while keeping is small and setting innodb_flush_mrthod to fsync makes the query notably faster. Whatever improvement you make with read ahead, it should show positive impact for larger buffer pool size and innodb_flush_method of O_DIRECT, otherwise relying on the OS filesystem cache and using less thatn 1/2 of RAM for the buffer pool would still be a better alternative.

            For io_uring, there is an interface change in the works that would allow writes to bypass the block cache in the kernel, on files that are not opened O_DIRECT. The magic words are RWF_UNCACHED and "uncached buffered I/O". That might be the best from the two worlds.

            For what it is worth, while looking up the above (initially searching for "unbuffered" instead of "uncached"), I found a claim that the older libaio based interface would only work on files that were opened in O_DIRECT mode. This does not fully match my impression. Maybe libaio would always perform uncached writes, and refuse to work if O_DIRECT is not supported on a file. This is something that may be worth investigating. That is, simply test a build with cmake -DCMAKE_DISABLE_FIND_PACKAGE_URING=1 and innodb_flush_method=fsync and checking with perf record if the InnoDB data page writes will bypass the Linux kernel’s block cache.

            marko Marko Mäkelä added a comment - For io_uring , there is an interface change in the works that would allow writes to bypass the block cache in the kernel, on files that are not opened O_DIRECT . The magic words are RWF_UNCACHED and "uncached buffered I/O". That might be the best from the two worlds. For what it is worth, while looking up the above (initially searching for "unbuffered" instead of "uncached"), I found a claim that the older libaio based interface would only work on files that were opened in O_DIRECT mode. This does not fully match my impression. Maybe libaio would always perform uncached writes, and refuse to work if O_DIRECT is not supported on a file. This is something that may be worth investigating. That is, simply test a build with cmake -DCMAKE_DISABLE_FIND_PACKAGE_URING=1 and innodb_flush_method=fsync and checking with perf record if the InnoDB data page writes will bypass the Linux kernel’s block cache.

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              3 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.