Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-31227

innodb_flush_method=O_DIRECT causes 3x regression in workload

Details

    • Bug
    • Status: Closed (View Workflow)
    • Minor
    • Resolution: Not a Bug
    • 10.6.12, 10.6.13
    • N/A
    • None
    • CentOS 7, kernel 3.10

    Description

      Repro procedure is the same as MDEV-30567. Additional details for that in the comments.

      In short, this is the performance discrepancy we're seeing in 10.6.13 (10.6.12 performs similarly but slightly worse)-

      # Default (O_DIRECT)
      # * Q0: real: 14.21  min: 14.21  max: 14.21
       
      # innodb-flush-method = fsync
      # * Q0: real: 4.94  min: 4.94  max: 4.94
       
      # innodb-flush-method = O_DSYNC
      # * Q0: real: 4.92  min: 4.92  max: 4.92
       
      # innodb-flush-method = littlesync
      # * Q0: real: 4.91  min: 4.91  max: 4.91
       
      # innodb-flush-method = nosync
      # * Q0: real: 4.95  min: 4.95  max: 4.95
       
      # innodb-flush-method = O_DIRECT
      # * Q0: real: 14.21  min: 14.21  max: 14.21
       
      # innodb-flush-method = O_DIRECT_NO_FSYNC
      # * Q0: real: 12.81  min: 12.81  max: 12.81

      The customer for whom we are concerned about this issue for is currently using 10.4 where innodb_flush_method=fsync is the default. For 10.6 though, this changes to O_DIRECT. For now, we're advising the customer to hard-specify innodb_flush_method=fsync for 10.6. However, we know that O_DIRECT was made default in 10.6 as it generally improves over fsync, so we'd like to know if the regression we're seeing here is due to a bug that, if fixed, would make O_DIRECT the all-around ideal it is intended to be.

      Attachments

        Issue Links

          Activity

            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) created issue -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            Field Original Value New Value
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            Priority Major [ 3 ] Minor [ 4 ]
            marko Marko Mäkelä made changes -

            As per MDEV-29967, the linear read-ahead is broken already in 10.5.

            Also MDEV-30986 has been filed for something similar.

            marko Marko Mäkelä added a comment - As per MDEV-29967 , the linear read-ahead is broken already in 10.5. Also MDEV-30986 has been filed for something similar.
            marko Marko Mäkelä made changes -
            julien.fritsch Julien Fritsch made changes -
            Fix Version/s 10.4 [ 22408 ]
            Fix Version/s 10.5 [ 23123 ]
            Fix Version/s 10.6 [ 24028 ]
            julien.fritsch Julien Fritsch made changes -
            Fix Version/s 10.4 [ 22408 ]
            Fix Version/s 10.5 [ 23123 ]
            julien.fritsch Julien Fritsch made changes -
            Assignee Marko Mäkelä [ marko ]
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -

            rob.schwyzer@mariadb.com, can you please test the MDEV-30986 branch, which implements logical read-ahead?

            marko Marko Mäkelä added a comment - rob.schwyzer@mariadb.com , can you please test the MDEV-30986 branch, which implements logical read-ahead?
            marko Marko Mäkelä made changes -
            Assignee Marko Mäkelä [ marko ] Rob Schwyzer [ JIRAUSER40844 ]
            Status Open [ 1 ] Needs Feedback [ 10501 ]
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            Assignee Rob Schwyzer [ JIRAUSER40844 ] Marko Mäkelä [ marko ]
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            Status Needs Feedback [ 10501 ] Open [ 1 ]
            marko Marko Mäkelä made changes -
            Assignee Marko Mäkelä [ marko ] Rob Schwyzer [ JIRAUSER40844 ]
            Status Open [ 1 ] Needs Feedback [ 10501 ]

            If I understood correctly, the workload consists solely of queries like the following:

            SELECT c1 FROM "table" ORDER BY c1 LIMIT 1;
            

            This looks like random I/O to me. Would setting innodb_random_read_ahead=ON help here? It is disabled by default.

            marko Marko Mäkelä added a comment - If I understood correctly, the workload consists solely of queries like the following: SELECT c1 FROM "table" ORDER BY c1 LIMIT 1; This looks like random I/O to me. Would setting innodb_random_read_ahead=ON help here? It is disabled by default.
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            Assignee Rob Schwyzer [ JIRAUSER40844 ] Aleksey Midenkov [ midenok ]
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            midenok Aleksey Midenkov made changes -
            Status Needs Feedback [ 10501 ] Open [ 1 ]
            midenok Aleksey Midenkov made changes -
            Assignee Aleksey Midenkov [ midenok ] Marko Mäkelä [ marko ]
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            julien.fritsch Julien Fritsch made changes -
            Status Open [ 1 ] Confirmed [ 10101 ]
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            marko Marko Mäkelä made changes -
            Status Confirmed [ 10101 ] Open [ 1 ]

            Would the performance of innodb_flush_method=fsync degrade to the level of innodb_flush_method=O_DIRECT if the Linux kernel’s file system cache is emptied before starting the workload? I read somewhere that the command for that should be sudo sysctl vm.drop_caches=3, but I did not try that.

            If the performance of innodb_flush_method=fsync will remain better than innodb_flush_method=O_DIRECT even with this, then I would think that the file system cache will act as a cache between the InnoDB buffer pool and the storage.

            How large would the file system cache be at the end of the benchmark run? To have a fair comparison, I think that for the innodb_flush_method=O_DIRECT case, the innodb_buffer_pool_size must be increased by that amount, so that both cases will have access to the same amount of memory.

            marko Marko Mäkelä added a comment - Would the performance of innodb_flush_method=fsync degrade to the level of innodb_flush_method=O_DIRECT if the Linux kernel’s file system cache is emptied before starting the workload? I read somewhere that the command for that should be sudo sysctl vm.drop_caches=3 , but I did not try that. If the performance of innodb_flush_method=fsync will remain better than innodb_flush_method=O_DIRECT even with this, then I would think that the file system cache will act as a cache between the InnoDB buffer pool and the storage. How large would the file system cache be at the end of the benchmark run? To have a fair comparison, I think that for the innodb_flush_method=O_DIRECT case, the innodb_buffer_pool_size must be increased by that amount, so that both cases will have access to the same amount of memory.
            marko Marko Mäkelä made changes -
            Assignee Marko Mäkelä [ marko ] Rob Schwyzer [ JIRAUSER40844 ]
            Status Open [ 1 ] Needs Feedback [ 10501 ]
            midenok Aleksey Midenkov added a comment - - edited

            After drop caches fsync is slightly worse (7.33 is good number now):

            {10.6} midenok@snsvr01:cmake$ sudo sysctl vm.drop_caches=3
             
            vm.drop_caches = 3
            {10.6} midenok@snsvr01:cmake$ free -h
                          total        used        free      shared  buff/cache   available
            Mem:           565G         10G        550G        4.0G        4.7G        549G
            Swap:          127G        115M        127G
             
            {10.6} midenok@snsvr01:cmake$ mp -q
            ! Started: 127393
            * Q0: real: 9.50  min: 9.50  max: 9.50
            ! Stopped: 127393
            * Server dirty stats:
            faults: 155  rfaults: 17424  fsi: 1279376  fso: 184  rss_max: 226332  cpu: 88  sys: 0.56  usr: 9.19  real: 11.04  ctx0: 8  ctx1: 15353
             
            {10.6} midenok@snsvr01:cmake$ free -h
                          total        used        free      shared  buff/cache   available
            Mem:           565G         10G        549G        4.0G        5.3G        549G
            Swap:          127G        115M        127G
             
            {10.6} midenok@snsvr01:cmake$ mp -q
            ! Started: 127686
            * Q0: real: 7.33  min: 7.33  max: 7.33
            ! Stopped: 127686
            * Server dirty stats:
            rfaults: 16757  fsi: 160  fso: 184  rss_max: 229176  cpu: 95  sys: 0.43  usr: 8.00  real: 8.84  ctx0: 19  ctx1: 7081
             
            {10.6} midenok@snsvr01:cmake$ free -h
                          total        used        free      shared  buff/cache   available
            Mem:           565G         10G        549G        4.0G        5.3G        549G
            Swap:          127G        115M        127G
            

            O_DIRECT is same bad:

            {10.6} midenok@snsvr01:cmake$ mp -q
            ! Started: 128117
            * Q0: real: 14.46  min: 14.46  max: 14.46
            ! Stopped: 128117
            * Server dirty stats:
            rfaults: 20930  fsi: 1300064  fso: 184  rss_max: 245112  cpu: 68  sys: 0.48  usr: 10.52  real: 15.96  ctx0: 43  ctx1: 44265
             
            {10.6} midenok@snsvr01:cmake$ free -h
                          total        used        free      shared  buff/cache   available
            Mem:           565G         10G        549G        4.0G        5.3G        549G
            Swap:          127G        115M        127G
            

            Marko, I'd suggest you to draw your attention to fsi and ctx1 numbers. For O_DIRECT they are much bigger.

            fsi: Number of file system inputs by the process.
            ctx1: Number of times that the program was context-switched voluntarily, for instance while waiting for an I/O operation to complete.

            midenok Aleksey Midenkov added a comment - - edited After drop caches fsync is slightly worse (7.33 is good number now): {10.6} midenok@snsvr01:cmake$ sudo sysctl vm.drop_caches=3   vm.drop_caches = 3 {10.6} midenok@snsvr01:cmake$ free -h total used free shared buff/cache available Mem: 565G 10G 550G 4.0G 4.7G 549G Swap: 127G 115M 127G   {10.6} midenok@snsvr01:cmake$ mp -q ! Started: 127393 * Q0: real: 9.50 min: 9.50 max: 9.50 ! Stopped: 127393 * Server dirty stats: faults: 155 rfaults: 17424 fsi: 1279376 fso: 184 rss_max: 226332 cpu: 88 sys: 0.56 usr: 9.19 real: 11.04 ctx0: 8 ctx1: 15353   {10.6} midenok@snsvr01:cmake$ free -h total used free shared buff/cache available Mem: 565G 10G 549G 4.0G 5.3G 549G Swap: 127G 115M 127G   {10.6} midenok@snsvr01:cmake$ mp -q ! Started: 127686 * Q0: real: 7.33 min: 7.33 max: 7.33 ! Stopped: 127686 * Server dirty stats: rfaults: 16757 fsi: 160 fso: 184 rss_max: 229176 cpu: 95 sys: 0.43 usr: 8.00 real: 8.84 ctx0: 19 ctx1: 7081   {10.6} midenok@snsvr01:cmake$ free -h total used free shared buff/cache available Mem: 565G 10G 549G 4.0G 5.3G 549G Swap: 127G 115M 127G O_DIRECT is same bad: {10.6} midenok@snsvr01:cmake$ mp -q ! Started: 128117 * Q0: real: 14.46 min: 14.46 max: 14.46 ! Stopped: 128117 * Server dirty stats: rfaults: 20930 fsi: 1300064 fso: 184 rss_max: 245112 cpu: 68 sys: 0.48 usr: 10.52 real: 15.96 ctx0: 43 ctx1: 44265   {10.6} midenok@snsvr01:cmake$ free -h total used free shared buff/cache available Mem: 565G 10G 549G 4.0G 5.3G 549G Swap: 127G 115M 127G Marko, I'd suggest you to draw your attention to fsi and ctx1 numbers. For O_DIRECT they are much bigger. fsi: Number of file system inputs by the process. ctx1: Number of times that the program was context-switched voluntarily, for instance while waiting for an I/O operation to complete.
            midenok Aleksey Midenkov made changes -
            Assignee Rob Schwyzer [ JIRAUSER40844 ] Marko Mäkelä [ marko ]
            Status Needs Feedback [ 10501 ] Open [ 1 ]

            midenok, thank you for the update. If I interpreted the output correctly, the file system cache grew by 0.6 GiB (from 4.7GiB to 5.3GiB) during the test with innodb_flush_method=fsync (file system cache enabled). If that is the case, for the innodb_flush_method=O_DIRECT test to be fair, the innodb_buffer_pool_size should have been increased by the same amount (and the file system cache emptied before running the test).

            Which server configuration parameters had been specified? I do not see any server parameters mentioned either in this ticket or in the test script itself.

            In particular, had innodb_random_read_ahead=ON been set? Would the prototype of logical read-ahead (MDEV-30986) improve the O_DIRECT performance? I suppose that when the Linux kernel’s file system cache is being used, the kernel can submit some read-ahead requests on its own.

            The most important one would be innodb_buffer_pool_size. The output rss_max is from time(1):

            %M Maximum resident set size of the process during its
            lifetime, in Kbytes.

            Based on this and the amount of available memory, I would assume innodb_buffer_pool_size to be over 100 gigabytes, and hence the 0.6GiB increase of the file system cache might not make much difference. But, I do not think that the kernel file system size would be covered in this number. We would need some monitoring of the global system state.

            The ctx1 and fsi are from the following time(1) attributes:

            %w Number of waits: times that the program was context-
            switched voluntarily, for instance while waiting for an
            I/O operation to complete.
            %I Number of filesystem inputs by the process.

            Supposedly, copying data from the kernel’s file system cache to the memory of the process would not affect either metric.

            Read latency would be tracked better by the instrumentation that MDEV-31558 is adding.

            I would expect there to be a number of performance improvements in the Linux kernel, especially in 5.x and 6.x.

            marko Marko Mäkelä added a comment - midenok , thank you for the update. If I interpreted the output correctly, the file system cache grew by 0.6 GiB (from 4.7GiB to 5.3GiB) during the test with innodb_flush_method=fsync (file system cache enabled). If that is the case, for the innodb_flush_method=O_DIRECT test to be fair, the innodb_buffer_pool_size should have been increased by the same amount (and the file system cache emptied before running the test). Which server configuration parameters had been specified? I do not see any server parameters mentioned either in this ticket or in the test script itself. In particular, had innodb_random_read_ahead=ON been set? Would the prototype of logical read-ahead ( MDEV-30986 ) improve the O_DIRECT performance? I suppose that when the Linux kernel’s file system cache is being used, the kernel can submit some read-ahead requests on its own. The most important one would be innodb_buffer_pool_size . The output rss_max is from time(1) : %M Maximum resident set size of the process during its lifetime, in Kbytes. Based on this and the amount of available memory, I would assume innodb_buffer_pool_size to be over 100 gigabytes, and hence the 0.6GiB increase of the file system cache might not make much difference. But, I do not think that the kernel file system size would be covered in this number. We would need some monitoring of the global system state. The ctx1 and fsi are from the following time(1) attributes: %w Number of waits: times that the program was context- switched voluntarily, for instance while waiting for an I/O operation to complete. %I Number of filesystem inputs by the process. Supposedly, copying data from the kernel’s file system cache to the memory of the process would not affect either metric. Read latency would be tracked better by the instrumentation that MDEV-31558 is adding. I would expect there to be a number of performance improvements in the Linux kernel, especially in 5.x and 6.x.

            There are two more sources of nondeterminism that should be accounted for, to prevent any unexpected "warm-up" of the buffer pool:

            • If the file ib_buffer_pool exists in the data directory, it must be deleted before starting the server under test.
            • Crash recovery, or purging the history of committed transactions: The server should previously have been shut down with innodb_fast_shutdown=0.
            marko Marko Mäkelä added a comment - There are two more sources of nondeterminism that should be accounted for, to prevent any unexpected "warm-up" of the buffer pool: If the file ib_buffer_pool exists in the data directory, it must be deleted before starting the server under test. Crash recovery, or purging the history of committed transactions: The server should previously have been shut down with innodb_fast_shutdown=0 .

            innodb_buffer_pool_size did the trick.

            O_DIRECT

            * Q0: real: 7.52  min: 7.52  max: 7.52
            ! Stopped: 220861
            * Server dirty stats:
            rfaults: 18686  fsi: 1238336  fso: 464  rss_max: 748692  cpu: 110  sys: 1.17  usr: 8.78  real: 9.04  ctx0: 20  ctx1: 128360
            

            fsync

            * Q0: real: 7.55  min: 7.55  max: 7.55
            ! Stopped: 221140
            * Server dirty stats:
            rfaults: 19576  fsi: 2144  fso: 456  rss_max: 744456  cpu: 110  sys: 1.24  usr: 8.77  real: 9.06  ctx0: 43  ctx1: 11396
            

            InnoDB config:

            innodb_random_read_ahead=ON
            innodb_buffer_pool_size=1342177280
            innodb-flush-method = fsync
            # innodb-flush-method = O_DSYNC
            # innodb-flush-method = littlesync
            # innodb-flush-method = nosync
            # innodb-flush-method = O_DIRECT
            # innodb-flush-method = O_DIRECT_NO_FSYNC
            innodb_stats_persistent=OFF
            innodb_stats_auto_recalc=OFF
            

            midenok Aleksey Midenkov added a comment - innodb_buffer_pool_size did the trick. O_DIRECT * Q0: real: 7.52 min: 7.52 max: 7.52 ! Stopped: 220861 * Server dirty stats: rfaults: 18686 fsi: 1238336 fso: 464 rss_max: 748692 cpu: 110 sys: 1.17 usr: 8.78 real: 9.04 ctx0: 20 ctx1: 128360 fsync * Q0: real: 7.55 min: 7.55 max: 7.55 ! Stopped: 221140 * Server dirty stats: rfaults: 19576 fsi: 2144 fso: 456 rss_max: 744456 cpu: 110 sys: 1.24 usr: 8.77 real: 9.06 ctx0: 43 ctx1: 11396 InnoDB config: innodb_random_read_ahead=ON innodb_buffer_pool_size=1342177280 innodb-flush-method = fsync # innodb-flush-method = O_DSYNC # innodb-flush-method = littlesync # innodb-flush-method = nosync # innodb-flush-method = O_DIRECT # innodb-flush-method = O_DIRECT_NO_FSYNC innodb_stats_persistent=OFF innodb_stats_auto_recalc=OFF

            After dropping FS cache, with the above config:

            O_DIRECT

            {10.6} midenok@snsvr01:cmake$ sudo sysctl vm.drop_caches=3
            vm.drop_caches = 3
            {10.6} midenok@snsvr01:cmake$ mp -q
            ! Started: 222519
            * Q0: real: 8.31  min: 8.31  max: 8.31
            ! Stopped: 222519
            * Server dirty stats:
            faults: 149  rfaults: 21529  fsi: 1275536  fso: 456  rss_max: 752284  cpu: 104  sys: 1.30  usr: 8.98  real: 9.80  ctx0: 50  ctx1: 117028
            

            fsync

            {10.6} midenok@snsvr01:cmake$ sudo sysctl vm.drop_caches=3
            vm.drop_caches = 3
            {10.6} midenok@snsvr01:cmake$ mp -q
            ! Started: 223108
            * Q0: real: 7.86  min: 7.86  max: 7.86
            ! Stopped: 223108
            * Server dirty stats:
            faults: 162  rfaults: 23074  fsi: 1280512  fso: 456  rss_max: 761780  cpu: 107  sys: 1.17  usr: 8.80  real: 9.32  ctx0: 45  ctx1: 103667
            

            midenok Aleksey Midenkov added a comment - After dropping FS cache, with the above config: O_DIRECT {10.6} midenok@snsvr01:cmake$ sudo sysctl vm.drop_caches=3 vm.drop_caches = 3 {10.6} midenok@snsvr01:cmake$ mp -q ! Started: 222519 * Q0: real: 8.31 min: 8.31 max: 8.31 ! Stopped: 222519 * Server dirty stats: faults: 149 rfaults: 21529 fsi: 1275536 fso: 456 rss_max: 752284 cpu: 104 sys: 1.30 usr: 8.98 real: 9.80 ctx0: 50 ctx1: 117028 fsync {10.6} midenok@snsvr01:cmake$ sudo sysctl vm.drop_caches=3 vm.drop_caches = 3 {10.6} midenok@snsvr01:cmake$ mp -q ! Started: 223108 * Q0: real: 7.86 min: 7.86 max: 7.86 ! Stopped: 223108 * Server dirty stats: faults: 162 rfaults: 23074 fsi: 1280512 fso: 456 rss_max: 761780 cpu: 107 sys: 1.17 usr: 8.80 real: 9.32 ctx0: 45 ctx1: 103667

            O_DIRECT innodb_random_read_ahead=OFF vs ON

            innodb_random_read_ahead=OFF

            innodb_random_read_ahead=OFF
            innodb_buffer_pool_size=1342177280
            innodb-flush-method = O_DIRECT
            innodb_stats_persistent=OFF
            innodb_stats_auto_recalc=OFF
            

            {10.6} midenok@snsvr01:cmake$ sudo sysctl vm.drop_caches=3 && mp -q3
            vm.drop_caches = 3
            ! Started: 245416
            * Q0: real: 10.33  min: 10.33  max: 10.33
            * Q1: real: 7.23  min: 7.23  max: 10.33
            * Q2: real: 7.35  min: 7.23  max: 10.33
            * Average for 3 queries:
            real: 8.30  min: 7.23  max: 10.33
            ! Stopped: 245416
            * Server dirty stats:
            faults: 164  rfaults: 21545  fsi: 1278528  fso: 456  rss_max: 741840  cpu: 91  sys: 0.93  usr: 23.69  real: 26.91  ctx0: 60  ctx1: 87688
            

            innodb_random_read_ahead=ON

            innodb_random_read_ahead=ON
            innodb_buffer_pool_size=1342177280
            innodb-flush-method = O_DIRECT
            innodb_stats_persistent=OFF
            innodb_stats_auto_recalc=OFF
            

            {10.6} midenok@snsvr01:cmake$ sudo sysctl vm.drop_caches=3 && mp -q3
            vm.drop_caches = 3
            ! Started: 245958
            * Q0: real: 8.08  min: 8.08  max: 8.08
            * Q1: real: 7.40  min: 7.40  max: 8.08
            * Q2: real: 7.31  min: 7.31  max: 8.08
            * Average for 3 queries:
            real: 7.59  min: 7.31  max: 8.08
            ! Stopped: 245958
            * Server dirty stats:
            faults: 164  rfaults: 21209  fsi: 1278112  fso: 472  rss_max: 745548  cpu: 100  sys: 1.24  usr: 23.55  real: 24.68  ctx0: 54  ctx1: 118957
            

            midenok Aleksey Midenkov added a comment - O_DIRECT innodb_random_read_ahead=OFF vs ON innodb_random_read_ahead=OFF innodb_random_read_ahead=OFF innodb_buffer_pool_size=1342177280 innodb-flush-method = O_DIRECT innodb_stats_persistent=OFF innodb_stats_auto_recalc=OFF {10.6} midenok@snsvr01:cmake$ sudo sysctl vm.drop_caches=3 && mp -q3 vm.drop_caches = 3 ! Started: 245416 * Q0: real: 10.33 min: 10.33 max: 10.33 * Q1: real: 7.23 min: 7.23 max: 10.33 * Q2: real: 7.35 min: 7.23 max: 10.33 * Average for 3 queries: real: 8.30 min: 7.23 max: 10.33 ! Stopped: 245416 * Server dirty stats: faults: 164 rfaults: 21545 fsi: 1278528 fso: 456 rss_max: 741840 cpu: 91 sys: 0.93 usr: 23.69 real: 26.91 ctx0: 60 ctx1: 87688 innodb_random_read_ahead=ON innodb_random_read_ahead=ON innodb_buffer_pool_size=1342177280 innodb-flush-method = O_DIRECT innodb_stats_persistent=OFF innodb_stats_auto_recalc=OFF {10.6} midenok@snsvr01:cmake$ sudo sysctl vm.drop_caches=3 && mp -q3 vm.drop_caches = 3 ! Started: 245958 * Q0: real: 8.08 min: 8.08 max: 8.08 * Q1: real: 7.40 min: 7.40 max: 8.08 * Q2: real: 7.31 min: 7.31 max: 8.08 * Average for 3 queries: real: 7.59 min: 7.31 max: 8.08 ! Stopped: 245958 * Server dirty stats: faults: 164 rfaults: 21209 fsi: 1278112 fso: 472 rss_max: 745548 cpu: 100 sys: 1.24 usr: 23.55 real: 24.68 ctx0: 54 ctx1: 118957
            midenok Aleksey Midenkov added a comment - - edited

            fsync innodb_random_read_ahead=OFF vs ON

            innodb_random_read_ahead=OFF

            innodb_random_read_ahead=OFF
            innodb_buffer_pool_size=1342177280
            innodb-flush-method = fsync
            innodb_stats_persistent=OFF
            innodb_stats_auto_recalc=OFF
            

            {10.6} midenok@snsvr01:cmake$ sudo sysctl vm.drop_caches=3 && mp -q3
            vm.drop_caches = 3
            ! Started: 246520
            * Q0: real: 8.53  min: 8.53  max: 8.53
            * Q1: real: 7.19  min: 7.19  max: 8.53
            * Q2: real: 7.13  min: 7.13  max: 8.53
            * Average for 3 queries:
            real: 7.61  min: 7.13  max: 8.53
            ! Stopped: 246520
            * Server dirty stats:
            faults: 164  rfaults: 24188  fsi: 1280376  fso: 456  rss_max: 762416  cpu: 96  sys: 0.96  usr: 23.11  real: 24.84  ctx0: 25  ctx1: 65197
            

            innodb_random_read_ahead=ON

            innodb_random_read_ahead=ON
            innodb_buffer_pool_size=1342177280
            innodb-flush-method = fsync
            innodb_stats_persistent=OFF
            innodb_stats_auto_recalc=OFF
            

            {10.6} midenok@snsvr01:cmake$ sudo sysctl vm.drop_caches=3 && mp -q3
            vm.drop_caches = 3
            ! Started: 247049
            * Q0: real: 8.32  min: 8.32  max: 8.32
            * Q1: real: 7.59  min: 7.59  max: 8.32
            * Q2: real: 7.48  min: 7.48  max: 8.32
            * Average for 3 queries:
            real: 7.79  min: 7.48  max: 8.32
            ! Stopped: 247049
            * Server dirty stats:
            faults: 164  rfaults: 31906  fsi: 1281136  fso: 472  rss_max: 765080  cpu: 101  sys: 1.29  usr: 24.53  real: 25.34  ctx0: 44  ctx1: 103060
            

            midenok Aleksey Midenkov added a comment - - edited fsync innodb_random_read_ahead=OFF vs ON innodb_random_read_ahead=OFF innodb_random_read_ahead=OFF innodb_buffer_pool_size=1342177280 innodb-flush-method = fsync innodb_stats_persistent=OFF innodb_stats_auto_recalc=OFF {10.6} midenok@snsvr01:cmake$ sudo sysctl vm.drop_caches=3 && mp -q3 vm.drop_caches = 3 ! Started: 246520 * Q0: real: 8.53 min: 8.53 max: 8.53 * Q1: real: 7.19 min: 7.19 max: 8.53 * Q2: real: 7.13 min: 7.13 max: 8.53 * Average for 3 queries: real: 7.61 min: 7.13 max: 8.53 ! Stopped: 246520 * Server dirty stats: faults: 164 rfaults: 24188 fsi: 1280376 fso: 456 rss_max: 762416 cpu: 96 sys: 0.96 usr: 23.11 real: 24.84 ctx0: 25 ctx1: 65197 innodb_random_read_ahead=ON innodb_random_read_ahead=ON innodb_buffer_pool_size=1342177280 innodb-flush-method = fsync innodb_stats_persistent=OFF innodb_stats_auto_recalc=OFF {10.6} midenok@snsvr01:cmake$ sudo sysctl vm.drop_caches=3 && mp -q3 vm.drop_caches = 3 ! Started: 247049 * Q0: real: 8.32 min: 8.32 max: 8.32 * Q1: real: 7.59 min: 7.59 max: 8.32 * Q2: real: 7.48 min: 7.48 max: 8.32 * Average for 3 queries: real: 7.79 min: 7.48 max: 8.32 ! Stopped: 247049 * Server dirty stats: faults: 164 rfaults: 31906 fsi: 1281136 fso: 472 rss_max: 765080 cpu: 101 sys: 1.29 usr: 24.53 real: 25.34 ctx0: 44 ctx1: 103060

            Thank you! One of my suspicions was that this was a case of the working set exceeding innodb_buffer_pool_size, and then having all "excess" RAM acting as a cache between the durable storage and InnoDB, or not having any kernel file system cache at all.

            Here is one more resource that I just came across: the README.md file of the nocache tool explains how you might be able to limit the size of the Linux kernel’s file system cache while a process is running. It might be useful for a valid comparison when using a smaller InnoDB buffer pool.

            Is there anything that needs to be changed in the code for this particular ticket? Out of curiosity, what was the original innodb_buffer_pool_size? The default 128MiB (vs. 1280MiB for the updated results)?

            marko Marko Mäkelä added a comment - Thank you! One of my suspicions was that this was a case of the working set exceeding innodb_buffer_pool_size , and then having all "excess" RAM acting as a cache between the durable storage and InnoDB, or not having any kernel file system cache at all. Here is one more resource that I just came across: the README.md file of the nocache tool explains how you might be able to limit the size of the Linux kernel’s file system cache while a process is running. It might be useful for a valid comparison when using a smaller InnoDB buffer pool. Is there anything that needs to be changed in the code for this particular ticket? Out of curiosity, what was the original innodb_buffer_pool_size ? The default 128MiB (vs. 1280MiB for the updated results)?
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            marko Marko Mäkelä made changes -
            Assignee Marko Mäkelä [ marko ] Rob Schwyzer [ JIRAUSER40844 ]
            Status Open [ 1 ] Needs Feedback [ 10501 ]

            > The default 128MiB (vs. 1280MiB for the updated results)?

            Right, it was the default.

            midenok Aleksey Midenkov added a comment - > The default 128MiB (vs. 1280MiB for the updated results)? Right, it was the default.
            midenok Aleksey Midenkov made changes -
            Assignee Rob Schwyzer [ JIRAUSER40844 ] Aleksey Midenkov [ midenok ]
            midenok Aleksey Midenkov made changes -
            Status Needs Feedback [ 10501 ] Open [ 1 ]
            midenok Aleksey Midenkov made changes -
            Component/s Storage Engine - InnoDB [ 10129 ]
            Fix Version/s N/A [ 14700 ]
            Fix Version/s 10.6 [ 24028 ]
            Resolution Not a Bug [ 6 ]
            Status Open [ 1 ] Closed [ 6 ]

            In MDEV-30986 the following approach worked for me. The statement under test would access about 400MiB of data.

            mkdir /var/tmp/var && rm -fr var && ln -s /var/tmp/var var
            ./mtr --mysqld=--innodb-buffer-pool-dump-at-shutdown=0 name_of_test
            ./mtr --mysqld=--innodb-buffer-pool-size=400m --mysqld=--innodb-buffer-pool-dump-at-shutdown=0 name_of_test
            sudo sysctl vm.drop_caches=3
            systemd-run --scope --property=MemoryLimit=250M -- ./mtr --mysqld=--innodb-flush-method=fsync --mysqld=--innodb-buffer-pool-dump-at-shutdown=0 name_of_test
            sudo sysctl vm.drop_caches=3
            systemd-run --scope --property=MemoryLimit=550M -- ./mtr --mysqld=--innodb-flush-method=fsync --mysqld=--innodb-buffer-pool-size=400m --mysqld=--innodb-buffer-pool-dump-at-shutdown=0 name_of_test
            

            For that test, I got the best read performance (for either buffer pool size) by using the default innodb_flush_method=O_DIRECT and enabling innodb_random_read_ahead=ON.

            marko Marko Mäkelä added a comment - In MDEV-30986 the following approach worked for me. The statement under test would access about 400MiB of data. mkdir /var/tmp/var && rm -fr var && ln -s /var/tmp/var var ./mtr --mysqld=--innodb-buffer-pool-dump-at-shutdown=0 name_of_test ./mtr --mysqld=--innodb-buffer-pool-size=400m --mysqld=--innodb-buffer-pool-dump-at-shutdown=0 name_of_test sudo sysctl vm.drop_caches=3 systemd-run --scope --property=MemoryLimit=250M -- ./mtr --mysqld=--innodb-flush-method=fsync --mysqld=--innodb-buffer-pool-dump-at-shutdown=0 name_of_test sudo sysctl vm.drop_caches=3 systemd-run --scope --property=MemoryLimit=550M -- ./mtr --mysqld=--innodb-flush-method=fsync --mysqld=--innodb-buffer-pool-size=400m --mysqld=--innodb-buffer-pool-dump-at-shutdown=0 name_of_test For that test, I got the best read performance (for either buffer pool size) by using the default innodb_flush_method=O_DIRECT and enabling innodb_random_read_ahead=ON .
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            marko Marko Mäkelä made changes -
            mariadb-jira-automation Jira Automation (IT) made changes -
            Zendesk Related Tickets 201658
            Zendesk active tickets 201658
            marko Marko Mäkelä made changes -

            People

              midenok Aleksey Midenkov
              rob.schwyzer@mariadb.com Rob Schwyzer (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.