[MDEV-31227] innodb_flush_method=O_DIRECT causes 3x regression in workload Created: 2023-05-09 Updated: 2023-08-15 Resolved: 2023-07-06 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.6.12, 10.6.13 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Minor |
| Reporter: | Rob Schwyzer | Assignee: | Aleksey Midenkov |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS 7, kernel 3.10 |
||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Description |
|
Repro procedure is the same as In short, this is the performance discrepancy we're seeing in 10.6.13 (10.6.12 performs similarly but slightly worse)-
The customer for whom we are concerned about this issue for is currently using 10.4 where innodb_flush_method=fsync is the default. For 10.6 though, this changes to O_DIRECT. For now, we're advising the customer to hard-specify innodb_flush_method=fsync for 10.6. However, we know that O_DIRECT was made default in 10.6 as it generally improves over fsync, so we'd like to know if the regression we're seeing here is due to a bug that, if fixed, would make O_DIRECT the all-around ideal it is intended to be. |
| Comments |
| Comment by Marko Mäkelä [ 2023-05-10 ] | ||||||||||||||||||||||||||||||||||||||||||
|
As per Also | ||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-05-24 ] | ||||||||||||||||||||||||||||||||||||||||||
|
rob.schwyzer@mariadb.com, can you please test the | ||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-06-07 ] | ||||||||||||||||||||||||||||||||||||||||||
|
If I understood correctly, the workload consists solely of queries like the following:
This looks like random I/O to me. Would setting innodb_random_read_ahead=ON help here? It is disabled by default. | ||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-06-29 ] | ||||||||||||||||||||||||||||||||||||||||||
|
Would the performance of innodb_flush_method=fsync degrade to the level of innodb_flush_method=O_DIRECT if the Linux kernel’s file system cache is emptied before starting the workload? I read somewhere that the command for that should be sudo sysctl vm.drop_caches=3, but I did not try that. If the performance of innodb_flush_method=fsync will remain better than innodb_flush_method=O_DIRECT even with this, then I would think that the file system cache will act as a cache between the InnoDB buffer pool and the storage. How large would the file system cache be at the end of the benchmark run? To have a fair comparison, I think that for the innodb_flush_method=O_DIRECT case, the innodb_buffer_pool_size must be increased by that amount, so that both cases will have access to the same amount of memory. | ||||||||||||||||||||||||||||||||||||||||||
| Comment by Aleksey Midenkov [ 2023-07-04 ] | ||||||||||||||||||||||||||||||||||||||||||
|
After drop caches fsync is slightly worse (7.33 is good number now):
O_DIRECT is same bad:
Marko, I'd suggest you to draw your attention to fsi and ctx1 numbers. For O_DIRECT they are much bigger. fsi: Number of file system inputs by the process. | ||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-07-05 ] | ||||||||||||||||||||||||||||||||||||||||||
|
midenok, thank you for the update. If I interpreted the output correctly, the file system cache grew by 0.6 GiB (from 4.7GiB to 5.3GiB) during the test with innodb_flush_method=fsync (file system cache enabled). If that is the case, for the innodb_flush_method=O_DIRECT test to be fair, the innodb_buffer_pool_size should have been increased by the same amount (and the file system cache emptied before running the test). Which server configuration parameters had been specified? I do not see any server parameters mentioned either in this ticket or in the test script itself. In particular, had innodb_random_read_ahead=ON been set? Would the prototype of logical read-ahead ( The most important one would be innodb_buffer_pool_size. The output rss_max is from time(1):
Based on this and the amount of available memory, I would assume innodb_buffer_pool_size to be over 100 gigabytes, and hence the 0.6GiB increase of the file system cache might not make much difference. But, I do not think that the kernel file system size would be covered in this number. We would need some monitoring of the global system state. The ctx1 and fsi are from the following time(1) attributes:
Supposedly, copying data from the kernel’s file system cache to the memory of the process would not affect either metric. Read latency would be tracked better by the instrumentation that I would expect there to be a number of performance improvements in the Linux kernel, especially in 5.x and 6.x. | ||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-07-05 ] | ||||||||||||||||||||||||||||||||||||||||||
|
There are two more sources of nondeterminism that should be accounted for, to prevent any unexpected "warm-up" of the buffer pool:
| ||||||||||||||||||||||||||||||||||||||||||
| Comment by Aleksey Midenkov [ 2023-07-05 ] | ||||||||||||||||||||||||||||||||||||||||||
|
innodb_buffer_pool_size did the trick. O_DIRECT
fsync
InnoDB config:
| ||||||||||||||||||||||||||||||||||||||||||
| Comment by Aleksey Midenkov [ 2023-07-05 ] | ||||||||||||||||||||||||||||||||||||||||||
|
After dropping FS cache, with the above config: O_DIRECT
fsync
| ||||||||||||||||||||||||||||||||||||||||||
| Comment by Aleksey Midenkov [ 2023-07-05 ] | ||||||||||||||||||||||||||||||||||||||||||
O_DIRECT innodb_random_read_ahead=OFF vs ONinnodb_random_read_ahead=OFF
innodb_random_read_ahead=ON
| ||||||||||||||||||||||||||||||||||||||||||
| Comment by Aleksey Midenkov [ 2023-07-05 ] | ||||||||||||||||||||||||||||||||||||||||||
fsync innodb_random_read_ahead=OFF vs ONinnodb_random_read_ahead=OFF
innodb_random_read_ahead=ON
| ||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-07-05 ] | ||||||||||||||||||||||||||||||||||||||||||
|
Thank you! One of my suspicions was that this was a case of the working set exceeding innodb_buffer_pool_size, and then having all "excess" RAM acting as a cache between the durable storage and InnoDB, or not having any kernel file system cache at all. Here is one more resource that I just came across: the README.md file of the nocache tool explains how you might be able to limit the size of the Linux kernel’s file system cache while a process is running. It might be useful for a valid comparison when using a smaller InnoDB buffer pool. Is there anything that needs to be changed in the code for this particular ticket? Out of curiosity, what was the original innodb_buffer_pool_size? The default 128MiB (vs. 1280MiB for the updated results)? | ||||||||||||||||||||||||||||||||||||||||||
| Comment by Aleksey Midenkov [ 2023-07-06 ] | ||||||||||||||||||||||||||||||||||||||||||
|
> The default 128MiB (vs. 1280MiB for the updated results)? Right, it was the default. | ||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-07-07 ] | ||||||||||||||||||||||||||||||||||||||||||
|
In
For that test, I got the best read performance (for either buffer pool size) by using the default innodb_flush_method=O_DIRECT and enabling innodb_random_read_ahead=ON. |