[MDEV-34043] Drastically slower query performance between CentOS (2sec) and Rocky (48sec) - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.6.16, 10.6.17
Fix Version/s: 10.5.27, 10.6.20, 10.11.10, 11.2.6, 11.4.4
Component/s: Optimizer, Storage Engine - Aria
Labels:
- innodb
- option
- query
Environment:
CentOS 8.x
Rocky 8.x

Description

There is a significant difference in query performance with MariaDB 10.6 running on CentOS (2 seconds) compared to on Rocky (48 seconds). When setting split_materialized=off, the query executes a bit faster (7 seconds) for this particular query, but it is still slower than on CentOS.

Query used to reproduce:

set session optimizer_switch='split_materialized=on';

use b_test;

select this_.`RMD_CURRENT_STATUS_VALUE` as y0_, count(*) as y1_

from `TEST_MAU_REQUEST_RECORD_VIEW` this_

where (lower(this_.`UAR_ALL_EMAILS`) like '%AbeFrazer8@example.com%' escape '!'

or lower(this_.`UAR_APPROVER_STLS`) like '%AbeFrazer8@example.com%' escape '!'

or lower(this_.`REQUEST_BY`) like '%AbeFrazer8@example.com%' escape '!')

group by this_.`RMD_CURRENT_STATUS_VALUE`

order by y0_ asc

limit 50;

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

b_test.sql.zip
2.25 MB
2024-08-20 15:18
mdev-34043_analyze_centos7.json
4 kB
2024-06-24 15:35
mdev-34043_analyze_rl9_ext4.json
4 kB
2024-06-17 21:05
mdev-34043_analyze_rl9.json
4 kB
2024-06-17 15:54
mdev-34043.tgz
91 kB
2024-08-02 20:20
rl9_analyze_json_10.6.txt
4 kB
2024-06-26 20:53
rl9_analyze_json_11.5.txt
7 kB
2024-06-26 20:53

Activity

Ascending order - Click to sort in descending order

Martin Reinhardt added a comment - 2024-04-30 19:47 - edited

Here are some of my finding while trying to reproduce the query performance:

Rocky 8 - MariaDB 10.6.14
Over 2 minute, Status: Creating sort index | SELECT ... - split_materialized=on
Over 2 minute, Status: Creating sort index | SELECT ... - split_materialized=off

Rocky 8 - MariaDB 10.6.16
50 rows in set (51.734 sec) - split_materialized=on
50 rows in set (6.746 sec) - split_materialized=off

Rocky 8 - MariaDB 10.6.17
Over 2 minute, Status: Creating sort index | SELECT ... - split_materialized=on
Over 2 minute, Status: Creating sort index | SELECT ... - split_materialized=off

Rocky 8 - MariaDB 10.11.7
50 rows in set (40.106 sec) - split_materialized=on
50 rows in set (5.911 sec) - split_materialized=off

CentOS 8 - MariaDB 10.6.14
50 rows in set (25.127 sec) - split_materialized=on
50 rows in set (5.124 sec) - split_materialized=off

CentOS 8 - MariaDB 10.6.16
50 rows in set (25.443 sec) - split_materialized=on
50 rows in set (6.290 sec) - split_materialized=off

CentOS 8 8tOS - MariaDB 10.6.17
50 rows in set (24.731 sec) - split_materialized=on
50 rows in set (5.333 sec) - split_materialized=off

Martin Reinhardt added a comment - 2024-04-30 19:47 - edited Here are some of my finding while trying to reproduce the query performance: Rocky 8 - MariaDB 10.6.14 Over 2 minute, Status: Creating sort index | SELECT ... - split_materialized=on Over 2 minute, Status: Creating sort index | SELECT ... - split_materialized=off Rocky 8 - MariaDB 10.6.16 50 rows in set (51.734 sec) - split_materialized=on 50 rows in set (6.746 sec) - split_materialized=off Rocky 8 - MariaDB 10.6.17 Over 2 minute, Status: Creating sort index | SELECT ... - split_materialized=on Over 2 minute, Status: Creating sort index | SELECT ... - split_materialized=off Rocky 8 - MariaDB 10.11.7 50 rows in set (40.106 sec) - split_materialized=on 50 rows in set (5.911 sec) - split_materialized=off CentOS 8 - MariaDB 10.6.14 50 rows in set (25.127 sec) - split_materialized=on 50 rows in set (5.124 sec) - split_materialized=off CentOS 8 - MariaDB 10.6.16 50 rows in set (25.443 sec) - split_materialized=on 50 rows in set (6.290 sec) - split_materialized=off CentOS 8 8tOS - MariaDB 10.6.17 50 rows in set (24.731 sec) - split_materialized=on 50 rows in set (5.333 sec) - split_materialized=off

Kyle Hutchinson added a comment - 2024-05-01 22:46 - edited

Here are the values I got when using the b_test.sql script using an AWS EC2 t3a.medium instance (2 vCPU, 4GB RAM, 100GB GP3 disk). At the bottom I included what was in my server.cnf file:

10.6.16

OS	split_materialized=ON	split_materialized=OFF
CentOS 7	05.655 sec	5.438 sec
Rocky 8.9	77.705 sec	5.307 sec
Rocky 9.3	70.388 sec	6.005 sec
AlmaLinux 8.9	76.443 sec	6.710 sec
AlmaLinux 9.3	68.365 sec	5.000 sec

10.11.7

OS	split_materialized=ON	split_materialized=OFF
CentOS 7	05.311 sec	5.562 sec
Rocky 8.9	73.333 sec	5.541 sec
Rocky 9.3	76.475 sec	6.140 sec
AlmaLinux 8.9	79.478 sec	5.083 sec
AlmaLinux 9.3	69.118 sec	5.346 sec

11.3.2

OS	split_materialized=ON	split_materialized=OFF
CentOS 7	05.461 sec	5.564 sec
Rocky 8.9	69.887 sec	4.834 sec
Rocky 9.3	66.833 sec	5.373 sec
AlmaLinux 8.9	75.674 sec	4.998 sec
AlmaLinux 9.3	66.870 sec	4.981 sec

[mariadb]

server_id=1

datadir=/var/lib/mysql

default_storage_engine=InnoDB

max_connections=1000

bind-address=0.0.0.0

binlog_format=row

log_error=mdb1.err

innodb_log_files_in_group=1

log_bin=mdb1

innodb_buffer_pool_size=2G

optimizer_switch='derived_merge=off'

For all OS when using 10.6.16 I always had to run an ANALYZE TABLE b_test.TEST_MAU_USER_ACCESS_REQUESTS; because the stats weren't populated for that table. There were no issues with the other tables and no issues at all with 10.11.7 or 11.3.2 in this respect. The optimizer_switch='derived_merge=off' is set to mimic the environment from the customer ticket.

Kyle Hutchinson added a comment - 2024-05-01 22:46 - edited Here are the values I got when using the b_test.sql script using an AWS EC2 t3a.medium instance (2 vCPU, 4GB RAM, 100GB GP3 disk). At the bottom I included what was in my server.cnf file: 10.6.16 OS split_materialized=ON split_materialized=OFF CentOS 7 05.655 sec 5.438 sec Rocky 8.9 77.705 sec 5.307 sec Rocky 9.3 70.388 sec 6.005 sec AlmaLinux 8.9 76.443 sec 6.710 sec AlmaLinux 9.3 68.365 sec 5.000 sec 10.11.7 OS split_materialized=ON split_materialized=OFF CentOS 7 05.311 sec 5.562 sec Rocky 8.9 73.333 sec 5.541 sec Rocky 9.3 76.475 sec 6.140 sec AlmaLinux 8.9 79.478 sec 5.083 sec AlmaLinux 9.3 69.118 sec 5.346 sec 11.3.2 OS split_materialized=ON split_materialized=OFF CentOS 7 05.461 sec 5.564 sec Rocky 8.9 69.887 sec 4.834 sec Rocky 9.3 66.833 sec 5.373 sec AlmaLinux 8.9 75.674 sec 4.998 sec AlmaLinux 9.3 66.870 sec 4.981 sec [mariadb] server_id=1 datadir=/var/lib/mysql default_storage_engine=InnoDB max_connections=1000 bind-address=0.0.0.0 binlog_format=row log_error=mdb1.err innodb_log_files_in_group=1 log_bin=mdb1 innodb_buffer_pool_size=2G optimizer_switch='derived_merge=off' For all OS when using 10.6.16 I always had to run an ANALYZE TABLE b_test.TEST_MAU_USER_ACCESS_REQUESTS; because the stats weren't populated for that table. There were no issues with the other tables and no issues at all with 10.11.7 or 11.3.2 in this respect. The optimizer_switch='derived_merge=off' is set to mimic the environment from the customer ticket.

Dave Gosselin added a comment - 2024-05-02 19:40 - edited

I confirm that with 10.6.16 on either CentOS7 or AlmaLinux 9.3 and split_materialized=off, the run times for the query in the description are very similar.
When split_materialized=on, queries on AlmaLinux 9.3 take about 3x longer than on CentOS7. Not quite as dramatic as what you're seeing kyle.hutchinson but regardless there's a measurable and repeatable difference.

I should mention that I ran ANALYZE TABLE on each table in B_TEST (excluding the view of course) before I benchmarked the queries.

Dave Gosselin added a comment - 2024-05-02 19:40 - edited I confirm that with 10.6.16 on either CentOS7 or AlmaLinux 9.3 and split_materialized=off, the run times for the query in the description are very similar. When split_materialized=on, queries on AlmaLinux 9.3 take about 3x longer than on CentOS7. Not quite as dramatic as what you're seeing kyle.hutchinson but regardless there's a measurable and repeatable difference. I should mention that I ran ANALYZE TABLE on each table in B_TEST (excluding the view of course) before I benchmarked the queries.

Dave Gosselin added a comment - 2024-05-15 19:50 - edited

I used perf to profile the mariadbd process while running the the select query from the description in a tight loop. I did this both on CentOS 7 and AlmaLinux 9 (essentially Rocky Linux 9), with a release build of mariadb built from the tagged 10.6.16 source. Interestingly, and to my surprise, there was no "smoking gun" difference between the runs (even when repeated). Taking a step back, I then ran the query in a tight loop while running perf top to get further insight. That showed, on AlmaLinux 9, the kernel spends much of its time in xfs_btree_lookup when the query is running. I'm running mariadb on XFS filesystems under both CentOS 7 and AlmaLinux 9, but I don't see this function on the perf top output on CentOS7. If I then set 'split_materialized=off' and run the query in a loop once more, the xfs_btree_lookup function does not appear in the perf top output (instead _ma_get_pack_key, ha_key_cmp). So it would seem there's a difference in XFS between CentOS 7 and AlmaLinux 9. To test this theory, I created a disk image file with ext4 filesystem inside of it. To my surprise, running the mariadb data directory from this file showed the same issue with xfs_btree_lookup as running from disk; but this probably makes sense, since the disk image file is itself hosted on XFS.... I will try again with a separate ext4 formatted device and report back. I'm NOT suggesting that the customer uses a filesystem other than XFS.

Dave Gosselin added a comment - 2024-05-15 19:50 - edited I used perf to profile the mariadbd process while running the the select query from the description in a tight loop. I did this both on CentOS 7 and AlmaLinux 9 (essentially Rocky Linux 9), with a release build of mariadb built from the tagged 10.6.16 source. Interestingly, and to my surprise, there was no "smoking gun" difference between the runs (even when repeated). Taking a step back, I then ran the query in a tight loop while running perf top to get further insight. That showed, on AlmaLinux 9, the kernel spends much of its time in xfs_btree_lookup when the query is running. I'm running mariadb on XFS filesystems under both CentOS 7 and AlmaLinux 9, but I don't see this function on the perf top output on CentOS7. If I then set 'split_materialized=off' and run the query in a loop once more, the xfs_btree_lookup function does not appear in the perf top output (instead _ma_get_pack_key, ha_key_cmp). So it would seem there's a difference in XFS between CentOS 7 and AlmaLinux 9. To test this theory, I created a disk image file with ext4 filesystem inside of it. To my surprise, running the mariadb data directory from this file showed the same issue with xfs_btree_lookup as running from disk; but this probably makes sense, since the disk image file is itself hosted on XFS.... I will try again with a separate ext4 formatted device and report back. I'm NOT suggesting that the customer uses a filesystem other than XFS.

Dave Gosselin added a comment - 2024-05-15 20:37 - edited

I'm unable to provision an ext4 volume at this time with my hardware setup. Suffice to say, this seems like it's firmly in the storage area, so I'm forwarding to marko for further insights. marko please see my immediately prior comment for details.

Dave Gosselin added a comment - 2024-05-15 20:37 - edited I'm unable to provision an ext4 volume at this time with my hardware setup. Suffice to say, this seems like it's firmly in the storage area, so I'm forwarding to marko for further insights. marko please see my immediately prior comment for details.

Sergei Golubchik added a comment - 2024-05-28 13:50

perf shows that the time is spent in xfs_btree_lookup function. rocky8 xfs has finobt enabled — while CentOS7 doesn't support it. It could be the origin of those xfs_btree_lookup calls

Sergei Golubchik added a comment - 2024-05-28 13:50 perf shows that the time is spent in xfs_btree_lookup function. rocky8 xfs has finobt enabled — while CentOS7 doesn't support it. It could be the origin of those xfs_btree_lookup calls

Sergei Petrunia added a comment - 2024-06-14 14:19

The query has ORDER BY...LIMIT 50 and uses only basic aggregates, so Using filesort would operate with a Priority Queue, that is, it is unlikely to be a performance bottleneck.

The Using temporary is there to do the grouping operation. It's not possible to tell whether it's a HEAP table or an on-disk Aria table ... One needs to examine counters for this. At the first glance, I don't see why this query would not be able to use HEAP temp table.

Do we have any idea how many GROUP BY groups the query generates?

Sergei Petrunia added a comment - 2024-06-14 14:19 The query has ORDER BY...LIMIT 50 and uses only basic aggregates, so Using filesort would operate with a Priority Queue, that is, it is unlikely to be a performance bottleneck. The Using temporary is there to do the grouping operation. It's not possible to tell whether it's a HEAP table or an on-disk Aria table ... One needs to examine counters for this. At the first glance, I don't see why this query would not be able to use HEAP temp table. Do we have any idea how many GROUP BY groups the query generates?

Marko Mäkelä added a comment - 2024-06-25 07:08

According to ANALYZE FORMAT=JSON the bulk of the difference seems to be explained by r_other_time_ms. That metric is attached to extra_time_tracker or jbuf_extra_time_tracker. These are passed to some set_gap_tracker member function call. Because I found it hard to follow the control flow, I did the following:

./mtr --rr main.order_by

rr replay var/log/mysqld.1.rr/latest-trace

and then tried to see which code execution this part is covering. (Unfortunately, main.rowid_filter_innodb would require a debug instrumented build.)
I tracked it as follows:

break Gap_time_tracker::get_time_ms

continue

up

watch -l extra_time_tracker.cycles

reverse-continue

command 2

backtrace

continue

end

backtrace

reverse-continue

This seems to be a rather catch-all counter, incremented in various cases. Here is one example that could confirm my hypothesis:

10.6 0076eb3d4eee5252e8373ee9ef6a55e6c93942d7
#0 0x000055caf33719d6 in Gap_time_tracker::log_time (this=0x7f30bc014900, start=1719298146709028, end=<optimized out>) at /mariadb/10.6-MDEV-33613/sql/sql_analyze_stmt.h:145
#1 process_gap_time_tracker (thd=thd@entry=0x7f30bc000c68, timeval=<optimized out>) at /mariadb/10.6-MDEV-33613/sql/sql_analyze_stmt.cc:117
#2 0x000055caf2f7e6e0 in Exec_time_tracker::start_tracking (this=0x7f30bc0148e0, thd=0x7f30bc000c68) at /mariadb/10.6-MDEV-33613/sql/sql_analyze_stmt.h:102
#3 handler::ha_rnd_next (this=this@entry=0x7f30bc117200, buf=0x7f30bc06c560 "\374\00267") at /mariadb/10.6-MDEV-33613/sql/handler.cc:3514
#4 0x000055caf30657cb in find_all_keys (thd=0x7f30bc000c68, param=0x7f30d9c61090, select=0x7f30bc014348, fs_info=0x7f30bc11ef20, buffpek_pointers=0x7f30d9c61240, tempfile=0x7f30d9c61150, pq=0x0, found_rows=0x7f30bc11f090) at /mariadb/10.6-MDEV-33613/sql/filesort.cc:914
#5 filesort (thd=thd@entry=0x7f30bc000c68, table=table@entry=0x7f30bc009178, filesort=filesort@entry=0x7f30bc014498, tracker=0x7f30bc014bd0, join=join@entry=0x7f30bc0120b0, first_table_bit=<optimized out>) at /mariadb/10.6-MDEV-33613/sql/filesort.cc:357
#6 0x000055caf32623d3 in create_sort_index (thd=0x7f30bc000c68, join=0x7f30bc0120b0, tab=tab@entry=0x7f30bc013698, fsort=0x7f30bc014498, fsort@entry=0x0) at /mariadb/10.6-MDEV-33613/sql/sql_select.cc:25404
#7 0x000055caf326270a in st_join_table::sort_table (this=this@entry=0x7f30bc013698) at /mariadb/10.6-MDEV-33613/sql/sql_select.cc:22956
#8 0x000055caf32627d0 in join_init_read_record (tab=0x7f30bc013698) at /mariadb/10.6-MDEV-33613/sql/sql_select.cc:22895
#9 0x000055caf324d57c in sub_select (join=0x7f30bc0120b0, join_tab=0x7f30bc013698, end_of_records=<optimized out>) at /mariadb/10.6-MDEV-33613/sql/sql_select.cc:21903

I checked all occurrences of this watchpoint, and all of them were like this, with create_sort_index in the stack trace.

Not knowing this code, I can’t claim that the filesort would be the only operation that is covered by r_other_time_ms. But, I believe that the time consumed by InnoDB is tracked separately, mostly by r_engine_stats.pages_read_time_ms (waiting for pages to be loaded into the buffer pool); you can see ~~MDEV-31227~~ for some discussion related to that.

My common sense would say that a copy-on-write file system such as XFS is not a good match for short-lived temporary files. How much would the performance be improved if a local ext4 file system was created for the temporary files? What would happen if the temporary directory was pointed to /dev/shm (RAM disk) in both environments? The latter I would of course not recommend in a production system, but it would simulate ‘extremely fast storage’ for a small test workload.

Marko Mäkelä added a comment - 2024-06-25 07:08 According to ANALYZE FORMAT=JSON the bulk of the difference seems to be explained by r_other_time_ms . That metric is attached to extra_time_tracker or jbuf_extra_time_tracker . These are passed to some set_gap_tracker member function call. Because I found it hard to follow the control flow, I did the following: ./mtr --rr main.order_by rr replay var/log/mysqld.1.rr/latest-trace and then tried to see which code execution this part is covering. (Unfortunately, main.rowid_filter_innodb would require a debug instrumented build.) I tracked it as follows: break Gap_time_tracker::get_time_ms continue up watch -l extra_time_tracker.cycles reverse-continue command 2 backtrace continue end backtrace reverse-continue This seems to be a rather catch-all counter, incremented in various cases. Here is one example that could confirm my hypothesis: 10.6 0076eb3d4eee5252e8373ee9ef6a55e6c93942d7 #0 0x000055caf33719d6 in Gap_time_tracker::log_time (this=0x7f30bc014900, start=1719298146709028, end=<optimized out>) at /mariadb/10.6-MDEV-33613/sql/sql_analyze_stmt.h:145 #1 process_gap_time_tracker (thd=thd@entry=0x7f30bc000c68, timeval=<optimized out>) at /mariadb/10.6-MDEV-33613/sql/sql_analyze_stmt.cc:117 #2 0x000055caf2f7e6e0 in Exec_time_tracker::start_tracking (this=0x7f30bc0148e0, thd=0x7f30bc000c68) at /mariadb/10.6-MDEV-33613/sql/sql_analyze_stmt.h:102 #3 handler::ha_rnd_next (this=this@entry=0x7f30bc117200, buf=0x7f30bc06c560 "\374\00267") at /mariadb/10.6-MDEV-33613/sql/handler.cc:3514 #4 0x000055caf30657cb in find_all_keys (thd=0x7f30bc000c68, param=0x7f30d9c61090, select=0x7f30bc014348, fs_info=0x7f30bc11ef20, buffpek_pointers=0x7f30d9c61240, tempfile=0x7f30d9c61150, pq=0x0, found_rows=0x7f30bc11f090) at /mariadb/10.6-MDEV-33613/sql/filesort.cc:914 #5 filesort (thd=thd@entry=0x7f30bc000c68, table=table@entry=0x7f30bc009178, filesort=filesort@entry=0x7f30bc014498, tracker=0x7f30bc014bd0, join=join@entry=0x7f30bc0120b0, first_table_bit=<optimized out>) at /mariadb/10.6-MDEV-33613/sql/filesort.cc:357 #6 0x000055caf32623d3 in create_sort_index (thd=0x7f30bc000c68, join=0x7f30bc0120b0, tab=tab@entry=0x7f30bc013698, fsort=0x7f30bc014498, fsort@entry=0x0) at /mariadb/10.6-MDEV-33613/sql/sql_select.cc:25404 #7 0x000055caf326270a in st_join_table::sort_table (this=this@entry=0x7f30bc013698) at /mariadb/10.6-MDEV-33613/sql/sql_select.cc:22956 #8 0x000055caf32627d0 in join_init_read_record (tab=0x7f30bc013698) at /mariadb/10.6-MDEV-33613/sql/sql_select.cc:22895 #9 0x000055caf324d57c in sub_select (join=0x7f30bc0120b0, join_tab=0x7f30bc013698, end_of_records=<optimized out>) at /mariadb/10.6-MDEV-33613/sql/sql_select.cc:21903 I checked all occurrences of this watchpoint, and all of them were like this, with create_sort_index in the stack trace. Not knowing this code, I can’t claim that the filesort would be the only operation that is covered by r_other_time_ms . But, I believe that the time consumed by InnoDB is tracked separately, mostly by r_engine_stats.pages_read_time_ms (waiting for pages to be loaded into the buffer pool); you can see MDEV-31227 for some discussion related to that. My common sense would say that a copy-on-write file system such as XFS is not a good match for short-lived temporary files . How much would the performance be improved if a local ext4 file system was created for the temporary files? What would happen if the temporary directory was pointed to /dev/shm (RAM disk) in both environments? The latter I would of course not recommend in a production system, but it would simulate ‘extremely fast storage’ for a small test workload.

Michael Widenius added a comment - 2024-06-25 07:32 - edited

As far as I know, XFS has copy on write as an option. I do not know if this is default or not.

Anyway, I doubt that filesort, that reads things mosty sequential and does not do any syncs, could cause this big difference. Another issue is that the code uses ORDER BY LIMIT 50, which means that filesort should use a priority key and not touch the disk at all.

I talked with Petruna and he is sure that filesort is not the problem as it is reported separately and is not part of r_other_time_ms.

In later MariaDB versions (I tested with a simple query in 11.5) we have even more information about the time spent in filesort:
"filesort": {
"sort_key": "count(0)",
"r_loops": 1,
"r_total_time_ms": 254.3999844,
"r_limit": 10,
"r_used_priority_queue": true,
"r_output_rows": 11,
"r_sort_mode": "sort_key,rowid",

Any chance someone could do a test with 11.5 on Centos and Rocky to see if filesort is the problem or not ?

What could cause a difference is the IO scheduler used, thread switches and the underlying hardware.
Do we know if the storage hardware for Centos and Rocky are the same?

Another issue is server configuration. Enterprise has totally different my.cnf files than the community server. To get a reliable test we should preferably use the exact same config files on both servers. For example if the sort buffer is wrong, it could easily explain this issue.

Michael Widenius added a comment - 2024-06-25 07:32 - edited As far as I know, XFS has copy on write as an option. I do not know if this is default or not. Anyway, I doubt that filesort, that reads things mosty sequential and does not do any syncs, could cause this big difference. Another issue is that the code uses ORDER BY LIMIT 50, which means that filesort should use a priority key and not touch the disk at all. I talked with Petruna and he is sure that filesort is not the problem as it is reported separately and is not part of r_other_time_ms. In later MariaDB versions (I tested with a simple query in 11.5) we have even more information about the time spent in filesort: "filesort": { "sort_key": "count(0)", "r_loops": 1, "r_total_time_ms": 254.3999844, "r_limit": 10, "r_used_priority_queue": true, "r_output_rows": 11, "r_sort_mode": "sort_key,rowid", Any chance someone could do a test with 11.5 on Centos and Rocky to see if filesort is the problem or not ? What could cause a difference is the IO scheduler used, thread switches and the underlying hardware. Do we know if the storage hardware for Centos and Rocky are the same? Another issue is server configuration. Enterprise has totally different my.cnf files than the community server. To get a reliable test we should preferably use the exact same config files on both servers. For example if the sort buffer is wrong, it could easily explain this issue.

Michael Widenius added a comment - 2024-06-25 08:47

hings I would consider adding to analyze format=json from get_rusage(RUSAGE_THREAD):
ru_utime
This is the total amount of time spent executing in user
mode, expressed in a timeval structure (seconds plus
microseconds).
ru_nvcsw (since Linux 2.6)
The number of times a context switch resulted due to a
process voluntarily giving up the processor before its
time slice was completed (usually to await availability of
a resource).
ru_nivcsw (since Linux 2.6)
The number of times a context switch resulted due to a
higher priority process becoming runnable or because the
current process exceeded its time slice. (edited)

Michael Widenius added a comment - 2024-06-25 08:47 hings I would consider adding to analyze format=json from get_rusage(RUSAGE_THREAD): ru_utime This is the total amount of time spent executing in user mode, expressed in a timeval structure (seconds plus microseconds). ru_nvcsw (since Linux 2.6) The number of times a context switch resulted due to a process voluntarily giving up the processor before its time slice was completed (usually to await availability of a resource). ru_nivcsw (since Linux 2.6) The number of times a context switch resulted due to a higher priority process becoming runnable or because the current process exceeded its time slice. (edited)

Michael Widenius added a comment - 2024-08-20 08:16

One cause of the slowdown is because the ftruncate call can be much
slower on some systems. ftruncate() is called by Aria for internal
temporary tables, tables created by the optimizer, when the upper level
asks Aria to delete the previous result set. This is needed when some
content from previous tables changes.

I have now changed Aria so that for internal temporary tables we don't
call ftruncate anymore.

I also had to update the Aria repair code to use the logical datafile
size and not the on-disk datafile size, which may contain data from a
previous result set. The repair code is called to create indexes for
the internal temporary table after it is filled.

Michael Widenius added a comment - 2024-08-20 08:16 One cause of the slowdown is because the ftruncate call can be much slower on some systems. ftruncate() is called by Aria for internal temporary tables, tables created by the optimizer, when the upper level asks Aria to delete the previous result set. This is needed when some content from previous tables changes. I have now changed Aria so that for internal temporary tables we don't call ftruncate anymore. I also had to update the Aria repair code to use the logical datafile size and not the on-disk datafile size, which may contain data from a previous result set. The repair code is called to create indexes for the internal temporary table after it is filled.

Michael Widenius added a comment - 2024-08-20 08:39

Pushed to bb-10.5-monty for testing

Michael Widenius added a comment - 2024-08-20 08:39 Pushed to bb-10.5-monty for testing

Michael Widenius added a comment - 2024-08-21 08:04

I have worked with Gosselin to test my patch.
The patch makes split_materialized ON or OFF to have same performance (about 6 seconds) on his machine.
The given test called maria_delete_all_rows() 678199 times and each call called ftruncate() 3 times.
I assume that on some OS ftruncate does something similar to a sync, which is causing the performance problem.
After the patch, ftruncate() is not called at all for internal temporary tables, which fixes the problem.

Michael Widenius added a comment - 2024-08-21 08:04 I have worked with Gosselin to test my patch. The patch makes split_materialized ON or OFF to have same performance (about 6 seconds) on his machine. The given test called maria_delete_all_rows() 678199 times and each call called ftruncate() 3 times. I assume that on some OS ftruncate does something similar to a sync, which is causing the performance problem. After the patch, ftruncate() is not called at all for internal temporary tables, which fixes the problem.

Sergei Petrunia added a comment - 2024-08-21 12:04

Ok to push.

Sergei Petrunia added a comment - 2024-08-21 12:04 Ok to push.

Michael Widenius added a comment - 2024-08-21 16:36

bb-10.5-monty pushed for final testing by buildbot

Michael Widenius added a comment - 2024-08-21 16:36 bb-10.5-monty pushed for final testing by buildbot

Michael Widenius added a comment - 2024-08-22 07:11

Pushed to 10.5

Michael Widenius added a comment - 2024-08-22 07:11 Pushed to 10.5

People

Assignee:: Michael Widenius

Reporter:: Martin Reinhardt

Votes:: 1 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 2024-04-30 19:45

Updated:: 2024-12-31 17:28

Resolved:: 2024-08-22 07:11

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Activity

People

Dates

Git Integration