Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-34043

Drastically slower query performance between CentOS (2sec) and Rocky (48sec)

Details

    Description

      There is a significant difference in query performance with MariaDB 10.6 running on CentOS (2 seconds) compared to on Rocky (48 seconds). When setting split_materialized=off, the query executes a bit faster (7 seconds) for this particular query, but it is still slower than on CentOS.

      Query used to reproduce:

      set session optimizer_switch='split_materialized=on';
       
      use b_test;
       
      select this_.`RMD_CURRENT_STATUS_VALUE` as y0_, count(*) as y1_
      from `TEST_MAU_REQUEST_RECORD_VIEW` this_
      where (lower(this_.`UAR_ALL_EMAILS`) like '%AbeFrazer8@example.com%' escape '!'
      or lower(this_.`UAR_APPROVER_STLS`) like '%AbeFrazer8@example.com%' escape '!'
      or lower(this_.`REQUEST_BY`) like '%AbeFrazer8@example.com%' escape '!')
      group by this_.`RMD_CURRENT_STATUS_VALUE`
      order by y0_ asc
      limit 50;

      Attachments

        1. b_test.sql.zip
          2.25 MB
        2. mdev-34043_analyze_centos7.json
          4 kB
        3. mdev-34043_analyze_rl9_ext4.json
          4 kB
        4. mdev-34043_analyze_rl9.json
          4 kB
        5. mdev-34043.tgz
          91 kB
        6. rl9_analyze_json_10.6.txt
          4 kB
        7. rl9_analyze_json_11.5.txt
          7 kB

        Activity

          martin.reinhardt@mariadb.com Martin Reinhardt added a comment - - edited

          Here are some of my finding while trying to reproduce the query performance:

          Rocky 8 - MariaDB 10.6.14
          Over 2 minute, Status: Creating sort index | SELECT ... - split_materialized=on
          Over 2 minute, Status: Creating sort index | SELECT ... - split_materialized=off

          Rocky 8 - MariaDB 10.6.16
          50 rows in set (51.734 sec) - split_materialized=on
          50 rows in set (6.746 sec) - split_materialized=off

          Rocky 8 - MariaDB 10.6.17
          Over 2 minute, Status: Creating sort index | SELECT ... - split_materialized=on
          Over 2 minute, Status: Creating sort index | SELECT ... - split_materialized=off

          Rocky 8 - MariaDB 10.11.7
          50 rows in set (40.106 sec) - split_materialized=on
          50 rows in set (5.911 sec) - split_materialized=off

          CentOS 8 - MariaDB 10.6.14
          50 rows in set (25.127 sec) - split_materialized=on
          50 rows in set (5.124 sec) - split_materialized=off

          CentOS 8 - MariaDB 10.6.16
          50 rows in set (25.443 sec) - split_materialized=on
          50 rows in set (6.290 sec) - split_materialized=off

          CentOS 8 8tOS - MariaDB 10.6.17
          50 rows in set (24.731 sec) - split_materialized=on
          50 rows in set (5.333 sec) - split_materialized=off

          martin.reinhardt@mariadb.com Martin Reinhardt added a comment - - edited Here are some of my finding while trying to reproduce the query performance: Rocky 8 - MariaDB 10.6.14 Over 2 minute, Status: Creating sort index | SELECT ... - split_materialized=on Over 2 minute, Status: Creating sort index | SELECT ... - split_materialized=off Rocky 8 - MariaDB 10.6.16 50 rows in set (51.734 sec) - split_materialized=on 50 rows in set (6.746 sec) - split_materialized=off Rocky 8 - MariaDB 10.6.17 Over 2 minute, Status: Creating sort index | SELECT ... - split_materialized=on Over 2 minute, Status: Creating sort index | SELECT ... - split_materialized=off Rocky 8 - MariaDB 10.11.7 50 rows in set (40.106 sec) - split_materialized=on 50 rows in set (5.911 sec) - split_materialized=off CentOS 8 - MariaDB 10.6.14 50 rows in set (25.127 sec) - split_materialized=on 50 rows in set (5.124 sec) - split_materialized=off CentOS 8 - MariaDB 10.6.16 50 rows in set (25.443 sec) - split_materialized=on 50 rows in set (6.290 sec) - split_materialized=off CentOS 8 8tOS - MariaDB 10.6.17 50 rows in set (24.731 sec) - split_materialized=on 50 rows in set (5.333 sec) - split_materialized=off
          kyle.hutchinson Kyle Hutchinson added a comment - - edited

          Here are the values I got when using the b_test.sql script using an AWS EC2 t3a.medium instance (2 vCPU, 4GB RAM, 100GB GP3 disk). At the bottom I included what was in my server.cnf file:

          10.6.16

          OS split_materialized=ON split_materialized=OFF
          CentOS 7 05.655 sec 5.438 sec
          Rocky 8.9 77.705 sec 5.307 sec
          Rocky 9.3 70.388 sec 6.005 sec
          AlmaLinux 8.9 76.443 sec 6.710 sec
          AlmaLinux 9.3 68.365 sec 5.000 sec

          10.11.7

          OS split_materialized=ON split_materialized=OFF
          CentOS 7 05.311 sec 5.562 sec
          Rocky 8.9 73.333 sec 5.541 sec
          Rocky 9.3 76.475 sec 6.140 sec
          AlmaLinux 8.9 79.478 sec 5.083 sec
          AlmaLinux 9.3 69.118 sec 5.346 sec

          11.3.2

          OS split_materialized=ON split_materialized=OFF
          CentOS 7 05.461 sec 5.564 sec
          Rocky 8.9 69.887 sec 4.834 sec
          Rocky 9.3 66.833 sec 5.373 sec
          AlmaLinux 8.9 75.674 sec 4.998 sec
          AlmaLinux 9.3 66.870 sec 4.981 sec

          [mariadb]
          server_id=1
          datadir=/var/lib/mysql
          default_storage_engine=InnoDB
          max_connections=1000
          bind-address=0.0.0.0
          binlog_format=row
          log_error=mdb1.err
          innodb_log_files_in_group=1
          log_bin=mdb1
          innodb_buffer_pool_size=2G
          optimizer_switch='derived_merge=off'
          

          For all OS when using 10.6.16 I always had to run an ANALYZE TABLE b_test.TEST_MAU_USER_ACCESS_REQUESTS; because the stats weren't populated for that table. There were no issues with the other tables and no issues at all with 10.11.7 or 11.3.2 in this respect. The optimizer_switch='derived_merge=off' is set to mimic the environment from the customer ticket.

          kyle.hutchinson Kyle Hutchinson added a comment - - edited Here are the values I got when using the b_test.sql script using an AWS EC2 t3a.medium instance (2 vCPU, 4GB RAM, 100GB GP3 disk). At the bottom I included what was in my server.cnf file: 10.6.16 OS split_materialized=ON split_materialized=OFF CentOS 7 05.655 sec 5.438 sec Rocky 8.9 77.705 sec 5.307 sec Rocky 9.3 70.388 sec 6.005 sec AlmaLinux 8.9 76.443 sec 6.710 sec AlmaLinux 9.3 68.365 sec 5.000 sec 10.11.7 OS split_materialized=ON split_materialized=OFF CentOS 7 05.311 sec 5.562 sec Rocky 8.9 73.333 sec 5.541 sec Rocky 9.3 76.475 sec 6.140 sec AlmaLinux 8.9 79.478 sec 5.083 sec AlmaLinux 9.3 69.118 sec 5.346 sec 11.3.2 OS split_materialized=ON split_materialized=OFF CentOS 7 05.461 sec 5.564 sec Rocky 8.9 69.887 sec 4.834 sec Rocky 9.3 66.833 sec 5.373 sec AlmaLinux 8.9 75.674 sec 4.998 sec AlmaLinux 9.3 66.870 sec 4.981 sec [mariadb] server_id=1 datadir=/var/lib/mysql default_storage_engine=InnoDB max_connections=1000 bind-address=0.0.0.0 binlog_format=row log_error=mdb1.err innodb_log_files_in_group=1 log_bin=mdb1 innodb_buffer_pool_size=2G optimizer_switch='derived_merge=off' For all OS when using 10.6.16 I always had to run an ANALYZE TABLE b_test.TEST_MAU_USER_ACCESS_REQUESTS; because the stats weren't populated for that table. There were no issues with the other tables and no issues at all with 10.11.7 or 11.3.2 in this respect. The optimizer_switch='derived_merge=off' is set to mimic the environment from the customer ticket.
          Gosselin Dave Gosselin added a comment - - edited

          I confirm that with 10.6.16 on either CentOS7 or AlmaLinux 9.3 and split_materialized=off, the run times for the query in the description are very similar.
          When split_materialized=on, queries on AlmaLinux 9.3 take about 3x longer than on CentOS7. Not quite as dramatic as what you're seeing kyle.hutchinson but regardless there's a measurable and repeatable difference.

          I should mention that I ran ANALYZE TABLE on each table in B_TEST (excluding the view of course) before I benchmarked the queries.

          Gosselin Dave Gosselin added a comment - - edited I confirm that with 10.6.16 on either CentOS7 or AlmaLinux 9.3 and split_materialized=off, the run times for the query in the description are very similar. When split_materialized=on, queries on AlmaLinux 9.3 take about 3x longer than on CentOS7. Not quite as dramatic as what you're seeing kyle.hutchinson but regardless there's a measurable and repeatable difference. I should mention that I ran ANALYZE TABLE on each table in B_TEST (excluding the view of course) before I benchmarked the queries.
          Gosselin Dave Gosselin added a comment - - edited

          I used perf to profile the mariadbd process while running the the select query from the description in a tight loop. I did this both on CentOS 7 and AlmaLinux 9 (essentially Rocky Linux 9), with a release build of mariadb built from the tagged 10.6.16 source. Interestingly, and to my surprise, there was no "smoking gun" difference between the runs (even when repeated). Taking a step back, I then ran the query in a tight loop while running perf top to get further insight. That showed, on AlmaLinux 9, the kernel spends much of its time in xfs_btree_lookup when the query is running. I'm running mariadb on XFS filesystems under both CentOS 7 and AlmaLinux 9, but I don't see this function on the perf top output on CentOS7. If I then set 'split_materialized=off' and run the query in a loop once more, the xfs_btree_lookup function does not appear in the perf top output (instead _ma_get_pack_key, ha_key_cmp). So it would seem there's a difference in XFS between CentOS 7 and AlmaLinux 9. To test this theory, I created a disk image file with ext4 filesystem inside of it. To my surprise, running the mariadb data directory from this file showed the same issue with xfs_btree_lookup as running from disk; but this probably makes sense, since the disk image file is itself hosted on XFS.... I will try again with a separate ext4 formatted device and report back. I'm NOT suggesting that the customer uses a filesystem other than XFS.

          Gosselin Dave Gosselin added a comment - - edited I used perf to profile the mariadbd process while running the the select query from the description in a tight loop. I did this both on CentOS 7 and AlmaLinux 9 (essentially Rocky Linux 9), with a release build of mariadb built from the tagged 10.6.16 source. Interestingly, and to my surprise, there was no "smoking gun" difference between the runs (even when repeated). Taking a step back, I then ran the query in a tight loop while running perf top to get further insight. That showed, on AlmaLinux 9, the kernel spends much of its time in xfs_btree_lookup when the query is running. I'm running mariadb on XFS filesystems under both CentOS 7 and AlmaLinux 9, but I don't see this function on the perf top output on CentOS7. If I then set 'split_materialized=off' and run the query in a loop once more, the xfs_btree_lookup function does not appear in the perf top output (instead _ma_get_pack_key, ha_key_cmp). So it would seem there's a difference in XFS between CentOS 7 and AlmaLinux 9. To test this theory, I created a disk image file with ext4 filesystem inside of it. To my surprise, running the mariadb data directory from this file showed the same issue with xfs_btree_lookup as running from disk; but this probably makes sense, since the disk image file is itself hosted on XFS.... I will try again with a separate ext4 formatted device and report back. I'm NOT suggesting that the customer uses a filesystem other than XFS.
          Gosselin Dave Gosselin added a comment - - edited

          I'm unable to provision an ext4 volume at this time with my hardware setup. Suffice to say, this seems like it's firmly in the storage area, so I'm forwarding to marko for further insights. marko please see my immediately prior comment for details.

          Gosselin Dave Gosselin added a comment - - edited I'm unable to provision an ext4 volume at this time with my hardware setup. Suffice to say, this seems like it's firmly in the storage area, so I'm forwarding to marko for further insights. marko please see my immediately prior comment for details.

          perf shows that the time is spent in xfs_btree_lookup function. rocky8 xfs has finobt enabled — while CentOS7 doesn't support it. It could be the origin of those xfs_btree_lookup calls

          serg Sergei Golubchik added a comment - perf shows that the time is spent in xfs_btree_lookup function. rocky8 xfs has finobt enabled — while CentOS7 doesn't support it. It could be the origin of those xfs_btree_lookup calls

          The query has ORDER BY...LIMIT 50 and uses only basic aggregates, so Using filesort would operate with a Priority Queue, that is, it is unlikely to be a performance bottleneck.

          The Using temporary is there to do the grouping operation. It's not possible to tell whether it's a HEAP table or an on-disk Aria table ... One needs to examine counters for this. At the first glance, I don't see why this query would not be able to use HEAP temp table.

          Do we have any idea how many GROUP BY groups the query generates?

          psergei Sergei Petrunia added a comment - The query has ORDER BY...LIMIT 50 and uses only basic aggregates, so Using filesort would operate with a Priority Queue, that is, it is unlikely to be a performance bottleneck. The Using temporary is there to do the grouping operation. It's not possible to tell whether it's a HEAP table or an on-disk Aria table ... One needs to examine counters for this. At the first glance, I don't see why this query would not be able to use HEAP temp table. Do we have any idea how many GROUP BY groups the query generates?

          According to ANALYZE FORMAT=JSON the bulk of the difference seems to be explained by r_other_time_ms. That metric is attached to extra_time_tracker or jbuf_extra_time_tracker. These are passed to some set_gap_tracker member function call. Because I found it hard to follow the control flow, I did the following:

          ./mtr --rr main.order_by
          rr replay var/log/mysqld.1.rr/latest-trace
          

          and then tried to see which code execution this part is covering. (Unfortunately, main.rowid_filter_innodb would require a debug instrumented build.)
          I tracked it as follows:

          break Gap_time_tracker::get_time_ms
          continue
          up
          watch -l extra_time_tracker.cycles
          reverse-continue
          command 2
          backtrace
          continue
          end
          backtrace
          reverse-continue
          

          This seems to be a rather catch-all counter, incremented in various cases. Here is one example that could confirm my hypothesis:

          10.6 0076eb3d4eee5252e8373ee9ef6a55e6c93942d7

          #0  0x000055caf33719d6 in Gap_time_tracker::log_time (this=0x7f30bc014900, start=1719298146709028, end=<optimized out>) at /mariadb/10.6-MDEV-33613/sql/sql_analyze_stmt.h:145
          #1  process_gap_time_tracker (thd=thd@entry=0x7f30bc000c68, timeval=<optimized out>) at /mariadb/10.6-MDEV-33613/sql/sql_analyze_stmt.cc:117
          #2  0x000055caf2f7e6e0 in Exec_time_tracker::start_tracking (this=0x7f30bc0148e0, thd=0x7f30bc000c68) at /mariadb/10.6-MDEV-33613/sql/sql_analyze_stmt.h:102
          #3  handler::ha_rnd_next (this=this@entry=0x7f30bc117200, buf=0x7f30bc06c560 "\374\00267") at /mariadb/10.6-MDEV-33613/sql/handler.cc:3514
          #4  0x000055caf30657cb in find_all_keys (thd=0x7f30bc000c68, param=0x7f30d9c61090, select=0x7f30bc014348, fs_info=0x7f30bc11ef20, buffpek_pointers=0x7f30d9c61240, tempfile=0x7f30d9c61150, pq=0x0, found_rows=0x7f30bc11f090) at /mariadb/10.6-MDEV-33613/sql/filesort.cc:914
          #5  filesort (thd=thd@entry=0x7f30bc000c68, table=table@entry=0x7f30bc009178, filesort=filesort@entry=0x7f30bc014498, tracker=0x7f30bc014bd0, join=join@entry=0x7f30bc0120b0, first_table_bit=<optimized out>) at /mariadb/10.6-MDEV-33613/sql/filesort.cc:357
          #6  0x000055caf32623d3 in create_sort_index (thd=0x7f30bc000c68, join=0x7f30bc0120b0, tab=tab@entry=0x7f30bc013698, fsort=0x7f30bc014498, fsort@entry=0x0) at /mariadb/10.6-MDEV-33613/sql/sql_select.cc:25404
          #7  0x000055caf326270a in st_join_table::sort_table (this=this@entry=0x7f30bc013698) at /mariadb/10.6-MDEV-33613/sql/sql_select.cc:22956
          #8  0x000055caf32627d0 in join_init_read_record (tab=0x7f30bc013698) at /mariadb/10.6-MDEV-33613/sql/sql_select.cc:22895
          #9  0x000055caf324d57c in sub_select (join=0x7f30bc0120b0, join_tab=0x7f30bc013698, end_of_records=<optimized out>) at /mariadb/10.6-MDEV-33613/sql/sql_select.cc:21903
          

          I checked all occurrences of this watchpoint, and all of them were like this, with create_sort_index in the stack trace.

          Not knowing this code, I can’t claim that the filesort would be the only operation that is covered by r_other_time_ms. But, I believe that the time consumed by InnoDB is tracked separately, mostly by r_engine_stats.pages_read_time_ms (waiting for pages to be loaded into the buffer pool); you can see MDEV-31227 for some discussion related to that.

          My common sense would say that a copy-on-write file system such as XFS is not a good match for short-lived temporary files. How much would the performance be improved if a local ext4 file system was created for the temporary files? What would happen if the temporary directory was pointed to /dev/shm (RAM disk) in both environments? The latter I would of course not recommend in a production system, but it would simulate ‘extremely fast storage’ for a small test workload.

          marko Marko Mäkelä added a comment - According to ANALYZE FORMAT=JSON the bulk of the difference seems to be explained by r_other_time_ms . That metric is attached to extra_time_tracker or jbuf_extra_time_tracker . These are passed to some set_gap_tracker member function call. Because I found it hard to follow the control flow, I did the following: ./mtr --rr main.order_by rr replay var/log/mysqld.1.rr/latest-trace and then tried to see which code execution this part is covering. (Unfortunately, main.rowid_filter_innodb would require a debug instrumented build.) I tracked it as follows: break Gap_time_tracker::get_time_ms continue up watch -l extra_time_tracker.cycles reverse-continue command 2 backtrace continue end backtrace reverse-continue This seems to be a rather catch-all counter, incremented in various cases. Here is one example that could confirm my hypothesis: 10.6 0076eb3d4eee5252e8373ee9ef6a55e6c93942d7 #0 0x000055caf33719d6 in Gap_time_tracker::log_time (this=0x7f30bc014900, start=1719298146709028, end=<optimized out>) at /mariadb/10.6-MDEV-33613/sql/sql_analyze_stmt.h:145 #1 process_gap_time_tracker (thd=thd@entry=0x7f30bc000c68, timeval=<optimized out>) at /mariadb/10.6-MDEV-33613/sql/sql_analyze_stmt.cc:117 #2 0x000055caf2f7e6e0 in Exec_time_tracker::start_tracking (this=0x7f30bc0148e0, thd=0x7f30bc000c68) at /mariadb/10.6-MDEV-33613/sql/sql_analyze_stmt.h:102 #3 handler::ha_rnd_next (this=this@entry=0x7f30bc117200, buf=0x7f30bc06c560 "\374\00267") at /mariadb/10.6-MDEV-33613/sql/handler.cc:3514 #4 0x000055caf30657cb in find_all_keys (thd=0x7f30bc000c68, param=0x7f30d9c61090, select=0x7f30bc014348, fs_info=0x7f30bc11ef20, buffpek_pointers=0x7f30d9c61240, tempfile=0x7f30d9c61150, pq=0x0, found_rows=0x7f30bc11f090) at /mariadb/10.6-MDEV-33613/sql/filesort.cc:914 #5 filesort (thd=thd@entry=0x7f30bc000c68, table=table@entry=0x7f30bc009178, filesort=filesort@entry=0x7f30bc014498, tracker=0x7f30bc014bd0, join=join@entry=0x7f30bc0120b0, first_table_bit=<optimized out>) at /mariadb/10.6-MDEV-33613/sql/filesort.cc:357 #6 0x000055caf32623d3 in create_sort_index (thd=0x7f30bc000c68, join=0x7f30bc0120b0, tab=tab@entry=0x7f30bc013698, fsort=0x7f30bc014498, fsort@entry=0x0) at /mariadb/10.6-MDEV-33613/sql/sql_select.cc:25404 #7 0x000055caf326270a in st_join_table::sort_table (this=this@entry=0x7f30bc013698) at /mariadb/10.6-MDEV-33613/sql/sql_select.cc:22956 #8 0x000055caf32627d0 in join_init_read_record (tab=0x7f30bc013698) at /mariadb/10.6-MDEV-33613/sql/sql_select.cc:22895 #9 0x000055caf324d57c in sub_select (join=0x7f30bc0120b0, join_tab=0x7f30bc013698, end_of_records=<optimized out>) at /mariadb/10.6-MDEV-33613/sql/sql_select.cc:21903 I checked all occurrences of this watchpoint, and all of them were like this, with create_sort_index in the stack trace. Not knowing this code, I can’t claim that the filesort would be the only operation that is covered by r_other_time_ms . But, I believe that the time consumed by InnoDB is tracked separately, mostly by r_engine_stats.pages_read_time_ms (waiting for pages to be loaded into the buffer pool); you can see MDEV-31227 for some discussion related to that. My common sense would say that a copy-on-write file system such as XFS is not a good match for short-lived temporary files . How much would the performance be improved if a local ext4 file system was created for the temporary files? What would happen if the temporary directory was pointed to /dev/shm (RAM disk) in both environments? The latter I would of course not recommend in a production system, but it would simulate ‘extremely fast storage’ for a small test workload.
          monty Michael Widenius added a comment - - edited

          As far as I know, XFS has copy on write as an option. I do not know if this is default or not.

          Anyway, I doubt that filesort, that reads things mosty sequential and does not do any syncs, could cause this big difference. Another issue is that the code uses ORDER BY LIMIT 50, which means that filesort should use a priority key and not touch the disk at all.

          I talked with Petruna and he is sure that filesort is not the problem as it is reported separately and is not part of r_other_time_ms.

          In later MariaDB versions (I tested with a simple query in 11.5) we have even more information about the time spent in filesort:
          "filesort": {
          "sort_key": "count(0)",
          "r_loops": 1,
          "r_total_time_ms": 254.3999844,
          "r_limit": 10,
          "r_used_priority_queue": true,
          "r_output_rows": 11,
          "r_sort_mode": "sort_key,rowid",

          Any chance someone could do a test with 11.5 on Centos and Rocky to see if filesort is the problem or not ?

          What could cause a difference is the IO scheduler used, thread switches and the underlying hardware.
          Do we know if the storage hardware for Centos and Rocky are the same?

          Another issue is server configuration. Enterprise has totally different my.cnf files than the community server. To get a reliable test we should preferably use the exact same config files on both servers. For example if the sort buffer is wrong, it could easily explain this issue.

          monty Michael Widenius added a comment - - edited As far as I know, XFS has copy on write as an option. I do not know if this is default or not. Anyway, I doubt that filesort, that reads things mosty sequential and does not do any syncs, could cause this big difference. Another issue is that the code uses ORDER BY LIMIT 50, which means that filesort should use a priority key and not touch the disk at all. I talked with Petruna and he is sure that filesort is not the problem as it is reported separately and is not part of r_other_time_ms. In later MariaDB versions (I tested with a simple query in 11.5) we have even more information about the time spent in filesort: "filesort": { "sort_key": "count(0)", "r_loops": 1, "r_total_time_ms": 254.3999844, "r_limit": 10, "r_used_priority_queue": true, "r_output_rows": 11, "r_sort_mode": "sort_key,rowid", Any chance someone could do a test with 11.5 on Centos and Rocky to see if filesort is the problem or not ? What could cause a difference is the IO scheduler used, thread switches and the underlying hardware. Do we know if the storage hardware for Centos and Rocky are the same? Another issue is server configuration. Enterprise has totally different my.cnf files than the community server. To get a reliable test we should preferably use the exact same config files on both servers. For example if the sort buffer is wrong, it could easily explain this issue.

          hings I would consider adding to analyze format=json from get_rusage(RUSAGE_THREAD):
          ru_utime
          This is the total amount of time spent executing in user
          mode, expressed in a timeval structure (seconds plus
          microseconds).
          ru_nvcsw (since Linux 2.6)
          The number of times a context switch resulted due to a
          process voluntarily giving up the processor before its
          time slice was completed (usually to await availability of
          a resource).
          ru_nivcsw (since Linux 2.6)
          The number of times a context switch resulted due to a
          higher priority process becoming runnable or because the
          current process exceeded its time slice. (edited)

          monty Michael Widenius added a comment - hings I would consider adding to analyze format=json from get_rusage(RUSAGE_THREAD): ru_utime This is the total amount of time spent executing in user mode, expressed in a timeval structure (seconds plus microseconds). ru_nvcsw (since Linux 2.6) The number of times a context switch resulted due to a process voluntarily giving up the processor before its time slice was completed (usually to await availability of a resource). ru_nivcsw (since Linux 2.6) The number of times a context switch resulted due to a higher priority process becoming runnable or because the current process exceeded its time slice. (edited)

          One cause of the slowdown is because the ftruncate call can be much
          slower on some systems. ftruncate() is called by Aria for internal
          temporary tables, tables created by the optimizer, when the upper level
          asks Aria to delete the previous result set. This is needed when some
          content from previous tables changes.

          I have now changed Aria so that for internal temporary tables we don't
          call ftruncate anymore.

          I also had to update the Aria repair code to use the logical datafile
          size and not the on-disk datafile size, which may contain data from a
          previous result set. The repair code is called to create indexes for
          the internal temporary table after it is filled.

          monty Michael Widenius added a comment - One cause of the slowdown is because the ftruncate call can be much slower on some systems. ftruncate() is called by Aria for internal temporary tables, tables created by the optimizer, when the upper level asks Aria to delete the previous result set. This is needed when some content from previous tables changes. I have now changed Aria so that for internal temporary tables we don't call ftruncate anymore. I also had to update the Aria repair code to use the logical datafile size and not the on-disk datafile size, which may contain data from a previous result set. The repair code is called to create indexes for the internal temporary table after it is filled.

          Pushed to bb-10.5-monty for testing

          monty Michael Widenius added a comment - Pushed to bb-10.5-monty for testing

          I have worked with Gosselin to test my patch.
          The patch makes split_materialized ON or OFF to have same performance (about 6 seconds) on his machine.
          The given test called maria_delete_all_rows() 678199 times and each call called ftruncate() 3 times.
          I assume that on some OS ftruncate does something similar to a sync, which is causing the performance problem.
          After the patch, ftruncate() is not called at all for internal temporary tables, which fixes the problem.

          monty Michael Widenius added a comment - I have worked with Gosselin to test my patch. The patch makes split_materialized ON or OFF to have same performance (about 6 seconds) on his machine. The given test called maria_delete_all_rows() 678199 times and each call called ftruncate() 3 times. I assume that on some OS ftruncate does something similar to a sync, which is causing the performance problem. After the patch, ftruncate() is not called at all for internal temporary tables, which fixes the problem.

          Ok to push.

          psergei Sergei Petrunia added a comment - Ok to push.

          bb-10.5-monty pushed for final testing by buildbot

          monty Michael Widenius added a comment - bb-10.5-monty pushed for final testing by buildbot

          Pushed to 10.5

          monty Michael Widenius added a comment - Pushed to 10.5

          People

            monty Michael Widenius
            martin.reinhardt@mariadb.com Martin Reinhardt
            Votes:
            1 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.