Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-34515

Contention between secondary index UPDATE and purge due to large innodb_purge_batch_size

Details

    Description

      When I run sysbench oltp_update_index on current 10.6-11.6 using very small amount of data (1 row in 1 table), performance regresses compared to 10.5

      The easiest way for me to reproduce it, is to run server with --innodb-flush-log-at-trx-commit=2 or 0 and a single benchmark user. I also see the same effect with --innodb-flush-log-at-trx-commit=1 and multiple users.

      Benchmark setup

      server:

      mysqld --innodb-flush-log-at-trx-commit=2 #all defaults otherwise
      

      sysbench prepare

      sysbench oltp_update_index --table-size=1 --tables=1 --threads=1 --mysql-user=root --mysql-db=mysql --report-interval=1 --time=120  --mysql-socket=/tmp/mysql.sock  --histogram prepare
      

      sysbench oltp_update_index --table-size=1 --tables=1 --threads=1 --mysql-user=root --mysql-db=mysql --report-interval=1 --time=120  --mysql-socket=/tmp/mysql.sock  --histogram run
      

      Results 10.5 vs 10.6

      version tps latency ms 95% max history length flamegraph
      10.5 16501.81 0.07 95 10_5_flame.svg
      10.6 592.19 5.28 70332 10_6_flame.svg

      As one can see from the attached 10_6_flame.svg , in 10.6 the purge is abnormally busy, with 85% of all CPU samples in purge . However, no matter how busy, purge does not achieve its goal, as Innodb_history_list_length grows continuously up to about 70K within 2 minutes. I think no record is actually purged. Additionally, foreground threads run into some sort of buffer pool contention, ssux_lock_impl<true>::wr_wait from buf_page_get_low being the most , and ssux_lock_impl<true>::wr_wait second most prominent places

      In contrast, as one can see from 10_5_flame.svg . in 10.5 the CPU time is spent roughly evenly between purge and foreground processing in do_command, and purge works, keeping history length tiny during the benchmark, with overall 25x better performance (although purge seems to have relatively high overhead to wake/wait for workers doing their tiny tasks, but this is for another day)

      What causes it, according to "git bisect"

      I ran manual bisect, and it points to aa719b5010c MDEV-32050: Do not copy undo records in purge from 10.6.16 .
      I had want to run the benchmark 2 or 3 times for this commit , from col start, on the first run it started slow, but unexplainably showed a better performance at about the middle of benchmark, on the second run it was slow all the way.

      So, the results for that commit vs previous commit 88733282fb15c80f0bd722df0041d06ad90c26b0 are below

      commit tps latency ms 95% history length max flamegraph
      aa719b5010c (bad) 592.96 5.37 69069 aa719b5010c9_flame2.svg
      88733282fb15(good) 19003.11 0.07 217 88733282fb15_flame.svg

      In fact, the "good" commit before regression was 20% better than 10.5, and purging uses only 7% CPU while managing to keep history length low.

      The "bad" flamegraph is pretty much the same as current 10.6, purge using up most of the CPU.

      Can it is reproduced differently ?

      I could also reproduce it by leaving innodb_flush_log_at_trx_commit=0 .
      I can also reproduce it by increasing concurrency (--threads=10 in sysbench), and leaving innodb_flush_log_at_trx_commit default (1).

      The bad effect seems to vanish once more rows are updated, but maybe I did not find a way to have a repro for that.

      Attachments

        Issue Links

          Activity

            A quick look at 10_6_flame.svg matches an observation that I made during the development of MDEV-32050: depending on the PAGE_MAX_TRX_ID, some operations on secondary index records may be extremely slow until MDEV-17598 would introduce a per-record transaction identifier. This was known already back in 2014.

            That said, there may be a way to alleviate this without changing any file format, by making purge or DML smarter. For example, if we are going to modify a secondary index leaf page, what if we tried to remove all garbage from the page? Okay, some garbage records in the page may point to PRIMARY KEY values that reside in multiple different pages, and it would be even more expensive to visit all those keys and traverse their history. But, perhaps we could prune other records in the secondary index page that point to the same PRIMARY KEY value as the record that we are accessing. In that way, the garbage left by several UPDATE…SET indexed_column=indexed_column+1 WHERE pk=1 might be pruned more efficiently. Provided that the MVCC rules allow it, all entries (indexed_column,pk) of (1,1),(2,1),(3,1),…,(n-1,1) could be removed in a single mini-transaction.

            marko Marko Mäkelä added a comment - A quick look at 10_6_flame.svg matches an observation that I made during the development of MDEV-32050 : depending on the PAGE_MAX_TRX_ID , some operations on secondary index records may be extremely slow until MDEV-17598 would introduce a per-record transaction identifier. This was known already back in 2014 . That said, there may be a way to alleviate this without changing any file format, by making purge or DML smarter. For example, if we are going to modify a secondary index leaf page, what if we tried to remove all garbage from the page? Okay, some garbage records in the page may point to PRIMARY KEY values that reside in multiple different pages, and it would be even more expensive to visit all those keys and traverse their history. But, perhaps we could prune other records in the secondary index page that point to the same PRIMARY KEY value as the record that we are accessing. In that way, the garbage left by several UPDATE…SET indexed_column=indexed_column+1 WHERE pk=1 might be pruned more efficiently. Provided that the MVCC rules allow it, all entries (indexed_column,pk) of (1,1),(2,1),(3,1),…,(n-1,1) could be removed in a single mini-transaction.

            The benchmark setting with just 1 table and 1 row in the PRIMARY KEY index is an extreme case for the InnoDB secondary index MVCC. The PRIMARY KEY root/leaf page would contain just one record, the top of the stack of versions, in a singly-linked list of undo log records by the DB_ROLL_PTR column. In the secondary indexes, we would have N records, one for each version that has not been purged yet. My above idea of making some operations smarter would address exactly this scenario, but it could lead to a regression in practical settings where we might have a large number of small records in the secondary index page, and none of them are for the same PRIMARY KEY value.

            The good commit (the parent of the bad commit) removed a serious bottleneck, dict_sys and MDL contention between purge threads. The bad commit was expected to improve things further by removing unnecessary memory heap allocation and letting all purge threads access the undo log records directly in the buffer pool. In the flame graph aa719b5010c9_flame2.svg I notice that buf_page_get_gen() for undo log pages is taking a lot of time in row_vers_old_has_index_entry(). I might try to implement a purge-specific version of that function that would look up the page in purge_sys.pages (which luckily is constant during this part of execution) before attempting to look up in the buffer pool directly. I think that there is a good chance that many pages that are involved in the history check are already cached in purge_sys.pages. We would not want to extend purge_sys.pages further during the execution; that would require additional contention-prone latches and involve an obvious risk that we would run out of buffer pool.

            I do not have an exact idea why the bad commit is bad. A hypothesis might be that the malloc() and memcpy() act as some kind of a useful throttle, to reduce the chances of a situation where multiple threads compete for the same buf_page_t::lock or buf_page_t::buf_fix_count concurrently.

            marko Marko Mäkelä added a comment - The benchmark setting with just 1 table and 1 row in the PRIMARY KEY index is an extreme case for the InnoDB secondary index MVCC . The PRIMARY KEY root/leaf page would contain just one record, the top of the stack of versions, in a singly-linked list of undo log records by the DB_ROLL_PTR column. In the secondary indexes, we would have N records, one for each version that has not been purged yet. My above idea of making some operations smarter would address exactly this scenario, but it could lead to a regression in practical settings where we might have a large number of small records in the secondary index page, and none of them are for the same PRIMARY KEY value. The good commit (the parent of the bad commit ) removed a serious bottleneck, dict_sys and MDL contention between purge threads. The bad commit was expected to improve things further by removing unnecessary memory heap allocation and letting all purge threads access the undo log records directly in the buffer pool. In the flame graph aa719b5010c9_flame2.svg I notice that buf_page_get_gen() for undo log pages is taking a lot of time in row_vers_old_has_index_entry() . I might try to implement a purge-specific version of that function that would look up the page in purge_sys.pages (which luckily is constant during this part of execution) before attempting to look up in the buffer pool directly. I think that there is a good chance that many pages that are involved in the history check are already cached in purge_sys.pages . We would not want to extend purge_sys.pages further during the execution; that would require additional contention-prone latches and involve an obvious risk that we would run out of buffer pool. I do not have an exact idea why the bad commit is bad. A hypothesis might be that the malloc() and memcpy() act as some kind of a useful throttle, to reduce the chances of a situation where multiple threads compete for the same buf_page_t::lock or buf_page_t::buf_fix_count concurrently.

            https://github.com/MariaDB/server/pull/3448 is the start of implementing this. There seems to be some potential to remove further memory heap allocation, as well as operations on purge_sys.latch or purge_sys.end_latch.

            marko Marko Mäkelä added a comment - https://github.com/MariaDB/server/pull/3448 is the start of implementing this. There seems to be some potential to remove further memory heap allocation, as well as operations on purge_sys.latch or purge_sys.end_latch .

            With my tentative fix, the throughput is significantly better in my tests. I think that this is thanks to having much fewer operations on purge_sys.latch as well as buf_page_t::lock and the io/buffer-fix word. That is, if the purge worker tasks find the undo page in purge_sys.pages, they will just use a pointer to it as is. Else, they will buffer-fix a page in the buffer pool. Only if the page is read-fixed, we will temporarily acquire and release a shared buf_page_t::lock. Last, we will never copy the undo log record from the buffer pool page, but just refer to it directly. All this should reduce cache pollution.

            For me, running the test on Linux /dev/shm, the throughput would significantly improve some 35 to 55 seconds into the 120-second benchmark and then remain rather constant until the end. This improvement occurred twice with the baseline 10.6 branch, and so far reliably every time I tested the tentative fix. The initial throughput would be between 700 and 1500 tps, and the improved throughput would be above 10000 or 11000 tps. I must run more tests to find out what exactly happens during the slow phase.

            marko Marko Mäkelä added a comment - With my tentative fix, the throughput is significantly better in my tests. I think that this is thanks to having much fewer operations on purge_sys.latch as well as buf_page_t::lock and the io/buffer-fix word. That is, if the purge worker tasks find the undo page in purge_sys.pages , they will just use a pointer to it as is. Else, they will buffer-fix a page in the buffer pool. Only if the page is read-fixed, we will temporarily acquire and release a shared buf_page_t::lock . Last, we will never copy the undo log record from the buffer pool page, but just refer to it directly. All this should reduce cache pollution. For me, running the test on Linux /dev/shm , the throughput would significantly improve some 35 to 55 seconds into the 120-second benchmark and then remain rather constant until the end. This improvement occurred twice with the baseline 10.6 branch, and so far reliably every time I tested the tentative fix. The initial throughput would be between 700 and 1500 tps, and the improved throughput would be above 10000 or 11000 tps. I must run more tests to find out what exactly happens during the slow phase.

            The slow throughput appears to occur while purge is active. I took two samples with

            sudo perf record -p $(pgrep mariadbd) -- sleep 10
            

            while the server was running, during the initial slow throughput and during the fast throughput.

            The slow one:

            Samples: 57K of event 'cycles:P', Event count (approx.): 27906590288
            Overhead  Command          Shared Object        Symbol
               8,80%  mariadbd         mariadbd             [.] trx_undo_prev_version_build(unsigned char const*, dict_index_t*, unsigned short*, mem_block_info_t*, unsigned char**, mem_block_info_t*, dtuple_t**, unsigned long)
               6,14%  mariadbd         libc.so.6            [.] _int_malloc
               6,07%  mariadbd         mariadbd             [.] row_build_low(unsigned long, dict_index_t const*, unsigned char const*, unsigned short const*, dict_table_t const*, dtuple_t const*, dict_add_v_col_t const*, unsigned long const*, row_ext_t**, mem_block_info_t*)
               3,12%  mariadbd         mariadbd             [.] trx_undo_update_rec_get_update(unsigned char const*, dict_index_t*, unsigned long, unsigned long, unsigned long, unsigned char, mem_block_info_t*, upd_t**)
               2,81%  mariadbd         mariadbd             [.] ssux_lock_impl<true>::rd_wait()
               2,33%  mariadbd         mariadbd             [.] void rec_init_offsets_comp_ordinary<false, false>(unsigned char const*, dict_index_t const*, unsigned short*, unsigned long, dict_col_t::def_t const*, rec_leaf_format)
               2,24%  mariadbd         mariadbd             [.] row_build_index_entry_low(dtuple_t const*, row_ext_t const*, dict_index_t const*, mem_block_info_t*, unsigned long)
               2,12%  mariadbd         mariadbd             [.] dict_table_copy_types(dtuple_t*, dict_table_t const*)
               1,87%  mariadbd         mariadbd             [.] ssux_lock_impl<true>::wr_wait(unsigned int)
               1,74%  mariadbd         mariadbd             [.] rec_get_offsets_func(unsigned char const*, dict_index_t const*, unsigned short*, unsigned long, unsigned long, mem_block_info_t**)
            

            The fast one:

            Samples: 49K of event 'cycles:P', Event count (approx.): 24998893905
            Overhead  Command          Shared Object        Symbol
               2,65%  mariadbd         mariadbd             [.] buf_page_get_low(page_id_t, unsigned long, unsigned long, buf_block_t*, unsigned long, mtr_t*, dberr_t*, bool, bool*)
               2,04%  mariadbd         libc.so.6            [.] _int_malloc
               1,91%  mariadbd         libc.so.6            [.] __GI___pthread_mutex_unlock_usercnt
               1,66%  mariadbd         mariadbd             [.] trx_undo_prev_version_build(unsigned char const*, dict_index_t*, unsigned short*, mem_block_info_t*, unsigned char**, mem_block_info_t*, dtuple_t**, unsigned long)
               1,65%  mariadbd         mariadbd             [.] row_build_low(unsigned long, dict_index_t const*, unsigned char const*, unsigned short const*, dict_table_t const*, dtuple_t const*, dict_add_v_col_t const*, unsigned long const*, row_ext_t**, mem_block_info_t*)
               1,49%  mariadbd         mariadbd             [.] ssux_lock_impl<true>::rd_wait()
            

            This gave me the idea to try innodb_purge_threads=1. The throughput would be significantly lower and not recover until the 100-second mark! I went on with the other extreme, innodb_purge_threads=32, and it seems to be even worse, not recovering at all. Out of the 91,593 transactions that were executed, 83,515 were to be purged on shutdown, after the 120-second test run.

            marko Marko Mäkelä added a comment - The slow throughput appears to occur while purge is active. I took two samples with sudo perf record -p $(pgrep mariadbd) -- sleep 10 while the server was running, during the initial slow throughput and during the fast throughput. The slow one: Samples: 57K of event 'cycles:P', Event count (approx.): 27906590288 Overhead Command Shared Object Symbol 8,80% mariadbd mariadbd [.] trx_undo_prev_version_build(unsigned char const*, dict_index_t*, unsigned short*, mem_block_info_t*, unsigned char**, mem_block_info_t*, dtuple_t**, unsigned long) 6,14% mariadbd libc.so.6 [.] _int_malloc 6,07% mariadbd mariadbd [.] row_build_low(unsigned long, dict_index_t const*, unsigned char const*, unsigned short const*, dict_table_t const*, dtuple_t const*, dict_add_v_col_t const*, unsigned long const*, row_ext_t**, mem_block_info_t*) 3,12% mariadbd mariadbd [.] trx_undo_update_rec_get_update(unsigned char const*, dict_index_t*, unsigned long, unsigned long, unsigned long, unsigned char, mem_block_info_t*, upd_t**) 2,81% mariadbd mariadbd [.] ssux_lock_impl<true>::rd_wait() 2,33% mariadbd mariadbd [.] void rec_init_offsets_comp_ordinary<false, false>(unsigned char const*, dict_index_t const*, unsigned short*, unsigned long, dict_col_t::def_t const*, rec_leaf_format) 2,24% mariadbd mariadbd [.] row_build_index_entry_low(dtuple_t const*, row_ext_t const*, dict_index_t const*, mem_block_info_t*, unsigned long) 2,12% mariadbd mariadbd [.] dict_table_copy_types(dtuple_t*, dict_table_t const*) 1,87% mariadbd mariadbd [.] ssux_lock_impl<true>::wr_wait(unsigned int) 1,74% mariadbd mariadbd [.] rec_get_offsets_func(unsigned char const*, dict_index_t const*, unsigned short*, unsigned long, unsigned long, mem_block_info_t**) The fast one: Samples: 49K of event 'cycles:P', Event count (approx.): 24998893905 Overhead Command Shared Object Symbol 2,65% mariadbd mariadbd [.] buf_page_get_low(page_id_t, unsigned long, unsigned long, buf_block_t*, unsigned long, mtr_t*, dberr_t*, bool, bool*) 2,04% mariadbd libc.so.6 [.] _int_malloc 1,91% mariadbd libc.so.6 [.] __GI___pthread_mutex_unlock_usercnt 1,66% mariadbd mariadbd [.] trx_undo_prev_version_build(unsigned char const*, dict_index_t*, unsigned short*, mem_block_info_t*, unsigned char**, mem_block_info_t*, dtuple_t**, unsigned long) 1,65% mariadbd mariadbd [.] row_build_low(unsigned long, dict_index_t const*, unsigned char const*, unsigned short const*, dict_table_t const*, dtuple_t const*, dict_add_v_col_t const*, unsigned long const*, row_ext_t**, mem_block_info_t*) 1,49% mariadbd mariadbd [.] ssux_lock_impl<true>::rd_wait() This gave me the idea to try innodb_purge_threads=1 . The throughput would be significantly lower and not recover until the 100-second mark! I went on with the other extreme, innodb_purge_threads=32 , and it seems to be even worse, not recovering at all. Out of the 91,593 transactions that were executed, 83,515 were to be purged on shutdown, after the 120-second test run.

            I got surprisingly good flame graphs this time and implemented a tentative fix of MDEV-34520 in an attempt to improve it.

            sudo offcputime-bpfcc --stack-storage-size=1048576 -df -p $(pgrep -nx mariadbd) 10 > out.stacks
            flamegraph.pl --color=io --title="Off-CPU Time Flame Graph" --countname=us < out.stacks > out.svg
            

            I guess that Debian Sid moved ahead and compiled all libraries with -fno-omit-frame-pointer, or I am just lucky with respect to https://github.com/iovisor/bcc/issues/1234, but there were no "unknown" stack frames.

            I think that innodb_purge_threads=1 should actually result in acceptable purge performance. It is still a mystery to me why or when exactly the throughput would switch from low to high. When the throughput is low, more than 70% of the CPU reported by perf record would be spent in a purge thread and less than 15% in the single SQL user thread. When the throughput is high, these numbers will be reversed.

            marko Marko Mäkelä added a comment - I got surprisingly good flame graphs this time and implemented a tentative fix of MDEV-34520 in an attempt to improve it. sudo offcputime-bpfcc --stack-storage-size=1048576 -df -p $(pgrep -nx mariadbd) 10 > out.stacks flamegraph.pl --color=io --title="Off-CPU Time Flame Graph" --countname=us < out.stacks > out.svg I guess that Debian Sid moved ahead and compiled all libraries with -fno-omit-frame-pointer , or I am just lucky with respect to https://github.com/iovisor/bcc/issues/1234 , but there were no "unknown" stack frames. I think that innodb_purge_threads=1 should actually result in acceptable purge performance. It is still a mystery to me why or when exactly the throughput would switch from low to high. When the throughput is low, more than 70% of the CPU reported by perf record would be spent in a purge thread and less than 15% in the single SQL user thread. When the throughput is high, these numbers will be reversed.
            wlad Vladislav Vaintroub added a comment - - edited

            It makes sense to compare 10.6 to 10.5
            In 10.5, I get 20 000 tps, on that benchmark. In 10.6, ~500 tps, or even 700 tps when sleep is removed. It is 30-40x slower, and there is no trick I can make it faster.

            wlad Vladislav Vaintroub added a comment - - edited It makes sense to compare 10.6 to 10.5 In 10.5, I get 20 000 tps, on that benchmark. In 10.6, ~500 tps, or even 700 tps when sleep is removed. It is 30-40x slower, and there is no trick I can make it faster.

            It also makes sense to vary --table-size. It is bad until 5-6, then it starts to improve, and with 8 rows, it is much-much better. This can give the idea of what's happening in the "bad" phase.

            wlad Vladislav Vaintroub added a comment - It also makes sense to vary --table-size. It is bad until 5-6, then it starts to improve, and with 8 rows, it is much-much better. This can give the idea of what's happening in the "bad" phase.

            According to offcputime, the main conflicts are on B-tree page latches, mostly in btr_cur_t::search_leaf(). Today, so far the throughput is not switching from the slow to the fast mode. I assume that when it would switch, it would be the purge task that would be blocked on the page latches, instead of the thread that runs SQL statements for the client connection.

            I removed two sources of waits in background threads by disabling the body of the function row_purge_reset_trx_id() and by setting innodb_stats_auto_recalc=OFF. Neither of these changes brought any measurable benefit.

            I also tried lowering the priority of the background threads, but that does not have any impact on the throughput. This is the entire patch that I tried on top of my current development branch:

            diff --git a/storage/innobase/row/row0purge.cc b/storage/innobase/row/row0purge.cc
            index fd306f46b73..65ddd6fa2fe 100644
            --- a/storage/innobase/row/row0purge.cc
            +++ b/storage/innobase/row/row0purge.cc
            @@ -1072,6 +1072,7 @@ whose old history can no longer be observed.
             @param[in,out]	mtr	mini-transaction (will be started and committed) */
             static void row_purge_reset_trx_id(purge_node_t* node, mtr_t* mtr)
             {
            +#if 0
             	/* Reset DB_TRX_ID, DB_ROLL_PTR for old records. */
             	mtr->start();
             
            @@ -1136,6 +1137,7 @@ static void row_purge_reset_trx_id(purge_node_t* node, mtr_t* mtr)
             	}
             
             	mtr->commit();
            +#endif
             }
             
             /***********************************************************//**
            diff --git a/tpool/tpool_generic.cc b/tpool/tpool_generic.cc
            index 9ed832dd1ed..729c232f888 100644
            --- a/tpool/tpool_generic.cc
            +++ b/tpool/tpool_generic.cc
            @@ -34,6 +34,7 @@ Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111 - 1301 USA*/
             #include <thr_timer.h>
             #include <stdlib.h>
             #include "aligned.h"
            +#include <sys/resource.h>
             
             namespace tpool
             {
            @@ -577,6 +578,7 @@ void thread_pool_generic::worker_main(worker_data *thread_var)
             
               tls_worker_data = thread_var;
               m_thread_creation_pending.clear();
            +  setpriority(PRIO_PROCESS, gettid(), 20);
             
               while (get_task(thread_var, &task) && task)
               {
            

            marko Marko Mäkelä added a comment - According to offcputime , the main conflicts are on B-tree page latches, mostly in btr_cur_t::search_leaf() . Today, so far the throughput is not switching from the slow to the fast mode. I assume that when it would switch, it would be the purge task that would be blocked on the page latches, instead of the thread that runs SQL statements for the client connection. I removed two sources of waits in background threads by disabling the body of the function row_purge_reset_trx_id() and by setting innodb_stats_auto_recalc=OFF . Neither of these changes brought any measurable benefit. I also tried lowering the priority of the background threads, but that does not have any impact on the throughput. This is the entire patch that I tried on top of my current development branch: diff --git a/storage/innobase/row/row0purge.cc b/storage/innobase/row/row0purge.cc index fd306f46b73..65ddd6fa2fe 100644 --- a/storage/innobase/row/row0purge.cc +++ b/storage/innobase/row/row0purge.cc @@ -1072,6 +1072,7 @@ whose old history can no longer be observed. @param[in,out] mtr mini-transaction (will be started and committed) */ static void row_purge_reset_trx_id(purge_node_t* node, mtr_t* mtr) { +#if 0 /* Reset DB_TRX_ID, DB_ROLL_PTR for old records. */ mtr->start(); @@ -1136,6 +1137,7 @@ static void row_purge_reset_trx_id(purge_node_t* node, mtr_t* mtr) } mtr->commit(); +#endif } /***********************************************************//** diff --git a/tpool/tpool_generic.cc b/tpool/tpool_generic.cc index 9ed832dd1ed..729c232f888 100644 --- a/tpool/tpool_generic.cc +++ b/tpool/tpool_generic.cc @@ -34,6 +34,7 @@ Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111 - 1301 USA*/ #include <thr_timer.h> #include <stdlib.h> #include "aligned.h" +#include <sys/resource.h> namespace tpool { @@ -577,6 +578,7 @@ void thread_pool_generic::worker_main(worker_data *thread_var) tls_worker_data = thread_var; m_thread_creation_pending.clear(); + setpriority(PRIO_PROCESS, gettid(), 20); while (get_task(thread_var, &task) && task) {

            I see that the rather expensive check row_purge_poss_sec() may be executed twice: once in the optimistic (leaf-page-only) operation row_purge_remove_sec_if_poss_leaf() and again in the pessimistic operation row_purge_remove_sec_if_poss_tree(). While we are executing this check, we are holding the index leaf page latch, acquiring the clustered index leaf page latch (always the root-leaf page in this case) and then traversing the history of undo log pages to see if the secondary index record can be safely purged. Had MDEV-17598 been implemented, we could check the safety by simply looking at the transaction identifier in the secondary index record and the time-consuming part of looking up the clustered index record and traversing its history could be avoided.

            A theoretical possibility might be to try to upgrade the index leaf-page latch into an exclusive index latch and further to exclusive page latches. But, I am rather sure that such an attempt would consistently fail in this workload. We’d have to use a non-blocking latch acquisition because this would obviously violate the lock order rules that were set in WL#6326.

            Yet another possibility would be to skip the row_purge_remove_sec_if_poss_leaf() step and go straight for the index tree latch. This was easily tested, and indeed it helps the throughput. But I think that we should apply some condition here, to only skip the operation when it is expected to be futile.

            diff a/storage/innobase/row/row0purge.cc b/storage/innobase/row/row0purge.cc
            --- a/storage/innobase/row/row0purge.cc
            +++ b/storage/innobase/row/row0purge.cc
            @@ -999,10 +999,6 @@ row_purge_remove_sec_if_poss(
             		return;
             	}
             
            -	if (row_purge_remove_sec_if_poss_leaf(node, index, entry)) {
            -
            -		return;
            -	}
             retry:
             	success = row_purge_remove_sec_if_poss_tree(node, index, entry);
             	/* The delete operation may fail if we have little
            

            After applying this hack, offcputime does not report any waits in a purge task. There would be a large futex_wait with an unknown call stack, which I would assume to be for a dict_index_t::lock of a secondary index tree, but I do not see any evidence of that in the perf record data.

            I think that it should be possible to reuse the result of the check from row_purge_remove_sec_if_poss_leaf() in the case that the PAGE_MAX_TRX_ID is not changing, that is, no DML operation was modifying the secondary index leaf page while we were not holding a page latch. This is what I am going to attempt next.

            marko Marko Mäkelä added a comment - I see that the rather expensive check row_purge_poss_sec() may be executed twice: once in the optimistic (leaf-page-only) operation row_purge_remove_sec_if_poss_leaf() and again in the pessimistic operation row_purge_remove_sec_if_poss_tree() . While we are executing this check, we are holding the index leaf page latch, acquiring the clustered index leaf page latch (always the root-leaf page in this case) and then traversing the history of undo log pages to see if the secondary index record can be safely purged. Had MDEV-17598 been implemented, we could check the safety by simply looking at the transaction identifier in the secondary index record and the time-consuming part of looking up the clustered index record and traversing its history could be avoided. A theoretical possibility might be to try to upgrade the index leaf-page latch into an exclusive index latch and further to exclusive page latches. But, I am rather sure that such an attempt would consistently fail in this workload. We’d have to use a non-blocking latch acquisition because this would obviously violate the lock order rules that were set in WL#6326 . Yet another possibility would be to skip the row_purge_remove_sec_if_poss_leaf() step and go straight for the index tree latch. This was easily tested, and indeed it helps the throughput. But I think that we should apply some condition here, to only skip the operation when it is expected to be futile. diff a/storage/innobase/row/row0purge.cc b/storage/innobase/row/row0purge.cc --- a/storage/innobase/row/row0purge.cc +++ b/storage/innobase/row/row0purge.cc @@ -999,10 +999,6 @@ row_purge_remove_sec_if_poss( return; } - if (row_purge_remove_sec_if_poss_leaf(node, index, entry)) { - - return; - } retry: success = row_purge_remove_sec_if_poss_tree(node, index, entry); /* The delete operation may fail if we have little After applying this hack, offcputime does not report any waits in a purge task. There would be a large futex_wait with an unknown call stack, which I would assume to be for a dict_index_t::lock of a secondary index tree, but I do not see any evidence of that in the perf record data. I think that it should be possible to reuse the result of the check from row_purge_remove_sec_if_poss_leaf() in the case that the PAGE_MAX_TRX_ID is not changing, that is, no DML operation was modifying the secondary index leaf page while we were not holding a page latch. This is what I am going to attempt next.

            Needless to say, innodb_force_recovery=2 (which would disable background tasks such as the purge of committed history) would yield rather consistent throughput (16.5ktps on my system). Is that faster than 10.5 on wlad’s system? On my system yesterday when the "fast mode" could kick in, the throughput was at best around 11ktps while the purge was enabled.

            marko Marko Mäkelä added a comment - Needless to say, innodb_force_recovery=2 (which would disable background tasks such as the purge of committed history) would yield rather consistent throughput (16.5ktps on my system). Is that faster than 10.5 on wlad ’s system? On my system yesterday when the "fast mode" could kick in, the throughput was at best around 11ktps while the purge was enabled.

            I implemented a few more performance tweaks:

            • Reuse the same mtr_t object for accessing the secondary index record, the clustered index record, and the undo log records.
            • Cache the purgeability result of row_purge_remove_sec_if_poss_leaf(). This makes row_purge_remove_sec_if_poss_tree() almost disappear from the profile.

            An important contention point seems to be the clustered index root/leaf page latch. (When the number of rows is small, there will be only one clustered index page.) One thing that could be done to alleviate this would be to copy the record and release the clustered index page latch before invoking the time-consuming row_purge_poss_sec() check. In that way, while this costly check is being executed, only a latch on one of the many secondary index pages would be held.

            A further improvement in row_purge_remove_sec_if_poss_leaf() might be to initially acquire a shared latch on the secondary index page and to upgrade it to exclusive right before we determine that the record can be purged. Before MDEV-24142, such upgrades were prone to deadlock and therefore prohibited. With the refactored latches, an upgrade S→U→X should be deadlock-free. Upgrading U→X was always possible.

            marko Marko Mäkelä added a comment - I implemented a few more performance tweaks: Reuse the same mtr_t object for accessing the secondary index record, the clustered index record, and the undo log records. Cache the purgeability result of row_purge_remove_sec_if_poss_leaf() . This makes row_purge_remove_sec_if_poss_tree() almost disappear from the profile. An important contention point seems to be the clustered index root/leaf page latch. (When the number of rows is small, there will be only one clustered index page.) One thing that could be done to alleviate this would be to copy the record and release the clustered index page latch before invoking the time-consuming row_purge_poss_sec() check. In that way, while this costly check is being executed, only a latch on one of the many secondary index pages would be held. A further improvement in row_purge_remove_sec_if_poss_leaf() might be to initially acquire a shared latch on the secondary index page and to upgrade it to exclusive right before we determine that the record can be purged. Before MDEV-24142 , such upgrades were prone to deadlock and therefore prohibited. With the refactored latches, an upgrade S→U→X should be deadlock-free. Upgrading U→X was always possible.

            I attempted to release the clustered index page latch in the refactored function row_purge_is_unsafe() right after constructing the second newest version based on an undo log record. Unfortunately, this would lead into an infinite loop in that function; it would never return to row_purge_poss_sec(). I must debug it further with a shorter workload in order to understand it.

            While the purge thread was stuck in the infinite loop, the SQL throughput would consistently remain great (16ktps). This seems to confirm that the contention point is the clustered index leaf page latch.

            I did not try to implement the secondary index page latch relaxation in row_purge_remove_sec_if_poss_leaf() yet. It would bring very little in this workload, because we’d have a large number of secondary index leaf pages pointing to the single clustered index page and therefore contention on a secondary index page latch should be unlikely.

            marko Marko Mäkelä added a comment - I attempted to release the clustered index page latch in the refactored function row_purge_is_unsafe() right after constructing the second newest version based on an undo log record. Unfortunately, this would lead into an infinite loop in that function; it would never return to row_purge_poss_sec() . I must debug it further with a shorter workload in order to understand it. While the purge thread was stuck in the infinite loop, the SQL throughput would consistently remain great (16ktps). This seems to confirm that the contention point is the clustered index leaf page latch. I did not try to implement the secondary index page latch relaxation in row_purge_remove_sec_if_poss_leaf() yet. It would bring very little in this workload, because we’d have a large number of secondary index leaf pages pointing to the single clustered index page and therefore contention on a secondary index page latch should be unlikely.

            If I release the clustered index page latch early in the purgeability check and run the benchmark for 2 seconds, the shutdown will complete:

            2024-08-08  8:56:23 0 [Note] /dev/shm/10.6/sql/mariadbd (initiated by: root[root] @ localhost []): Normal shutdown
            2024-08-08  8:56:23 0 [Note] InnoDB: FTS optimize thread exiting.
            2024-08-08  8:56:23 0 [Note] InnoDB: to purge 32556 transactions
            2024-08-08  8:56:40 0 [Note] InnoDB: to purge 32556 transactions
            2024-08-08  8:56:40 0 [Note] InnoDB: Starting shutdown...
            

            After a workload of 1 second, it would be even faster:

            2024-08-08  8:57:10 0 [Note] InnoDB: to purge 13743 transactions
            2024-08-08  8:57:16 0 [Note] InnoDB: Starting shutdown...
            

            After 3 seconds of workload, it takes longer:

            2024-08-08  8:59:10 0 [Note] InnoDB: to purge 46161 transactions
            2024-08-08  8:59:47 0 [Note] InnoDB: to purge 46161 transactions
            2024-08-08  8:59:47 0 [Note] InnoDB: Starting shutdown...
            

            benchmark time/s transactions to purge slow shutdown time/s purge speed/tps
            1 13,743 6 2290.5
            2 32,556 17 1915.1
            3 46,161 37 1247.6
            4 62,951 47 1339.4
            5 78,570 25 3142.8
            6 95,735 103 929.46

            Yes, it is very extraordinary that the cleanup (purge of committed transaction history) takes so much longer than the actual workload. When there are only a few rows in a table, one should not define any secondary indexes. Because there is only 1 row in 1 table and only 1 clustered index root page, the purging can only be executed by 1 thread, and the checks are expensive until MDEV-17598 has been implemented.

            We can see that the purge speed can be a tiny fraction of the original workload speed (around 16,000 tps) and that it may vary a lot. The above numbers are from a single run. Yesterday I thought that the innodb_fast_shutdown=0 after a 5-second benchmark was in an infinite loop. Today I am testing a similar (not exactly the same) patch to get the above numbers. It looks like that there might have been no issue with releasing the clustered index page while traversing older history; maybe I just did not wait long enough yesterday. We may implement something similar for ordinary MVCC and implicit locking checks.

            marko Marko Mäkelä added a comment - If I release the clustered index page latch early in the purgeability check and run the benchmark for 2 seconds, the shutdown will complete: 2024-08-08 8:56:23 0 [Note] /dev/shm/10.6/sql/mariadbd (initiated by: root[root] @ localhost []): Normal shutdown 2024-08-08 8:56:23 0 [Note] InnoDB: FTS optimize thread exiting. 2024-08-08 8:56:23 0 [Note] InnoDB: to purge 32556 transactions 2024-08-08 8:56:40 0 [Note] InnoDB: to purge 32556 transactions 2024-08-08 8:56:40 0 [Note] InnoDB: Starting shutdown... After a workload of 1 second, it would be even faster: 2024-08-08 8:57:10 0 [Note] InnoDB: to purge 13743 transactions 2024-08-08 8:57:16 0 [Note] InnoDB: Starting shutdown... After 3 seconds of workload, it takes longer: 2024-08-08 8:59:10 0 [Note] InnoDB: to purge 46161 transactions 2024-08-08 8:59:47 0 [Note] InnoDB: to purge 46161 transactions 2024-08-08 8:59:47 0 [Note] InnoDB: Starting shutdown... benchmark time/s transactions to purge slow shutdown time/s purge speed/tps 1 13,743 6 2290.5 2 32,556 17 1915.1 3 46,161 37 1247.6 4 62,951 47 1339.4 5 78,570 25 3142.8 6 95,735 103 929.46 Yes, it is very extraordinary that the cleanup (purge of committed transaction history) takes so much longer than the actual workload. When there are only a few rows in a table, one should not define any secondary indexes. Because there is only 1 row in 1 table and only 1 clustered index root page, the purging can only be executed by 1 thread, and the checks are expensive until MDEV-17598 has been implemented. We can see that the purge speed can be a tiny fraction of the original workload speed (around 16,000 tps) and that it may vary a lot. The above numbers are from a single run. Yesterday I thought that the innodb_fast_shutdown=0 after a 5-second benchmark was in an infinite loop. Today I am testing a similar (not exactly the same) patch to get the above numbers. It looks like that there might have been no issue with releasing the clustered index page while traversing older history; maybe I just did not wait long enough yesterday. We may implement something similar for ordinary MVCC and implicit locking checks.
            marko Marko Mäkelä added a comment - - edited

            A couple of more data points, after 12 seconds of benchmark load:

            2024-08-08 11:43:27 0 [Note] InnoDB: to purge 196482 transactions
            2024-08-08 11:43:56 0 [Note] InnoDB: to purge 196482 transactions
            2024-08-08 11:43:57 0 [Note] InnoDB: Starting shutdown...
            

            Impressively fast, 196482/30s = 6549.4 tps. The progress messages are intended to be output every 15 seconds. Apparently a single purge batch was taking more than that.

            After a 60-second workload, the shutdown with innodb_fast_shutdown=0 is taking extremely long again, which would seem to match what I observed yesterday after a 5-second workload:

            2024-08-08 11:57:20 0 [Note] InnoDB: to purge 951966 transactions
            

            After more than 20 minutes, I attached a debugger and invoked trx_sys.history_size_approx(). It would return the same 951966. So, we do seem to have an infinite loop problem somehow. I killed it, restarted and re-executed slow shutdown:

            2024-08-08 12:22:46 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=41484,41484
            2024-08-08 12:22:48 0 [Note] InnoDB: To recover: 18416 pages
            2024-08-08 12:22:50 0 [Note] InnoDB: FTS optimize thread exiting.
            2024-08-08 12:22:50 0 [Note] InnoDB: to purge 737706 transactions
            2024-08-08 12:22:54 0 [Note] InnoDB: Starting shutdown...
            2024-08-08 12:22:54 0 [Note] InnoDB: Dumping buffer pool(s) to /dev/shm/sbtest/ib_buffer_pool
            2024-08-08 12:22:54 0 [Note] InnoDB: Buffer pool(s) dump completed at 240808 12:22:54
            2024-08-08 12:22:54 0 [Note] InnoDB: Removed temporary tablespace data file: "./ibtmp1"
            2024-08-08 12:22:54 0 [Note] InnoDB: Shutdown completed; log sequence number 213758006; transaction id 1903951
            2024-08-08 12:22:54 0 [Note] 10.6/sql/mariadbd: Shutdown complete
            

            Note: the checkpoint had not been advanced since the bootstrap, because only 213 MiB of log had been written and the log file size was 5 GiB. This workload appears to involve only 18,416 pages. The hang remains a mystery for now. It is also a mystery why it only took 4 seconds to purge the remaining 737,706 transactions after the crash recovery. The table sbtest1 had not been dropped, and the size of the file sbtest1.ibd file is 20 MiB. Had the table been dropped, it would be clear that the purge would complete fast.

            marko Marko Mäkelä added a comment - - edited A couple of more data points, after 12 seconds of benchmark load: 2024-08-08 11:43:27 0 [Note] InnoDB: to purge 196482 transactions 2024-08-08 11:43:56 0 [Note] InnoDB: to purge 196482 transactions 2024-08-08 11:43:57 0 [Note] InnoDB: Starting shutdown... Impressively fast, 196482/30s = 6549.4 tps. The progress messages are intended to be output every 15 seconds. Apparently a single purge batch was taking more than that. After a 60-second workload, the shutdown with innodb_fast_shutdown=0 is taking extremely long again, which would seem to match what I observed yesterday after a 5-second workload: 2024-08-08 11:57:20 0 [Note] InnoDB: to purge 951966 transactions After more than 20 minutes, I attached a debugger and invoked trx_sys.history_size_approx() . It would return the same 951966. So, we do seem to have an infinite loop problem somehow. I killed it, restarted and re-executed slow shutdown: 2024-08-08 12:22:46 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=41484,41484 2024-08-08 12:22:48 0 [Note] InnoDB: To recover: 18416 pages … 2024-08-08 12:22:50 0 [Note] InnoDB: FTS optimize thread exiting. 2024-08-08 12:22:50 0 [Note] InnoDB: to purge 737706 transactions 2024-08-08 12:22:54 0 [Note] InnoDB: Starting shutdown... 2024-08-08 12:22:54 0 [Note] InnoDB: Dumping buffer pool(s) to /dev/shm/sbtest/ib_buffer_pool 2024-08-08 12:22:54 0 [Note] InnoDB: Buffer pool(s) dump completed at 240808 12:22:54 2024-08-08 12:22:54 0 [Note] InnoDB: Removed temporary tablespace data file: "./ibtmp1" 2024-08-08 12:22:54 0 [Note] InnoDB: Shutdown completed; log sequence number 213758006; transaction id 1903951 2024-08-08 12:22:54 0 [Note] 10.6/sql/mariadbd: Shutdown complete Note: the checkpoint had not been advanced since the bootstrap, because only 213 MiB of log had been written and the log file size was 5 GiB. This workload appears to involve only 18,416 pages. The hang remains a mystery for now. It is also a mystery why it only took 4 seconds to purge the remaining 737,706 transactions after the crash recovery. The table sbtest1 had not been dropped, and the size of the file sbtest1.ibd file is 20 MiB. Had the table been dropped, it would be clear that the purge would complete fast.

            The purge hang seems to recover when the server is killed and restarted, and another shutdown with innodb_fast_shutdown=0 is executed. I have not been able to reproduce it under rr record. The hang can occur both with release and debug builds. No assertions are failing in the debug build.

            I am yet to run the workload against a 10.5 server and to analyze what happens there and how the purge and user threads are being kept in balance.

            marko Marko Mäkelä added a comment - The purge hang seems to recover when the server is killed and restarted, and another shutdown with innodb_fast_shutdown=0 is executed. I have not been able to reproduce it under rr record . The hang can occur both with release and debug builds. No assertions are failing in the debug build. I am yet to run the workload against a 10.5 server and to analyze what happens there and how the purge and user threads are being kept in balance.

            I ran the workload against 10.5. The average throughput is 14.9 ktps, and the purge lag seems minimal, because the size of the file sbtest1.ibd is only 96 KiB at the end of the test. offcputime shows that the majority of waits are for some futex, but the call stacks are obscured. We would want to identify the mutex or latch that is responsible for the balance between the purge and the transaction workload. To get useful stack traces, we must compile everything with -fno-omit-frame-pointer. My first attempt was with a novel approach that would enable static linking:

            cmake -DCMAKE_{EXE,MODULE,SHARED}_LINKER_FLAGS:STRING=-static -DWITH_SSL=bundled ..
            

            This would fail due to something that would seem to have been addressed by implementing MDEV-22083. It does not affect only the client programs, but also the sql/mariadbd would fail to build. So, it looks like I will have to build libc.so and libstdc++.so with -fno-omit-frame-pointer, like I did in MDEV-32050.

            marko Marko Mäkelä added a comment - I ran the workload against 10.5. The average throughput is 14.9 ktps, and the purge lag seems minimal, because the size of the file sbtest1.ibd is only 96 KiB at the end of the test. offcputime shows that the majority of waits are for some futex , but the call stacks are obscured . We would want to identify the mutex or latch that is responsible for the balance between the purge and the transaction workload. To get useful stack traces, we must compile everything with -fno-omit-frame-pointer . My first attempt was with a novel approach that would enable static linking: cmake -DCMAKE_{EXE,MODULE,SHARED}_LINKER_FLAGS:STRING=-static -DWITH_SSL=bundled .. This would fail due to something that would seem to have been addressed by implementing MDEV-22083 . It does not affect only the client programs, but also the sql/mariadbd would fail to build. So, it looks like I will have to build libc.so and libstdc++.so with -fno-omit-frame-pointer , like I did in MDEV-32050 .

            It turns out that in order to get good stack traces, I only had to compile libc with -fno-omit-frame-pointer, using the directions that I had written in MDEV-32050:

            Added to glibc-2.39/debian/sysdeps/amd64.mk the line

            extra_cflags = -fno-omit-frame-pointer
            

            and built with dpkg-buildpackage, copied the libc.so.6* somewhere from the build directory while the tests were running

            For the benchmark, I simply pointed the environment variable LD_LIBRARY_PATH to the directory that contains the special built library.

            It looks like there might be an issue with the clang-specific libc++. I did not bother to compile it myself yet.

            The GCC libstdc++ produced a fairly nice flame graph 10.5-libstdc++.svg, with only 9% of an unknown futex wait source on the very left. In perf report, I see that the SQL thread that executes mysql_update spends 68% of the cycles of one CPU, and the purge_coordinator_callback in another thread consumes 84%. Apparently there are no waits for buffer page latches at all, or they are part of the mystery 9%. I will dig deeper, maybe recompile also one of the C++ libraries.

            I also realized that my attempted optimization to release the clustered index leaf page latch during some MVCC-like checks may be simply incorrect and this might explain the observed infinite loops in the purge. The scenario would be that an indexed column is being updated back and forth between a few values. If we release the clustered index page latch, another thread could meanwhile update the record to match the secondary index record that we are interested in. This is only a hypothesis that I did not validate yet.

            marko Marko Mäkelä added a comment - It turns out that in order to get good stack traces, I only had to compile libc with -fno-omit-frame-pointer , using the directions that I had written in MDEV-32050 : Added to glibc-2.39/debian/sysdeps/amd64.mk the line extra_cflags = -fno-omit-frame-pointer and built with dpkg-buildpackage , copied the libc.so.6* somewhere from the build directory while the tests were running For the benchmark, I simply pointed the environment variable LD_LIBRARY_PATH to the directory that contains the special built library. It looks like there might be an issue with the clang-specific libc++. I did not bother to compile it myself yet. The GCC libstdc++ produced a fairly nice flame graph 10.5-libstdc++.svg , with only 9% of an unknown futex wait source on the very left. In perf report , I see that the SQL thread that executes mysql_update spends 68% of the cycles of one CPU, and the purge_coordinator_callback in another thread consumes 84%. Apparently there are no waits for buffer page latches at all, or they are part of the mystery 9%. I will dig deeper, maybe recompile also one of the C++ libraries. I also realized that my attempted optimization to release the clustered index leaf page latch during some MVCC-like checks may be simply incorrect and this might explain the observed infinite loops in the purge. The scenario would be that an indexed column is being updated back and forth between a few values. If we release the clustered index page latch, another thread could meanwhile update the record to match the secondary index record that we are interested in. This is only a hypothesis that I did not validate yet.

            In the GCC 14 debian/changelog there is the following entry:

            • Build target libraries with -fno-omit-frame-pointer, when enabled in dpkg-buildflags.

            Encouraged by this, I started the following:

            sudo apt-get build-dep libstdc++-14-dev
            apt source libstdc++-14-dev
            cd gcc-14-14.2.0
            DEB_CFLAGS_APPEND=-fno-omit-frame-pointer DEB_CXXFLAGS_APPEND=-fno-omit-frame-pointer dpkg-buildpackage
            

            I see that some targets would use these flags, while others wouldn’t. I assume that this will build a large number of GCC related packages, while I would only need libstdc++.

            marko Marko Mäkelä added a comment - In the GCC 14 debian/changelog there is the following entry: Build target libraries with -fno-omit-frame-pointer , when enabled in dpkg-buildflags . Encouraged by this, I started the following: sudo apt-get build-dep libstdc++-14-dev apt source libstdc++-14-dev cd gcc-14-14.2.0 DEB_CFLAGS_APPEND=-fno-omit-frame-pointer DEB_CXXFLAGS_APPEND=-fno-omit-frame-pointer dpkg-buildpackage I see that some targets would use these flags, while others wouldn’t. I assume that this will build a large number of GCC related packages, while I would only need libstdc++ .

            I executed

            mkdir ~/lib-framepointer
            cp -a gcc-14-14.2.0/build/prev-x86_64-linux-gnu/libstdc++-v3/src/.libs/libstdc++.so ~/lib-framepointer
            

            while the dpkg-buildpackage was still running. With LD_LIBRARY_PATH=~/lib-framepointer (also containing the self-built libc.so), all stacks for the waits in the 10.5 experiment will be fully resolved, and there are no waits inside InnoDB. The reported waits were as follows:

            • tpool::thread_pool_generic::wait_for_tasks (the thread pool waiting to invoke a task callback function)
            • vio_io_wait (a connection handler thread waiting for input from a client connection)
            • timer_handler waiting in pthread_cond_timedwait

            The “bad” commit of MDEV-32050 removes some memory allocation and copying. It does not directly affect any synchronization, other than by removing some malloc() calls, which would also be used by SQL execution every now and then. Other changes in MDEV-32050 significantly reduce the amount of table lookups and MDL acquisition, which in this 10.5 benchmark would seem to use about a third of the CPU time of the purge_coordinator_callback. While offcputime on 10.5 does not indicate any significant waiting in the operating system, it could be that there are waits due to cache contention. On 10.5, perf record -e cache-misses shows some cache misses on undo log and B-tree pages, as expected.

            I think that I must next compare a 10.6 based branch with and without that “bad” commit.

            marko Marko Mäkelä added a comment - I executed mkdir ~/lib-framepointer cp -a gcc-14-14.2.0/build/prev-x86_64-linux-gnu/libstdc++-v3/src/.libs/libstdc++.so ~/lib-framepointer while the dpkg-buildpackage was still running. With LD_LIBRARY_PATH=~/lib-framepointer (also containing the self-built libc.so ), all stacks for the waits in the 10.5 experiment will be fully resolved, and there are no waits inside InnoDB. The reported waits were as follows: tpool::thread_pool_generic::wait_for_tasks (the thread pool waiting to invoke a task callback function) vio_io_wait (a connection handler thread waiting for input from a client connection) timer_handler waiting in pthread_cond_timedwait The “bad” commit of MDEV-32050 removes some memory allocation and copying. It does not directly affect any synchronization, other than by removing some malloc() calls, which would also be used by SQL execution every now and then. Other changes in MDEV-32050 significantly reduce the amount of table lookups and MDL acquisition, which in this 10.5 benchmark would seem to use about a third of the CPU time of the purge_coordinator_callback . While offcputime on 10.5 does not indicate any significant waiting in the operating system, it could be that there are waits due to cache contention. On 10.5, perf record -e cache-misses shows some cache misses on undo log and B-tree pages, as expected. I think that I must next compare a 10.6 based branch with and without that “bad” commit.

            commit immediately prior to "bad", its performance was excellent, better than 10.5, by a large margin, IIRC

            wlad Vladislav Vaintroub added a comment - commit immediately prior to "bad", its performance was excellent, better than 10.5, by a large margin, IIRC

            I conducted some analysis with https://rr-project.org. While offcputime would indicate a buffer page latch wait in the UPDATE execution, I got a single wait for a secondary index tree latch. The conflicting latch would be held by the purge thread, which was growing the size of the .ibd file inside row_purge_remove_sec_if_poss_tree(). This operation seems to be futile, because we were trying to shrink the secondary index from 3 pages to 1. Why would it have to allocate more space? The reason why this operation was attempted at all was because the default MERGE_THRESHOLD=50 caused btr_cur_can_delete_without_compress() to return false. I experimented with a change to the global default:

            diff --git a/storage/innobase/include/dict0mem.h b/storage/innobase/include/dict0mem.h
            index 751d7f1a9e4..beb002f86e9 100644
            --- a/storage/innobase/include/dict0mem.h
            +++ b/storage/innobase/include/dict0mem.h
            @@ -998,11 +998,11 @@ struct dict_index_t {
             	from storage (DISCARD TABLESPACE or similar),
             	or 1 if the index is in table->freed_indexes */
             	unsigned	page:32;
            -	unsigned	merge_threshold:6;
            +	unsigned	merge_threshold:7;
             				/*!< In the pessimistic delete, if the page
             				data size drops below this limit in percent,
             				merging it to a neighbor is tried */
            -# define DICT_INDEX_MERGE_THRESHOLD_DEFAULT 50
            +# define DICT_INDEX_MERGE_THRESHOLD_DEFAULT 100
             	unsigned	type:DICT_IT_BITS;
             				/*!< index type (DICT_CLUSTERED, DICT_UNIQUE,
             				DICT_IBUF, DICT_CORRUPT) */
            

            With this, the SQL throughput would more than double to 1.6ktps, which is about 10% of what it should be. Still, a small fraction of the waits (0.43%) would be in row_purge_remove_sec_if_poss_tree(). However, I think that we should leave this as is and concentrate on the .ibd file extension during record deletion.

            As far as I understand, in btr_cur_pessimistic_delete() the call to fsp_reserve_free_extents() is only needed in the rare case that the key of the tree is variable length. When a previous child page pointer in a non-leaf page is replaced with a longer record, the page could overflow and a page split may be necessary. We do not even make any effort to shrink these records to a minimum key length that would distinguish the child pages. So, the index tree may grow during btr_cur_pessimistic_delete(), and at the end it may shrink again due to btr_compress().

            marko Marko Mäkelä added a comment - I conducted some analysis with https://rr-project.org . While offcputime would indicate a buffer page latch wait in the UPDATE execution, I got a single wait for a secondary index tree latch. The conflicting latch would be held by the purge thread, which was growing the size of the .ibd file inside row_purge_remove_sec_if_poss_tree() . This operation seems to be futile, because we were trying to shrink the secondary index from 3 pages to 1. Why would it have to allocate more space? The reason why this operation was attempted at all was because the default MERGE_THRESHOLD=50 caused btr_cur_can_delete_without_compress() to return false . I experimented with a change to the global default: diff --git a/storage/innobase/include/dict0mem.h b/storage/innobase/include/dict0mem.h index 751d7f1a9e4..beb002f86e9 100644 --- a/storage/innobase/include/dict0mem.h +++ b/storage/innobase/include/dict0mem.h @@ -998,11 +998,11 @@ struct dict_index_t { from storage (DISCARD TABLESPACE or similar), or 1 if the index is in table->freed_indexes */ unsigned page:32; - unsigned merge_threshold:6; + unsigned merge_threshold:7; /*!< In the pessimistic delete, if the page data size drops below this limit in percent, merging it to a neighbor is tried */ -# define DICT_INDEX_MERGE_THRESHOLD_DEFAULT 50 +# define DICT_INDEX_MERGE_THRESHOLD_DEFAULT 100 unsigned type:DICT_IT_BITS; /*!< index type (DICT_CLUSTERED, DICT_UNIQUE, DICT_IBUF, DICT_CORRUPT) */ With this, the SQL throughput would more than double to 1.6ktps, which is about 10% of what it should be. Still, a small fraction of the waits (0.43%) would be in row_purge_remove_sec_if_poss_tree() . However, I think that we should leave this as is and concentrate on the .ibd file extension during record deletion. As far as I understand, in btr_cur_pessimistic_delete() the call to fsp_reserve_free_extents() is only needed in the rare case that the key of the tree is variable length. When a previous child page pointer in a non-leaf page is replaced with a longer record, the page could overflow and a page split may be necessary. We do not even make any effort to shrink these records to a minimum key length that would distinguish the child pages. So, the index tree may grow during btr_cur_pessimistic_delete() , and at the end it may shrink again due to btr_compress() .

            I made some mistakes. The 1.6ktps was under rr record, which runs the trace in one thread at a time and in this case doubles the throughput. The merge_threshold would actually become 36 and not 100 if I did not adjust the -Wconversion silencing bitmasks in innobase_parse_hint_from_comment(). But, we’d actually want to set MERGE_THRESHOLD=0 to prevent the merging from being attempted. After I fixed that as well as temporarily disabled the file extension:

            diff --git a/storage/innobase/btr/btr0cur.cc b/storage/innobase/btr/btr0cur.cc
            index cfbc6532c41..1bece9ea5de 100644
            --- a/storage/innobase/btr/btr0cur.cc
            +++ b/storage/innobase/btr/btr0cur.cc
            @@ -4768,6 +4768,7 @@ btr_cur_pessimistic_delete(
             	ut_ad(!index->is_dummy);
             	ut_ad(block->page.id().space() == index->table->space->id);
             
            +#if 0
             	if (!has_reserved_extents) {
             		/* First reserve enough free space for the file segments
             		of the index tree, so that the node pointer updates will
            @@ -4783,6 +4784,7 @@ btr_cur_pessimistic_delete(
             			return(FALSE);
             		}
             	}
            +#endif
             
             	heap = mem_heap_create(1024);
             	rec = btr_cur_get_rec(cursor);
            

            the throughput did not improve much. I produced an rr replay trace where the UPDATE statement is waiting for an exclusive latch on the clustered index page latch, not a secondary index page. The conflicting latch would have been acquired by row_purge_poss_sec(). After several waits for a clustered index page latch, I observed a wait for an exclusive secondary index page latch, which would be held by the purge, which is searching for a record in the page:

            #6  row_search_index_entry (entry=entry@entry=0x7f9e0c00a238, mode=<optimized out>, pcur=pcur@entry=0x7f9e2afff020, mtr=mtr@entry=0x7f9e2afff270) at /mariadb/10.6/storage/innobase/row/row0row.cc:1281
            #7  0x00005561455630fe in row_purge_remove_sec_if_poss_leaf (node=node@entry=0x55616f817008, index=index@entry=0x7f9e0404ba28, entry=entry@entry=0x7f9e0c00a238) at /mariadb/10.6/storage/innobase/row/row0purge.cc:879
            

            There are no real surprises here. I ran 10.5 once more with innodb_sync_spin_loops=0 so that I would better catch similar conflicts. There were none according to offcputime. But that is not the full story. With rr record I can observe several invocations on rw_lock_x_lock_wait_func() for an undo log page, in trx_undo_report_row_operation() or at transaction commit. At this time, the purge coordinator would be doing nothing!

            10.5 8b51d34462370b96af420587fdc07934abcea8ba

            #0  0x00007f685474639b in sched_yield () at ./src/preload/overrides.c:247
            #1  0x0000563cef2b69c1 in trx_purge_get_next_rec (n_pages_handled=n_pages_handled@entry=0x7f6827dffc58, heap=heap@entry=0x563d0c9df100) at /mariadb/10.5/storage/innobase/trx/trx0purge.cc:1007
            #2  0x0000563cef2b6ced in trx_purge_fetch_next_rec (roll_ptr=<synthetic pointer>, n_pages_handled=0x7f6827dffc58, heap=0x563d0c9df100) at /mariadb/10.5/storage/innobase/trx/trx0purge.cc:1074
            #3  trx_purge_attach_undo_recs (n_purge_threads=1) at /mariadb/10.5/storage/innobase/trx/trx0purge.cc:1147
            #4  trx_purge (n_tasks=<optimized out>) at /mariadb/10.5/storage/innobase/trx/trx0purge.cc:1267
            #5  0x0000563cef2a318c in srv_do_purge (n_total_purged=<synthetic pointer>) at /mariadb/10.5/storage/innobase/srv/srv0srv.cc:1840
            #6  purge_coordinator_callback_low () at /mariadb/10.5/storage/innobase/srv/srv0srv.cc:1945
            #7  purge_coordinator_callback () at /mariadb/10.5/storage/innobase/srv/srv0srv.cc:1974
            

            Such throttling was removed in this commit of MDEV-32050, which also reduced the number of operations on trx_rseg_t::latch. That could have contributed to a better working balance between DML and purge. I tried to augment the buffer-fixing of the undo log pages in 10.6 with shared latches in order to cause higher-level conflicts with concurrent SQL threads, but that would seem to lead to deadlocks with trx_rseg_t::latch.

            It is not trivial to revert the “bad” commit for testing purposes, because it would conflict with some subsequent changes. I think that I must compare exactly the ”bad” commit and its successor to find a conclusive explanation for the regression.

            marko Marko Mäkelä added a comment - I made some mistakes. The 1.6ktps was under rr record , which runs the trace in one thread at a time and in this case doubles the throughput. The merge_threshold would actually become 36 and not 100 if I did not adjust the -Wconversion silencing bitmasks in innobase_parse_hint_from_comment() . But, we’d actually want to set MERGE_THRESHOLD=0 to prevent the merging from being attempted. After I fixed that as well as temporarily disabled the file extension: diff --git a/storage/innobase/btr/btr0cur.cc b/storage/innobase/btr/btr0cur.cc index cfbc6532c41..1bece9ea5de 100644 --- a/storage/innobase/btr/btr0cur.cc +++ b/storage/innobase/btr/btr0cur.cc @@ -4768,6 +4768,7 @@ btr_cur_pessimistic_delete( ut_ad(!index->is_dummy); ut_ad(block->page.id().space() == index->table->space->id); +#if 0 if (!has_reserved_extents) { /* First reserve enough free space for the file segments of the index tree, so that the node pointer updates will @@ -4783,6 +4784,7 @@ btr_cur_pessimistic_delete( return(FALSE); } } +#endif heap = mem_heap_create(1024); rec = btr_cur_get_rec(cursor); the throughput did not improve much. I produced an rr replay trace where the UPDATE statement is waiting for an exclusive latch on the clustered index page latch, not a secondary index page. The conflicting latch would have been acquired by row_purge_poss_sec() . After several waits for a clustered index page latch, I observed a wait for an exclusive secondary index page latch, which would be held by the purge, which is searching for a record in the page: #6 row_search_index_entry (entry=entry@entry=0x7f9e0c00a238, mode=<optimized out>, pcur=pcur@entry=0x7f9e2afff020, mtr=mtr@entry=0x7f9e2afff270) at /mariadb/10.6/storage/innobase/row/row0row.cc:1281 #7 0x00005561455630fe in row_purge_remove_sec_if_poss_leaf (node=node@entry=0x55616f817008, index=index@entry=0x7f9e0404ba28, entry=entry@entry=0x7f9e0c00a238) at /mariadb/10.6/storage/innobase/row/row0purge.cc:879 There are no real surprises here. I ran 10.5 once more with innodb_sync_spin_loops=0 so that I would better catch similar conflicts. There were none according to offcputime . But that is not the full story. With rr record I can observe several invocations on rw_lock_x_lock_wait_func() for an undo log page, in trx_undo_report_row_operation() or at transaction commit. At this time, the purge coordinator would be doing nothing! 10.5 8b51d34462370b96af420587fdc07934abcea8ba #0 0x00007f685474639b in sched_yield () at ./src/preload/overrides.c:247 #1 0x0000563cef2b69c1 in trx_purge_get_next_rec (n_pages_handled=n_pages_handled@entry=0x7f6827dffc58, heap=heap@entry=0x563d0c9df100) at /mariadb/10.5/storage/innobase/trx/trx0purge.cc:1007 #2 0x0000563cef2b6ced in trx_purge_fetch_next_rec (roll_ptr=<synthetic pointer>, n_pages_handled=0x7f6827dffc58, heap=0x563d0c9df100) at /mariadb/10.5/storage/innobase/trx/trx0purge.cc:1074 #3 trx_purge_attach_undo_recs (n_purge_threads=1) at /mariadb/10.5/storage/innobase/trx/trx0purge.cc:1147 #4 trx_purge (n_tasks=<optimized out>) at /mariadb/10.5/storage/innobase/trx/trx0purge.cc:1267 #5 0x0000563cef2a318c in srv_do_purge (n_total_purged=<synthetic pointer>) at /mariadb/10.5/storage/innobase/srv/srv0srv.cc:1840 #6 purge_coordinator_callback_low () at /mariadb/10.5/storage/innobase/srv/srv0srv.cc:1945 #7 purge_coordinator_callback () at /mariadb/10.5/storage/innobase/srv/srv0srv.cc:1974 Such throttling was removed in this commit of MDEV-32050 , which also reduced the number of operations on trx_rseg_t::latch . That could have contributed to a better working balance between DML and purge. I tried to augment the buffer-fixing of the undo log pages in 10.6 with shared latches in order to cause higher-level conflicts with concurrent SQL threads, but that would seem to lead to deadlocks with trx_rseg_t::latch . It is not trivial to revert the “bad” commit for testing purposes, because it would conflict with some subsequent changes. I think that I must compare exactly the ”bad” commit and its successor to find a conclusive explanation for the regression.

            According to the offcputime flame graphs 10.6-good-88733282fb15c80f0bd722df0041d06ad90c26b0.svg and 10.6-bad-aa719b5010c929132b4460b78113fbd07497d9c8.svg there are buffer page waits for both. The “good” commit delivered 9114 tps during the 30-second workload (including 10 seconds of sampling), while the “bad” one was 640 tps. In these flame graphs, we can see that my innodb_buffer_pool_size=50m was too small for this workload; there was page I/O going on. I made another run with innodb_buffer_pool_size=150m. The throughput got a little better for both. The “good” version still does quite a bit of page I/O, while the “bad” one does not but is waiting for buffer page latches, as you can see in 10.6-good-88733282fb15c80f0bd722df0041d06ad90c26b0-large.svg and 10.6-bad-aa719b5010c929132b4460b78113fbd07497d9c8-large.svg.

            It appears that the page I/O acted as some kind of useful throttle between the purge and the workload. I thought that it would be a no-brainer to ensure that a single purge batch can process all undo log records that were already parsed. It seems that we should experiment with an even smaller innodb_purge_batch_size.

            marko Marko Mäkelä added a comment - According to the offcputime flame graphs 10.6-good-88733282fb15c80f0bd722df0041d06ad90c26b0.svg and 10.6-bad-aa719b5010c929132b4460b78113fbd07497d9c8.svg there are buffer page waits for both. The “good” commit delivered 9114 tps during the 30-second workload (including 10 seconds of sampling), while the “bad” one was 640 tps. In these flame graphs, we can see that my innodb_buffer_pool_size=50m was too small for this workload; there was page I/O going on. I made another run with innodb_buffer_pool_size=150m . The throughput got a little better for both. The “good” version still does quite a bit of page I/O, while the “bad” one does not but is waiting for buffer page latches, as you can see in 10.6-good-88733282fb15c80f0bd722df0041d06ad90c26b0-large.svg and 10.6-bad-aa719b5010c929132b4460b78113fbd07497d9c8-large.svg . It appears that the page I/O acted as some kind of useful throttle between the purge and the workload. I thought that it would be a no-brainer to ensure that a single purge batch can process all undo log records that were already parsed. It seems that we should experiment with an even smaller innodb_purge_batch_size .

            It looks like I may finally have nailed it down. If I set a much lower innodb_purge_batch_size, the throughput will be back to great (15ktps to 17ktps) and there will not be excessively much purge lag. For innodb_purge_batch_size=30 and an ample buffer pool size, I observed some page I/O and some (I would not say excessive) purge lag after a 120-second run:

            refs/pull/3448/merge d67a8cdf3ad2d9c559b43f0f3c4ce84d541de9e2

            2024-08-09 16:14:43 0 [Note] InnoDB: to purge 2379 transactions
            2024-08-09 16:14:43 0 [Note] InnoDB: Starting shutdown...
            2024-08-09 16:14:43 0 [Note] InnoDB: Dumping buffer pool(s) to /dev/shm/sbtest/ib_buffer_pool
            2024-08-09 16:14:43 0 [Note] InnoDB: Buffer pool(s) dump completed at 240809 16:14:43
            2024-08-09 16:14:43 0 [Note] InnoDB: Removed temporary tablespace data file: "./ibtmp1"
            2024-08-09 16:14:43 0 [Note] InnoDB: Shutdown completed; log sequence number 426839858; transaction id 3792719
            2024-08-09 16:14:43 0 [Note] /dev/shm/10.6/sql/mariadbd: Shutdown complete
            

            The corresponding offcputime flame graph (10 seconds sampled during the 120-second workload) is 10.6-innodb-purge-batch-size-30.svg. The innodb_log_file_size=5368709120 (5GiB) is an order of magnitude larger than the final LSN, so there should have been no checkpoint flushing. I didn’t check what the asynchronous I/O (for InnoDB data pages) could be for. I had innodb_buffer_pool_size=1512M, which is some 10× the data volume.

            With innodb_purge_batch_size=10 I got slightly better throughput (peaking above 17ktps), but I did not monitor the purge lag. The sbtest1.ibd file was slightly larger at the end. The 10-second offcputime sample is in 10.6-innodb-purge-batch-size-10.svg.

            I think that we should test the improvements that I made so far, and try to find a smaller default value of innodb_purge_batch_size if applicable. Maybe with my improvements that I came up with would make an even smaller value than the old default innodb_purge_batch_size=300 to work with many workloads. If not, for a special workload like this one, you could change it with SET GLOBAL without restarting the server.

            marko Marko Mäkelä added a comment - It looks like I may finally have nailed it down. If I set a much lower innodb_purge_batch_size , the throughput will be back to great (15ktps to 17ktps) and there will not be excessively much purge lag. For innodb_purge_batch_size=30 and an ample buffer pool size, I observed some page I/O and some (I would not say excessive) purge lag after a 120-second run: refs/pull/3448/merge d67a8cdf3ad2d9c559b43f0f3c4ce84d541de9e2 2024-08-09 16:14:43 0 [Note] InnoDB: to purge 2379 transactions 2024-08-09 16:14:43 0 [Note] InnoDB: Starting shutdown... 2024-08-09 16:14:43 0 [Note] InnoDB: Dumping buffer pool(s) to /dev/shm/sbtest/ib_buffer_pool 2024-08-09 16:14:43 0 [Note] InnoDB: Buffer pool(s) dump completed at 240809 16:14:43 2024-08-09 16:14:43 0 [Note] InnoDB: Removed temporary tablespace data file: "./ibtmp1" 2024-08-09 16:14:43 0 [Note] InnoDB: Shutdown completed; log sequence number 426839858; transaction id 3792719 2024-08-09 16:14:43 0 [Note] /dev/shm/10.6/sql/mariadbd: Shutdown complete The corresponding offcputime flame graph (10 seconds sampled during the 120-second workload) is 10.6-innodb-purge-batch-size-30.svg . The innodb_log_file_size=5368709120 (5GiB) is an order of magnitude larger than the final LSN, so there should have been no checkpoint flushing. I didn’t check what the asynchronous I/O (for InnoDB data pages) could be for. I had innodb_buffer_pool_size=1512M , which is some 10× the data volume. With innodb_purge_batch_size=10 I got slightly better throughput (peaking above 17ktps), but I did not monitor the purge lag. The sbtest1.ibd file was slightly larger at the end. The 10-second offcputime sample is in 10.6-innodb-purge-batch-size-10.svg . I think that we should test the improvements that I made so far, and try to find a smaller default value of innodb_purge_batch_size if applicable. Maybe with my improvements that I came up with would make an even smaller value than the old default innodb_purge_batch_size=300 to work with many workloads. If not, for a special workload like this one, you could change it with SET GLOBAL without restarting the server.

            It is a bit a undfortunate that this situation is possible at all - you can run with default, or set some innodb_purge_batch_size. and then no purge is made. Can this be monitored? Can this be automatically adjusted if no purge progress is made?

            wlad Vladislav Vaintroub added a comment - It is a bit a undfortunate that this situation is possible at all - you can run with default, or set some innodb_purge_batch_size. and then no purge is made. Can this be monitored? Can this be automatically adjusted if no purge progress is made?

            On my system, in the build as of today, innodb_purge_batch_size=128 was the largest for which I got good throughput for this benchmark. With innodb_purge_batch_size=129 the throughput dropped to less than 10%. This was with innodb_purge_threads=1 and innodb_stats_auto_recalc=off.

            I think that we should lower the default value, and compensate for that by optimizing the purge subsystem a little further. For example, wlad pointed out that we are unnecessarily spawning purge worker tasks and waiting for them. We can’t employ more purge tasks than the number of tables in the purge batch. During my analysis, I worked around this by explicitly setting innodb_purge_threads=1 that is specific to this single-table workload.

            marko Marko Mäkelä added a comment - On my system, in the build as of today, innodb_purge_batch_size=128 was the largest for which I got good throughput for this benchmark. With innodb_purge_batch_size=129 the throughput dropped to less than 10%. This was with innodb_purge_threads=1 and innodb_stats_auto_recalc=off . I think that we should lower the default value, and compensate for that by optimizing the purge subsystem a little further. For example, wlad pointed out that we are unnecessarily spawning purge worker tasks and waiting for them. We can’t employ more purge tasks than the number of tables in the purge batch. During my analysis, I worked around this by explicitly setting innodb_purge_threads=1 that is specific to this single-table workload.

            It might not be a coincidence that there are 128 rollback segments (MDEV-12289), or 127 if innodb_undo_tablespaces>1.

            Thinking aloud: It could make sense to implement an additional limit that will refuse to purge multiple transactions from the same rollback segment when an active transaction has been registered in the rollback segment. With large transactions, a purge batch could comprise up to innodb_purge_batch_size undo log pages, and with small transactions we would limit it to as few as 128 pages.

            marko Marko Mäkelä added a comment - It might not be a coincidence that there are 128 rollback segments ( MDEV-12289 ), or 127 if innodb_undo_tablespaces>1 . Thinking aloud: It could make sense to implement an additional limit that will refuse to purge multiple transactions from the same rollback segment when an active transaction has been registered in the rollback segment. With large transactions, a purge batch could comprise up to innodb_purge_batch_size undo log pages, and with small transactions we would limit it to as few as 128 pages.
            wlad Vladislav Vaintroub added a comment - - edited

            There is a strange relation between `row_purge_is_unsafe` execution speed and innodb_purge_batch_size . Once innodb_purge_batch_size is 128 or smaller, the function is very fast, with 129 very slow. On this test, the function always returns `false`
            Below here are the execution stats, from some added instrumentation, on my machine, from 60 seconds benchmark run

            innodb_purge_batch_size Number of calls Total execution time (sec) single call avg time (microseconds)
            128 1990921 4.586788 2.30
            129 189422 52.163513 275.38

            That is, this function and descendants is ~140 times slower.

            The tps in this test is 33120.84 (batch_size=128) vs 3156.60(batch_size=129) , which is similar 2 orders or magnitude difference.

            wlad Vladislav Vaintroub added a comment - - edited There is a strange relation between `row_purge_is_unsafe` execution speed and innodb_purge_batch_size . Once innodb_purge_batch_size is 128 or smaller, the function is very fast, with 129 very slow. On this test, the function always returns `false` Below here are the execution stats, from some added instrumentation, on my machine, from 60 seconds benchmark run innodb_purge_batch_size Number of calls Total execution time (sec) single call avg time (microseconds) 128 1990921 4.586788 2.30 129 189422 52.163513 275.38 That is, this function and descendants is ~140 times slower. The tps in this test is 33120.84 (batch_size=128) vs 3156.60(batch_size=129) , which is similar 2 orders or magnitude difference.

            I confirmed that for this workload with only 1 table, 1 row, and a secondary indexes being updated, the maximum good value of innodb_purge_batch_size is equal to the number of available rollback segments for transactions. With multiple innodb_undo_tablespaces the good maximum would be innodb_purge_batch_size=127.

            If I change the workload from oltp_update_index to oltp_update_non_index, then the impact of a “too large” purge batch size is noticeable but much less dramatic. It’s similar when I run a oltp_update_index workload with 8 tables of 1 row each.

            I experimented with a patch that would limit the number of undo pages traversed per table in a purge batch:

            diff --git a/storage/innobase/trx/trx0purge.cc b/storage/innobase/trx/trx0purge.cc
            index cad3319864c..9e4d5a8d684 100644
            --- a/storage/innobase/trx/trx0purge.cc
            +++ b/storage/innobase/trx/trx0purge.cc
            @@ -1311,6 +1311,21 @@ trx_purge_attach_undo_recs(ulint n_purge_threads, THD *thd)
             		} else if (table_node->tables[table_id].first) {
             enqueue:
             			table_node->undo_recs.push(purge_rec);
            +			const byte type = purge_rec.undo_rec[2];
            +			if ((type & TRX_UNDO_UPD_EXTERN)
            +			    || ((type & (TRX_UNDO_CMPL_INFO_MULT - 1))
            +				!= TRX_UNDO_INSERT_REC
            +				&& (~type & ((UPD_NODE_NO_ORD_CHANGE
            +					      | UPD_NODE_NO_SIZE_CHANGE)
            +					     * TRX_UNDO_CMPL_INFO_MULT))
            +				&& table_node->undo_recs.size()
            +				>= 127 + !srv_undo_tablespaces
            +				&& srv_shutdown_state == SRV_SHUTDOWN_NONE)) {
            +				/* Limit the number of complex operations that
            +				affect secondary index records in a table. */
            +sql_print_information("purge limit (%zu pages)", purge_sys.n_pages_handled());
            +				break;
            +			}
             		}
             
             		if (purge_sys.n_pages_handled() >= max_pages) {
            

            The diagnostic output could be displayed 20 to 140 times per second, depending on what I tested. Unfortunately, this limit is kicking in also during an oltp_update_non_index workload while it was my intention to avoid that. We could refine this and parse more of the undo log record, to limit exactly the number of row_purge_poss_sec() calls per table and batch. But that could lead to a performance regression.

            I think that we should consider lowering the default to innodb_purge_batch_size=127 if that would avoid any significant performance regression in a wider selection of workloads. Due to the already implemented and partly planned improvements in this ticket, the purge should become faster, which should compensate for a smaller default batch size.

            marko Marko Mäkelä added a comment - I confirmed that for this workload with only 1 table, 1 row, and a secondary indexes being updated, the maximum good value of innodb_purge_batch_size is equal to the number of available rollback segments for transactions. With multiple innodb_undo_tablespaces the good maximum would be innodb_purge_batch_size=127 . If I change the workload from oltp_update_index to oltp_update_non_index , then the impact of a “too large” purge batch size is noticeable but much less dramatic. It’s similar when I run a oltp_update_index workload with 8 tables of 1 row each. I experimented with a patch that would limit the number of undo pages traversed per table in a purge batch: diff --git a/storage/innobase/trx/trx0purge.cc b/storage/innobase/trx/trx0purge.cc index cad3319864c..9e4d5a8d684 100644 --- a/storage/innobase/trx/trx0purge.cc +++ b/storage/innobase/trx/trx0purge.cc @@ -1311,6 +1311,21 @@ trx_purge_attach_undo_recs(ulint n_purge_threads, THD *thd) } else if (table_node->tables[table_id].first) { enqueue: table_node->undo_recs.push(purge_rec); + const byte type = purge_rec.undo_rec[2]; + if ((type & TRX_UNDO_UPD_EXTERN) + || ((type & (TRX_UNDO_CMPL_INFO_MULT - 1)) + != TRX_UNDO_INSERT_REC + && (~type & ((UPD_NODE_NO_ORD_CHANGE + | UPD_NODE_NO_SIZE_CHANGE) + * TRX_UNDO_CMPL_INFO_MULT)) + && table_node->undo_recs.size() + >= 127 + !srv_undo_tablespaces + && srv_shutdown_state == SRV_SHUTDOWN_NONE)) { + /* Limit the number of complex operations that + affect secondary index records in a table. */ +sql_print_information("purge limit (%zu pages)", purge_sys.n_pages_handled()); + break; + } } if (purge_sys.n_pages_handled() >= max_pages) { The diagnostic output could be displayed 20 to 140 times per second, depending on what I tested. Unfortunately, this limit is kicking in also during an oltp_update_non_index workload while it was my intention to avoid that. We could refine this and parse more of the undo log record, to limit exactly the number of row_purge_poss_sec() calls per table and batch. But that could lead to a performance regression. I think that we should consider lowering the default to innodb_purge_batch_size=127 if that would avoid any significant performance regression in a wider selection of workloads. Due to the already implemented and partly planned improvements in this ticket, the purge should become faster, which should compensate for a smaller default batch size.

            https://github.com/MariaDB/server/pull/3448 now reduces the default to innodb_purge_batch_size=127. It also includes an improvement from wlad that will avoid the unnecessary creation of worker tasks in the case that there is not enough work for all of them. Without that improvement, I observed a throughput of 12kpts when using the maximum innodb_purge_threads=32 during the single-table workload. That improvement would yield 16ktps, on par with innodb_purge_threads=1.

            marko Marko Mäkelä added a comment - https://github.com/MariaDB/server/pull/3448 now reduces the default to innodb_purge_batch_size=127 . It also includes an improvement from wlad that will avoid the unnecessary creation of worker tasks in the case that there is not enough work for all of them. Without that improvement, I observed a throughput of 12kpts when using the maximum innodb_purge_threads=32 during the single-table workload. That improvement would yield 16ktps, on par with innodb_purge_threads=1 .
            mdcallag Mark Callaghan added a comment - - edited

            From a medium server (Google Cloud c2d high-compute with 16 cores, 64G RAM, Ubuntu 22.04) and cached sysbench I see some good results. My usage of sysbench is explained here.

            Tests are run with 12 clients.

            I compared two builds with 10.6.19, using the names from the gists linked below

            • ma1006_bbperfpkgtest -> the bb-10.6-performance-pkgtest branch
            • ma1006_mdev34759 > the latest from remotes/origin/10.6MDEV-34759

            I name things as Version.Config where Version is one of:

            • ma100619_rel_withdbg - MariaDB 10.6.19 using -DBUILD_TYPE=RelWithDebInfo
            • ma1006_bbperfpkgtest - explained above
            • ma1006_mdev34759 - explained above

            And config is one of:

            • z11a_c24r64 - has innodb_change_buffering=none
            • z11a2_c24r64 - has innodb_change_buffering=all

            Below I focus on the results for ma1006_bbperfpkgtest compared to 10.6.19

            With 8 tables, 10M rows/table

            • innodb_change_buffering=none - update-index is ~1.2X faster, most write-heavy tests are 1% to 6% faster (see here)
            • innodb_change_buffering=all - update-index is ~1.2X faster, most write-heavy tests are 1% to 4% faster (see here)

            WIth 1 table, 80M rows

            • innodb_change_buffering=none - update-index is ~2.2X faster, most write-heavy tests are 2% to 5% faster (see here)
            • innodb_change_buffering=all - update-index is ~2.2X faster, most write-heavy tests are 3% to 6% faster (see here)
            mdcallag Mark Callaghan added a comment - - edited From a medium server (Google Cloud c2d high-compute with 16 cores, 64G RAM, Ubuntu 22.04) and cached sysbench I see some good results. My usage of sysbench is explained here . Tests are run with 12 clients. I compared two builds with 10.6.19, using the names from the gists linked below ma1006_bbperfpkgtest -> the bb-10.6-performance-pkgtest branch ma1006_mdev34759 > the latest from remotes/origin/10.6 MDEV-34759 I name things as Version.Config where Version is one of: ma100619_rel_withdbg - MariaDB 10.6.19 using -DBUILD_TYPE=RelWithDebInfo ma1006_bbperfpkgtest - explained above ma1006_mdev34759 - explained above And config is one of: z11a_c24r64 - has innodb_change_buffering=none z11a2_c24r64 - has innodb_change_buffering=all Below I focus on the results for ma1006_bbperfpkgtest compared to 10.6.19 With 8 tables, 10M rows/table innodb_change_buffering=none - update-index is ~1.2X faster, most write-heavy tests are 1% to 6% faster ( see here ) innodb_change_buffering=all - update-index is ~1.2X faster, most write-heavy tests are 1% to 4% faster ( see here ) WIth 1 table, 80M rows innodb_change_buffering=none - update-index is ~2.2X faster, most write-heavy tests are 2% to 5% faster ( see here ) innodb_change_buffering=all - update-index is ~2.2X faster, most write-heavy tests are 3% to 6% faster ( see here )

            Results from my small server are good and similar to the results above from the medium server.

            The small server has 8 cores and 32G RAM. I have results for:

            For the "1 thread, 1 table" and "6 threads, 6 tables" tests the best result is for update-index which is ~1.2X faster
            For "6 threads, 1 table" the best result is also for update-index which is ~1.4X faster, I assume there is more benefit because there is more contention

            mdcallag Mark Callaghan added a comment - Results from my small server are good and similar to the results above from the medium server. The small server has 8 cores and 32G RAM. I have results for: 1 thread, 1 table - change buffering disabled and enabled 6 threads, 6 tables (thread per table) - change buffering disabled and enabled 6 threads, 1 table - change buffering disabled and enabled For the "1 thread, 1 table" and "6 threads, 6 tables" tests the best result is for update-index which is ~1.2X faster For "6 threads, 1 table" the best result is also for update-index which is ~1.4X faster, I assume there is more benefit because there is more contention

            I came up with a possible explanation why limiting innodb_purge_batch_size to 127+(innodb_undo_tablespaces<2) might work. In each purge batch, there will be a call to trx_sys.history_size(), which will for a short time acquire all trx_sys.rseg_array latches in shared mode. (Before MDEV-25062, there was a global atomic counter trx_sys.rseg_history_len, and a single lock xadd instruction to update it would use some 2% of all CPU time in a benchmark.)

            Holding all rollback segment latches would block any transaction from being started or committed. This frequent “stop the world” by the purge subsystem could act as a useful throttle for the workload in the Description, where each transaction is small, updating the single row of the single table. Thanks to this pausing of the workload threads, the purge coordinator could achieve a lot during its time slice, before the SQL threads get a chance to conflict on the index tree pages. Thanks to the lower-level changes, the purge tasks will perform less “busy work”, such as repeated buf_pool.page_hash lookups for the same pages, or redundant “is this purgeable” checks.

            marko Marko Mäkelä added a comment - I came up with a possible explanation why limiting innodb_purge_batch_size to 127+(innodb_undo_tablespaces<2) might work. In each purge batch, there will be a call to trx_sys.history_size() , which will for a short time acquire all trx_sys.rseg_array latches in shared mode. (Before MDEV-25062 , there was a global atomic counter trx_sys.rseg_history_len , and a single lock xadd instruction to update it would use some 2% of all CPU time in a benchmark.) Holding all rollback segment latches would block any transaction from being started or committed. This frequent “stop the world” by the purge subsystem could act as a useful throttle for the workload in the Description, where each transaction is small, updating the single row of the single table. Thanks to this pausing of the workload threads, the purge coordinator could achieve a lot during its time slice, before the SQL threads get a chance to conflict on the index tree pages. Thanks to the lower-level changes, the purge tasks will perform less “busy work”, such as repeated buf_pool.page_hash lookups for the same pages, or redundant “is this purgeable” checks.

            After this change, will users that uses innodb_purge_batch_size=1000 face any new issues with performance.

            monty Michael Widenius added a comment - After this change, will users that uses innodb_purge_batch_size=1000 face any new issues with performance.

            monty, depending on the type of workload, configuring a larger innodb_purge_batch_size may really hurt things. With the workload given in the Description, I and wlad observed an order or two of magnitude difference between using a value that corresponds to the number of rollback segments, and a value that is 1 larger. This is an extreme workload, and nobody should really be defining any secondary indexes on a tiny table. With more general workloads, the impact is smaller but it does exist.

            marko Marko Mäkelä added a comment - monty , depending on the type of workload, configuring a larger innodb_purge_batch_size may really hurt things. With the workload given in the Description, I and wlad observed an order or two of magnitude difference between using a value that corresponds to the number of rollback segments, and a value that is 1 larger. This is an extreme workload, and nobody should really be defining any secondary indexes on a tiny table. With more general workloads, the impact is smaller but it does exist.

            From a medium server (Google Cloud c2d high-compute with 16 cores, 64G RAM, Ubuntu 22.04) and cached sysbench. My usage of sysbench is explained here.

            Tests are run with 12 clients and the --secondary option for sysbench so that "id" has a secondary index instead of a PK index.

            I compared two builds with 10.6.19, using the names from the gists linked below

            • ma1006_bbperfpkgtest -> the bb-10.6-performance-pkgtest branch
            • ma1006_mdev34759 > the latest from remotes/origin/10.6MDEV-34759, version_source_revision is this

            I name things as Version.Config where Version is one of:

            • ma100619_rel_withdbg - MariaDB 10.6.19 using -DBUILD_TYPE=RelWithDebInfo
            • ma1006_bbperfpkgtest - explained above
            • ma1006_mdev34759 - explained above

            And config is one of:

            • z11a_c24r64 - has innodb_change_buffering=none
            • z11a2_c24r64 - has innodb_change_buffering=all

            For all cases, the "hot-point" benchmark is much faster with both ma1006_bbperfpkgtest and ma1006_mdev34759 when compared to unchanged 10.6.19. The improvement is larger in the 1 table case where both are almost 6X faster than 10.6.19. One reason for the improvement is a ~30X reduction in the context switch rate (see here)

            Below I focus on the results for ma1006_mdev34759 compared to 10.6.19 which is "col 1" in the results:

            With 8 tables, 10M rows/table

            • innodb_change_buffering=none - results are similar to 10.6.19 (see here)
            • innodb_change_buffering=all - results are similar to 10.6.19 (see here)

            WIth 1 table, 80M rows

            • innodb_change_buffering=none - results are similar to 10.6.19 (see here) except for update-index which is 2.19X faster and write-only which is 1.75X faster in ma1006_bbperfpkgtest (see here)
            • innodb_change_buffering=all - results are similar to 10.6.19 (see here) except for update-index which is 2.24X faster and write-only which is 1.80X faster in ma1006_bbperfpkgtest (see here)
            mdcallag Mark Callaghan added a comment - From a medium server (Google Cloud c2d high-compute with 16 cores, 64G RAM, Ubuntu 22.04) and cached sysbench. My usage of sysbench is explained here . Tests are run with 12 clients and the --secondary option for sysbench so that "id" has a secondary index instead of a PK index. I compared two builds with 10.6.19, using the names from the gists linked below ma1006_bbperfpkgtest -> the bb-10.6-performance-pkgtest branch ma1006_mdev34759 > the latest from remotes/origin/10.6 MDEV-34759 , version_source_revision is this I name things as Version.Config where Version is one of: ma100619_rel_withdbg - MariaDB 10.6.19 using -DBUILD_TYPE=RelWithDebInfo ma1006_bbperfpkgtest - explained above ma1006_mdev34759 - explained above And config is one of: z11a_c24r64 - has innodb_change_buffering=none z11a2_c24r64 - has innodb_change_buffering=all For all cases, the "hot-point" benchmark is much faster with both ma1006_bbperfpkgtest and ma1006_mdev34759 when compared to unchanged 10.6.19. The improvement is larger in the 1 table case where both are almost 6X faster than 10.6.19. One reason for the improvement is a ~30X reduction in the context switch rate ( see here ) Below I focus on the results for ma1006_mdev34759 compared to 10.6.19 which is "col 1" in the results: With 8 tables, 10M rows/table innodb_change_buffering=none - results are similar to 10.6.19 ( see here ) innodb_change_buffering=all - results are similar to 10.6.19 ( see here ) WIth 1 table, 80M rows innodb_change_buffering=none - results are similar to 10.6.19 ( see here ) except for update-index which is 2.19X faster and write-only which is 1.75X faster in ma1006_bbperfpkgtest ( see here ) innodb_change_buffering=all - results are similar to 10.6.19 ( see here ) except for update-index which is 2.24X faster and write-only which is 1.80X faster in ma1006_bbperfpkgtest ( see here )

            This blog post about Huawei GaussDB (a MySQL fork) mentions a new parameter innodb_rds_purge_subbatch_size=150. I think that MySQL 5.7 onwards would use 95 or 96 persistent rollback segments; MDEV-12289 in MariaDB prevented that regression. I assume that they have other contention points elsewhere so that the index page latch contention between purge and DML did not become a problem. Or maybe they did not test with few tiny tables.

            The idea of a “purge subbatch” would seem to be incompatible with purge_sys_t::close_and_reopen() as well as the latch-free purge_sys.pages lookup in purge_sys_t::view_guard::get().

            marko Marko Mäkelä added a comment - This blog post about Huawei GaussDB (a MySQL fork) mentions a new parameter innodb_rds_purge_subbatch_size=150 . I think that MySQL 5.7 onwards would use 95 or 96 persistent rollback segments; MDEV-12289 in MariaDB prevented that regression. I assume that they have other contention points elsewhere so that the index page latch contention between purge and DML did not become a problem. Or maybe they did not test with few tiny tables. The idea of a “purge subbatch” would seem to be incompatible with purge_sys_t::close_and_reopen() as well as the latch-free purge_sys.pages lookup in purge_sys_t::view_guard::get() .
            ralf.gebhardt Ralf Gebhardt added a comment -

            This fix includes a change of the default for innodb_purge_batch_size from 1000 to 127

            ralf.gebhardt Ralf Gebhardt added a comment - This fix includes a change of the default for innodb_purge_batch_size from 1000 to 127

            People

              marko Marko Mäkelä
              wlad Vladislav Vaintroub
              Votes:
              1 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.