Details

    Description

      Since MariaDB 10.2.2, InnoDB never holds any mutexes or RW-locks across handler API calls. (Until that version, btr_search_latch for the adaptive hash index could be held, and there was a special call handlerton::release_temporary_latches.)

      During UPDATE operations, and also possibly during reads that perform range scans, it could help a lot to reuse the same InnoDB mini-transaction and to protect the current page with the page latch (buf_block_t::lock) across calls:

      1. Introduce row_prebuilt_t::mtr and keep it open.
      2. Avoid mtr_t::commit() between row reads
      3. Avoid storing & restoring btr_pcur_t position
      4. If there is any possibility of a delay (such as, waiting for a row read from another table, or waiting for client connection I/O), then btr_pcur_store_position(); mtr.commit() will have to be called before the wait and mtr.start();btr_pcur_restore_position(); after it.

      This change could remove any benefit of the row_prebuilt_t::fetch_cache (after 4 consecutive row reads, it’d prefetch 8 rows). Removing this cache would greatly reduce the InnoDB memory usage for partitioned tables.

      Mini-transactions for single-row UPDATE/DELETE

      1. Search and S-latch the PRIMARY KEY leaf page (get explicit transactional lock)
      2. X-latch the PRIMARY KEY leaf page, update transaction directory page (rollback segment header page), allocate&initialize first undo log page of the transaction
      3. Write undo log record
      4. Modify the PRIMARY KEY index
      5. (For each off-page column, use 1 mini-transaction per page written.)
      6. (For each secondary index, modify the index.)
      7. Commit the user transaction

      There are 1 read-only mini-transaction and 4 read-write mini-transactions for a 1-row user transaction! (With MariaDB 10.3.5, only 3 read-write mini-transactions, because the first two writes were merged.)

      We can actually use a single mini-transaction for all this. Only if there are secondary indexes or off-page columns, multiple mini-transactions will be needed:

      1. Search and X-latch the PRIMARY KEY leaf page, update transaction directory page, allocate&initialize first undo log page, write undo log record, modify the PRIMARY KEY index (with implicit transactional locking)
      2. (For each off-page column, use 1 mini-transaction per page written.)
      3. (For each secondary index, modify the index.)
      4. Commit the user transaction

      If there are no off-page columns or secondary indexes, the user transaction commit can be merged to the same mini-transaction. (This is a special case for a single-row user transaction.)

      The merging of the 'read' and 'write' steps under a single page lock would implement implicit locking for UPDATE and DELETE. When there is no locking conflict, this should greatly reduce the contention on lock_sys.mutex.

      Using fewer mini-transactions for writes also means less communication with the redo log buffer, which should reduce contention in log_sys.mutex or whatever MDEV-14425 will be replacing it with.

      Note: For any record modifications, we must always commit and restart the mini-transaction between rows, because we cannot move to another B-tree page after acquiring an undo page lock. Reads can reuse the same mini-transaction.

      Attachments

        Issue Links

          Activity

            I created a branch bb-10.5-MDEV-16232 to let a user confirm whether their observation of increased memory usage is caused by the row prefetch cache. The cache would be removed as part of this task.

            marko Marko Mäkelä added a comment - I created a branch bb-10.5-MDEV-16232 to let a user confirm whether their observation of increased memory usage is caused by the row prefetch cache. The cache would be removed as part of this task.

            (a note so that this questoin is not forgotten):
            Why does this patch:

            commit eb63f8b9a56fe673e5bf8add20f6b457c7032e69
            Author: Thirunarayanan Balathandayuthapani <thiru@mariadb.com>
            Date:   Thu May 26 16:26:23 2022 +0530
             
                MDEV-16232      Use fewer mini-transactions
                
                - For update operation, InnoDB uses separate mtr for the following:
                ...
            

            add implementation of start_operations_batch() and end_operations_batch() for every engine? Won't it be easier if there was a default implementation, handler::start_operations_batch() which did nothing?

            psergei Sergei Petrunia added a comment - (a note so that this questoin is not forgotten): Why does this patch: commit eb63f8b9a56fe673e5bf8add20f6b457c7032e69 Author: Thirunarayanan Balathandayuthapani <thiru@mariadb.com> Date: Thu May 26 16:26:23 2022 +0530   MDEV-16232 Use fewer mini-transactions - For update operation, InnoDB uses separate mtr for the following: ... add implementation of start_operations_batch() and end_operations_batch() for every engine? Won't it be easier if there was a default implementation, handler::start_operations_batch() which did nothing?

            Would terminology like opening or closing a page cursor be acceptable?

            I have the impression that system versioning is mostly performing its special handling of UPDATE and DELETE inside InnoDB. I do not foresee a problem with that.

            For a read that is executed as part of INSERT…SELECT, CREATE…SELECT or even ALTER TABLE…ALGORITHM=COPY (in copy_data_between_tables()), it would be good if the SQL layer fetched multiple rows in a batch, before ending the batch and inserting the rows.

            An infinite loop in INSERT INTO t1 SELECT FROM t1 might be avoided by the fact that it is using a locking read. (Related to that, see also MDEV-24813.) Perhaps locking reads will ignore records that were inserted by the current transaction. This optimization should not affect that.

            marko Marko Mäkelä added a comment - Would terminology like opening or closing a page cursor be acceptable? I have the impression that system versioning is mostly performing its special handling of UPDATE and DELETE inside InnoDB. I do not foresee a problem with that. For a read that is executed as part of INSERT…SELECT , CREATE…SELECT or even ALTER TABLE…ALGORITHM=COPY (in copy_data_between_tables() ), it would be good if the SQL layer fetched multiple rows in a batch, before ending the batch and inserting the rows. An infinite loop in INSERT INTO t1 SELECT FROM t1 might be avoided by the fact that it is using a locking read. (Related to that, see also MDEV-24813 .) Perhaps locking reads will ignore records that were inserted by the current transaction. This optimization should not affect that.

            --source include/have_innodb.inc
            create table t1 (v3 int primary key, v2 text(100) unique not null) engine=innodb;
            insert into t1 values ( -32768 , -128) , (-1 , 44);
            update t1 set v2 = 1;
            drop table t1;
            

            Above test case fails in 10.11-MDEV-16232 branch.

            In update_stmt(), InnoDB do start batch mini-transaction and holds the lock on clustered
            index record and secondary index records. Since this table has long unique blob
            (table->s->long_unique_table), we do clone the current handler and it leads
            to different prebuilt.

            (rr) where
            #0  0x0000563635d95fc5 in ha_innobase::open (this=0x7f66240163a0, name=0x7f66240919a8 "test/t1") at ../storage/innobase/handler/ha_innodb.cc:5959
            #1  0x000056363595c6ca in handler::ha_open (this=0x7f66240163a0, table_arg=0x7f66241e0238, name=0x7f66240919a8 "test/t1", mode=33, test_if_locked=2, mem_root=0x7f6624006d78, partitions_to_open=0x0) at ../sql/handler.cc:3349
            #2  0x000056363595bcb1 in handler::clone (this=0x7f6624238ef0, name=0x7f66240919a8 "test/t1", mem_root=0x7f6624006d78) at ../sql/handler.cc:3209
            #3  0x0000563635d96ce2 in ha_innobase::clone (this=0x7f6624238ef0, name=0x7f66241dfff0 "./test/t1", mem_root=0x7f6624006d78) at ../storage/innobase/handler/ha_innodb.cc:6237
            #4  0x000056363595bd4d in handler::create_lookup_handler (this=0x7f6624238ef0) at ../sql/handler.cc:3232
            #5  0x000056363596af02 in handler::prepare_for_insert (this=0x7f6624238ef0, do_create=true) at ../sql/handler.cc:7543
            #6  0x00005636356dc24a in mysql_update (thd=0x7f6624000db8, table_list=0x7f6624015580, fields=@0x7f6624005cc8: {<base_list> = {<Sql_alloc> = {<No data fields>}, first = 0x7f6624015e50, last = 0x7f6624015e50, elements = 1}, <No data fields>}, values=@0x7f6624006100: {<base_list> = {<Sql_alloc> = {<No data fields>}, first = 0x7f6624015e60, last = 0x7f6624015e60, elements = 1}, <No data fields>}, conds=0x0, order_num=0, order=0x0, limit=18446744073709551615, ignore=false, found_return=0x7f6684058de0, updated_return=0x7f6684058ed0) at ../sql/sql_update.cc:1033
            #7  0x00005636355833b1 in mysql_execute_command (thd=0x7f6624000db8, is_called_from_prepared_stmt=false) at ../sql/sql_parse.cc:4410
            #8  0x000056363558f60a in mysql_parse (thd=0x7f6624000db8, rawbuf=0x7f66240154b0 "update t1 set v2 = 1", length=20, parser_state=0x7f6684059390) at ../sql/sql_parse.cc:8000
            #9  0x000056363557ba9d in dispatch_command (command=COM_QUERY, thd=0x7f6624000db8, packet=0x7f662400ba39 "update t1 set v2 = 1", packet_length=20, blocking=true) at ../sql/sql_parse.cc:1894
            #10 0x000056363557a45c in do_command (thd=0x7f6624000db8, blocking=true) at ../sql/sql_parse.cc:1407
            #11 0x000056363575b9b5 in do_handle_one_connection (connect=0x5636390913d8, put_in_cache=true) at ../sql/sql_connect.cc:1416
            #12 0x000056363575b716 in handle_one_connection (arg=0x563639175b18) at ../sql/sql_connect.cc:1318
            #13 0x0000563635cb72b5 in pfs_spawn_thread (arg=0x56363913d9e8) at ../storage/perfschema/pfs.cc:2201
            #14 0x00007f669bceb609 in start_thread (arg=<optimized out>) at pthread_create.c:477
            #15 0x00007f669b8bc163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
            

            After creating the new record, we do lookup in unique index with the cloned prebuilt:

            gdb) where
            #0  row_search_mvcc (buf=0x555556c08bee <_my_thread_var+24> "]\303\363\017\036\372UH\211\345H\203\354\020H\211}\370H\215\005i9u\001\213", mode=32767, prebuilt=0x1000056c08d2a, match_mode=140736959523632, direction=0) at ../storage/innobase/row/row0sel.cc:4347
            #1  0x00005555567667a2 in ha_innobase::index_read (this=0x7fffb0016348, buf=0x7fffb02292e8 "", key_ptr=0x7fffe07ab63f "\364\001", key_len=8, find_flag=HA_READ_KEY_EXACT) at ../storage/innobase/handler/ha_innodb.cc:9004
            #2  0x000055555624da34 in handler::index_read_map (this=0x7fffb0016348, buf=0x7fffb02292e8 "", key=0x7fffe07ab63f "\364\001", keypart_map=18446744073709551615, find_flag=HA_READ_KEY_EXACT) at ../sql/handler.h:3861
            #3  0x000055555632b23a in handler::ha_index_read_map (this=0x7fffb0016348, buf=0x7fffb02292e8 "", key=0x7fffe07ab63f "\364\001", keypart_map=18446744073709551615, find_flag=HA_READ_KEY_EXACT) at ../sql/handler.cc:3488
            #4  0x000055555633731d in handler::check_duplicate_long_entry_key (this=0x7fffb023a560, new_rec=0x7fffb02292e8 "", key_no=1) at ../sql/handler.cc:7248
            #5  0x0000555556337c8a in handler::check_duplicate_long_entries_update (this=0x7fffb023a560, new_rec=0x7fffb02292e8 "") at ../sql/handler.cc:7381
            #6  0x0000555556339033 in handler::ha_update_row (this=0x7fffb023a560, old_data=0x7fffb0229300 "", new_data=0x7fffb02292e8 "") at ../sql/handler.cc:7664
            #7  0x00005555560b100a in mysql_update (thd=0x7fffb0000db8, table_list=0x7fffb0015560, fields=@0x7fffb0005ca8: {<base_list> = {<Sql_alloc> = {<No data fields>}, first = 0x7fffb0015e20, last = 0x7fffb0015e20, elements = 1}, <No data fields>}, values=@0x7fffb0006100: {<base_list> = {<Sql_alloc> = {<No data fields>}, first = 0x7fffb0015e30, last = 0x7fffb0015e30, elements = 1}, <No data fields>}, conds=0x0, order_num=0, order=0x0, limit=18446744073709551615, ignore=false, found_return=0x7fffe07abe30, updated_return=0x7fffe07abf20) at ../sql/sql_update.cc:1092
            #8  0x0000555555f5ae92 in mysql_execute_command (thd=0x7fffb0000db8, is_called_from_prepared_stmt=false) at ../sql/sql_parse.cc:4411
            #9  0x0000555555f67119 in mysql_parse (thd=0x7fffb0000db8, rawbuf=0x7fffb0015490 "update t1 set v2 = 1", length=20, parser_state=0x7fffe07ac3a0) at ../sql/sql_parse.cc:8019
            #10 0x0000555555f5359a in dispatch_command (command=COM_QUERY, thd=0x7fffb0000db8, packet=0x7fffb000b9f9 "update t1 set v2 = 1", packet_length=20, blocking=true) at ../sql/sql_parse.cc:1894
            #11 0x0000555555f51f59 in do_command (thd=0x7fffb0000db8, blocking=true) at ../sql/sql_parse.cc:1407
            #12 0x000055555612ef32 in do_handle_one_connection (connect=0x555558b4c218, put_in_cache=true) at ../sql/sql_connect.cc:1416
            #13 0x000055555612ec93 in handle_one_connection (arg=0x555558be79f8) at ../sql/sql_connect.cc:1318
            #14 0x0000555556681cef in pfs_spawn_thread (arg=0x555558b4bd88) at ../storage/perfschema/pfs.cc:2201
            #15 0x00007ffff7f3e609 in start_thread (arg=<optimized out>) at pthread_create.c:477
            #16 0x00007ffff7b0f163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
            

            So update creates one mini-transaction(prebuilt), check_duplicate_long_entry_key() works on different prebuilt. This leads to block lock assertion.

            thiru Thirunarayanan Balathandayuthapani added a comment - --source include/have_innodb.inc create table t1 (v3 int primary key, v2 text(100) unique not null) engine=innodb; insert into t1 values ( -32768 , -128) , (-1 , 44); update t1 set v2 = 1; drop table t1; Above test case fails in 10.11- MDEV-16232 branch. In update_stmt(), InnoDB do start batch mini-transaction and holds the lock on clustered index record and secondary index records. Since this table has long unique blob (table->s->long_unique_table), we do clone the current handler and it leads to different prebuilt. (rr) where #0 0x0000563635d95fc5 in ha_innobase::open (this=0x7f66240163a0, name=0x7f66240919a8 "test/t1") at ../storage/innobase/handler/ha_innodb.cc:5959 #1 0x000056363595c6ca in handler::ha_open (this=0x7f66240163a0, table_arg=0x7f66241e0238, name=0x7f66240919a8 "test/t1", mode=33, test_if_locked=2, mem_root=0x7f6624006d78, partitions_to_open=0x0) at ../sql/handler.cc:3349 #2 0x000056363595bcb1 in handler::clone (this=0x7f6624238ef0, name=0x7f66240919a8 "test/t1", mem_root=0x7f6624006d78) at ../sql/handler.cc:3209 #3 0x0000563635d96ce2 in ha_innobase::clone (this=0x7f6624238ef0, name=0x7f66241dfff0 "./test/t1", mem_root=0x7f6624006d78) at ../storage/innobase/handler/ha_innodb.cc:6237 #4 0x000056363595bd4d in handler::create_lookup_handler (this=0x7f6624238ef0) at ../sql/handler.cc:3232 #5 0x000056363596af02 in handler::prepare_for_insert (this=0x7f6624238ef0, do_create=true) at ../sql/handler.cc:7543 #6 0x00005636356dc24a in mysql_update (thd=0x7f6624000db8, table_list=0x7f6624015580, fields=@0x7f6624005cc8: {<base_list> = {<Sql_alloc> = {<No data fields>}, first = 0x7f6624015e50, last = 0x7f6624015e50, elements = 1}, <No data fields>}, values=@0x7f6624006100: {<base_list> = {<Sql_alloc> = {<No data fields>}, first = 0x7f6624015e60, last = 0x7f6624015e60, elements = 1}, <No data fields>}, conds=0x0, order_num=0, order=0x0, limit=18446744073709551615, ignore=false, found_return=0x7f6684058de0, updated_return=0x7f6684058ed0) at ../sql/sql_update.cc:1033 #7 0x00005636355833b1 in mysql_execute_command (thd=0x7f6624000db8, is_called_from_prepared_stmt=false) at ../sql/sql_parse.cc:4410 #8 0x000056363558f60a in mysql_parse (thd=0x7f6624000db8, rawbuf=0x7f66240154b0 "update t1 set v2 = 1", length=20, parser_state=0x7f6684059390) at ../sql/sql_parse.cc:8000 #9 0x000056363557ba9d in dispatch_command (command=COM_QUERY, thd=0x7f6624000db8, packet=0x7f662400ba39 "update t1 set v2 = 1", packet_length=20, blocking=true) at ../sql/sql_parse.cc:1894 #10 0x000056363557a45c in do_command (thd=0x7f6624000db8, blocking=true) at ../sql/sql_parse.cc:1407 #11 0x000056363575b9b5 in do_handle_one_connection (connect=0x5636390913d8, put_in_cache=true) at ../sql/sql_connect.cc:1416 #12 0x000056363575b716 in handle_one_connection (arg=0x563639175b18) at ../sql/sql_connect.cc:1318 #13 0x0000563635cb72b5 in pfs_spawn_thread (arg=0x56363913d9e8) at ../storage/perfschema/pfs.cc:2201 #14 0x00007f669bceb609 in start_thread (arg=<optimized out>) at pthread_create.c:477 #15 0x00007f669b8bc163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 After creating the new record, we do lookup in unique index with the cloned prebuilt: gdb) where #0 row_search_mvcc (buf=0x555556c08bee <_my_thread_var+24> "]\303\363\017\036\372UH\211\345H\203\354\020H\211}\370H\215\005i9u\001\213", mode=32767, prebuilt=0x1000056c08d2a, match_mode=140736959523632, direction=0) at ../storage/innobase/row/row0sel.cc:4347 #1 0x00005555567667a2 in ha_innobase::index_read (this=0x7fffb0016348, buf=0x7fffb02292e8 "", key_ptr=0x7fffe07ab63f "\364\001", key_len=8, find_flag=HA_READ_KEY_EXACT) at ../storage/innobase/handler/ha_innodb.cc:9004 #2 0x000055555624da34 in handler::index_read_map (this=0x7fffb0016348, buf=0x7fffb02292e8 "", key=0x7fffe07ab63f "\364\001", keypart_map=18446744073709551615, find_flag=HA_READ_KEY_EXACT) at ../sql/handler.h:3861 #3 0x000055555632b23a in handler::ha_index_read_map (this=0x7fffb0016348, buf=0x7fffb02292e8 "", key=0x7fffe07ab63f "\364\001", keypart_map=18446744073709551615, find_flag=HA_READ_KEY_EXACT) at ../sql/handler.cc:3488 #4 0x000055555633731d in handler::check_duplicate_long_entry_key (this=0x7fffb023a560, new_rec=0x7fffb02292e8 "", key_no=1) at ../sql/handler.cc:7248 #5 0x0000555556337c8a in handler::check_duplicate_long_entries_update (this=0x7fffb023a560, new_rec=0x7fffb02292e8 "") at ../sql/handler.cc:7381 #6 0x0000555556339033 in handler::ha_update_row (this=0x7fffb023a560, old_data=0x7fffb0229300 "", new_data=0x7fffb02292e8 "") at ../sql/handler.cc:7664 #7 0x00005555560b100a in mysql_update (thd=0x7fffb0000db8, table_list=0x7fffb0015560, fields=@0x7fffb0005ca8: {<base_list> = {<Sql_alloc> = {<No data fields>}, first = 0x7fffb0015e20, last = 0x7fffb0015e20, elements = 1}, <No data fields>}, values=@0x7fffb0006100: {<base_list> = {<Sql_alloc> = {<No data fields>}, first = 0x7fffb0015e30, last = 0x7fffb0015e30, elements = 1}, <No data fields>}, conds=0x0, order_num=0, order=0x0, limit=18446744073709551615, ignore=false, found_return=0x7fffe07abe30, updated_return=0x7fffe07abf20) at ../sql/sql_update.cc:1092 #8 0x0000555555f5ae92 in mysql_execute_command (thd=0x7fffb0000db8, is_called_from_prepared_stmt=false) at ../sql/sql_parse.cc:4411 #9 0x0000555555f67119 in mysql_parse (thd=0x7fffb0000db8, rawbuf=0x7fffb0015490 "update t1 set v2 = 1", length=20, parser_state=0x7fffe07ac3a0) at ../sql/sql_parse.cc:8019 #10 0x0000555555f5359a in dispatch_command (command=COM_QUERY, thd=0x7fffb0000db8, packet=0x7fffb000b9f9 "update t1 set v2 = 1", packet_length=20, blocking=true) at ../sql/sql_parse.cc:1894 #11 0x0000555555f51f59 in do_command (thd=0x7fffb0000db8, blocking=true) at ../sql/sql_parse.cc:1407 #12 0x000055555612ef32 in do_handle_one_connection (connect=0x555558b4c218, put_in_cache=true) at ../sql/sql_connect.cc:1416 #13 0x000055555612ec93 in handle_one_connection (arg=0x555558be79f8) at ../sql/sql_connect.cc:1318 #14 0x0000555556681cef in pfs_spawn_thread (arg=0x555558b4bd88) at ../storage/perfschema/pfs.cc:2201 #15 0x00007ffff7f3e609 in start_thread (arg=<optimized out>) at pthread_create.c:477 #16 0x00007ffff7b0f163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 So update creates one mini-transaction(prebuilt), check_duplicate_long_entry_key() works on different prebuilt. This leads to block lock assertion.
            marko Marko Mäkelä added a comment - - edited

            I conducted a small performance test, with Sysbench oltp_read_only and oltp_update_non_index, 4×10000 rows, 96 threads, 30 seconds on /dev/shm, on a dual socket system with 40 threads (2×10 without hyperthreading).

            revision oltp_read_only/qps oltp_update_non_index/tps
            10.11 (baseline) 324540.53 212804.57
            Thiru’s prototype 324430.26 211264.99
            removing row_prebuilt_t::fetch_cache 297972.85 211938.88

            To my surprise, removing the row_prebuilt_t::fetch_cache, which was one of my motivations for this task, is reducing performance. We will need to find out why.

            Before running any further UPDATE or DELETE benchmarks, I think that thiru needs to revise some conditions around the sel_set_rec_lock() calls in row_search_mvcc(), so that DELETE and UPDATE will rely more on implicit locking, like INSERT does. For secondary indexes, some gap locks will be necessary due to reasons explained in MDEV-16406.

            marko Marko Mäkelä added a comment - - edited I conducted a small performance test, with Sysbench oltp_read_only and oltp_update_non_index , 4×10000 rows, 96 threads, 30 seconds on /dev/shm , on a dual socket system with 40 threads (2×10 without hyperthreading). revision oltp_read_only /qps oltp_update_non_index /tps 10.11 (baseline) 324540.53 212804.57 Thiru’s prototype 324430.26 211264.99 removing row_prebuilt_t::fetch_cache 297972.85 211938.88 To my surprise, removing the row_prebuilt_t::fetch_cache , which was one of my motivations for this task, is reducing performance. We will need to find out why. Before running any further UPDATE or DELETE benchmarks, I think that thiru needs to revise some conditions around the sel_set_rec_lock() calls in row_search_mvcc() , so that DELETE and UPDATE will rely more on implicit locking, like INSERT does. For secondary indexes, some gap locks will be necessary due to reasons explained in MDEV-16406 .

            People

              debarun Debarun Banerjee
              marko Marko Mäkelä
              Votes:
              5 Vote for this issue
              Watchers:
              21 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.