Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-24621

In bulk insert, pre-sort and build indexes one page at a time

Details

    Description

      In MDEV-515, the original intention was to do 2 things to speed up INSERT to an empty table or partition:

      1. Stop writing undo log records for each inserted record, and instead write just one saying that the table was empty.
      2. When the table was empty, pre-sort the records for each index, and build the indexes one page at a time, instead of the current one unsorted record at a time.

      The undo logging change turned out to be rather challenging on its own, because it affects MVCC and locking as well. Preventing the useless undo logging should speed up ROLLBACK and recovery, and also the purge of history, compensating for MDEV-12288. Hence, it makes sense to complete MDEV-515 in the limited form.

      The purpose of this task is to make INSERT into an empty table or partition even more efficient by making use of sorting. We expect data loads into huge tables with many secondary indexes to become faster. The change buffer (MDEV-11634) should no longer come into play in this case. (Huge loads into nonempty tables will be unaffected by this enhancement.)

      After implementation comments:
      The following queries will use the new optimization:

      • CREATE ... SELECT
      • Inserting into an empty table together with SET STATEMENT unique_checks=0,foreign_key_checks=0 FOR ...
        • INSERT statement
        • INSERT ... SELECT
        • LOAD DATA INFILE

      Insert into an empty table without unique_checks=0,foreign_key_checks=0 will NOT be optimized.

      Attachments

        Issue Links

          Activity

            marko Marko Mäkelä created issue -
            marko Marko Mäkelä made changes -
            Field Original Value New Value
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            ralf.gebhardt Ralf Gebhardt made changes -
            Priority Major [ 3 ] Critical [ 2 ]
            thiru Thirunarayanan Balathandayuthapani made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -

            Before MDEV-25036 has been implemented, I think that we must disable this optimization for a subsequent INSERT statements into an initially empty table. That is, for a multi-INSERT transaction into an initially empty table, we will retain the table-level undo logging as per MDEV-515, but we will apply this optimization only to the first INSERT: At the end of the first statement that inserts into an empty table, we will complete the sort, and report any duplicate key errors.

            After MDEV-25036, we could improve things in subsequent task(s) as follows:

            1. We would keep buffering rows from potentially multiple INSERT statements until we receive a statement that forces us to apply the changes.
              • Examples would include XA PREPARE, COMMIT, and anything that reads from the table (including UPDATE or DELETE).
              • Duplicate key errors or other constraint violations would be reported at this point only.
            2. If the special START TRANSACTION statement indicates so, we may assume that the data is already sorted by PRIMARY KEY and skip the sorting altogether. If we find that the order is violated, we would report an error and roll back the transaction.
            marko Marko Mäkelä added a comment - Before MDEV-25036 has been implemented, I think that we must disable this optimization for a subsequent INSERT statements into an initially empty table. That is, for a multi- INSERT transaction into an initially empty table, we will retain the table-level undo logging as per MDEV-515 , but we will apply this optimization only to the first INSERT : At the end of the first statement that inserts into an empty table, we will complete the sort, and report any duplicate key errors. After MDEV-25036 , we could improve things in subsequent task(s) as follows: We would keep buffering rows from potentially multiple INSERT statements until we receive a statement that forces us to apply the changes. Examples would include XA PREPARE , COMMIT , and anything that reads from the table (including UPDATE or DELETE ). Duplicate key errors or other constraint violations would be reported at this point only. If the special START TRANSACTION statement indicates so, we may assume that the data is already sorted by PRIMARY KEY and skip the sorting altogether. If we find that the order is violated, we would report an error and roll back the transaction.

            Patch is in bb-10.7-MDEV-24621

            thiru Thirunarayanan Balathandayuthapani added a comment - - edited Patch is in bb-10.7- MDEV-24621
            thiru Thirunarayanan Balathandayuthapani made changes -
            Assignee Thirunarayanan Balathandayuthapani [ thiru ] Marko Mäkelä [ marko ]
            Status In Progress [ 3 ] In Review [ 10002 ]

            I posted the first batch of review comments.

            I think that the interfaces are roughly correct. I think that we must be careful about overhead during DML operation. Error handling needs to be improved too.

            marko Marko Mäkelä added a comment - I posted the first batch of review comments . I think that the interfaces are roughly correct. I think that we must be careful about overhead during DML operation. Error handling needs to be improved too.
            marko Marko Mäkelä made changes -
            Assignee Marko Mäkelä [ marko ] Thirunarayanan Balathandayuthapani [ thiru ]
            Status In Review [ 10002 ] Stalled [ 10000 ]
            thiru Thirunarayanan Balathandayuthapani made changes -
            Status Stalled [ 10000 ] In Progress [ 3 ]
            thiru Thirunarayanan Balathandayuthapani made changes -
            Assignee Thirunarayanan Balathandayuthapani [ thiru ] Marko Mäkelä [ marko ]
            Status In Progress [ 3 ] In Review [ 10002 ]

            Does it really have to add a new handlerton method and extend the server's commit code ?
            It seems that InnoDB can do it on commit internally.

            serg Sergei Golubchik added a comment - Does it really have to add a new handlerton method and extend the server's commit code ? It seems that InnoDB can do it on commit internally.

            serg, until MDEV-25036 has been implemented, we must finalize bulk insert at the end of each statement that was the first one to insert into an empty table.

            Even after MDEV-25036 is implemented, the finalization must take place before either handlerton::prepare_commit_versioned or handlerton::prepare are invoked. It would be cleaner to have a "pre-prepare" step. That could be handy not only for handlerton::prepare_commit_versioned calls, but also for a new one that we might need in MDEV-24608.

            marko Marko Mäkelä added a comment - serg , until MDEV-25036 has been implemented, we must finalize bulk insert at the end of each statement that was the first one to insert into an empty table. Even after MDEV-25036 is implemented, the finalization must take place before either handlerton::prepare_commit_versioned or handlerton::prepare are invoked. It would be cleaner to have a "pre-prepare" step. That could be handy not only for handlerton::prepare_commit_versioned calls, but also for a new one that we might need in MDEV-24608 .
            serg Sergei Golubchik made changes -
            Priority Critical [ 2 ] Major [ 3 ]
            serg Sergei Golubchik made changes -
            marko Marko Mäkelä made changes -

            An incomplete work-around for MDEV-26623 was included in this branch. I plan to merge MDEV-26623 and revert the workaround as soon as possible.

            marko Marko Mäkelä added a comment - An incomplete work-around for MDEV-26623 was included in this branch. I plan to merge MDEV-26623 and revert the workaround as soon as possible.

            The MDEV-26623 fix has now been merged to the preview branch.

            marko Marko Mäkelä added a comment - The MDEV-26623 fix has now been merged to the preview branch.

            I wanted to perform a small performance test, similar to what I conducted for MDEV-515 earlier. Unfortunately, this test will crash the preview branch if the first SET statement is present:

            --source include/have_innodb.inc
            --source include/have_sequence.inc
            --source include/not_debug.inc
             
            SET unique_checks=0, foreign_key_checks=0;
             
            SET @freq=@@GLOBAL.innodb_purge_rseg_truncate_frequency;
            SET GLOBAL innodb_purge_rseg_truncate_frequency=1;
             
            CREATE TABLE t (a INT PRIMARY KEY, b CHAR(255) CHARACTER SET utf8)
            ENGINE=InnoDB ROW_FORMAT=REDUNDANT;
             
            let $n=1000000;
             
            --source include/wait_all_purged.inc
            SET profiling_history_size=100;
            SET profiling = 1;
            SELECT variable_value INTO @old_lsn FROM INFORMATION_SCHEMA.GLOBAL_STATUS
            WHERE variable_name = 'INNODB_LSN_CURRENT';
             
            BEGIN;
            eval INSERT INTO t(a) SELECT * FROM seq_1_to_$n;
            SELECT (@lsn:= variable_value)-@old_lsn
            FROM INFORMATION_SCHEMA.GLOBAL_STATUS
            WHERE variable_name = 'INNODB_LSN_CURRENT';
            ROLLBACK;
            SET @old_lsn=@lsn;
            SELECT (@lsn:= variable_value)-@old_lsn
            FROM INFORMATION_SCHEMA.GLOBAL_STATUS
            WHERE variable_name = 'INNODB_LSN_CURRENT';
             
            SELECT * FROM t;
             
            eval INSERT INTO t(a) SELECT * FROM seq_1_to_$n;
            --source include/wait_all_purged.inc
            SET @old_lsn=@lsn;
            SELECT (@lsn:= variable_value)-@old_lsn
            FROM INFORMATION_SCHEMA.GLOBAL_STATUS
            WHERE variable_name = 'INNODB_LSN_CURRENT';
             
            TRUNCATE TABLE t;
            --source include/wait_all_purged.inc
            SET @old_lsn=@lsn;
            SELECT (@lsn:= variable_value)-@old_lsn
            FROM INFORMATION_SCHEMA.GLOBAL_STATUS
            WHERE variable_name = 'INNODB_LSN_CURRENT';
             
            BEGIN;
            eval INSERT INTO t SET a=$n;
            --error ER_DUP_ENTRY
            eval INSERT INTO t(a) SELECT * FROM seq_1_to_$n;
            SELECT * FROM t;
            ROLLBACK;
            SELECT * FROM t;
            SET @old_lsn=@lsn;
            SELECT (@lsn:= variable_value)-@old_lsn
            FROM INFORMATION_SCHEMA.GLOBAL_STATUS
            WHERE variable_name = 'INNODB_LSN_CURRENT';
             
            SHOW PROFILES;
            SET profiling = 0;
            DROP TABLE t;
            SET GLOBAL innodb_purge_rseg_truncate_frequency=@freq;
            

            Without that first SET statement, we will be using normal row-logged INSERT. For me, that did consume roughly the same amount of time and produced the same amount of redo log, when comparing to the 10.7 base revision that was last merged with the branch.

            preview-10.7-MDEV-24621-innodb-bulk-insert e62107a1adbea2af5b514bcc43a1c2e54da687fa

            #2  0x0000555555b9d26d in ib::fatal::~fatal (this=<optimized out>, 
                this@entry=0x7ffff04d0360, __in_chrg=<optimized out>)
                at /mariadb/10.7/storage/innobase/ut/ut0ut.cc:519
            #3  0x0000555555b8fc66 in row_mysql_handle_errors (new_err=0x7ffff04d057c, 
                trx=0x7ffff1785130, thr=<optimized out>, savept=0x7ffff04d0580)
                at /mariadb/10.7/storage/innobase/row/row0mysql.cc:727
            #4  0x00005555562191a5 in row_insert_for_mysql (mysql_rec=<optimized out>, 
                prebuilt=0x7fff8c0286a0, ins_mode=ROW_INS_NORMAL)
                at /mariadb/10.7/storage/innobase/row/row0mysql.cc:1329
            …
            (gdb) f 3
            #3  0x0000555555b8fc66 in row_mysql_handle_errors (new_err=0x7ffff04d057c, 
                trx=0x7ffff1785130, thr=<optimized out>, savept=0x7ffff04d0580)
                at /mariadb/10.7/storage/innobase/row/row0mysql.cc:727
            727			ib::fatal() << "Unknown error " << err;
            (gdb) p err
            $1 = DB_IO_ERROR
            

            I repeated the same with rr, and I determined that the DB_IO_ERROR was set here:

            preview-10.7-MDEV-24621-innodb-bulk-insert e62107a1adbea2af5b514bcc43a1c2e54da687fa

            #0  0x0000563fadc90434 in os_file_io (in_type=<optimized out>, 
                file=<optimized out>, buf=<optimized out>, n=<optimized out>, offset=0, 
                err=0x7f723c9a4074) at /mariadb/10.7/storage/innobase/os/os0file.cc:2896
            #1  0x0000563fadc905c5 in os_file_pwrite (err=0x7f723c9a4074, 
                offset=<optimized out>, n=<optimized out>, buf=0x0, file=52, type=
                  @0x563fae06eb10: {bpage = 0x0, node = 0x0, type = IORequest::WRITE_SYNC})
                at /mariadb/10.7/storage/innobase/os/os0file.cc:2932
            #2  os_file_write_func (type=
                  @0x563fae06eb10: {bpage = 0x0, node = 0x0, type = IORequest::WRITE_SYNC}, name=<optimized out>, name@entry=0x563fae06e154 "(bulk insert)", 
                file=file@entry=52, buf=buf@entry=0x0, offset=<optimized out>, 
                offset@entry=0, n=<optimized out>, n@entry=4294967295)
                at /mariadb/10.7/storage/innobase/os/os0file.cc:2963
            #3  0x0000563fadcc9827 in pfs_os_file_write_func (type=
                  @0x563fae06eb10: {bpage = 0x0, node = 0x0, type = IORequest::WRITE_SYNC}, src_file=0x563fae06dbf0 "/mariadb/10.7/storage/innobase/row/row0merge.cc", 
                src_line=1081, n=4294967295, offset=0, buf=<optimized out>, 
                file=<optimized out>, name=0x563fae06e154 "(bulk insert)")
                at /mariadb/10.7/storage/innobase/include/os0file.ic:317
            #4  row_merge_buf_blob (blob_file=0x7f721c939b38, heap=<synthetic pointer>, 
                n_fields=4, entry=0x7f721c939c18)
                at /mariadb/10.7/storage/innobase/row/row0merge.cc:1081
            #5  row_merge_buf_write (buf=0x7f721c939b78, of=<optimized out>, 
                5c000 "", blob_file=0x7f721c939b38) at /mariadb/10.7/storage/innobase/row/row0merge.cc:1126
            #6  0x0000563fadcc9d5e in row_merge_bulk_t::write_to_tmp_file (index_no=0, this=0x7f721c939af0) at /mariadb/10.7/storage/innobase/row/row0merge.cc:5075
            #7  row_merge_bulk_t::write_to_tmp_file (this=0x7f721c939af0, index_no=0) at /mariadb/10.7/storage/innobase/row/row0merge.cc:5066
            #8  0x0000563fadcca154 in row_merge_bulk_t::bulk_insert_buffered (this=0x7f721c939af0, row=row@entry=0x7f721c026b40, ind=<optimized out>, trx=<optimized out>)
                at /mariadb/10.7/storage/innobase/row/row0merge.cc:5123
            #9  0x0000563fadcc4975 in trx_mod_table_time_t::bulk_insert_buffered (trx=<optimized out>, index=<optimized out>, entry=<optimized out>, this=<optimized out>)
                at /mariadb/10.7/storage/innobase/include/trx0trx.h:488
            #10 row_ins_index_entry (thr=0x7f721c02eb08, entry=<optimized out>, index=<optimized out>) at /mariadb/10.7/storage/innobase/row/row0ins.cc:3373
            #11 row_ins_index_entry_step (thr=0x7f721c02eb08, node=<optimized out>) at /mariadb/10.7/storage/innobase/row/row0ins.cc:3546
            #12 row_ins (thr=0x7f721c02eb08, node=<optimized out>) at /mariadb/10.7/storage/innobase/row/row0ins.cc:3692
            #13 row_ins_step (thr=thr@entry=0x7f721c02eb08) at /mariadb/10.7/storage/innobase/row/row0ins.cc:3838
            

            I did not debug this deeper. One idea would be to use normal file I/O instead of the InnoDB page I/O wrapper. We do not want O_DIRECT for these writes, since we will be writing arbitrary-sized blocks without any alignment guarantees.

            marko Marko Mäkelä added a comment - I wanted to perform a small performance test, similar to what I conducted for MDEV-515 earlier. Unfortunately, this test will crash the preview branch if the first SET statement is present: --source include/have_innodb.inc --source include/have_sequence.inc --source include/not_debug.inc   SET unique_checks=0, foreign_key_checks=0;   SET @freq=@@ GLOBAL .innodb_purge_rseg_truncate_frequency; SET GLOBAL innodb_purge_rseg_truncate_frequency=1;   CREATE TABLE t (a INT PRIMARY KEY , b CHAR (255) CHARACTER SET utf8) ENGINE=InnoDB ROW_FORMAT=REDUNDANT;   let $n=1000000;   --source include/wait_all_purged.inc SET profiling_history_size=100; SET profiling = 1; SELECT variable_value INTO @old_lsn FROM INFORMATION_SCHEMA.GLOBAL_STATUS WHERE variable_name = 'INNODB_LSN_CURRENT' ;   BEGIN ; eval INSERT INTO t(a) SELECT * FROM seq_1_to_$n; SELECT (@lsn:= variable_value)-@old_lsn FROM INFORMATION_SCHEMA.GLOBAL_STATUS WHERE variable_name = 'INNODB_LSN_CURRENT' ; ROLLBACK ; SET @old_lsn=@lsn; SELECT (@lsn:= variable_value)-@old_lsn FROM INFORMATION_SCHEMA.GLOBAL_STATUS WHERE variable_name = 'INNODB_LSN_CURRENT' ;   SELECT * FROM t;   eval INSERT INTO t(a) SELECT * FROM seq_1_to_$n; --source include/wait_all_purged.inc SET @old_lsn=@lsn; SELECT (@lsn:= variable_value)-@old_lsn FROM INFORMATION_SCHEMA.GLOBAL_STATUS WHERE variable_name = 'INNODB_LSN_CURRENT' ;   TRUNCATE TABLE t; --source include/wait_all_purged.inc SET @old_lsn=@lsn; SELECT (@lsn:= variable_value)-@old_lsn FROM INFORMATION_SCHEMA.GLOBAL_STATUS WHERE variable_name = 'INNODB_LSN_CURRENT' ;   BEGIN ; eval INSERT INTO t SET a=$n; --error ER_DUP_ENTRY eval INSERT INTO t(a) SELECT * FROM seq_1_to_$n; SELECT * FROM t; ROLLBACK ; SELECT * FROM t; SET @old_lsn=@lsn; SELECT (@lsn:= variable_value)-@old_lsn FROM INFORMATION_SCHEMA.GLOBAL_STATUS WHERE variable_name = 'INNODB_LSN_CURRENT' ;   SHOW PROFILES; SET profiling = 0; DROP TABLE t; SET GLOBAL innodb_purge_rseg_truncate_frequency=@freq; Without that first SET statement, we will be using normal row-logged INSERT . For me, that did consume roughly the same amount of time and produced the same amount of redo log, when comparing to the 10.7 base revision that was last merged with the branch. preview-10.7-MDEV-24621-innodb-bulk-insert e62107a1adbea2af5b514bcc43a1c2e54da687fa #2 0x0000555555b9d26d in ib::fatal::~fatal (this=<optimized out>, this@entry=0x7ffff04d0360, __in_chrg=<optimized out>) at /mariadb/10.7/storage/innobase/ut/ut0ut.cc:519 #3 0x0000555555b8fc66 in row_mysql_handle_errors (new_err=0x7ffff04d057c, trx=0x7ffff1785130, thr=<optimized out>, savept=0x7ffff04d0580) at /mariadb/10.7/storage/innobase/row/row0mysql.cc:727 #4 0x00005555562191a5 in row_insert_for_mysql (mysql_rec=<optimized out>, prebuilt=0x7fff8c0286a0, ins_mode=ROW_INS_NORMAL) at /mariadb/10.7/storage/innobase/row/row0mysql.cc:1329 … (gdb) f 3 #3 0x0000555555b8fc66 in row_mysql_handle_errors (new_err=0x7ffff04d057c, trx=0x7ffff1785130, thr=<optimized out>, savept=0x7ffff04d0580) at /mariadb/10.7/storage/innobase/row/row0mysql.cc:727 727 ib::fatal() << "Unknown error " << err; (gdb) p err $1 = DB_IO_ERROR I repeated the same with rr , and I determined that the DB_IO_ERROR was set here: preview-10.7-MDEV-24621-innodb-bulk-insert e62107a1adbea2af5b514bcc43a1c2e54da687fa #0 0x0000563fadc90434 in os_file_io (in_type=<optimized out>, file=<optimized out>, buf=<optimized out>, n=<optimized out>, offset=0, err=0x7f723c9a4074) at /mariadb/10.7/storage/innobase/os/os0file.cc:2896 #1 0x0000563fadc905c5 in os_file_pwrite (err=0x7f723c9a4074, offset=<optimized out>, n=<optimized out>, buf=0x0, file=52, type= @0x563fae06eb10: {bpage = 0x0, node = 0x0, type = IORequest::WRITE_SYNC}) at /mariadb/10.7/storage/innobase/os/os0file.cc:2932 #2 os_file_write_func (type= @0x563fae06eb10: {bpage = 0x0, node = 0x0, type = IORequest::WRITE_SYNC}, name=<optimized out>, name@entry=0x563fae06e154 "(bulk insert)", file=file@entry=52, buf=buf@entry=0x0, offset=<optimized out>, offset@entry=0, n=<optimized out>, n@entry=4294967295) at /mariadb/10.7/storage/innobase/os/os0file.cc:2963 #3 0x0000563fadcc9827 in pfs_os_file_write_func (type= @0x563fae06eb10: {bpage = 0x0, node = 0x0, type = IORequest::WRITE_SYNC}, src_file=0x563fae06dbf0 "/mariadb/10.7/storage/innobase/row/row0merge.cc", src_line=1081, n=4294967295, offset=0, buf=<optimized out>, file=<optimized out>, name=0x563fae06e154 "(bulk insert)") at /mariadb/10.7/storage/innobase/include/os0file.ic:317 #4 row_merge_buf_blob (blob_file=0x7f721c939b38, heap=<synthetic pointer>, n_fields=4, entry=0x7f721c939c18) at /mariadb/10.7/storage/innobase/row/row0merge.cc:1081 #5 row_merge_buf_write (buf=0x7f721c939b78, of=<optimized out>, 5c000 "", blob_file=0x7f721c939b38) at /mariadb/10.7/storage/innobase/row/row0merge.cc:1126 #6 0x0000563fadcc9d5e in row_merge_bulk_t::write_to_tmp_file (index_no=0, this=0x7f721c939af0) at /mariadb/10.7/storage/innobase/row/row0merge.cc:5075 #7 row_merge_bulk_t::write_to_tmp_file (this=0x7f721c939af0, index_no=0) at /mariadb/10.7/storage/innobase/row/row0merge.cc:5066 #8 0x0000563fadcca154 in row_merge_bulk_t::bulk_insert_buffered (this=0x7f721c939af0, row=row@entry=0x7f721c026b40, ind=<optimized out>, trx=<optimized out>) at /mariadb/10.7/storage/innobase/row/row0merge.cc:5123 #9 0x0000563fadcc4975 in trx_mod_table_time_t::bulk_insert_buffered (trx=<optimized out>, index=<optimized out>, entry=<optimized out>, this=<optimized out>) at /mariadb/10.7/storage/innobase/include/trx0trx.h:488 #10 row_ins_index_entry (thr=0x7f721c02eb08, entry=<optimized out>, index=<optimized out>) at /mariadb/10.7/storage/innobase/row/row0ins.cc:3373 #11 row_ins_index_entry_step (thr=0x7f721c02eb08, node=<optimized out>) at /mariadb/10.7/storage/innobase/row/row0ins.cc:3546 #12 row_ins (thr=0x7f721c02eb08, node=<optimized out>) at /mariadb/10.7/storage/innobase/row/row0ins.cc:3692 #13 row_ins_step (thr=thr@entry=0x7f721c02eb08) at /mariadb/10.7/storage/innobase/row/row0ins.cc:3838 I did not debug this deeper. One idea would be to use normal file I/O instead of the InnoDB page I/O wrapper. We do not want O_DIRECT for these writes, since we will be writing arbitrary-sized blocks without any alignment guarantees.
            marko Marko Mäkelä made changes -
            Assignee Marko Mäkelä [ marko ] Thirunarayanan Balathandayuthapani [ thiru ]
            Status In Review [ 10002 ] Stalled [ 10000 ]

            The issue was that we were incorrectly attempting to write NULL values to the BLOB file. With that fixed, I am seeing the following amount of redo log writes reported by the test. (Note that these numbers are not entirely stable.)

            operation baseline baseline-nopurge patched patched-nopurge
            bulk INSERT 1M rows 823,539,085 823,536,600 842,825,019 842,823,916
            ROLLBACK 71,347 71,353 82,611 82,617
            bulk INSERT 1M rows 823,489,560 823,487,932 842,768,033 842,771,515
            TRUNCATE 2,880 2,616 2,973 2,632
            normal INSERT 1+1M rows,rollback 871,863,611 871,860,836 871,864,475 871,861,868

            For the last step, the numbers should be identical, and they are close to identical. Somewhat surprising is that for the ROLLBACK and TRUNCATE we see more log written, even though those were supposed to be roughly identical as well. To remove randomness caused by the purge of transaction history, I repeated the test with innodb_force_recovery=2 and disabled the wait_all_purge.inc lines. Every second column is from that run.

            As we can see, the nopurge columns for TRUNCATE are virtually identical, but for ROLLBACK they are not. I think that the reason is different index tree layout. After the first INSERT of the 1M rows, I am seeing the following t.ibd file size:

            baseline patched
            884,998,144 1,006,632,960

            thiru, I think that such a huge change of file size is something that we must investigate. With the same MERGE_THRESHOLD and other parameters that affect the B-tree page fill factor, we would want to have a similar file layout. I would expect the file size change to explain the different amount of redo log written for ROLLBACK as well.

            Note: Performance-wise we seem to be doing fine. Because I had some concurrent load running on the machine, I cannot quote reliable numbers. The test completed in about 31 seconds on the baseline 10.7 revision and in 25 seconds with the patch.

            marko Marko Mäkelä added a comment - The issue was that we were incorrectly attempting to write NULL values to the BLOB file . With that fixed, I am seeing the following amount of redo log writes reported by the test. (Note that these numbers are not entirely stable.) operation baseline baseline-nopurge patched patched-nopurge bulk INSERT 1M rows 823,539,085 823,536,600 842,825,019 842,823,916 ROLLBACK 71,347 71,353 82,611 82,617 bulk INSERT 1M rows 823,489,560 823,487,932 842,768,033 842,771,515 TRUNCATE 2,880 2,616 2,973 2,632 normal INSERT 1+1M rows,rollback 871,863,611 871,860,836 871,864,475 871,861,868 For the last step, the numbers should be identical, and they are close to identical. Somewhat surprising is that for the ROLLBACK and TRUNCATE we see more log written, even though those were supposed to be roughly identical as well. To remove randomness caused by the purge of transaction history, I repeated the test with innodb_force_recovery=2 and disabled the wait_all_purge.inc lines. Every second column is from that run. As we can see, the nopurge columns for TRUNCATE are virtually identical, but for ROLLBACK they are not. I think that the reason is different index tree layout. After the first INSERT of the 1M rows, I am seeing the following t.ibd file size: baseline patched 884,998,144 1,006,632,960 thiru , I think that such a huge change of file size is something that we must investigate. With the same MERGE_THRESHOLD and other parameters that affect the B-tree page fill factor, we would want to have a similar file layout. I would expect the file size change to explain the different amount of redo log written for ROLLBACK as well. Note: Performance-wise we seem to be doing fine. Because I had some concurrent load running on the machine, I cannot quote reliable numbers. The test completed in about 31 seconds on the baseline 10.7 revision and in 25 seconds with the patch.

            I must say that it is encouraging to see such time savings for a simple benchmark that did not even include secondary indexes. With secondary indexes, I would expect the performance difference to be even bigger, between row-by-row inserts into each index, and the merge sort and bulk insert of one index at a time.

            marko Marko Mäkelä added a comment - I must say that it is encouraging to see such time savings for a simple benchmark that did not even include secondary indexes. With secondary indexes, I would expect the performance difference to be even bigger, between row-by-row inserts into each index, and the merge sort and bulk insert of one index at a time.
            marko Marko Mäkelä made changes -
            Assignee Thirunarayanan Balathandayuthapani [ thiru ] Axel Schwenke [ axel ]
            Priority Major [ 3 ] Critical [ 2 ]

            --source include/have_innodb.inc
            --source include/have_sequence.inc
            create table t1(f1 int not null primary key, b char(255) CHARACTER SET utf8)engine=innodb;
            INSERT INTO t1(f1) SELECT * FROM seq_1_to_1000000;
            --source include/restart_mysqld.inc - Here t1 size is 32 MB
            alter table t1 force, algorithm=inplace;
            --source include/restart_mysqld.inc - Here t1 size is 36 MB
            drop table t1;
            

            Inplace alter table code increases the file size of the table by 4MB. Bulk insert code uses the same code path as inplace alter code. This file size
            increases issue should affect 10.2+ onwards.

            thiru Thirunarayanan Balathandayuthapani added a comment - --source include/have_innodb.inc --source include/have_sequence.inc create table t1(f1 int not null primary key, b char(255) CHARACTER SET utf8)engine=innodb; INSERT INTO t1(f1) SELECT * FROM seq_1_to_1000000; --source include/restart_mysqld.inc - Here t1 size is 32 MB alter table t1 force, algorithm=inplace; --source include/restart_mysqld.inc - Here t1 size is 36 MB drop table t1; Inplace alter table code increases the file size of the table by 4MB. Bulk insert code uses the same code path as inplace alter code. This file size increases issue should affect 10.2+ onwards.
            thiru Thirunarayanan Balathandayuthapani made changes -

            thiru, I got some data from axel regarding a performance regression, but it did not have good stack traces. I reread the code changes, and I suspect that the performance regression is due to the SQL layer change that also serg asked about. Could we replace innodb_bulk_insert_write() and its caller with a small addition to ha_innobase::reset()?

            marko Marko Mäkelä added a comment - thiru , I got some data from axel regarding a performance regression, but it did not have good stack traces. I reread the code changes, and I suspect that the performance regression is due to the SQL layer change that also serg asked about. Could we replace innodb_bulk_insert_write() and its caller with a small addition to ha_innobase::reset() ?

            This is still causing some performance regression for normal DML workload in a case when bulk loading was not used at all.

            marko Marko Mäkelä added a comment - This is still causing some performance regression for normal DML workload in a case when bulk loading was not used at all.
            marko Marko Mäkelä made changes -
            Assignee Axel Schwenke [ axel ] Thirunarayanan Balathandayuthapani [ thiru ]

            Performance seems to be acceptable now.

            marko Marko Mäkelä added a comment - Performance seems to be acceptable now.
            marko Marko Mäkelä made changes -
            issue.field.resolutiondate 2021-10-26 12:18:11.0 2021-10-26 12:18:11.414
            marko Marko Mäkelä made changes -
            Fix Version/s 10.7.1 [ 26120 ]
            Fix Version/s 10.7 [ 24805 ]
            Resolution Fixed [ 1 ]
            Status Stalled [ 10000 ] Closed [ 6 ]
            marko Marko Mäkelä made changes -
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 118240 ] MariaDB v4 [ 134379 ]
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            Roel Roel Van de Paar made changes -
            Roel Roel Van de Paar made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            Roel Roel Van de Paar made changes -
            marko Marko Mäkelä made changes -
            cmaehler Christopher Mähler made changes -
            Roel Roel Van de Paar made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            bar Alexander Barkov made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            thiru Thirunarayanan Balathandayuthapani made changes -
            marko Marko Mäkelä made changes -
            ralf.gebhardt Ralf Gebhardt made changes -
            Labels performance Preview_10.7 performance
            Roel Roel Van de Paar made changes -
            Roel Roel Van de Paar made changes -
            marko Marko Mäkelä made changes -
            alice Alice Sherepa made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            midenok Aleksey Midenkov made changes -
            marko Marko Mäkelä made changes -
            monty Michael Widenius made changes -
            Description In MDEV-515, the original intention was to do 2 things to speed up {{INSERT}} to an empty table or partition:
            # Stop writing undo log records for each inserted record, and instead write just one saying that the table was empty.
            # When the table was empty, pre-sort the records for each index, and build the indexes one page at a time, instead of the current one unsorted record at a time.

            The undo logging change turned out to be rather challenging on its own, because it affects MVCC and locking as well. Preventing the useless undo logging should speed up {{ROLLBACK}} and recovery, and also the purge of history, compensating for MDEV-12288. Hence, it makes sense to complete MDEV-515 in the limited form.

            The purpose of this task is to make {{INSERT}} into an empty table or partition even more efficient by making use of sorting. We expect data loads into huge tables with many secondary indexes to become faster. The change buffer (MDEV-11634) should no longer come into play in this case. (Huge loads into nonempty tables will be unaffected by this enhancement.)
            In MDEV-515, the original intention was to do 2 things to speed up {{INSERT}} to an empty table or partition:
            # Stop writing undo log records for each inserted record, and instead write just one saying that the table was empty.
            # When the table was empty, pre-sort the records for each index, and build the indexes one page at a time, instead of the current one unsorted record at a time.

            The undo logging change turned out to be rather challenging on its own, because it affects MVCC and locking as well. Preventing the useless undo logging should speed up {{ROLLBACK}} and recovery, and also the purge of history, compensating for MDEV-12288. Hence, it makes sense to complete MDEV-515 in the limited form.

            The purpose of this task is to make {{INSERT}} into an empty table or partition even more efficient by making use of sorting. We expect data loads into huge tables with many secondary indexes to become faster. The change buffer (MDEV-11634) should no longer come into play in this case. (Huge loads into nonempty tables will be unaffected by this enhancement.)

            After implementation comments:
            The following queries will use the new optimization:
            - CREATE ... SELECT
            - Inserting into an empty table together with SET STATEMENT unique_checks=0,foreign_key_checks=0 FOR ...
              - INSERT statement
              - INSERT ... SELECT
              - LOAD DATA INFILE

            Insert into an empty table without unique_checks=0,foreign_key_checks=0 will NOT be optimized.
             
            monty Michael Widenius made changes -
            Description In MDEV-515, the original intention was to do 2 things to speed up {{INSERT}} to an empty table or partition:
            # Stop writing undo log records for each inserted record, and instead write just one saying that the table was empty.
            # When the table was empty, pre-sort the records for each index, and build the indexes one page at a time, instead of the current one unsorted record at a time.

            The undo logging change turned out to be rather challenging on its own, because it affects MVCC and locking as well. Preventing the useless undo logging should speed up {{ROLLBACK}} and recovery, and also the purge of history, compensating for MDEV-12288. Hence, it makes sense to complete MDEV-515 in the limited form.

            The purpose of this task is to make {{INSERT}} into an empty table or partition even more efficient by making use of sorting. We expect data loads into huge tables with many secondary indexes to become faster. The change buffer (MDEV-11634) should no longer come into play in this case. (Huge loads into nonempty tables will be unaffected by this enhancement.)

            After implementation comments:
            The following queries will use the new optimization:
            - CREATE ... SELECT
            - Inserting into an empty table together with SET STATEMENT unique_checks=0,foreign_key_checks=0 FOR ...
              - INSERT statement
              - INSERT ... SELECT
              - LOAD DATA INFILE

            Insert into an empty table without unique_checks=0,foreign_key_checks=0 will NOT be optimized.
             
            In MDEV-515, the original intention was to do 2 things to speed up {{INSERT}} to an empty table or partition:
            # Stop writing undo log records for each inserted record, and instead write just one saying that the table was empty.
            # When the table was empty, pre-sort the records for each index, and build the indexes one page at a time, instead of the current one unsorted record at a time.

            The undo logging change turned out to be rather challenging on its own, because it affects MVCC and locking as well. Preventing the useless undo logging should speed up {{ROLLBACK}} and recovery, and also the purge of history, compensating for MDEV-12288. Hence, it makes sense to complete MDEV-515 in the limited form.

            The purpose of this task is to make {{INSERT}} into an empty table or partition even more efficient by making use of sorting. We expect data loads into huge tables with many secondary indexes to become faster. The change buffer (MDEV-11634) should no longer come into play in this case. (Huge loads into nonempty tables will be unaffected by this enhancement.)

            After implementation comments:
            The following queries will use the new optimization:
            - CREATE ... SELECT
            - Inserting into an empty table together with SET STATEMENT unique_checks=0,foreign_key_checks=0 FOR ...
              -- INSERT statement
              -- INSERT ... SELECT
              -- LOAD DATA INFILE

            Insert into an empty table without unique_checks=0,foreign_key_checks=0 will NOT be optimized.
             
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -

            People

              thiru Thirunarayanan Balathandayuthapani
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.