[MDEV-24621] In bulk insert, pre-sort and build indexes one page at a time Created: 2021-01-19  Updated: 2024-01-15  Resolved: 2021-10-26

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Fix Version/s: 10.7.1

Type: Task Priority: Critical
Reporter: Marko Mäkelä Assignee: Thirunarayanan Balathandayuthapani
Resolution: Fixed Votes: 0
Labels: Preview_10.7, performance

Issue Links:
Blocks
blocks MDEV-26155 Mangled data after IMPORT tablespace ... Open
blocks MDEV-33087 ALTER TABLE...ALGORITHM=COPY should b... Confirmed
is blocked by MDEV-515 innodb bulk insert Closed
is blocked by MDEV-26623 Possible race condition between stati... Closed
Problem/Incident
causes MDEV-27220 Assertion `!(index)->is_spatial()' fa... Closed
causes MDEV-27312 LeakSanitizer error in trx_mod_table_... Closed
causes MDEV-27316 Assertion `!(index)->is_spatial()' fa... Closed
causes MDEV-27318 SIGSEGV in row_merge_tuple_sort and A... Closed
causes MDEV-27858 Assertion `page_dir_get_n_heap(new_pa... Closed
causes MDEV-27890 SIGSEGV in innobase_get_mysql_key_num... Closed
causes MDEV-27953 Assertion `!check_unique_secondary' f... Closed
causes MDEV-28138 MariaDB Assertion Failed in mtr_buf_t... Closed
causes MDEV-28242 Assertion `!check_foreigns' failed in... Closed
causes MDEV-28865 Assertion `!check_foreigns' failed in... Closed
causes MDEV-29570 LeakSanitizer error in trx_mod_table_... Closed
causes MDEV-30426 Assertion !rec_offs_nth_extern(offset... Closed
causes MDEV-30796 Auto_increment values changed after r... Closed
causes MDEV-31025 Redundant table alter fails when fixe... Closed
Relates
relates to MDEV-16281 Implement parallel CREATE INDEX, ALTE... Open
relates to MDEV-25036 use bulk insert optimization for mult... Open
relates to MDEV-26947 UNIQUE column checks fail in InnoDB r... Closed
relates to MDEV-27751 InnoDB: Failing assertion: !cursor->i... Closed
relates to MDEV-27971 SIGSEGV in trx_undo_build_roll_ptr on... Closed
relates to MDEV-28237 Assertion `0' failed in row_upd_sec_i... Closed
relates to MDEV-28400 LeakSanitizer error in trx_mod_table_... Closed
relates to MDEV-29761 Bulk insert fails to rollback during... Closed
relates to MDEV-29989 binlog_do_db option breaks importing ... Closed
relates to MDEV-30063 InnoDB: Failing assertion: ib_table->... Open
relates to MDEV-30489 Assertion `trx->bulk_insert' failed i... Confirmed
relates to MDEV-31027 Assertion `!is_set() || (m_status == ... Open
relates to MDEV-31298 Assertion `!check_foreigns' failed in... Confirmed
relates to MDEV-31537 Bulk insert operation aborts the serv... Closed
relates to MDEV-32453 Bulk insert fails to apply when trigg... Stalled
relates to MDEV-33243 Assertion `!check_foreigns' failed in... Open
relates to MDEV-5171 Add support for --innodb-optimize-ke... Closed
relates to MDEV-16226 TRX_ID-based System Versioning refact... Stalled
relates to MDEV-26740 Inplace alter rebuild increases file ... Closed
relates to MDEV-27214 Import with disabled keys corrupts me... Closed
relates to MDEV-28679 After upgrade to 10.7.3-1 with enable... Closed
relates to MDEV-31667 SEGV in trx_undo_report_row_operation... Stalled
relates to MDEV-31686 InnoDB: Failing assertion: heap_no < ... Confirmed

 Description   

In MDEV-515, the original intention was to do 2 things to speed up INSERT to an empty table or partition:

  1. Stop writing undo log records for each inserted record, and instead write just one saying that the table was empty.
  2. When the table was empty, pre-sort the records for each index, and build the indexes one page at a time, instead of the current one unsorted record at a time.

The undo logging change turned out to be rather challenging on its own, because it affects MVCC and locking as well. Preventing the useless undo logging should speed up ROLLBACK and recovery, and also the purge of history, compensating for MDEV-12288. Hence, it makes sense to complete MDEV-515 in the limited form.

The purpose of this task is to make INSERT into an empty table or partition even more efficient by making use of sorting. We expect data loads into huge tables with many secondary indexes to become faster. The change buffer (MDEV-11634) should no longer come into play in this case. (Huge loads into nonempty tables will be unaffected by this enhancement.)

After implementation comments:
The following queries will use the new optimization:

  • CREATE ... SELECT
  • Inserting into an empty table together with SET STATEMENT unique_checks=0,foreign_key_checks=0 FOR ...
    • INSERT statement
    • INSERT ... SELECT
    • LOAD DATA INFILE

Insert into an empty table without unique_checks=0,foreign_key_checks=0 will NOT be optimized.



 Comments   
Comment by Marko Mäkelä [ 2021-08-17 ]

Before MDEV-25036 has been implemented, I think that we must disable this optimization for a subsequent INSERT statements into an initially empty table. That is, for a multi-INSERT transaction into an initially empty table, we will retain the table-level undo logging as per MDEV-515, but we will apply this optimization only to the first INSERT: At the end of the first statement that inserts into an empty table, we will complete the sort, and report any duplicate key errors.

After MDEV-25036, we could improve things in subsequent task(s) as follows:

  1. We would keep buffering rows from potentially multiple INSERT statements until we receive a statement that forces us to apply the changes.
    • Examples would include XA PREPARE, COMMIT, and anything that reads from the table (including UPDATE or DELETE).
    • Duplicate key errors or other constraint violations would be reported at this point only.
  2. If the special START TRANSACTION statement indicates so, we may assume that the data is already sorted by PRIMARY KEY and skip the sorting altogether. If we find that the order is violated, we would report an error and roll back the transaction.
Comment by Thirunarayanan Balathandayuthapani [ 2021-08-30 ]

Patch is in bb-10.7-MDEV-24621

Comment by Marko Mäkelä [ 2021-09-01 ]

I posted the first batch of review comments.

I think that the interfaces are roughly correct. I think that we must be careful about overhead during DML operation. Error handling needs to be improved too.

Comment by Sergei Golubchik [ 2021-09-07 ]

Does it really have to add a new handlerton method and extend the server's commit code ?
It seems that InnoDB can do it on commit internally.

Comment by Marko Mäkelä [ 2021-09-08 ]

serg, until MDEV-25036 has been implemented, we must finalize bulk insert at the end of each statement that was the first one to insert into an empty table.

Even after MDEV-25036 is implemented, the finalization must take place before either handlerton::prepare_commit_versioned or handlerton::prepare are invoked. It would be cleaner to have a "pre-prepare" step. That could be handy not only for handlerton::prepare_commit_versioned calls, but also for a new one that we might need in MDEV-24608.

Comment by Marko Mäkelä [ 2021-09-16 ]

An incomplete work-around for MDEV-26623 was included in this branch. I plan to merge MDEV-26623 and revert the workaround as soon as possible.

Comment by Marko Mäkelä [ 2021-09-17 ]

The MDEV-26623 fix has now been merged to the preview branch.

Comment by Marko Mäkelä [ 2021-09-20 ]

I wanted to perform a small performance test, similar to what I conducted for MDEV-515 earlier. Unfortunately, this test will crash the preview branch if the first SET statement is present:

--source include/have_innodb.inc
--source include/have_sequence.inc
--source include/not_debug.inc
 
SET unique_checks=0, foreign_key_checks=0;
 
SET @freq=@@GLOBAL.innodb_purge_rseg_truncate_frequency;
SET GLOBAL innodb_purge_rseg_truncate_frequency=1;
 
CREATE TABLE t (a INT PRIMARY KEY, b CHAR(255) CHARACTER SET utf8)
ENGINE=InnoDB ROW_FORMAT=REDUNDANT;
 
let $n=1000000;
 
--source include/wait_all_purged.inc
SET profiling_history_size=100;
SET profiling = 1;
SELECT variable_value INTO @old_lsn FROM INFORMATION_SCHEMA.GLOBAL_STATUS
WHERE variable_name = 'INNODB_LSN_CURRENT';
 
BEGIN;
eval INSERT INTO t(a) SELECT * FROM seq_1_to_$n;
SELECT (@lsn:= variable_value)-@old_lsn
FROM INFORMATION_SCHEMA.GLOBAL_STATUS
WHERE variable_name = 'INNODB_LSN_CURRENT';
ROLLBACK;
SET @old_lsn=@lsn;
SELECT (@lsn:= variable_value)-@old_lsn
FROM INFORMATION_SCHEMA.GLOBAL_STATUS
WHERE variable_name = 'INNODB_LSN_CURRENT';
 
SELECT * FROM t;
 
eval INSERT INTO t(a) SELECT * FROM seq_1_to_$n;
--source include/wait_all_purged.inc
SET @old_lsn=@lsn;
SELECT (@lsn:= variable_value)-@old_lsn
FROM INFORMATION_SCHEMA.GLOBAL_STATUS
WHERE variable_name = 'INNODB_LSN_CURRENT';
 
TRUNCATE TABLE t;
--source include/wait_all_purged.inc
SET @old_lsn=@lsn;
SELECT (@lsn:= variable_value)-@old_lsn
FROM INFORMATION_SCHEMA.GLOBAL_STATUS
WHERE variable_name = 'INNODB_LSN_CURRENT';
 
BEGIN;
eval INSERT INTO t SET a=$n;
--error ER_DUP_ENTRY
eval INSERT INTO t(a) SELECT * FROM seq_1_to_$n;
SELECT * FROM t;
ROLLBACK;
SELECT * FROM t;
SET @old_lsn=@lsn;
SELECT (@lsn:= variable_value)-@old_lsn
FROM INFORMATION_SCHEMA.GLOBAL_STATUS
WHERE variable_name = 'INNODB_LSN_CURRENT';
 
SHOW PROFILES;
SET profiling = 0;
DROP TABLE t;
SET GLOBAL innodb_purge_rseg_truncate_frequency=@freq;

Without that first SET statement, we will be using normal row-logged INSERT. For me, that did consume roughly the same amount of time and produced the same amount of redo log, when comparing to the 10.7 base revision that was last merged with the branch.

preview-10.7-MDEV-24621-innodb-bulk-insert e62107a1adbea2af5b514bcc43a1c2e54da687fa

#2  0x0000555555b9d26d in ib::fatal::~fatal (this=<optimized out>, 
    this@entry=0x7ffff04d0360, __in_chrg=<optimized out>)
    at /mariadb/10.7/storage/innobase/ut/ut0ut.cc:519
#3  0x0000555555b8fc66 in row_mysql_handle_errors (new_err=0x7ffff04d057c, 
    trx=0x7ffff1785130, thr=<optimized out>, savept=0x7ffff04d0580)
    at /mariadb/10.7/storage/innobase/row/row0mysql.cc:727
#4  0x00005555562191a5 in row_insert_for_mysql (mysql_rec=<optimized out>, 
    prebuilt=0x7fff8c0286a0, ins_mode=ROW_INS_NORMAL)
    at /mariadb/10.7/storage/innobase/row/row0mysql.cc:1329
(gdb) f 3
#3  0x0000555555b8fc66 in row_mysql_handle_errors (new_err=0x7ffff04d057c, 
    trx=0x7ffff1785130, thr=<optimized out>, savept=0x7ffff04d0580)
    at /mariadb/10.7/storage/innobase/row/row0mysql.cc:727
727			ib::fatal() << "Unknown error " << err;
(gdb) p err
$1 = DB_IO_ERROR

I repeated the same with rr, and I determined that the DB_IO_ERROR was set here:

preview-10.7-MDEV-24621-innodb-bulk-insert e62107a1adbea2af5b514bcc43a1c2e54da687fa

#0  0x0000563fadc90434 in os_file_io (in_type=<optimized out>, 
    file=<optimized out>, buf=<optimized out>, n=<optimized out>, offset=0, 
    err=0x7f723c9a4074) at /mariadb/10.7/storage/innobase/os/os0file.cc:2896
#1  0x0000563fadc905c5 in os_file_pwrite (err=0x7f723c9a4074, 
    offset=<optimized out>, n=<optimized out>, buf=0x0, file=52, type=
      @0x563fae06eb10: {bpage = 0x0, node = 0x0, type = IORequest::WRITE_SYNC})
    at /mariadb/10.7/storage/innobase/os/os0file.cc:2932
#2  os_file_write_func (type=
      @0x563fae06eb10: {bpage = 0x0, node = 0x0, type = IORequest::WRITE_SYNC}, name=<optimized out>, name@entry=0x563fae06e154 "(bulk insert)", 
    file=file@entry=52, buf=buf@entry=0x0, offset=<optimized out>, 
    offset@entry=0, n=<optimized out>, n@entry=4294967295)
    at /mariadb/10.7/storage/innobase/os/os0file.cc:2963
#3  0x0000563fadcc9827 in pfs_os_file_write_func (type=
      @0x563fae06eb10: {bpage = 0x0, node = 0x0, type = IORequest::WRITE_SYNC}, src_file=0x563fae06dbf0 "/mariadb/10.7/storage/innobase/row/row0merge.cc", 
    src_line=1081, n=4294967295, offset=0, buf=<optimized out>, 
    file=<optimized out>, name=0x563fae06e154 "(bulk insert)")
    at /mariadb/10.7/storage/innobase/include/os0file.ic:317
#4  row_merge_buf_blob (blob_file=0x7f721c939b38, heap=<synthetic pointer>, 
    n_fields=4, entry=0x7f721c939c18)
    at /mariadb/10.7/storage/innobase/row/row0merge.cc:1081
#5  row_merge_buf_write (buf=0x7f721c939b78, of=<optimized out>, 
    5c000 "", blob_file=0x7f721c939b38) at /mariadb/10.7/storage/innobase/row/row0merge.cc:1126
#6  0x0000563fadcc9d5e in row_merge_bulk_t::write_to_tmp_file (index_no=0, this=0x7f721c939af0) at /mariadb/10.7/storage/innobase/row/row0merge.cc:5075
#7  row_merge_bulk_t::write_to_tmp_file (this=0x7f721c939af0, index_no=0) at /mariadb/10.7/storage/innobase/row/row0merge.cc:5066
#8  0x0000563fadcca154 in row_merge_bulk_t::bulk_insert_buffered (this=0x7f721c939af0, row=row@entry=0x7f721c026b40, ind=<optimized out>, trx=<optimized out>)
    at /mariadb/10.7/storage/innobase/row/row0merge.cc:5123
#9  0x0000563fadcc4975 in trx_mod_table_time_t::bulk_insert_buffered (trx=<optimized out>, index=<optimized out>, entry=<optimized out>, this=<optimized out>)
    at /mariadb/10.7/storage/innobase/include/trx0trx.h:488
#10 row_ins_index_entry (thr=0x7f721c02eb08, entry=<optimized out>, index=<optimized out>) at /mariadb/10.7/storage/innobase/row/row0ins.cc:3373
#11 row_ins_index_entry_step (thr=0x7f721c02eb08, node=<optimized out>) at /mariadb/10.7/storage/innobase/row/row0ins.cc:3546
#12 row_ins (thr=0x7f721c02eb08, node=<optimized out>) at /mariadb/10.7/storage/innobase/row/row0ins.cc:3692
#13 row_ins_step (thr=thr@entry=0x7f721c02eb08) at /mariadb/10.7/storage/innobase/row/row0ins.cc:3838

I did not debug this deeper. One idea would be to use normal file I/O instead of the InnoDB page I/O wrapper. We do not want O_DIRECT for these writes, since we will be writing arbitrary-sized blocks without any alignment guarantees.

Comment by Marko Mäkelä [ 2021-09-20 ]

The issue was that we were incorrectly attempting to write NULL values to the BLOB file. With that fixed, I am seeing the following amount of redo log writes reported by the test. (Note that these numbers are not entirely stable.)

operation baseline baseline-nopurge patched patched-nopurge
bulk INSERT 1M rows 823,539,085 823,536,600 842,825,019 842,823,916
ROLLBACK 71,347 71,353 82,611 82,617
bulk INSERT 1M rows 823,489,560 823,487,932 842,768,033 842,771,515
TRUNCATE 2,880 2,616 2,973 2,632
normal INSERT 1+1M rows,rollback 871,863,611 871,860,836 871,864,475 871,861,868

For the last step, the numbers should be identical, and they are close to identical. Somewhat surprising is that for the ROLLBACK and TRUNCATE we see more log written, even though those were supposed to be roughly identical as well. To remove randomness caused by the purge of transaction history, I repeated the test with innodb_force_recovery=2 and disabled the wait_all_purge.inc lines. Every second column is from that run.

As we can see, the nopurge columns for TRUNCATE are virtually identical, but for ROLLBACK they are not. I think that the reason is different index tree layout. After the first INSERT of the 1M rows, I am seeing the following t.ibd file size:

baseline patched
884,998,144 1,006,632,960

thiru, I think that such a huge change of file size is something that we must investigate. With the same MERGE_THRESHOLD and other parameters that affect the B-tree page fill factor, we would want to have a similar file layout. I would expect the file size change to explain the different amount of redo log written for ROLLBACK as well.

Note: Performance-wise we seem to be doing fine. Because I had some concurrent load running on the machine, I cannot quote reliable numbers. The test completed in about 31 seconds on the baseline 10.7 revision and in 25 seconds with the patch.

Comment by Marko Mäkelä [ 2021-09-20 ]

I must say that it is encouraging to see such time savings for a simple benchmark that did not even include secondary indexes. With secondary indexes, I would expect the performance difference to be even bigger, between row-by-row inserts into each index, and the merge sort and bulk insert of one index at a time.

Comment by Thirunarayanan Balathandayuthapani [ 2021-10-01 ]

--source include/have_innodb.inc
--source include/have_sequence.inc
create table t1(f1 int not null primary key, b char(255) CHARACTER SET utf8)engine=innodb;
INSERT INTO t1(f1) SELECT * FROM seq_1_to_1000000;
--source include/restart_mysqld.inc - Here t1 size is 32 MB
alter table t1 force, algorithm=inplace;
--source include/restart_mysqld.inc - Here t1 size is 36 MB
drop table t1;

Inplace alter table code increases the file size of the table by 4MB. Bulk insert code uses the same code path as inplace alter code. This file size
increases issue should affect 10.2+ onwards.

Comment by Marko Mäkelä [ 2021-10-08 ]

thiru, I got some data from axel regarding a performance regression, but it did not have good stack traces. I reread the code changes, and I suspect that the performance regression is due to the SQL layer change that also serg asked about. Could we replace innodb_bulk_insert_write() and its caller with a small addition to ha_innobase::reset()?

Comment by Marko Mäkelä [ 2021-10-15 ]

This is still causing some performance regression for normal DML workload in a case when bulk loading was not used at all.

Comment by Marko Mäkelä [ 2021-10-22 ]

Performance seems to be acceptable now.

Generated at Thu Feb 08 09:31:22 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.