[MDEV-31835] LOAD DATA into InnoDB w/partitions: huge performance loss, affected 10.6+ Created: 2023-08-03  Updated: 2024-01-14  Resolved: 2023-08-26

Status: Closed
Project: MariaDB Server
Component/s: Partitioning, Storage Engine - InnoDB
Affects Version/s: 10.6.14, 10.11.4
Fix Version/s: 10.6.16, 10.10.7, 10.11.6, 11.0.4, 11.1.3, 11.2.2

Type: Bug Priority: Critical
Reporter: Karol Roslaniec Assignee: Thirunarayanan Balathandayuthapani
Resolution: Fixed Votes: 3
Labels: performance
Environment:

x86_64, Rocky Linux release 8.8 (Green Obsidian)


Attachments: File tables.sql    
Issue Links:
Blocks
Problem/Incident
causes MDEV-32291 memory leak in innodb.insert_into_emp... Closed
is caused by MDEV-515 innodb bulk insert Closed
Relates
relates to MDEV-24818 Concurrent use of InnoDB table is imp... Closed
relates to MDEV-31985 IGNORE in INSERT and LOAD are being i... Confirmed

 Description   

Hi,
Starting from MariaDB ver 10.6, I experience huge performance loss (like x10)
when loading (via LOAD DATA INFILE) data into table with partitions.
Time of loading depends on number of partitions in table.
This bug appeared on ver. 10.6.
10.11.4 is also affected, although I don't include results here.
All previous versions up to 10.5.21 are not affected.

Let me share results first:
tbl0 - simple table with 3 columns, without partitions
tbl1 - as above, with 1 partition
tbl2 - as above, with 800 partitions

Below are times spent on LOAD DATA INFILE (each test repeated 3 times):
10.6.14 tbl0 = 1.406s 1.384s 1.338s
10.6.14 tbl1 = 1.553s 1.432s 1.429s
10.6.14 tbl2 = 11.692s 10.929s 11.019s
10.5.21 tbl0 = 1.457s 1.332s 1.228s
10.5.21 tbl1 = 1.511s 1.496s 1.249s
10.5.21 tbl2 = 1.500s 1.543s 1.499s

More info about my configuration:
OS: Linux Rocky 8.8, x86_64

my.cnf:

[mysqld]
innodb_buffer_pool_size = 10G
tmpdir = /opt/tmp
log_error
log_warnings
innodb_log_file_size = 6G
innodb_file_per_table = 1

How I build MariaDB:

VER=10.6.14
tar xzf mariadb-$VER.tar.gz
mkdir build-$VER
cd build-$VER
NP=/opt/db.$VER
cmake3 ../mariadb-$VER \
-DCMAKE_INSTALL_PREFIX=$NP \
-DINSTALL_MYSQLDATADIR=$NP/data \
-DINSTALL_SYSCONFDIR=$NP/etc \
-DINSTALL_UNIX_ADDRDIR=$NP/tmp/mysql.sock \
-DCMAKE_CXX_FLAGS=-march=sandybridge \
-DCMAKE_C_FLAGS=-march=sandybridge \
-DPLUGIN_MROONGA=NO \
-DWITHOUT_TOKUDB=1 \
-DCMAKE_BUILD_TYPE=Release
make -j12
make install

Tables - see attached tables.sql for details, but in principle simple table like:

CREATE TABLE `tbl0` (
`id` int(11) NOT NULL,
`tm` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`v` varchar(15) DEFAULT NULL,
PRIMARY KEY (`tm`,`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3 COLLATE=utf8mb3_bin
;

Test scenario:

cat ~/tables.sql | mysql -uroot test
FILE=~/test.txt
perl -e 'print "$_\t2023-07-25\t1234567890ABCD\n" for (1..2e5)' > $FILE
time mysql -uroot -e "LOAD DATA INFILE '$FILE' IGNORE INTO TABLE tbl0" test
time mysql -uroot -e "LOAD DATA INFILE '$FILE' IGNORE INTO TABLE tbl1" test
time mysql -uroot -e "LOAD DATA INFILE '$FILE' IGNORE INTO TABLE tbl2" test
mysql -uroot -e "truncate tbl0; truncate tbl1; truncate tbl2" test

Best Regards,
Karol



 Comments   
Comment by Marko Mäkelä [ 2023-08-03 ]

I can confirm this as described, on /dev/shm (RAM disk). I executed perf record -p $(pgrep mariadbd) while the statement on tbl2 was executing. Here is what perf report told me:

10.6 691e964d2357b579f64e08398f4b66f7d2c81566

  33,75%  mariadbd       mariadbd             [.] bitmap_get_next_set
  18,20%  mariadbd       mariadbd             [.] ha_innobase::extra
   9,70%  mariadbd       mariadbd             [.] ha_partition::extra
   7,50%  mariadbd       mariadbd             [.] innobase_trx_init
   3,61%  mariadbd       mariadbd             [.] mysql_sys_var_int
   2,72%  mariadbd       libc.so.6            [.] __GI___pthread_mutex_unlock_u
   1,87%  mariadbd       mariadbd             [.] thd_get_ha_data
   1,50%  mariadbd       mariadbd             [.] thd_test_options
   1,24%  mariadbd       mariadbd             [.] wsrep_on
   1,23%  mariadbd       mariadbd             [.] buf_page_get_low
   1,20%  mariadbd       mariadbd             [.] handler::ha_thd
   1,15%  mariadbd       mariadbd             [.] rec_get_offsets_func
   0,85%  mariadbd       mariadbd             [.] page_cur_search_with_match
   0,76%  mariadbd       mariadbd             [.] cmp_data
   0,68%  mariadbd       mariadbd             [.] cmp_dtuple_rec_with_match_low
   0,46%  mariadbd       mariadbd             [.] READ_INFO::read_field
   0,43%  mariadbd       mariadbd             [.] btr_cur_t::search_leaf
   0,41%  mariadbd       mariadbd             [.] rec_init_offsets_comp_ordinary
   0,41%  mariadbd       libc.so.6            [.] __memcmp_avx2_movbe_rtm
   0,39%  mariadbd       mariadbd             [.] mysql_load
   0,31%  mariadbd       libc.so.6            [.] __tzfile_compute

Based on this, there is an excessive amount of calls to ha_innobase::extra() as well as innobase_trx_init(). It is as if each row were inserted in a separate transaction.

I set breakpoints on these two functions to see what is going on. I see that ha_partition::open_read_partitions() is invoking handler::ha_open() on each partition, apparently with a different THD, because the parameter that is being passed to check_trx_exists() is changing. This forces InnoDB to constantly allocate new transaction objects.

Comment by Yuchen Pei [ 2023-08-09 ]

mtr case (thanks to marko for helping with the reprod)

--source include/have_innodb.inc
--source include/have_partition.inc
DROP TABLE IF EXISTS `tbl0`;
CREATE TABLE `tbl0` (
  `id` int(11) NOT NULL,
  `tm` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
  `v` varchar(15) DEFAULT NULL,
  PRIMARY KEY (`tm`,`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3 COLLATE=utf8mb3_bin
;
 
DROP TABLE IF EXISTS `tbl1`;
CREATE TABLE `tbl1` (
  `id` int(11) NOT NULL,
  `tm` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
  `v` varchar(15) DEFAULT NULL,
  PRIMARY KEY (`tm`,`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3 COLLATE=utf8mb3_bin
 PARTITION BY RANGE (unix_timestamp(`tm`))
(PARTITION `p2023080123` VALUES LESS THAN (1690927200) ENGINE = InnoDB);
 
DROP TABLE IF EXISTS `tbl2`;
CREATE TABLE `tbl2` (
  `id` int(11) NOT NULL,
  `tm` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
  `v` varchar(15) DEFAULT NULL,
  PRIMARY KEY (`tm`,`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3 COLLATE=utf8mb3_bin
 PARTITION BY RANGE (unix_timestamp(`tm`))
(PARTITION `p2023062900` VALUES LESS THAN (1687993200) ENGINE = InnoDB,
... # omit ~800 lines from tables.sql
 PARTITION `p2023080123` VALUES LESS THAN (1690927200) ENGINE = InnoDB);
 
--let $TMP_DIR=`select @@tmpdir`
--let inf=$TMP_DIR/inf
 
perl;
my $file="$ENV{inf}";
open(FH, ">", $file) or die $!;
print FH "$_\t2023-07-25\t1234567890ABCD\n" for (1..2e5);
close FH or die "close";
EOF
 
SET profiling = 1;
eval load data infile '$inf' ignore into table tbl0;
eval load data infile '$inf' ignore into table tbl1;
eval load data infile '$inf' ignore into table tbl2;
show profiles;
drop table tbl0, tbl1, tbl2;

Running this case, at 10.5.21 bed70468ea08c2820647f5e3ac006a9ff88144ac
I get

Query_ID	Duration	Query
1	0.91881544	load data infile '/home/ycp/source/mariadb-server/10.5/build/mysql-test/var/tmp/mysqld.1/inf' ignore into table tbl0
2	0.86368869	load data infile '/home/ycp/source/mariadb-server/10.5/build/mysql-test/var/tmp/mysqld.1/inf' ignore into table tbl1
3	0.86235132	load data infile '/home/ycp/source/mariadb-server/10.5/build/mysql-test/var/tmp/mysqld.1/inf' ignore into table tbl2

at 10.6 d28d636f57bb8c197c7bd9f35ecfe2a972bf31ab I get

Query_ID	Duration	Query
1	0.76890261	load data infile '/home/ycp/source/mariadb-server/10.6/build/mysql-test/var/tmp/mysqld.1/inf' ignore into table tbl0
2	0.96452144	load data infile '/home/ycp/source/mariadb-server/10.6/build/mysql-test/var/tmp/mysqld.1/inf' ignore into table tbl1
3	2.72212813	load data infile '/home/ycp/source/mariadb-server/10.6/build/mysql-test/var/tmp/mysqld.1/inf' ignore into table tbl2

I had -march=native and CMAKE_BUILD_TYPE=Release
(Debug may not work in showing the correct scale of regression).

Running outside mtr with the time command produced similar
results.

Comment by Yuchen Pei [ 2023-08-10 ]

Git bisect result is in:

3cef4f8f0fc88ae5bfae4603d8d600ec84cc70a9 is the first bad commit
commit 3cef4f8f0fc88ae5bfae4603d8d600ec84cc70a9
Author: Marko Mäkelä <marko.makela@mariadb.com>
Date:   Mon Jan 25 18:41:27 2021 +0200
    MDEV-515 Reduce InnoDB undo logging for insert into empty table

Undoing the change to ha_partition.cc in this commit fixes the test:

Hmm, I don't know why the test passed locally earlier on the following
commit Undoing the change to ha_partition.cc in the above commit.
Obviously it should fall through the case and fail an assertion, as it does
now:

(based on 11.0)
https://github.com/MariaDB/server/commit/5bfe3e24785

Here's a better one (hopefully):
https://github.com/MariaDB/server/commit/a744a9ce2fb

Comment by Yuchen Pei [ 2023-08-10 ]

Initial debugging shows that in the load data query for tbl1 and tbl2,
each row causes one call to ha_partition::extra with
operation=HA_EXTRA_IGNORE_INSERT (introduced in the bad commit),
and each such call makes a number calls to ha_innobase::extra
equal to the number of partitions, i.e. 1 call for tbl1 (200k calls in
total) and 816 calls for tbl2 (163.2 mil calls in total), the high
number of calls in the query for tbl2 could be the cause of the
performance issue.

trace for tbl0:

ha_innobase::extra > write_record > read_sep_field > mysql_load > mysql_execute_command > mysql_parse > dispatch_command > do_command > do_handle_one_connection > handle_one_connection > pfs_spawn_thread

trace for tbl1 and tbl2:

ha_innobase::extra > extra_cb > ha_partition::loop_partitions > ha_partition::extra > write_record > read_sep_field > mysql_load > mysql_execute_command > mysql_parse > dispatch_command > do_command > do_handle_one_connection > handle_one_connection > pfs_spawn_thread

Comment by Marko Mäkelä [ 2023-08-10 ]

Thank you for the analysis. I think that we must remove the call table->file->extra(HA_EXTRA_IGNORE_INSERT) from write_row() and move it to each (grand)caller of write_row() where IGNORE semantics may be needed, right before the loop that would insert the records. It seems to me that the LOAD statement is implemented by mysql_load(). Is it being invoked with the parameter bool ignore=true in the test case?

Comment by Yuchen Pei [ 2023-08-11 ]

[1] https://github.com/MariaDB/server/commit/bb924273f3b8700d85827e65663e0b6fbca99f92

Comment by Rick James [ 2023-08-11 ]

How did you generate the LESS THAN clauses? I wonder whether all the data went into one partition or was evenly spread across partitions. Please rerun and display the sizes of each partition (or at least the first and last partition).

(Meanwhile, thanks for the test case demonstrating that lots of partitions can be slow. I have long suspected such.)

Comment by Yuchen Pei [ 2023-08-11 ]

> How did you generate the LESS THAN clauses?

It is in the attached tables.sql file.

> I wonder whether all the data went into one partition or was evenly spread across partitions.

They are all in one partition, because tm has the same value (2023-07-25) for all rows:

perl -e 'print "$_\t2023-07-25\t1234567890ABCD\n" for (1..2e5)' > $FILE

See also the select output in the result file[1] in the commit:

select count(*) from tbl2 where unix_timestamp(`tm`) = 1690207200;
count(*)
200000

https://github.com/MariaDB/server/commit/bb924273f3b8700d85827e65663e0b6fbca99f92#diff-84653f5b4fbf870463ee396603fac266ac3dd4f451da4bf4a3f8676a27ace339R860-R862

Comment by Yuchen Pei [ 2023-08-11 ]

Given each call to ha_innobase::write_row() resets
trx->bulk_insert to true which could cause the problem, one way to fix
it is to re-reset bulk_insert to false before or after each call. The
issue is that by the time a call is made to ha_innobase::write_row(),
the info is not necessary available, because the call could be made
from ha_partition. So one idea is to do something in a back channel,
maybe in ha_innobase::extra(), to mark a flag so that
ha::innobase::write_row() will reset trx->bulk_insert. This is
what I did in <https://github.com/MariaDB/server/commit/144e91f1dc5>,
which passes all existing tests. The failure in buildbot is the new
test I added (main.mdev_31835), which passes locally, so it could be
something to do with the filesystem or other inconsequential issues.

Anyway, following up on discussions on #spider, I have transferred
this ticket to the innodb team.

Comment by Marko Mäkelä [ 2023-08-14 ]

I think that we should try to avoid introducing conditional branches in frequently invoked low-level code, such as write_record() or handler::write_row(). Conditional branches may cause pipeline stalls and can incur a huge overhead on superscalar processors. It would be better to inform the storage engine only once per statement (rather than once per written row) that the bulk insert interface cannot be used due to an IGNORE keyword.

Comment by Matthias Leich [ 2023-08-17 ]

origin/bb-10.10-MDEV-31835 8fa040f25aeffb9c2ab62ff48a5948551a9a2bb3 2023-08-16T20:54:10+05:30   and
origin/bb-10.6-MDEV-31835 88e09ec78d6e6d0fec4b734b85919090d7dcbbf0 2023-08-16T20:41:39+05:30
behaved well in RQG testing. No new bad effects.

Comment by Marko Mäkelä [ 2023-08-18 ]

Thank you both. It would be really great if we can simply remove the HA_EXTRA_IGNORE_INSERT without any bad effects. thiru, can you please find out which later change in 10.6 made the flag unnecessary? Was it perhaps MDEV-24818? I think that we need to fully understand this. For 10.10 (the earliest currently supported major version which contains MDEV-24621), I see that some additional changes are necessary, to apply any earlier bulk INSERT before the first INSERT IGNORE statement will cause a switch to row-level undo logging.

mleich, I think that we need extra comparison runs to ensure that LOADIGNORE and INSERT IGNORE behave exactly the same way as they would before this change. Ideally, we should compare both the 10.6 and the 10.10 versions of the fix to unfixed 10.6 and 10.10 servers.

Comment by Thirunarayanan Balathandayuthapani [ 2023-08-21 ]

How insert operation works under bulk insert operation:

begin;
insert into t1 values(1);
insert into t1 values(2);
insert into t1 values(2);
ERROR 23000: Duplicate entry '2' for key 'PRIMARY'
commit;
select * from t1;
f1

Since we have only one table level undo log for the table, we clear the table itself incase of partial rollback or complete rollback.

We can follow the same for insert..ignore under bulk insert operation too. There is no need to end bulk insert when we encounter insert..ignore stmt.
Given patch does remove the extra call of HA_EXTRA_IGNORE_INSERT

Comment by Matthias Leich [ 2023-08-21 ]

Results of RQG testing
============================
Scenario:
There are two DB servers which run with different MariaDB versions.
One session runs data modifying SQL (DDL/DML). Other sessions observe but do not change data.
 Any SQL which gets executed with success by the data modifier session on the first server gets sent to the the second server too (RQG builtin statement based replication).
After that session disconnects
- the data integrity on both servers is checked by running
   CHECK TABLE ... EXTENDED, SELECTs with FORCE INDEX and similar, ... ALTER TABLE ... FORCE
- the data on both servers is dumped and compared
Test setup:
     --grammar=conf/mariadb/oltp-transactional.yy --gendata=conf/mariadb/oltp.zz --redefine=conf/mariadb/bulk_insert.yy --rpl_mode=rqg2 
     --grammar=conf/mariadb/innodb_compression_encryption.yy --gendata=conf/mariadb/innodb_compression_encryption.zz --redefine=conf/mariadb/bulk_insert.yy --rpl_mode=rqg2 
     + variation of server startup parameters.
 
First server: origin/bb-10.6-MDEV-31835 88e09ec78d6e6d0fec4b734b85919090d7dcbbf0 2023-08-16T20:41:39+05:30.
Second server: origin/10.6 0be4781428a4044b13b085965820a995652bb0e9 2023-08-09T03:14:35+02:00
775 RQG tests were executed and only two failed.
Both failing tests used innodb_change_buffering != "none" which is known to be error prone.
 
First server: origin/bb-10.10-MDEV-31835 8fa040f25aeffb9c2ab62ff48a5948551a9a2bb3 2023-08-16T20:54:10+05:30
Second server: origin/10.10 afc64eacc992dd9736a0284926a34d9d5124aabf 2023-08-18T09:42:14+04:00
~ 3530 RQG tests were executed and only two failed
  - one with innodb_change_buffering != "none"
  - the other with false alarm

Comment by Marko Mäkelä [ 2023-08-21 ]

thiru, as far as I understand, you are trying to say that SET foreign_key_checks=0, unique_checks=0 is already changing the way how a multi-statement INSERT transactions work. Before MDEV-515 and MDEV-24818 were implemented, the outcome of your example

begin;
insert into t1 values(1);
insert into t1 values(2);
--error ER_DUP_ENTRY
insert into t1 values(2);
commit;

would always be that we would have the records (1),(2) in the table. With MDEV-515 and MDEV-24818, if the table is empty and the bulk insert is enabled, we would end up with an empty table.

I now see that even without this fix, we will ignore the IGNORE in the following example, and end up with an empty table in MariaDB Server 10.6.15:

--source include/have_innodb.inc
SET foreign_key_checks=0, unique_checks=0;
CREATE TABLE t1 (a INT UNIQUE) ENGINE=InnoDB;
BEGIN;
INSERT INTO t1 VALUES(1),(2);
INSERT IGNORE INTO t1 VALUES(2);
COMMIT;
SELECT * FROM t1;
DROP TABLE t1;

Only if the COMMIT is moved before the INSERT IGNORE, then we will end up with the result that I would expect: (1),(2).

I think that it would be helpful issue a warning "ignoring IGNORE" on the SQL layer if LOAD…IGNORE or INSERT…IGNORE will be ignored if it is being executed on an InnoDB table while unique_checks=0,foreign_key_checks=0 are in effect. serg, what do you think?

Comment by Marko Mäkelä [ 2023-08-22 ]

I tracked down the change that causes the above test to return an empty table. MDEV-24818 in MariaDB Server 10.6.0 would make the HA_EXTRA_IGNORE_INSERT redundant and basically cause the IGNORE clause to be ignored in bulk insert mode.

10.6 8ea923f55b7666a359ac2c54f6c10e8609d16846

SET foreign_key_checks=0, unique_checks=0;
CREATE TABLE t (a INT UNIQUE) ENGINE=InnoDB;
BEGIN;
INSERT INTO t VALUES(1),(2);
INSERT IGNORE INTO t VALUES(2);
Warnings:
Warning	1062	Duplicate entry '2' for key 'a'
COMMIT;
SELECT * FROM t;
a
DROP TABLE t;

10.6 8dd35a2507f8d63ca8df9335d2c6072d5c0e3b86

SET foreign_key_checks=0, unique_checks=0;
CREATE TABLE t (a INT UNIQUE) ENGINE=InnoDB;
BEGIN;
INSERT INTO t VALUES(1),(2);
INSERT IGNORE INTO t VALUES(2);
Warnings:
Warning	1062	Duplicate entry '2' for key 'a'
COMMIT;
SELECT * FROM t;
a
DROP TABLE t;

In other words, the HA_EXTRA_IGNORE_INSERT can be safely removed. It could be worth filing a separate bug for issuing a warning that the IGNORE clause of INSERT or LOAD statements will be ignored in bulk insert mode.

Comment by Marko Mäkelä [ 2023-08-22 ]

Please mention in the commit message that the second commit of MDEV-24818 made the HA_EXTRA_IGNORE_INSERT redundant.

Comment by Thirunarayanan Balathandayuthapani [ 2023-08-22 ]

commit 8ea923f55b7666a359ac2c54f6c10e8609d16846
Author: Marko Mäkelä <marko.makela@mariadb.com>
Date:   Tue Mar 16 15:21:34 2021 +0200
 
    MDEV-24818: Optimize multi-statement INSERT into an empty table
    
    If the user "opts in" (as in the parent
    commit 92b2a911e516ded986945d5d3dc6326f5575a21a),
    we can optimize multiple INSERT statements to use table-level locking
    and undo logging.
    
    There will be a change of behavior:
    
        CREATE TABLE t(a PRIMARY KEY) ENGINE=InnoDB;
        SET foreign_key_checks=0, unique_checks=0;
        BEGIN; INSERT INTO t SET a=1; INSERT INTO t SET a=1; COMMIT;
    
    will end up with an empty table, because in case of an error,
    the entire transaction will be rolled back, instead of rolling
    back the failing statement. Previously, the second INSERT statement
    would have been logged row by row, and only that second statement
    would have been rolled back, leaving the first INSERT intact.
    
    lock_table_x_unlock(), trx_mod_table_time_t::WAS_BULK: Remove.
    Because we cannot really support statement rollback in this
    optimized mode, we will not optimize the locking. The exclusive
    table lock will be held until the end of the transaction.

Above commit has made the HA_EXTRA_IGNORE_INSERT redundant.

Comment by Sergei Golubchik [ 2023-08-22 ]

marko

I think that it would be helpful issue a warning "ignoring IGNORE" on the SQL layer if LOAD…IGNORE or INSERT…IGNORE will be ignored if it is being executed on an InnoDB table while unique_checks=0,foreign_key_checks=0 are in effect.

No, I don't think it should be a warning. You don't execute the command that user wanted, but do something different, this produces incorrect (according to the semantics of the user issued command) result. So I am sure this should be an error — if you cannot do what user wants, you should not proceed doing something else instead.

Come to think of it, IGNORE doesn't make much sense when unique_checks=0, perhaps the server can error out early and don't even pass it to the engine?

Comment by Marko Mäkelä [ 2023-08-22 ]

serg, thank you. I have filed MDEV-31985 for the missing error handling.

Comment by Michael Widenius [ 2023-10-31 ]

unique_checks may only disable checking of UNIQUE .. HASH.
Primary keys in all engines I know of are always checked!
This means that IGNORE is still relevant even if unique_checks=0

Generated at Thu Feb 08 10:26:50 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.