[MDEV-26782] InnoDB temporary tablespace: reclaiming of free space does not work Created: 2021-10-07  Updated: 2023-10-25  Resolved: 2023-04-21

Status: Closed
Project: MariaDB Server
Component/s: Server, Storage Engine - InnoDB
Affects Version/s: None
Fix Version/s: 11.1.1, 10.11.3, 11.0.2, 10.6.13, 10.8.8, 10.9.6, 10.10.4

Type: Bug Priority: Critical
Reporter: Jan Volf Assignee: Marko Mäkelä
Resolution: Fixed Votes: 8
Labels: None

Attachments: PNG File Screenshot_20220218_160215.png    
Issue Links:
Blocks
is blocked by MDEV-13013 InnoDB unnecessarily extends data files Closed
Relates
relates to MDEV-26790 InnoDB read-ahead may cause page writes Closed
relates to MDEV-28038 Why is there a big gap in the actual ... Open
relates to MDEV-28240 InnoDB Temporary Tablespace (ibtmp1) ... Closed
relates to MDEV-29593 Purge misses a chance to free not-yet... Closed
relates to MDEV-12227 Defer writes to the InnoDB temporary ... Closed
relates to MDEV-27886 innodb temporay tablespace files are ... Closed
relates to MDEV-28699 Shrink temporary tablespaces without ... Closed

 Description   

When using InnoDB engine for temporary tables, reclaiming of free table space does not seem to work.

Periodically performing statements that force the server to write into InnoDB temporary tablespace eventually results in the reaching of maximum size of temporary table space or even a full disk.

I was able to reproduce this behaviour by execution of following statements more times even when the connection between the executions is dropped:

create temporary table temp_random (
    id bigint primary key,
    random decimal not null
);
insert into temp_random(id, random) select seq, rand() from seq_0_to_2000000;
drop temporary table temp_random;

After each execution grow the file /var/lib/mysql/ibtmp1 in size

Expected behaviour:
The file /var/lib/mysql/ibtmp1 should grow in size only after first execution. After second an further execution should the file size stay eventually unchnaged as the space claimed by the first execution should be reused by the further executions.

Last version where the expected behaviour was observed: 10.5.11

Workarround: explicitly setting engine for temporary tables to MyISAM



 Comments   
Comment by Marko Mäkelä [ 2021-10-07 ]

I tested the following in mtr (just one round of your test):

--source include/have_innodb.inc
--source include/have_sequence.inc
create temporary table temp_random (
    id bigint primary key,
    random decimal not null
);
insert into temp_random(id, random) select seq, rand() from seq_0_to_2000000;
drop temporary table temp_random;
 
sleep 3600;

While the test was sleeping. I executed the following:

od -Ax -t x1 mysql-test/var/mysqld.1/data/ibtmp1

The file was filled with NUL bytes, so thanks to MDEV-12227, it is not too bad.

After attaching GDB to the server I noticed that the pages belonging to the temporary table are still marked as dirty in the buffer pool, and they could unnecessarily be written to ibtmp1 if we ran out of buffer pool:

set $b=buf_pool.LRU.start
while ($b)
  if ($b.oldest_modification_)
    print/x $b.id_
  end
  set $b=$b.LRU.next
end

The buffer pool contained 131 dirty-marked blocks belonging to the temporary tablespace (0xfffffffe) and 7 belonging to persistent pages.

  1. When marking pages that belong to temporary tables as free, we must mark the pages as clean, because they cannot possibly be needed.
  2. Furthermore, must ensure that at the start of each round of this test, the allocation data structures of the temporary tablespace will be identical (so that we will reuse the same page numbers over and over again). This is somewhat related to MDEV-13013.
Comment by Marko Mäkelä [ 2021-10-07 ]

I did not verify the growth of the ibtmp1 file, but I can easily believe it.
For testing, it should be better to just insert a constant; it will be much faster than evaluating rand() for every inserted row.

Comment by Jan Volf [ 2021-10-08 ]

I don`t know if it helps, but as I already stated in the description of the problem - the version 10.5.11 seems to be not affected. We observed this problem first after upgrading to version 10.6.4 when we suddenly run out of disk space due to the file ibtmp1 growing in size beyond any expectations effectively filling the free disk space.

Comment by Jon Wagoner [ 2022-02-03 ]

I think I'm hitting this issue in production as well. The ibtmp1 file continues to grow until it runs out of space. If I set a max size in innodb_temp_data_file_path I start getting Error 135: "No more room in record file" once the tmp file is full.

Comment by Oli Sennhauser [ 2022-02-18 ]

We can easily reproduce Jan example on 10.7.1. See graph

aria_used_for_temp_tables ON
innodb_temp_data_file_path ibtmp1:12M:autoextend
default_tmp_storage_engine  

root@localhost [(none)]> show global status like '%tmp%';
------------------------------+

Variable_name Value

------------------------------+

Created_tmp_disk_tables 0
Created_tmp_files 5
Created_tmp_tables 3
Handler_tmp_write 14
Rows_tmp_read 14

------------------------------+

root@localhost [(none)]> show global status like '%temp%';
------------------------------------------------+

Variable_name Value

------------------------------------------------+

Com_create_temporary_table 8
Com_drop_temporary_table 8

------------------------------------------------+

Comment by Marko Mäkelä [ 2022-02-18 ]

jwagoner, can you reproduce this on a newer 10.5 server than 10.5.11? Can anyone provide something easily executable for reproducing this?

Is this only repeatable with the Docker image? If yes, danblack is our resident Docker expert and could have some ideas.

I can’t think of anything that should significantly differ between 10.5 and 10.6 in this respect, except maybe io_uring (MDEV-24883) if your server is linked with liburing instead of libaio. If that is the case, I would like to know the Linux kernel version.

Would this be helped if you set innodb_immediate_scrub_data_uncompressed=ON to enable the MDEV-15528 hole-punching? In that case, you could use ls -ls ibtmp1 to see both the logical and physical sizes of the file.

Comment by Jon Wagoner [ 2022-02-18 ]

I am seeing this with 10.6.5 on a Gentoo system. My kernel version is 5.10.76. I don't see my mariadb binary linked against liburing or libaio, although I do have the latter on my system.

I ended up setting default_tmp_storage_engine to MyISAM to work around it for now.

Comment by Oli Sennhauser [ 2022-02-20 ]

@Marco: We do NOT see it on 10.5, only 10.6 and newer and NO Docker just plain multi-instance set-up.
This sequence is enough to simulate the problem:
create temporary table temp_random (
id bigint primary key,
random decimal not null
);
insert into temp_random(id, random) select seq, rand() from seq_0_to_2000000;
drop temporary table temp_random;
The funny thing is, that it should use an Aria Table at all because aria_used_for_temp_tables is set to ON an not touch ibtmp1...

Comment by Marko Mäkelä [ 2022-02-21 ]

oli, thank you. With the following test case, I end up with an ibtmp1 file whose size is 140MiB. The setting innodb_immediate_scrub_data_uncompressed=1 has no impact on this.

--source include/have_innodb.inc
--source include/have_sequence.inc
create temporary table temp_random (
id bigint primary key, random decimal not null) engine=innodb;
insert into temp_random select seq, 0 from seq_0_to_2000000;
drop temporary table temp_random;

Note: Assigning random values to a non-indexed fixed-length column should not make any difference.

MDEV-12227 could play a role in this, at least when it comes to innodb_immediate_scrub_data_uncompressed=1. Ever since that change, for temporary tables we might only ever write pages that are evicted from the buffer pool, but never punch holes when pages in the temporary tablespace are freed.

In my test, I used a very small buffer pool size, so that buf_pool.LRU eviction would take place, causing data to be written to the temporary tablespace. I checked what the pages are being used for:

innochecksum -a 100000000000 -S var/mysqld.1/data/ibtmp1 

The nonzero part of the summary said the following:

page count page type
3898 Index page
2046 Undo log page
1 Inode page
2886 Freshly allocated page
128 System page

My attempt of disabling row-level undo logging by SET unique_checks=0,foreign_key_checks=0; (MDEV-515) had no effect.

If I split the insert into two transactions:

insert into temp_random select seq, 0 from seq_0_to_1000000;
insert into temp_random select seq, 0 from seq_1000001_to_2000000;

then the summary will look different, but the file size is still the same:

page count page type
3877 Index page
931 Undo log page
1 Inode page
4022 Freshly allocated page
128 System page

I tried to reduce the transaction sizes even further:

insert into temp_random select seq, 0 from seq_0_to_500000;
insert into temp_random select seq, 0 from seq_500001_to_1000000;
insert into temp_random select seq, 0 from seq_1000001_to_1500000;
insert into temp_random select seq, 0 from seq_1500001_to_2000000;

I monitored the file size growth. I think that the file growed from 77MiB to 140MiB during the execution of the last statement. The summary from innochecksum was as follows:

page count page type
3867 Index page
404 Undo log page
1 Inode page
4559 Freshly allocated page
128 System page

Note: Because the temporary tablespace pages are being written out from the 8MiB buffer pool when we are running out of space (LRU eviction), there is some fluctuation in the page type statistics. It should not matter much, because 8MiB is much smaller than the total file size 140MiB.

As it was already noted in MDEV-13013, we seem to be way too eager to enlarge the temporary tablespace file.

Also, it should be possible to shrink the temporary tablespace once all tables have been dropped. For persistent tablespaces, this is more tricky due to recovery considerations.

Comment by Oli Sennhauser [ 2022-02-22 ]

Customer just gave feedback: The problem does NOT happen in 10.5.15 and start occurring in 10.6.5 and 10.7.x He is running on Windows.

Comment by Thirunarayanan Balathandayuthapani [ 2022-07-11 ]

In 10.6.9 version:

Ran the marko's test case with --innodb_buffer_pool_size=1G. There was no LRU eviction or page flushing of temporary
tablespace happens when buffer pool is big.

[~/mariarepo/10.6/10.6-latest/bld_latest-debug/mysql-test]$innochecksum -a 100000000000 -S var/mysqld.1/data/ibtmp1 
 
File::var/mysqld.1/data/ibtmp1
================PAGE TYPE SUMMARY==============
#PAGE_COUNT	PAGE_TYPE
===============================================
       0	Index page
       0	Undo log page
       0	Inode page
       0	Insert buffer free list page
    8960	Freshly allocated page
       0	Insert buffer bitmap
       0	System page
       0	Transaction system page
       0	File Space Header
       0	Extent descriptor page
       0	BLOB page
       0	Compressed BLOB page
       0	Other type of page
===============================================
Additional information:
Undo page type: 0 insert, 0 update, 0 other
Undo page state: 0 active, 0 cached, 0 to_free, 0 to_purge, 0 prepared, 0 other

file size is :

[~/mariarepo/10.6/10.6-latest/bld_latest-debug/mysql-test]$ls -ls var/mysqld.1/data/ibtmp1 
143360 -rw-rw---- 1 thiru thiru 146800640 Jul 11 18:27 var/mysqld.1/data/ibtmp1

fil_space_extend_must_retry() has been called twice and it is being extended to 8960 pages.

Trying the same test case in 10.5.15:

10.5.15 also behaves like 10.6.9. No changes.

Comment by Marko Mäkelä [ 2022-08-01 ]

I think that this needs to be tested and reevaluated after MDEV-13013 has been fixed.

Comment by Thirunarayanan Balathandayuthapani [ 2022-08-24 ]

Testing after MDEV-13013 patch in 10.6:

[~/mariarepo/10.6/10.6-latest/bld_latest-debug/mysql-test]$ls -ls var/mysqld.1/data/ibtmp1
143360 -rw-rw---- 1 thiru thiru 146800640 Aug 24 13:37 var/mysqld.1/data/ibtmp1
 
 
[~/mariarepo/10.6/10.6-latest/bld_latest-debug/extra]$./innochecksum -S ../mysql-test/var/mysqld.1/data/ibtmp1
 
File::../mysql-test/var/mysqld.1/data/ibtmp1
================PAGE TYPE SUMMARY==============
#PAGE_COUNT	PAGE_TYPE
===============================================
    3840	Index page
    2046	Undo log page
       2	Inode page
       0	Insert buffer free list page
    2943	Freshly allocated page
       0	Insert buffer bitmap
     128	System page
       0	Transaction system page
       1	File Space Header
       0	Extent descriptor page
       0	BLOB page
       0	Compressed BLOB page
       0	Page compressed page
       0	Page compressed encrypted page
       0	Other type of page
 
===============================================
Additional information:
Undo page type: 2046
Undo page state: 0 active, 0 cached, 0 to_purge, 0 prepared, 2046 other
index_id	#pages		#leaf_pages	#recs_per_page	#bytes_per_page
23		3840		3838		483		14969
 
index_id	page_data_bytes_histgram(empty,...,oversized)
23		0	0	0	0	0	2	0	00	0	3838	0

I don't see any improvement after MDEV-13013 patch. We reduced the un-necessary extent reservation during fseg_create().
There are few more optimization can be done like instead of allocating 4 extent at a time, we can go for 1 extent for temp tablespace.
We could allocate rollback segment allocation dynamically (It could reduce 2 mib at max)
But IIUC, user is asking a way to truncate the temporary tablespace.

Comment by Marko Mäkelä [ 2022-09-21 ]

I filed MDEV-29593 for part of this problem.

Comment by Hartmut Holzgraefe [ 2022-12-22 ]

Looks as if this is no longer reproducible in >= 10.6.10

Comment by Marko Mäkelä [ 2022-12-30 ]

hholzgra, could it be that you are confusing this bug with the 10.6 regression MDEV-28240 that was fixed in 10.6.10?

Comment by Marko Mäkelä [ 2023-01-10 ]

I believe that the most pressing issue was the bug MDEV-28240. For truncating the temporary tablespace, the ticket MDEV-28699 has been filed.

Comment by Marko Mäkelä [ 2023-03-24 ]

In MDEV-29593, I wrote the following:

The rseg.undo_cached is also being used for temporary undo logs (see trx_undo_assign_low() and trx_undo_reuse_cached()), but there is no purge of history or MVCC for temporary tables. For temporary tables, we could modify trx_undo_commit_cleanup() so that it will never add anything to rseg.undo_cached and instead just free the undo page after commit.

Comment by Marko Mäkelä [ 2023-03-27 ]

I ran a simple benchmark before and after this fix:

mariadbd --innodb-log-file-size=5m --innodb-buffer-pool-size=10m&
for i in 1 2 3 4 5 6 7 8 9 10 11 12 13
do
  time echo 'create temporary table t(a int primary key, b int not null) engine=innodb; insert into t select seq,0 from seq_1_to_10000000; update t set b=a; drop table t;'|
  mariadb -uroot test&
done

The patch allows the space that was occupied by the row-level undo log from the INSERT…SELECT to be reused by the UPDATE operation. The real time reported for each connection was as follows:

revision min(time)/s avg(time)/s max(time)/s
before 94,368 95,923 97,017
after 74,879 76,823 78,385

The storage was in /dev/shm and nproc reports 40 on this dual-socket system. During the test, top reported close to 1300% CPU usage for the mariadbd process, corresponding to the number of concurrent client connections.

Generated at Thu Feb 08 09:47:54 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.