[MDEV-32268] GNU libc posix_fallocate() may be extremely slow Created: 2023-09-27  Updated: 2024-01-18  Resolved: 2024-01-18

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: None
Fix Version/s: 10.4.33, 10.5.24, 10.6.17, 10.11.7, 11.0.5, 11.1.4, 11.2.3, 11.3.2, 11.4.1

Type: Bug Priority: Critical
Reporter: Valerii Kravchuk Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Blocks
Relates
relates to MDEV-11687 innodb_use_fallocate has no effect Closed

 Description   

After the "fix" for MDEV-11687 we always use posix_fallocate() in the os_file_set_size (see https://github.com/MariaDB/server/blob/b0763f509a3f120e882cee2810495d169ff344a4/storage/innobase/os/os0file.cc#L4926):

...
# ifdef HAVE_POSIX_FALLOCATE
	int err;
	do {
		if (fstat(file, &statbuf)) {
			err = errno;
		} else {
			os_offset_t current_size = statbuf.st_size;
			if (current_size >= size) {
				return true;
			}
			current_size &= ~4095ULL;
			err = posix_fallocate(file, current_size,
					      size - current_size);
		}
	} while (err == EINTR
		 && srv_shutdown_state <= SRV_SHUTDOWN_INITIATED);
...

while in some cases (like ALTER TABLE ... ALGORITHM=COPY for COMPRESSED table and datadir on a NAS/NFS mount) using alternative allocation method provides performance benefits.

Consider the following test case on 10.1 that proves the point to some extent on a local NFS mount on current Ubuntu 22.04:

openxs@ao756:~/dbs/maria10.1/mysql-test$ ./mtr --mysqld="--innodb_file_format=Barracuda" --mysqld="--innodb_buffer_pool_size=32M" --mysqld="--innodb_use_fallocate=1" innodb.slow_alter
Logging: ./mtr  --mysqld=--innodb_file_format=Barracuda --mysqld=--innodb_buffer_pool_size=32M --mysqld=--innodb_use_fallocate=1 innodb.slow_alter
vardir: /home/openxs/dbs/maria10.1/mysql-test/var
Checking leftover processes...
Removing old var directory...
 - WARNING: Using the 'mysql-test/var' symlink
Creating var directory '/home/openxs/dbs/maria10.1/mysql-test/var'...
Checking supported features...
MariaDB Version 10.1.49-MariaDB
 - SSL connections supported
Sphinx 'indexer' binary not found, sphinx suite will be skipped
Collecting tests...
Installing system database...
 
==============================================================================
 
TEST                                      RESULT   TIME (ms) or COMMENT
--------------------------------------------------------------------------
 
worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019
set default_storage_engine=innodb;
show global variables like '%fallocate'||
Variable_name	Value
innodb_use_fallocate	ON
create table tbig(id int auto_increment primary key, c1 char(100)) ROW_FORMAT=COMPRESSED;
insert into tbig(c1) values (repeat('a', 100));
select @now := now(6)||
@now := now(6)
2023-09-27 13:31:58.256905
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
select timestampdiff(microsecond, @now, now(6))||
timestampdiff(microsecond, @now, now(6))
86978103
show table status like 'tbig'||
Name	Engine	Version	Row_format	Rows	Avg_row_length	Data_length	Max_data_length	Index_length	Data_free	Auto_increment	Create_time	Update_time	Check_time	Collation	Checksum	Create_options	Comment
tbig	InnoDB	10	Compressed	1048731	60	63258624	0	0	3145728	1376221	2023-09-27 13:31:57	NULL	NULL	latin1_swedish_ci	NULL	row_format=COMPRESSED	
set session profiling = ON||
select count(*) from tbig||
count(*)
1048576
alter table tbig add column c2 char(200) default 'b'||
set session profiling = OFF||
show profiles||
Query_ID	Duration	Query
1	1.69141316	select count(*) from tbig
2	385.22975249	alter table tbig add column c2 char(200) default 'b'
drop table tbig||
innodb.slow_alter 'xtradb'               [ pass ]  474460
--------------------------------------------------------------------------
The servers were restarted 0 times
Spent 474.460 of 562 seconds executing testcases
 
Completed: All 1 tests were successful.
 
openxs@ao756:~/dbs/maria10.1/mysql-test$ ./mtr --mysqld="--innodb_file_format=Barracuda" --mysqld="--innodb_buffer_pool_size=32M" --mysqld="--innodb_use_fallocate=0" innodb.slow_alter
Logging: ./mtr  --mysqld=--innodb_file_format=Barracuda --mysqld=--innodb_buffer_pool_size=32M --mysqld=--innodb_use_fallocate=0 innodb.slow_alter
vardir: /home/openxs/dbs/maria10.1/mysql-test/var
Checking leftover processes...
Removing old var directory...
 - WARNING: Using the 'mysql-test/var' symlink
Creating var directory '/home/openxs/dbs/maria10.1/mysql-test/var'...
Checking supported features...
MariaDB Version 10.1.49-MariaDB
 - SSL connections supported
Sphinx 'indexer' binary not found, sphinx suite will be skipped
Collecting tests...
Installing system database...
 
==============================================================================
 
TEST                                      RESULT   TIME (ms) or COMMENT
--------------------------------------------------------------------------
 
worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019
set default_storage_engine=innodb;
show global variables like '%fallocate'||
Variable_name	Value
innodb_use_fallocate	OFF
create table tbig(id int auto_increment primary key, c1 char(100)) ROW_FORMAT=COMPRESSED;
insert into tbig(c1) values (repeat('a', 100));
select @now := now(6)||
@now := now(6)
2023-09-27 13:46:34.280800
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
select timestampdiff(microsecond, @now, now(6))||
timestampdiff(microsecond, @now, now(6))
86275891
show table status like 'tbig'||
Name	Engine	Version	Row_format	Rows	Avg_row_length	Data_length	Max_data_length	Index_length	Data_free	Auto_increment	Create_time	Update_time	Check_time	Collation	Checksum	Create_options	Comment
tbig	InnoDB	10	Compressed	1048681	59	62734336	0	0	3145728	1376221	2023-09-27 13:46:33	NULL	NULL	latin1_swedish_ci	NULL	row_format=COMPRESSED	
set session profiling = ON||
select count(*) from tbig||
count(*)
1048576
alter table tbig add column c2 char(200) default 'b'||
set session profiling = OFF||
show profiles||
Query_ID	Duration	Query
1	1.91823956	select count(*) from tbig
2	377.63216720	alter table tbig add column c2 char(200) default 'b'
drop table tbig||
innodb.slow_alter 'xtradb'               [ pass ]  466519
--------------------------------------------------------------------------
The servers were restarted 0 times
Spent 466.519 of 555 seconds executing testcases
 
Completed: All 1 tests were successful.
 
openxs@ao756:~/dbs/maria10.1/mysql-test$ ls -l | grep var
drwxrwxr-x  8 openxs openxs   4096 вер 27 10:56 oldvar
lrwxrwxrwx  1 openxs openxs      4 вер 27 12:09 var -> /mnt
openxs@ao756:~/dbs/maria10.1/mysql-test$ mount | grep /mnt
127.0.0.1:/home/openxs/share on /mnt type nfs4 (rw,relatime,vers=4.2,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=none,addr=127.0.0.1)

Here we have 378 seconds to ALTER the table that is twice as large as the buffer pool with innodb_use_fallocate = OFF vs 385 where it is ON. I expect bigger difference when table size increases, but wanted to keep the test fast enough to be pracical on my slow HDD.

Users reported much more serious impact (that prevents the use of 10.4+ for table terabytes in size) and upgrade from pre-10.2.4 versions.



 Comments   
Comment by Marko Mäkelä [ 2023-09-28 ]

I think that the option name innodb_use_fallocate would be inappropriate and misleading here.

There are 3 ways of setting the file size:

  1. Logical: "sparsely" extend the file so that the last bytes beyond the physical end will read as 0 (ASCII NUL):
    This is POSIX ftruncate(), and this is what would be beneficial on NFS but questionable anywhere else, because:
    • We cannot ensure that the space is actually available, and a future pwrite or similar could fail with ENOSPC
    • future pwrite could cause the file to be fragmented
    • We allow this for page_compressed tables, because those are designed to be heavily fragmented anyway.
  2. Physical with pwrite() or write(): write actual 0 bytes to the file:
    • The slowest.
  3. Physical with posix_fallocate(), fallocate() or similar:
    This is preferable on any file system that supports it, such as Linux ext4fs.
    • The maintenance debt or "price to pay later" would be that future writes will have to update metadata to say that the page has been initialized, and therefore fdatasync() after an O_DIRECT write would not be a no-op; see MDEV-24854.
    • GNU libc implements a fallback, others don’t (see MDEV-16015).

Basically, starting with MariaDB Server 10.2, we always behave as if innodb_use_fallocate=ON, that is, we use fallocate() whenever we think it makes sense.

I think that new parameter name should be something like innodb_sparse_allocate or innodb_logical_allocate. In that way, it would also be OS agnostic; the system call interface names are something completely different on Microsoft Windows.

I am not against adding a new parameter. This should be rather straightforward to implement: just add something like

if (innodb_sparse_allocate)
	is_sparse = true;

to the start of os_file_set_size(), and add a MYSQL_SYSVAR_BOOL binding to ha_innodb.cc.

Comment by Marko Mäkelä [ 2023-12-05 ]

I took a closer look at the code. In a 10.1 after-merge fix of MDEV-11520, a condition in fil_space_extend_must_retry() was revised:

	if (srv_use_posix_fallocate && !is_sparse) {
// … invoke posix_fallocate() to physically extend the file
	}
	else
	if (is_sparse) {
// … invoke ftruncate() to logically extend the file
        } else {
// … invoke pwrite() to physically extend the file
	}

It appears to me that only files that are associated with page_compressed tables could be logically extended by ftruncate(). The setting innodb_use_fallocate=FALSE would disallow the use of physical allocation via posix_fallocate(), and always cause the original pwrite() code path to be used. The code was later refactored to invoke the function os_file_set_size(), but the logic remained the same, as far as I can tell.

I do not see how reintroducing the parameter innodb_use_fallocate would help here. We would be allocating pages to the file anyway. If fallocate() returns EOPNOTSUPP on some file system (MDEV-16015), then the implementation of posix_fallocate() or some code in InnoDB will fall back to extending the file by pwrite() or write() or similar system calls. I can’t imagine how invoking the fallocate() system call would cause a performance regression.

The support ticket specifically mentions page_compressed tables and not ROW_FORMAT=COMPRESSED, which is what the test case in the Description is using. As far as I can tell, files of anything else than page_compressed tables were always extended by physical writes or allocation of pages, while files of page_compressed tables should be extended logically by ftruncate(), because those files will be sparse by design.

valerii, can you provide some strace or similar output as a proof what is actually going on?

Comment by Valerii Kravchuk [ 2023-12-05 ]

It's clear from my test case we speak about COMPRESSED tables here:

...
create table tbig(id int auto_increment primary key, c1 char(100)) ROW_FORMAT=COMPRESSED;
...

and this is what customer cares about. We discussed page_compressed maybe in the issue or related issues, but it does not provide enough compression benefit for the customer.

So, what statement(s) and in what version (10.1?) do you want me to strace, in a hope to make what exact point?

Comment by Marko Mäkelä [ 2023-12-19 ]

I had to jump through some hoops to compile MariaDB Server 10.1 on my system. With GCC 13 it was doomed, because there is a class byte defined somewhere in the libstdc++ headers. With GCC 10 I succeeded, after making some small modifications to the latest 10.1 branch. I used the following test:

--source include/have_innodb.inc
--source include/have_sequence.inc
create table tbig(id int auto_increment primary key, c1 char(100))
ROW_FORMAT=COMPRESSED ENGINE=InnoDB;
SET STATEMENT unique_checks=0,foreign_key_checks=0 FOR
insert into tbig select seq, repeat('a', 100) from seq_1_to_1048576;
SET profiling=ON;
select count(*) from tbig;
alter table tbig add column c2 char(200) default 'b';
SET profiling=OFF;
show profiles;
drop table tbig;

I invoked this as

./mtr --mysqld=--innodb-{file-format=barracuda,use-fallocate=0}

Between MariaDB Server 5.5 and 10.1, there were 2 implementations of InnoDB included: a dynamically linked innodb_plugin (storage/innobase) and a statically linked xtradb. The former is much faster on my NVMe on Linux 6.5.13 and ext4 file system:

engine innodb_use_fallocate ALTER/s total time/s
innodb_plugin 0 15.634 24.108
xtradb 0 57.568 93.766
innodb_plugin 1 15.423 23.700
xtradb 1 58.082 97.253
10.6 with unique_checks N/A 7.062 15.402
10.6 N/A 5.926 17.955

I can confirm that innodb_use_fallocate=1 leads to a slightly faster result on the 10.1 XtraDB.

I was surprised to see XtraDB being so much slower, but that’s not the point of this investigation. MariaDB Server 10.6 is faster than both. (Note: MDEV-11369 instant ADD COLUMN is not available for ROW_FORMAT=COMPRESSED tables.)

There is some variance in the numbers, which I noticed when running the 10.6 test another time, with the MDEV-515 optimization disabled. The ALTER TABLE was faster as expected, but for some reason the total execution time was a little slower.

To better highlight the difference between innodb_use_fallocate=0 and innodb_use_fallocate=1 I revised the test so that it would create and populate an uncompressed table first:

--source include/have_innodb.inc
--source include/have_sequence.inc
create table tbig(id int auto_increment primary key, c1 char(100))
ROW_FORMAT=DYNAMIC ENGINE=InnoDB;
SET STATEMENT unique_checks=0,foreign_key_checks=0 FOR
insert into tbig select seq, repeat('a', 100) from seq_1_to_262144;
SET profiling=ON;
alter table tbig add column c2 char(200) default 'b', row_format=compressed;
SET profiling=OFF;
show profiles;
drop table tbig;

I also specified innodb_compression_level=0 so that less time would be spent on compressing data and more on extending the file. The time consumed by ALTER TABLE was about 1 second longer with innodb_use_fallocate=1. I also tried applying the following patch, and it did not change this:

diff --git a/storage/xtradb/os/os0file.cc b/storage/xtradb/os/os0file.cc
index 66af4a39f7c..cb269859f50 100644
--- a/storage/xtradb/os/os0file.cc
+++ b/storage/xtradb/os/os0file.cc
@@ -2698,7 +2698,7 @@ os_file_set_size(
 		do {
 			os_offset_t current_size = os_file_get_size(file);
 			err = current_size >= size
-				? 0 : posix_fallocate(file, current_size,
+				? 0 : fallocate(file, 0, current_size,
 						      size - current_size);
 		} while (err == EINTR
 			 && srv_shutdown_state == SRV_SHUTDOWN_NONE);

Note: This patch is incorrect; fallocate() on Linux returns 0 or -1, and errno would be set to what the surrounding code expects.

I went on to run the faster test variant on 10.6, with and without a patch to replace posix_fallocate() with fallocate(). The ALTER TABLE time would be cut to about 10% (1.3 seconds instead of 13 seconds). Setting innodb_compression_level=0 had little effect. I do not see any difference between applying or not applying the patch to 10.6:

diff --git a/storage/innobase/os/os0file.cc b/storage/innobase/os/os0file.cc
index 7628939f2dd..9e21978f410 100644
--- a/storage/innobase/os/os0file.cc
+++ b/storage/innobase/os/os0file.cc
@@ -3169,7 +3169,7 @@ os_file_set_size(
 		return(success);
 	}
 
-# ifdef HAVE_POSIX_FALLOCATE
+# if defined HAVE_POSIX_FALLOCATE || defined __linux__
 	int err;
 	do {
 		if (fstat(file, &statbuf)) {
@@ -3181,8 +3181,16 @@ os_file_set_size(
 				return true;
 			}
 			current_size &= ~4095ULL;
+#  ifdef __linux__
+			if (!fallocate(file, 0, current_size,
+				       size - current_size)) {
+				return true;
+			}
+			err = errno;
+#  else
 			err = posix_fallocate(file, current_size,
 					      size - current_size);
+#  endif
 		}
 	} while (err == EINTR
 		 && srv_shutdown_state <= SRV_SHUTDOWN_INITIATED);

I think that this needs to be tested in a NFS or NAS environment. I have not set up one. valerii, can you help? In the above 10.6 patch, you could also try replacing the first #if with #if 0 so that fallocate() will not be invoked in any form.

Comment by Marko Mäkelä [ 2023-12-19 ]

Because the performance regression was reported for 10.4, I tested the current 10.4 branch as well, using similar build options as 10.1 (only GCC 13 instead of GCC 10). The ALTER TABLE statement in the original test took 6.09 seconds, which is not that much worse from 10.6. With the smaller test and innodb_compression_level=0, it was 1.8 to 2.2 seconds, again, not much worse than the 1.3 or 1.4 seconds on 10.6.

As noted already, results from a fast local NVMe drive cannot be generalized to an environment that uses NFS.

Comment by Valerii Kravchuk [ 2023-12-19 ]

Baseline test (using Marko's recent rest case, that is IMHO oversimplified, the problem was to ALTER already COMPRESSED table originally) on 10.1 with NFS-mounted datadir:

openxs@ao756:~/dbs/maria10.1/mysql-test$ cat suite/innodb/t/MDEV-32268.test
--source include/have_innodb.inc
--source include/have_sequence.inc
create table tbig(id int auto_increment primary key, c1 char(100))
ROW_FORMAT=DYNAMIC ENGINE=InnoDB;
SET STATEMENT unique_checks=0,foreign_key_checks=0 FOR
insert into tbig select seq, repeat('a', 100) from seq_1_to_262144;
SET profiling=ON;
alter table tbig add column c2 char(200) default 'b', row_format=compressed;
SET profiling=OFF;
show profiles;
drop table tbig;
openxs@ao756:~/dbs/maria10.1/mysql-test$ ls -l | grep var
drwxrwxr-x  8 openxs openxs   4096 вер 27 10:56 oldvar
lrwxrwxrwx  1 openxs openxs      4 вер 27 12:09 var -> /mnt
openxs@ao756:~/dbs/maria10.1/mysql-test$ mount | grep /mnt
127.0.0.1:/home/openxs/share on /mnt type nfs4 (rw,relatime,vers=4.2,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=none,addr=127.0.0.1)

Now the results with --innodb_use_fallocate=1 (rest of settings are the same as for my original test, as I've got no suggestions to the contrary):

openxs@ao756:~/dbs/maria10.1/mysql-test$ ./mtr --mysqld="--innodb_file_format=Barracuda" --mysqld="--innodb_buffer_pool_size=32M" --mysqld="--innodb_use_fallocate=1" innodb.MDEV-32268
Logging: ./mtr  --mysqld=--innodb_file_format=Barracuda --mysqld=--innodb_buffer_pool_size=32M --mysqld=--innodb_use_fallocate=1 innodb.MDEV-32268
vardir: /home/openxs/dbs/maria10.1/mysql-test/var
Checking leftover processes...
Removing old var directory...
 - WARNING: Using the 'mysql-test/var' symlink
Creating var directory '/home/openxs/dbs/maria10.1/mysql-test/var'...
Checking supported features...
MariaDB Version 10.1.49-MariaDB
 - SSL connections supported
Sphinx 'indexer' binary not found, sphinx suite will be skipped
Collecting tests...
Installing system database...
 
==============================================================================
 
TEST                                      RESULT   TIME (ms) or COMMENT
--------------------------------------------------------------------------
 
worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019
create table tbig(id int auto_increment primary key, c1 char(100))
ROW_FORMAT=DYNAMIC ENGINE=InnoDB;
SET STATEMENT unique_checks=0,foreign_key_checks=0 FOR
insert into tbig select seq, repeat('a', 100) from seq_1_to_262144;
SET profiling=ON;
alter table tbig add column c2 char(200) default 'b', row_format=compressed;
SET profiling=OFF;
show profiles;
Query_ID	Duration	Query
1	75.78709445	alter table tbig add column c2 char(200) default 'b', row_format=compressed
drop table tbig;
innodb.MDEV-32268 'xtradb'               [ pass ]  92725
--------------------------------------------------------------------------
The servers were restarted 0 times
Spent 92.725 of 181 seconds executing testcases
 
Completed: All 1 tests were successful.

With --innodb_use_fallocate=0 we have SLOWER execution for this test:

openxs@ao756:~/dbs/maria10.1/mysql-test$ ./mtr --mysqld="--innodb_file_format=Barracuda" --mysqld="--innodb_buffer_pool_size=32M" --mysqld="--innodb_use_fallocate=0" innodb.MDEV-32268
Logging: ./mtr  --mysqld=--innodb_file_format=Barracuda --mysqld=--innodb_buffer_pool_size=32M --mysqld=--innodb_use_fallocate=0 innodb.MDEV-32268
vardir: /home/openxs/dbs/maria10.1/mysql-test/var
Checking leftover processes...
Removing old var directory...
 - WARNING: Using the 'mysql-test/var' symlink
Creating var directory '/home/openxs/dbs/maria10.1/mysql-test/var'...
Checking supported features...
MariaDB Version 10.1.49-MariaDB
 - SSL connections supported
Sphinx 'indexer' binary not found, sphinx suite will be skipped
Collecting tests...
Installing system database...
 
==============================================================================
 
TEST                                      RESULT   TIME (ms) or COMMENT
--------------------------------------------------------------------------
 
worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019
create table tbig(id int auto_increment primary key, c1 char(100))
ROW_FORMAT=DYNAMIC ENGINE=InnoDB;
SET STATEMENT unique_checks=0,foreign_key_checks=0 FOR
insert into tbig select seq, repeat('a', 100) from seq_1_to_262144;
SET profiling=ON;
alter table tbig add column c2 char(200) default 'b', row_format=compressed;
SET profiling=OFF;
show profiles;
Query_ID	Duration	Query
1	98.01038599	alter table tbig add column c2 char(200) default 'b', row_format=compressed
drop table tbig;
innodb.MDEV-32268 'xtradb'               [ pass ]  117305
--------------------------------------------------------------------------
The servers were restarted 0 times
Spent 117.305 of 201 seconds executing testcases
 
Completed: All 1 tests were successful.

Now my ORIGINAL test case:

openxs@ao756:~/dbs/maria10.1/mysql-test$ cat suite/innodb/t/slow_alter.test 
source include/have_innodb.inc;
set default_storage_engine=innodb;
delimiter ||;
show global variables like '%fallocate'||
create table tbig(id int auto_increment primary key, c1 char(100)) ROW_FORMAT=COMPRESSED;
insert into tbig(c1) values (repeat('a', 100));
select @now := now(6)||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
select timestampdiff(microsecond, @now, now(6))||
 
show table status like 'tbig'||
 
set session profiling = ON||
select count(*) from tbig||
alter table tbig add column c2 char(200) default 'b'||
set session profiling = OFF||
show profiles||
 
drop table tbig||

still shows some positive impact of NOT using posix_fallocate when ALTER is performed for COMPRESSED table of a proper size vs buffer pool (more carefully scaled down based on the real life use case):

openxs@ao756:~/dbs/maria10.1/mysql-test$ ./mtr --mysqld="--innodb_file_format=Barracuda" --mysqld="--innodb_buffer_pool_size=32M" --mysqld="--innodb_use_fallocate=0" innodb.slow_alter
Logging: ./mtr  --mysqld=--innodb_file_format=Barracuda --mysqld=--innodb_buffer_pool_size=32M --mysqld=--innodb_use_fallocate=0 innodb.slow_alter
vardir: /home/openxs/dbs/maria10.1/mysql-test/var
Checking leftover processes...
Removing old var directory...
 - WARNING: Using the 'mysql-test/var' symlink
Creating var directory '/home/openxs/dbs/maria10.1/mysql-test/var'...
Checking supported features...
MariaDB Version 10.1.49-MariaDB
 - SSL connections supported
Sphinx 'indexer' binary not found, sphinx suite will be skipped
Collecting tests...
Installing system database...
 
==============================================================================
 
TEST                                      RESULT   TIME (ms) or COMMENT
--------------------------------------------------------------------------
 
worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019
set default_storage_engine=innodb;
show global variables like '%fallocate'||
Variable_name	Value
innodb_use_fallocate	OFF
create table tbig(id int auto_increment primary key, c1 char(100)) ROW_FORMAT=COMPRESSED;
insert into tbig(c1) values (repeat('a', 100));
select @now := now(6)||
@now := now(6)
2023-12-19 18:02:05.992570
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
select timestampdiff(microsecond, @now, now(6))||
timestampdiff(microsecond, @now, now(6))
84892724
show table status like 'tbig'||
Name	Engine	Version	Row_format	Rows	Avg_row_length	Data_length	Max_data_length	Index_length	Data_free	Auto_increment	Create_time	Update_time	Check_time	Collation	Checksum	Create_options	Comment
tbig	InnoDB	10	Compressed	1048777	60	63782912	0	0	3145728	1376221	2023-12-19 18:02:05	NULL	NULL	latin1_swedish_ci	NULL	row_format=COMPRESSED	
set session profiling = ON||
select count(*) from tbig||
count(*)
1048576
alter table tbig add column c2 char(200) default 'b'||
set session profiling = OFF||
show profiles||
Query_ID	Duration	Query
1	1.65288880	select count(*) from tbig
2	368.30020891	alter table tbig add column c2 char(200) default 'b'
drop table tbig||
innodb.slow_alter 'xtradb'               [ pass ]  455411
--------------------------------------------------------------------------
The servers were restarted 0 times
Spent 455.411 of 538 seconds executing testcases
 
Completed: All 1 tests were successful.
 
openxs@ao756:~/dbs/maria10.1/mysql-test$ ./mtr --mysqld="--innodb_file_format=Barracuda" --mysqld="--innodb_buffer_pool_size=32M" --mysqld="--innodb_use_fallocate=1" innodb.slow_alter
Logging: ./mtr  --mysqld=--innodb_file_format=Barracuda --mysqld=--innodb_buffer_pool_size=32M --mysqld=--innodb_use_fallocate=1 innodb.slow_alter
vardir: /home/openxs/dbs/maria10.1/mysql-test/var
Checking leftover processes...
Removing old var directory...
 - WARNING: Using the 'mysql-test/var' symlink
Creating var directory '/home/openxs/dbs/maria10.1/mysql-test/var'...
Checking supported features...
MariaDB Version 10.1.49-MariaDB
 - SSL connections supported
Sphinx 'indexer' binary not found, sphinx suite will be skipped
Collecting tests...
Installing system database...
 
==============================================================================
 
TEST                                      RESULT   TIME (ms) or COMMENT
--------------------------------------------------------------------------
 
worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019
set default_storage_engine=innodb;
show global variables like '%fallocate'||
Variable_name	Value
innodb_use_fallocate	ON
create table tbig(id int auto_increment primary key, c1 char(100)) ROW_FORMAT=COMPRESSED;
insert into tbig(c1) values (repeat('a', 100));
select @now := now(6)||
@now := now(6)
2023-12-19 18:17:26.729541
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
insert into tbig(c1) select c1 from tbig||
select timestampdiff(microsecond, @now, now(6))||
timestampdiff(microsecond, @now, now(6))
79631144
show table status like 'tbig'||
Name	Engine	Version	Row_format	Rows	Avg_row_length	Data_length	Max_data_length	Index_length	Data_free	Auto_increment	Create_time	Update_time	Check_time	Collation	Checksum	Create_options	Comment
tbig	InnoDB	10	Compressed	1048749	61	64831488	0	0	3145728	1376221	2023-12-19 18:17:26	NULL	NULL	latin1_swedish_ci	NULL	row_format=COMPRESSED	
set session profiling = ON||
select count(*) from tbig||
count(*)
1048576
alter table tbig add column c2 char(200) default 'b'||
set session profiling = OFF||
show profiles||
Query_ID	Duration	Query
1	1.65457088	select count(*) from tbig
2	381.66136909	alter table tbig add column c2 char(200) default 'b'
drop table tbig||
innodb.slow_alter 'xtradb'               [ pass ]  463771
--------------------------------------------------------------------------
The servers were restarted 0 times
Spent 463.771 of 541 seconds executing testcases
 
Completed: All 1 tests were successful.

368 vs 381 is some difference.

So, question remains: should I use my original or Marko's test case on 10.4, 10.6 and 10.6 patched? I see no real point to use Marko's as it does NOT show any positive imact of innodb_use_fallocate = 0 on 10.1.

Comment by Marko Mäkelä [ 2023-12-19 ]

valerii, I believe that my test case should be equivalent to yours if you change the number of rows from 262144 to 1048576. I thought that the interesting part is the file extension in ALTER TABLE, not the compression of data pages before that, or the allocation of index record structures in the InnoDB buffer pool for your variant of INSERT…SELECT. One parameter that I completely forgot about is the size of the buffer pool.

On 10.6 my variant would be even faster (and a somewhat unfair comparison to 10.4) because it enables the MDEV-515 bulk insert and avoids excessive undo logging. If you add a slow shutdown and restart before the ALTER TABLE, then it should be a more fair comparison.

Can you run any version under strace -e fallocate,pwrite64 or similar, to see if the fallocate system call is reporting EOPNOTSUPP all the time, like https://www.linux-nfs.org/wiki/index.php/Fallocate suggests it could? Can you run 10.6 with the HAVE_POSIX_FALLOCATE logic disabled? If it is faster, then that would be a strong reason to bring back the setting innodb_use_fallocate=OFF so that this ‘shortcut’ system call could be brought back. An alternative would be that we keep track of file systems that support fallocate(). We already keep track of non-rotational devices (SSD) so that innodb_flush_neighbours=ON will be ignored on them.

Comment by Marko Mäkelä [ 2023-12-21 ]

valerii, could you execute finish in GDB when the breakpoint on fallocate64 is hit, so that we will see if it succeeds or fails? I assume that it fails on this setup, but I would like to be sure about that. You could also have checked the backtrace to ensure that the caller is posix_fallocate() inside InnoDB.

Comment by Marko Mäkelä [ 2024-01-12 ]

The customer replied that with a custom build that disables the posix_fallocate() calls, their ALTER TABLE data rate would be improved by almost 4×. The patch was as follows:

diff --git a/storage/innobase/os/os0file.cc b/storage/innobase/os/os0file.cc
index 71e48da0dec..8b62b391afe 100644
--- a/storage/innobase/os/os0file.cc
+++ b/storage/innobase/os/os0file.cc
@@ -4923,7 +4923,7 @@ os_file_set_size(
 		return(success);
 	}
 
-# ifdef HAVE_POSIX_FALLOCATE
+# if 0
 	int err;
 	do {
 		if (fstat(file, &statbuf)) {

Based on this, it makes sense to add back the Boolean parameter innodb_use_fallocate, and default it to ON. This customer would set that parameter to OFF in their environment.

ralf.gebhardt and serg, I wonder if we could handle this rather trivial change a bug fix. After all, it is fixing a regression that had been introduced in MariaDB Server 10.2. Basically, we would introduce a Boolean parameter and add a runtime check for it in the infrequently called function os_file_set_size().

Comment by Marko Mäkelä [ 2024-01-12 ]

There is an alternative that on Linux, we invoke fallocate(2) instead of posix_fallocate(). We do not know yet if the performance problem is caused by the fallocate() system call itself or by the EOPNOTSUPP fallback in GNU libc posix_fallocate(). I hope that the customer can test another build that implements this. Fewer configuration parameters would be better.

Comment by Marko Mäkelä [ 2024-01-12 ]

For the record, in GNU libc there is a generic implementation sysdeps/posix/posix_fallocate64.c that invokes pwrite() of 1 byte at a time, as well as a Linux specific sysdeps/unix/sysv/linux/posix_fallocate64.c that first tries fallocate() and then falls back to the generic implementation, which a #define renames to internal_fallocate64. There also exist variants without the 64 suffix.

In case the second custom build (of invoking fallocate() instead of posix_fallocate()) turns out to fix the performance regression, I have a hypothesis that could explain it. Quoting sysdeps/posix/posix_fallocate64.c:

  unsigned increment;
  {
    struct statfs64 f;
 
    if (__fstatfs64 (fd, &f) != 0)
      return errno;
    if (f.f_bsize == 0)
      increment = 512;
    else if (f.f_bsize < 4096)
      increment = f.f_bsize;
    else
      /* NFS clients do not propagate the block size of the underlying
         storage and may report a much larger value which would still
         leave holes after the loop below, so we cap the increment at
         4096.  */
      increment = 4096;
  }

Let us assume we write 1 byte every 4096 bytes. If the file is extended to 4 megabytes, it would be done by 1024 writes of 1 byte, every 4096 bytes. In case the NFS server uses 512-byte allocation block size, the physical size of the file would be 1024*512 = 0.5 megabytes instead of 4 megabytes, and the file would consist of 1024 fragments. When the file is being written with actual data later, it would have to be "defragmented" or "unsparsed" by the file system, which can be very expensive and depending on the file system implementation, block any concurrent I/O on that file system.

Comment by Sergei Golubchik [ 2024-01-12 ]

Just for the record, I think that preallocating space only ever makes sense for local storage, we don't know anything about physical layout of the network storage and cannot assume that preallocating there helps or even that it does anything at all.

That is, I think InnoDB should do fallocate() and if that fails — do not fallback to pwrite()

Comment by Marko Mäkelä [ 2024-01-18 ]

We got feedback from the customer. With the second custom build, they are back to 4× the write rate of the normal build that uses posix_fallocate(). So, my hypothesis about the sparse files seems to be right. The patch was as follows:

diff --git a/storage/innobase/os/os0file.cc b/storage/innobase/os/os0file.cc
index 71e48da0dec..075f838ebb1 100644
--- a/storage/innobase/os/os0file.cc
+++ b/storage/innobase/os/os0file.cc
@@ -4934,8 +4934,18 @@ os_file_set_size(
 				return true;
 			}
 			current_size &= ~4095ULL;
+#  ifdef __linux__
+			if (!fallocate(file, 0, current_size,
+				       size - current_size)) {
+				err = 0;
+				break;
+			}
+
+			err = errno;
+#  else
 			err = posix_fallocate(file, current_size,
 					      size - current_size);
+#  endif
 		}
 	} while (err == EINTR
 		 && srv_shutdown_state <= SRV_SHUTDOWN_INITIATED);

I think that we’d better apply the fix to all our releases. We would not need any new configuration option at this point. The fallback code at the end of os_file_set_size(), which is writing 1-megabyte blocks of NUL bytes, has been demonstrated to work on Alpine Linux (which uses musl libc) and on some version of Docker for Microsoft Windows, if you look at MDEV-16015.

serg, sure, in a separate feature request we could introduce a configuration option to logically extend files (ftruncate(), which we currently use for extending files of page_compressed tables). On an update-in-place file system such as Linux ext4, that could lead to more likely breakage the a design assumption of InnoDB that writes cannot fail due to running out of space.

Generated at Thu Feb 08 10:30:05 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.