[MDEV-11556] InnoDB redo log apply fails to adjust data file sizes Created: 2016-12-13  Updated: 2022-11-22  Resolved: 2016-12-30

Status: Closed
Project: MariaDB Server
Component/s: Backup, Storage Engine - InnoDB
Affects Version/s: 10.1.20
Fix Version/s: 10.1.21, 10.2.4

Type: Bug Priority: Major
Reporter: Andrii Nikitin (Inactive) Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: None

Attachments: File my.cnf     File partial_uncommitted.test     File problem.tar.gz     File test_partial_uncommitted.log     File test_partial_uncommitted.sh    
Issue Links:
Problem/Incident
causes MDEV-12027 posix_fallocate returned OS error 201. Closed
causes MDEV-12075 innodb_use_fallocate does not work in... Closed
causes MDEV-22501 Various issues when using --innodb-da... Closed
Relates
relates to MDEV-11690 Remove UNIV_HOTBACKUP Closed
relates to MDEV-11915 Detect InnoDB system tablespace size ... Closed
relates to MDEV-12022 InnoDB wrongly ignores the end of an ... Closed
relates to MDEV-13822 Mariabackup --prepare --incremental s... Closed
relates to MDEV-14082 Enforcing innodb_open_files leads to ... Closed
relates to MDEV-18194 Incremental prepare tries to access p... Closed
relates to MDEV-30069 InnoDB: Trying to write ... bytes at ... Closed
relates to MDEV-11968 With innodb_page_size=8K crash with '... Closed
relates to MDEV-12893 innodb.log_data_file_size failed in b... Closed
relates to MDEV-14447 mariabackup --incremental --prepare f... Closed
relates to MDEV-23190 Assertion `id.page_no() < space.size'... Closed
relates to MDEV-26068 Server crashes when innodb-data-file-... Closed
Sprint: 10.2.4-5

 Description   

Prepare of partial backup (with --tables option) sporadically fails with following error :

InnoDB: Error: tablespace size stored in header is 4864 pages, but
InnoDB: the sum of data file sizes is only 768 pages
InnoDB: Cannot start InnoDB. The tail of the system tablespace is
InnoDB: missing.

Attached are:
test_partial_uncommitted.sh - test which reliably demonstrates the problem in my environment
test_partial_uncommitted.log - corresponding log
my.cnf - config used
partial_uncommitted.test 'identical' mtr test, which DOES NOT demonstrate the problem (for unknown reasons (generated xtrabackup_info and backup-my.cnf are identical to .sh test).

Interesting observation is that all tables are backed up - this should be another bug eventually if not fixed as part of this.



 Comments   
Comment by Vladislav Vaintroub [ 2016-12-13 ]

Could you also attach the compressed data directory and backup directory from the test? Thanks!

Comment by Marko Mäkelä [ 2016-12-15 ]

I looked at this.

tar xzf problem.tar.gz -C /dev/shm
xtrabackup --prepare --target-dir=/dev/shm/w1/bkup

The immediate problem is that backup-my.cnf contains the following line, which is handled incorrectly:

innodb_data_file_path=ibdata1:12M:autoextend

This is interpreted as a hard limit of 768 pages, while we should notice that there is no :max:12M or similar appended.

Another problem is that the XtraDB startup is done in the wrong order. By the time we wrongly decide to report the size mismatch, we have already applied the redo log and started the rollback of incomplete transactions.

Comment by Marko Mäkelä [ 2016-12-15 ]

Actually, the reason why this does not fail for normal XtraDB or InnoDB crash recovery is that files cannot be extended during redo log apply. In xtrabackup, this is possible, and the problem is that open_or_create_data_files() reads the file sizes as they were before the files were extended.

We should either extend the files before this point, by calling fil_extend_space_to_desired_size(), or we should relax the check. For the relaxation, I was thinking about the following:

diff --git a/storage/xtradb/srv/srv0start.cc b/storage/xtradb/srv/srv0start.cc
index b65366a4863..a6d38263ffd 100644
--- a/storage/xtradb/srv/srv0start.cc
+++ b/storage/xtradb/srv/srv0start.cc
@@ -2958,7 +2958,16 @@ innobase_start_or_create_for_mysql(void)
 
 	if (!srv_read_only_mode
 	    && srv_auto_extend_last_data_file
-	    && sum_of_data_file_sizes < tablespace_size_in_header) {
+	    && sum_of_data_file_sizes < tablespace_size_in_header
+	    /* In xtrabackup, open_or_create_data_files() initialized the
+	    srv_data_file_sizes[] before the redo log was applied.
+	    Because data files can be extended during xtrabackup
+	    redo log apply, we must relax the check. The files will
+	    be extended later in xtrabackup_prepare_func(), which
+	    invokes fil_extend_space_to_desired_size(). */
+	    && (!IS_XTRABACKUP()
+		|| (srv_last_file_size_max != 0/* :max: was specified */
+		    && srv_last_file_size_max < tablespace_size_in_header))) {
 
 		ut_print_timestamp(stderr);
 		fprintf(stderr,

However, this does not look correct to me. I think that we should first extend the files, then apply all redo log to those pages (at which point it becomes possible that data is actually written back to the files), and only after that start generating more redo log (start the rollback of incomplete transactions etc.)

In problem.tar.gz the last nonzero page of ibdata1 is 523, and there last page number for which recv_add_to_hash_table() is invoked on space=0 is page_no=522. (It is a bit strange why ibdata1 was extended from 768 to 4864 pages in the first place, with so many empty pages at the end. Maybe still a bug in the InnoDB page allocation?)

I think that we need a test case with innodb_file_per_table=0 that would cause ibdata1 to be extended such that there are redo log records to be applied for the extended area.

I fear that we may be losing redo log records in the current implementation, because by the time fil_extend_space_to_desired_size() is called, recv_apply_hashed_log_recs(TRUE) would already have applied all redo logs from recv_sys->addr_hash, including any for the ‘missing’ pages that would be created by fil_extend_space_to_desired_size().

We can find out the total size of the system tablespace already when scanning the redo log, like this:

diff --git a/storage/xtradb/log/log0recv.cc b/storage/xtradb/log/log0recv.cc
index 10dbfdfae6b..cdf166fbdbc 100644
--- a/storage/xtradb/log/log0recv.cc
+++ b/storage/xtradb/log/log0recv.cc
@@ -2253,6 +2253,7 @@ recv_parse_log_rec(
 	}
 #endif /* UNIV_LOG_LSN_DEBUG */
 
+	byte*	old_ptr = new_ptr;
 	new_ptr = recv_parse_or_apply_log_rec_body(*type, new_ptr, end_ptr,
 						   NULL, NULL, *space);
 	if (UNIV_UNLIKELY(new_ptr == NULL)) {
@@ -2260,6 +2261,13 @@ recv_parse_log_rec(
 		return(0);
 	}
 
+	if (*space == 0 && *page_no == 0 && *type == MLOG_4BYTES
+	    && mach_read_from_2(old_ptr) == FSP_HEADER_OFFSET + FSP_SIZE) {
+		ulint	size;
+		mach_parse_compressed(old_ptr + 2, end_ptr, &size);
+		fprintf(stderr, "size: %lu\n", (ulong) size);
+	}
+
 	if (*page_no > recv_max_parsed_page_no) {
 		recv_max_parsed_page_no = *page_no;
 	}

I think that we should extend the last system tablespace file before actual redo log application starts (recv_read_in_area() is called).

Comment by Marko Mäkelä [ 2016-12-20 ]

I got an idea how to write a test case for this. It would also make the InnoDB/XtraDB crash recovery properly support file extension.

SET GLOBAL innodb_file_per_table=0;
--source include/no_checkpoint_start.inc
CREATE TABLE t(a INT)ENGINE=InnoDB;
let $page= `SELECT page FROM INFORMATION_SCHEMA.INNODB_SYS_…`;
--source include/kill_mysqld.inc

After this, skip the test if the redo log checkpoint is after the checkpoint that we recorded before the CREATE TABLE, or if the ibdata1 file starting from $page is not zero-initialized.

If the preconditions are met, truncate the ibdata1 file at $page * innodb_page_size bytes and start the server to execute the following:

DROP TABLE t;

The DROP TABLE will access the B-tree root page which should have been in the truncated portion of the ibdata1 file. Without any patch, InnoDB should crash in some way when trying to access a non-existing page in the system tablespace file.

Comment by Marko Mäkelä [ 2016-12-20 ]

Actually the same test should also be applicable for innodb_file_per_table=1 when we truncate the file to 3 pages so that the clustered index root page (page number 3) will be missing. Something like this:

SET GLOBAL innodb_file_per_table=0;
--source include/no_checkpoint_start.inc
CREATE TABLE t(a INT)ENGINE=InnoDB;
let $page= `SELECT page FROM INFORMATION_SCHEMA.INNODB_SYS_…`;
SET GLOBAL innodb_file_per_table=1;
CREATE TABLE ibd(a INT)ENGINE=InnoDB;
--source include/kill_mysqld.inc

Then, check the LSN. If it is OK, truncate the ibd.ibd file. If also the page number is OK, truncate ibdata1. Finally start up the server and continue:

DROP TABLE t,ibd;

If an unwanted log checkpoint occurred during the two CREATE TABLE (the checkpoint LSN in the redo log is too new), after the above restart and DROP TABLE, we will report that the test was skipped.

Comment by Marko Mäkelä [ 2016-12-23 ]

We should truncate .ibd files to 0, 1, 2, or 3 pages, and also some fractional number of pages (incomplete page).
This could be done by creating a partitioned table and then truncating the ibd*.ibd files in different ways.

Comment by Marko Mäkelä [ 2016-12-27 ]

I developed a mysql-test case that truncates the data files as suggested above.
For the InnoDB system tablespace, there is the problem that during the database initialization, the ibdata1 file gets heavily extended in fseg_create() and btr_create(), even when plenty of free pages already exist in the tablespace. It would be very difficult to create a scenario where the CREATE TABLE itself is extending the system tablespace. We would first have to fill it with enough many junk pages. If the first available root page number is 328 and the file size is 768 pages, we would have to create some ‘padding’ table(s) that will ensure that the CREATE TABLE statement will use page number 768 and extend the file.

If we simply truncate the ibdata1 file which had already been enlarged to the desired size, we cannot expect the revised recovery to work, because there would not be any redo log record for updating the file size in the tablespace header.

With single-table tablespaces, it is easier to construct the necessary scenario. Currently, startup would actually trigger MDEV-9282 because the .ibd files would be shorter than expected.

Comment by Marko Mäkelä [ 2016-12-28 ]

Please see bb-10.1-mdev-11556
The second commit removes a debug option that was introduced in the first commit and did not work correctly. I am keeping the two commits only for the record, in case someone wants to develop an mtr test case for triggering system tablespace extension and then killing the server.

With this patch applied, bb-10.1-wlad-xtrabackup was able to recover from the problematic dataset (it did extend the system tablespace).

Comment by Jan Lindström (Inactive) [ 2016-12-29 ]

Based on your comment "Currently, it is theoretically possible that some file pages are read from a tablespace file whose size has not been adjusted yet. The transaction system state is being restored concurrently with redo log apply. It could theoretically mean that the logic for extending the system tablespace or the undo log tablespaces is broken under some scenario, such as when undo log pages were written to the end of a file just after resizing.
I will try to see if the calls to fil_recv_set_sizes() can be placed better.", there is at least some investigation needed, all other changes are ok to push.

Comment by Marko Mäkelä [ 2016-12-29 ]

I am going to try a revised fix that would introduce a field fil_space_t::recv_size. This field would be normally 0, and it would be set by recv_parse_log_rec(). When this field is nonzero for a tablespace that fil_io() is about to access, fil_io() would extend the tablespace on the spot, before posting the read or write request. This mechanism should guarantee that each file be resized before any redo log is applied.

Comment by Marko Mäkelä [ 2016-12-29 ]

Please see the updated bb-10.1-mdev-11556 branch.

The problem when extending the system tablespace with innodb_page_size=4k or 8k remains. It looks unrelated to my changes, but it must definitely be fixed before this patch is committed.

Comment by Marko Mäkelä [ 2016-12-29 ]

The remaining problem was solved by removing some inappropriate rounding of the InnoDB system tablespace size:

diff --git a/storage/xtradb/srv/srv0start.cc b/storage/xtradb/srv/srv0start.cc
index 8b04b51ce0b..34fb1f87bdf 100644
--- a/storage/xtradb/srv/srv0start.cc
+++ b/storage/xtradb/srv/srv0start.cc
@@ -192,9 +192,6 @@ static const ulint SRV_UNDO_TABLESPACE_SIZE_IN_PAGES =
 #define SRV_N_PENDING_IOS_PER_THREAD	OS_AIO_N_PENDING_IOS_PER_THREAD
 #define SRV_MAX_N_PENDING_SYNC_IOS	100
 
-/** The round off to MB is similar as done in srv_parse_megabytes() */
-#define CALC_NUMBER_OF_PAGES(size)  ((size) / (1024 * 1024)) * \
-				  ((1024 * 1024) / (UNIV_PAGE_SIZE))
 #ifdef UNIV_PFS_THREAD
 /* Keys to register InnoDB threads with performance schema */
 UNIV_INTERN mysql_pfs_key_t	io_handler_thread_key;
@@ -1029,15 +1026,12 @@ open_or_create_data_files(
 			size = os_file_get_size(files[i]);
 			ut_a(size != (os_offset_t) -1);
 
-			/* Under some error conditions like disk full
-			narios or file size reaching filesystem
-			limit the data file could contain an incomplete
-			extent at the end. When we extend a data file
-			and if some failure happens, then also the data
-			file could contain an incomplete extent.  So we
-			need to round the size downward to a megabyte.*/
+			/* If InnoDB encountered an error or was killed
+			while extending the data file, the last page
+			could be incomplete. */
 
-			rounded_size_pages = (ulint) CALC_NUMBER_OF_PAGES(size);
+			rounded_size_pages = static_cast<ulint>(
+				size >> UNIV_PAGE_SIZE_SHIFT);
 
 			if (i == srv_n_data_files - 1
 			    && srv_auto_extend_last_data_file) {

Comment by Jan Lindström (Inactive) [ 2016-12-29 ]

ok to push.

Comment by Marko Mäkelä [ 2016-12-29 ]

I repushed once more to bb-10.1-mdev-11556. There was a bug in my test file that caused it to fail when run against a non-debug server.

Comment by Marko Mäkelä [ 2016-12-30 ]

I pushed this to 10.1 and am now working on the merge to 10.2.

Comment by Marko Mäkelä [ 2019-01-24 ]

I did not merge this correctly to MariaDB 10.2. A follow-up adjustment will fix MDEV-18194.

Generated at Thu Feb 08 07:50:52 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.