[MDEV-18611] mariabackup silently ended during xtrabackup_copy_logfile() Created: 2019-02-18  Updated: 2021-09-30  Resolved: 2020-09-21

Status: Closed
Project: MariaDB Server
Component/s: mariabackup
Affects Version/s: 10.3.11, 10.2, 10.4
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Richard Stracke Assignee: Vladislav Lesin
Resolution: Won't Fix Votes: 2
Labels: None

Issue Links:
Duplicate
is duplicated by MDEV-18612 mariabackup: Error: xtrabackup_copy_l... Closed
is duplicated by MDEV-19492 Mariabackup hangs if table populated ... Stalled
Relates
relates to MDEV-17029 mariabackup fails under load with sch... Closed
relates to MDEV-23711 make mariabackup innodb redo log read... Closed

 Description   

Mariabackup silently stopped in some cases.

Regarding an analysis from vlad.lesin , it will be happen, if

log_sys.log.read_log_seg() returns 0, even if there was something wring during reading.

Variant 1:
log_sys.log.read_log_seg() returned true, but lsn == start_lsn.

) log_sys.log.read_log_seg() returned true, but lsn == start_lsn. If we look into log_t::files::read_log_seg(), we will see that the function can silently return true and do not advance lsn on the following condition:
=== log_t::files::read_log_seg()
if (block_number != log_block_convert_lsn_to_no(*start_lsn)) {
/* Garbage or an incompletely written log block.
We will not report any error, because this can
happen when InnoDB was killed while it was
writing redo log. We simply treat this as an
abrupt end of the redo log. */
end_lsn = *start_lsn;
break;
}

Variant 2:

Downloaded buffer is unequal to expected datasize.

 
=== log_t::files::read_log_seg()
if (dl < LOG_BLOCK_HDR_SIZE
|| (dl > OS_FILE_LOG_BLOCK_SIZE - LOG_BLOCK_TRL_SIZE
&& dl != OS_FILE_LOG_BLOCK_SIZE)) {
recv_sys->found_corrupt_log = true;
end_lsn = *start_lsn;
break;
}
=== log_t::files::read_log_seg()

But in this case the flag about corrupted log is set.

Suggested fix from vlad.lesin

 
===
--- a/extra/mariabackup/xtrabackup.cc
+++ b/extra/mariabackup/xtrabackup.cc
@@ -2728,13 +2728,19 @@ static bool xtrabackup_copy_logfile(bool last = false)
lsn_t lsn= start_lsn;
for (int retries= 0; retries < 100; retries++) {
if (log_sys.log.read_log_seg(&lsn, end_lsn)
- || lsn != start_lsn) {
+ && lsn != start_lsn) {
break;
}
msg("Retrying read of log at LSN=" LSN_PF "\n", lsn);
my_sleep(1000);
}
 
+ if (recv_sys->found_corrupt_log) {
+ msg("mariabackup: Error: xtrabackup_copy_logfile()"
+ " found corrupted log.\n");
+ exit(EXIT_FAILURE);
+ }
+
start_lsn = (lsn == start_lsn)
? 0 : xtrabackup_copy_log(start_lsn, lsn, last);
 
===



 Comments   
Comment by Marko Mäkelä [ 2019-02-19 ]

I would suggest the following patch against Mariabackup 10.2:

diff --git a/storage/innobase/log/log0recv.cc b/storage/innobase/log/log0recv.cc
index bcd6254ebd6..5036939f4a5 100644
--- a/storage/innobase/log/log0recv.cc
+++ b/storage/innobase/log/log0recv.cc
@@ -807,7 +807,9 @@ log_group_read_log_seg(
 			happen when InnoDB was killed while it was
 			writing redo log. We simply treat this as an
 			abrupt end of the redo log. */
+fail:
 			end_lsn = *start_lsn;
+			success = false;
 			break;
 		}
 
@@ -829,9 +831,7 @@ log_group_read_log_seg(
 					    << log_block_get_checkpoint_no(buf)
 					    << " expected: " << crc
 					    << " found: " << cksum;
-				end_lsn = *start_lsn;
-				success = false;
-				break;
+				goto fail;
 			}
 
 			if (group->is_encrypted()) {
@@ -845,8 +845,7 @@ log_group_read_log_seg(
 		    || (dl > OS_FILE_LOG_BLOCK_SIZE - LOG_BLOCK_TRL_SIZE
 			&& dl != OS_FILE_LOG_BLOCK_SIZE)) {
 			recv_sys->found_corrupt_log = true;
-			end_lsn = *start_lsn;
-			break;
+			goto fail;
 		}
 	}
 

Comment by Marko Mäkelä [ 2019-02-19 ]

I pushed my version of the fix. It turns out that only mariabackup was affected; InnoDB crash recovery would ignore the return value of the function.

Comment by Vladislav Lesin [ 2019-03-12 ]

The customer still has the issue despite using mariabackup 10.3.13 with the fix for this task.

Comment by Vladislav Lesin [ 2019-09-26 ]

Support sent me the information about the size of innodb log files:

-rw-rw----. 1 mysql mysql 134217728 Sep 12 00:00 ib_logfile1
-rw-rw----. 1 mysql mysql 134217728 Sep 12 00:00 ib_logfile0

I have prepared mariabackup version with some extra logging for support team, and the following is the part of mariabackup log:

2019-09-26 0:00:11 0 [Note] InnoDB: Number of pools: 1
?func: ib_log: checkpoint 2926242 at 2875890640474 found, checkpoint offset 187681370
?func: ib_log: checkpoint 2926241 at 2875890640161 found, checkpoint offset 187681057
[00] 2019-09-26 00:00:11 mariabackup: Generating a list of tablespaces
2019-09-26 0:00:11 0 [Warning] InnoDB: Allocated tablespace ID 13715 for ---/---, old maximum was 0
[00] 2019-09-26 00:00:22 xtrabackup_copy_logfile: start throttling
[00] 2019-09-26 00:00:22 xtrabackup_copy_logfile: stop throttling
?func: ib_log: read_log_seg: start_lsn: 2875890640384, source_offset: 187681280, at_eof: 0, len: 65536, page_no: 11455
?func: ib_log: read_log_seg: iterate 0, block_number: 248789068, calculated_block_number: 248264788
2019-09-26 0:00:22 0 [Note] InnoDB: Read redo log up to LSN=2875890640384
[00] 2019-09-26 00:00:22 Retrying read of log at LSN=2875890640384
?func: ib_log: read_log_seg: start_lsn: 2875890640384, source_offset: 187681280, at_eof: 0, len: 65536, page_no: 11455
?func: ib_log: read_log_seg: iterate 0, block_number: 248789068, calculated_block_number: 248264788
2019-09-26 0:00:22 0 [Note] InnoDB: Read redo log up to LSN=2875890640384
[00] 2019-09-26 00:00:22 Retrying read of log at LSN=2875890640384
?func: ib_log: read_log_seg: start_lsn: 2875890640384, source_offset: 187681280, at_eof: 0, len: 65536, page_no: 11455
?func: ib_log: read_log_seg: iterate 0, block_number: 248789068, calculated_block_number: 248264788
2019-09-26 0:00:22 0 [Note] InnoDB: Read redo log up to LSN=2875890640384
[00] 2019-09-26 00:00:22 Retrying read of log at LSN=2875890640384
?func: ib_log: read_log_seg: start_lsn: 2875890640384, source_offset: 187681280, at_eof: 0, len: 65536, page_no: 11455
?func: ib_log: read_log_seg: iterate 0, block_number: 248789068, calculated_block_number: 248264788

Consider the following line:

?func: ib_log: read_log_seg: iterate 0, block_number: 248789068, calculated_block_number: 248264788

(248264788-248789068)*512 = 268431360 - the difference between read and calculated log block number.

The redo log size is 268435456. Log header size is 2k. There are two files, i.e. the total header size is 4k. So 268435456 - 4096 = 268431360. This is exactly the amount of bytes necessary to make full loop in circular innodb log buffer. And this is the difference between read and calculated LSN we found above.

What we see is that mariabackup started at 00:00:11. The delay between the moment it read checkpoint lsn and the moment when it started to read redo log records from redo log(00:00:22) is 11s. And during this time server wrote greater or equal to 268431360 bytes in redo log, new checkpoint is started and redo log blocks were overwritten. But mariabackup did not refresh it's checkpoint info and that is why log block number read from redo log does not correspond to calculated redo log block number.

Comment by Vladislav Lesin [ 2019-10-02 ]

Test case: https://github.com/vlad-lesin/mariadb-server/tree/10.2-MDEV-18611-test-case .

Comment by Vladislav Lesin [ 2020-09-21 ]

We are not going to spend much effort to fix this edge case, we have improved the descriptiveness of the error message instead, please see MDEV-23711.

Generated at Thu Feb 08 08:45:25 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.