[MDEV-18611] mariabackup silently ended during xtrabackup_copy_logfile() - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Won't Fix
Affects Version/s: 10.3.11, 10.2(EOL), 10.4(EOL)
Fix Version/s: N/A
Component/s: mariabackup
Labels:
None

Description

Mariabackup silently stopped in some cases.

Regarding an analysis from vlad.lesin , it will be happen, if

log_sys.log.read_log_seg() returns 0, even if there was something wring during reading.

Variant 1:
log_sys.log.read_log_seg() returned true, but lsn == start_lsn.

) log_sys.log.read_log_seg() returned true, but lsn == start_lsn. If we look into log_t::files::read_log_seg(), we will see that the function can silently return true and do not advance lsn on the following condition:

=== log_t::files::read_log_seg()

if (block_number != log_block_convert_lsn_to_no(*start_lsn)) {

/* Garbage or an incompletely written log block.

We will not report any error, because this can

happen when InnoDB was killed while it was

writing redo log. We simply treat this as an

abrupt end of the redo log. */

end_lsn = *start_lsn;

break;

Variant 2:

Downloaded buffer is unequal to expected datasize.

=== log_t::files::read_log_seg()

if (dl < LOG_BLOCK_HDR_SIZE

|| (dl > OS_FILE_LOG_BLOCK_SIZE - LOG_BLOCK_TRL_SIZE

&& dl != OS_FILE_LOG_BLOCK_SIZE)) {

recv_sys->found_corrupt_log = true;

end_lsn = *start_lsn;

break;

=== log_t::files::read_log_seg()

But in this case the flag about corrupted log is set.

Suggested fix from vlad.lesin

===

--- a/extra/mariabackup/xtrabackup.cc

+++ b/extra/mariabackup/xtrabackup.cc

@@ -2728,13 +2728,19 @@ static bool xtrabackup_copy_logfile(bool last = false)

lsn_t lsn= start_lsn;

for (int retries= 0; retries < 100; retries++) {

if (log_sys.log.read_log_seg(&lsn, end_lsn)

- || lsn != start_lsn) {

+ && lsn != start_lsn) {

break;

msg("Retrying read of log at LSN=" LSN_PF "\n", lsn);

my_sleep(1000);

+ if (recv_sys->found_corrupt_log) {

+ msg("mariabackup: Error: xtrabackup_copy_logfile()"

+ " found corrupted log.\n");

+ exit(EXIT_FAILURE);

+ }

start_lsn = (lsn == start_lsn)

? 0 : xtrabackup_copy_log(start_lsn, lsn, last);

===

Attachments

Issue Links

is duplicated by

MDEV-18612 mariabackup: Error: xtrabackup_copy_logfile() failed.

Closed

MDEV-19492 Mariabackup hangs if table populated with INSERT... SELECT while it runs

Stalled

relates to

MDEV-17029 mariabackup fails under load with schema changes

Closed

MDEV-23711 make mariabackup innodb redo log read error message more clear

Closed

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2019-02-19 08:44

I would suggest the following patch against Mariabackup 10.2:

diff --git a/storage/innobase/log/log0recv.cc b/storage/innobase/log/log0recv.cc

index bcd6254ebd6..5036939f4a5 100644

--- a/storage/innobase/log/log0recv.cc

+++ b/storage/innobase/log/log0recv.cc

@@ -807,7 +807,9 @@ log_group_read_log_seg(

 			happen when InnoDB was killed while it was

 			writing redo log. We simply treat this as an

 			abrupt end of the redo log. */

+fail:

 			end_lsn = *start_lsn;

+			success = false;

 			break;

@@ -829,9 +831,7 @@ log_group_read_log_seg(

 					    << log_block_get_checkpoint_no(buf)

 					    << " expected: " << crc

 					    << " found: " << cksum;

-				end_lsn = *start_lsn;

-				success = false;

-				break;

+				goto fail;

 			if (group->is_encrypted()) {

@@ -845,8 +845,7 @@ log_group_read_log_seg(

 		    || (dl > OS_FILE_LOG_BLOCK_SIZE - LOG_BLOCK_TRL_SIZE

 			&& dl != OS_FILE_LOG_BLOCK_SIZE)) {

 			recv_sys->found_corrupt_log = true;

-			end_lsn = *start_lsn;

-			break;

+			goto fail;

Marko Mäkelä added a comment - 2019-02-19 08:44 I would suggest the following patch against Mariabackup 10.2: diff --git a/storage/innobase/log/log0recv.cc b/storage/innobase/log/log0recv.cc index bcd6254ebd6..5036939f4a5 100644 --- a/storage/innobase/log/log0recv.cc +++ b/storage/innobase/log/log0recv.cc @@ -807,7 +807,9 @@ log_group_read_log_seg( happen when InnoDB was killed while it was writing redo log. We simply treat this as an abrupt end of the redo log. */ +fail: end_lsn = *start_lsn; + success = false; break; } @@ -829,9 +831,7 @@ log_group_read_log_seg( << log_block_get_checkpoint_no(buf) << " expected: " << crc << " found: " << cksum; - end_lsn = *start_lsn; - success = false; - break; + goto fail; } if (group->is_encrypted()) { @@ -845,8 +845,7 @@ log_group_read_log_seg( || (dl > OS_FILE_LOG_BLOCK_SIZE - LOG_BLOCK_TRL_SIZE && dl != OS_FILE_LOG_BLOCK_SIZE)) { recv_sys->found_corrupt_log = true; - end_lsn = *start_lsn; - break; + goto fail; } }

Marko Mäkelä added a comment - 2019-02-19 09:21

I pushed my version of the fix. It turns out that only mariabackup was affected; InnoDB crash recovery would ignore the return value of the function.

Marko Mäkelä added a comment - 2019-02-19 09:21 I pushed my version of the fix. It turns out that only mariabackup was affected; InnoDB crash recovery would ignore the return value of the function.

Vladislav Lesin added a comment - 2019-03-12 13:44

The customer still has the issue despite using mariabackup 10.3.13 with the fix for this task.

Vladislav Lesin added a comment - 2019-03-12 13:44 The customer still has the issue despite using mariabackup 10.3.13 with the fix for this task.

Vladislav Lesin added a comment - 2019-09-26 13:40 - edited

Support sent me the information about the size of innodb log files:

-rw-rw----. 1 mysql mysql 134217728 Sep 12 00:00 ib_logfile1

-rw-rw----. 1 mysql mysql 134217728 Sep 12 00:00 ib_logfile0

I have prepared mariabackup version with some extra logging for support team, and the following is the part of mariabackup log:

2019-09-26 0:00:11 0 [Note] InnoDB: Number of pools: 1

?func: ib_log: checkpoint 2926242 at 2875890640474 found, checkpoint offset 187681370

?func: ib_log: checkpoint 2926241 at 2875890640161 found, checkpoint offset 187681057

[00] 2019-09-26 00:00:11 mariabackup: Generating a list of tablespaces

2019-09-26 0:00:11 0 [Warning] InnoDB: Allocated tablespace ID 13715 for ---/---, old maximum was 0

[00] 2019-09-26 00:00:22 xtrabackup_copy_logfile: start throttling

[00] 2019-09-26 00:00:22 xtrabackup_copy_logfile: stop throttling

?func: ib_log: read_log_seg: start_lsn: 2875890640384, source_offset: 187681280, at_eof: 0, len: 65536, page_no: 11455

?func: ib_log: read_log_seg: iterate 0, block_number: 248789068, calculated_block_number: 248264788

2019-09-26 0:00:22 0 [Note] InnoDB: Read redo log up to LSN=2875890640384

[00] 2019-09-26 00:00:22 Retrying read of log at LSN=2875890640384

?func: ib_log: read_log_seg: start_lsn: 2875890640384, source_offset: 187681280, at_eof: 0, len: 65536, page_no: 11455

?func: ib_log: read_log_seg: iterate 0, block_number: 248789068, calculated_block_number: 248264788

2019-09-26 0:00:22 0 [Note] InnoDB: Read redo log up to LSN=2875890640384

[00] 2019-09-26 00:00:22 Retrying read of log at LSN=2875890640384

?func: ib_log: read_log_seg: start_lsn: 2875890640384, source_offset: 187681280, at_eof: 0, len: 65536, page_no: 11455

?func: ib_log: read_log_seg: iterate 0, block_number: 248789068, calculated_block_number: 248264788

2019-09-26 0:00:22 0 [Note] InnoDB: Read redo log up to LSN=2875890640384

[00] 2019-09-26 00:00:22 Retrying read of log at LSN=2875890640384

?func: ib_log: read_log_seg: start_lsn: 2875890640384, source_offset: 187681280, at_eof: 0, len: 65536, page_no: 11455

?func: ib_log: read_log_seg: iterate 0, block_number: 248789068, calculated_block_number: 248264788

Consider the following line:

?func: ib_log: read_log_seg: iterate 0, block_number: 248789068, calculated_block_number: 248264788

(248264788-248789068)*512 = 268431360 - the difference between read and calculated log block number.

The redo log size is 268435456. Log header size is 2k. There are two files, i.e. the total header size is 4k. So 268435456 - 4096 = 268431360. This is exactly the amount of bytes necessary to make full loop in circular innodb log buffer. And this is the difference between read and calculated LSN we found above.

What we see is that mariabackup started at 00:00:11. The delay between the moment it read checkpoint lsn and the moment when it started to read redo log records from redo log(00:00:22) is 11s. And during this time server wrote greater or equal to 268431360 bytes in redo log, new checkpoint is started and redo log blocks were overwritten. But mariabackup did not refresh it's checkpoint info and that is why log block number read from redo log does not correspond to calculated redo log block number.

Vladislav Lesin added a comment - 2019-09-26 13:40 - edited Support sent me the information about the size of innodb log files: -rw-rw----. 1 mysql mysql 134217728 Sep 12 00:00 ib_logfile1 -rw-rw----. 1 mysql mysql 134217728 Sep 12 00:00 ib_logfile0 I have prepared mariabackup version with some extra logging for support team, and the following is the part of mariabackup log: 2019-09-26 0:00:11 0 [Note] InnoDB: Number of pools: 1 ?func: ib_log: checkpoint 2926242 at 2875890640474 found, checkpoint offset 187681370 ?func: ib_log: checkpoint 2926241 at 2875890640161 found, checkpoint offset 187681057 [00] 2019-09-26 00:00:11 mariabackup: Generating a list of tablespaces 2019-09-26 0:00:11 0 [Warning] InnoDB: Allocated tablespace ID 13715 for ---/---, old maximum was 0 [00] 2019-09-26 00:00:22 xtrabackup_copy_logfile: start throttling [00] 2019-09-26 00:00:22 xtrabackup_copy_logfile: stop throttling ?func: ib_log: read_log_seg: start_lsn: 2875890640384, source_offset: 187681280, at_eof: 0, len: 65536, page_no: 11455 ?func: ib_log: read_log_seg: iterate 0, block_number: 248789068, calculated_block_number: 248264788 2019-09-26 0:00:22 0 [Note] InnoDB: Read redo log up to LSN=2875890640384 [00] 2019-09-26 00:00:22 Retrying read of log at LSN=2875890640384 ?func: ib_log: read_log_seg: start_lsn: 2875890640384, source_offset: 187681280, at_eof: 0, len: 65536, page_no: 11455 ?func: ib_log: read_log_seg: iterate 0, block_number: 248789068, calculated_block_number: 248264788 2019-09-26 0:00:22 0 [Note] InnoDB: Read redo log up to LSN=2875890640384 [00] 2019-09-26 00:00:22 Retrying read of log at LSN=2875890640384 ?func: ib_log: read_log_seg: start_lsn: 2875890640384, source_offset: 187681280, at_eof: 0, len: 65536, page_no: 11455 ?func: ib_log: read_log_seg: iterate 0, block_number: 248789068, calculated_block_number: 248264788 2019-09-26 0:00:22 0 [Note] InnoDB: Read redo log up to LSN=2875890640384 [00] 2019-09-26 00:00:22 Retrying read of log at LSN=2875890640384 ?func: ib_log: read_log_seg: start_lsn: 2875890640384, source_offset: 187681280, at_eof: 0, len: 65536, page_no: 11455 ?func: ib_log: read_log_seg: iterate 0, block_number: 248789068, calculated_block_number: 248264788 Consider the following line: ?func: ib_log: read_log_seg: iterate 0, block_number: 248789068, calculated_block_number: 248264788 (248264788-248789068)*512 = 268431360 - the difference between read and calculated log block number. The redo log size is 268435456. Log header size is 2k. There are two files, i.e. the total header size is 4k. So 268435456 - 4096 = 268431360. This is exactly the amount of bytes necessary to make full loop in circular innodb log buffer. And this is the difference between read and calculated LSN we found above. What we see is that mariabackup started at 00:00:11. The delay between the moment it read checkpoint lsn and the moment when it started to read redo log records from redo log(00:00:22) is 11s. And during this time server wrote greater or equal to 268431360 bytes in redo log, new checkpoint is started and redo log blocks were overwritten. But mariabackup did not refresh it's checkpoint info and that is why log block number read from redo log does not correspond to calculated redo log block number.

Vladislav Lesin added a comment - 2019-10-02 18:49

Test case: https://github.com/vlad-lesin/mariadb-server/tree/10.2-MDEV-18611-test-case .

Vladislav Lesin added a comment - 2019-10-02 18:49 Test case: https://github.com/vlad-lesin/mariadb-server/tree/10.2-MDEV-18611-test-case .

Vladislav Lesin added a comment - 2020-09-21 10:05

We are not going to spend much effort to fix this edge case, we have improved the descriptiveness of the error message instead, please see ~~MDEV-23711~~.

Vladislav Lesin added a comment - 2020-09-21 10:05 We are not going to spend much effort to fix this edge case, we have improved the descriptiveness of the error message instead, please see MDEV-23711 .

People

Assignee:: Vladislav Lesin

Reporter:: Richard Stracke

Votes:: 2 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 2019-02-18 06:44

Updated:: 2024-09-06 10:00

Resolved:: 2020-09-21 10:05

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server