Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-18611

mariabackup silently ended during xtrabackup_copy_logfile()

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Won't Fix
    • 10.3.11, 10.2(EOL), 10.4(EOL)
    • N/A
    • mariabackup
    • None

    Description

      Mariabackup silently stopped in some cases.

      Regarding an analysis from vlad.lesin , it will be happen, if

      log_sys.log.read_log_seg() returns 0, even if there was something wring during reading.

      Variant 1:
      log_sys.log.read_log_seg() returned true, but lsn == start_lsn.

      ) log_sys.log.read_log_seg() returned true, but lsn == start_lsn. If we look into log_t::files::read_log_seg(), we will see that the function can silently return true and do not advance lsn on the following condition:
      === log_t::files::read_log_seg()
      if (block_number != log_block_convert_lsn_to_no(*start_lsn)) {
      /* Garbage or an incompletely written log block.
      We will not report any error, because this can
      happen when InnoDB was killed while it was
      writing redo log. We simply treat this as an
      abrupt end of the redo log. */
      end_lsn = *start_lsn;
      break;
      }
      

      Variant 2:

      Downloaded buffer is unequal to expected datasize.

       
      === log_t::files::read_log_seg()
      if (dl < LOG_BLOCK_HDR_SIZE
      || (dl > OS_FILE_LOG_BLOCK_SIZE - LOG_BLOCK_TRL_SIZE
      && dl != OS_FILE_LOG_BLOCK_SIZE)) {
      recv_sys->found_corrupt_log = true;
      end_lsn = *start_lsn;
      break;
      }
      === log_t::files::read_log_seg()
      

      But in this case the flag about corrupted log is set.

      Suggested fix from vlad.lesin

       
      ===
      --- a/extra/mariabackup/xtrabackup.cc
      +++ b/extra/mariabackup/xtrabackup.cc
      @@ -2728,13 +2728,19 @@ static bool xtrabackup_copy_logfile(bool last = false)
      lsn_t lsn= start_lsn;
      for (int retries= 0; retries < 100; retries++) {
      if (log_sys.log.read_log_seg(&lsn, end_lsn)
      - || lsn != start_lsn) {
      + && lsn != start_lsn) {
      break;
      }
      msg("Retrying read of log at LSN=" LSN_PF "\n", lsn);
      my_sleep(1000);
      }
       
      + if (recv_sys->found_corrupt_log) {
      + msg("mariabackup: Error: xtrabackup_copy_logfile()"
      + " found corrupted log.\n");
      + exit(EXIT_FAILURE);
      + }
      +
      start_lsn = (lsn == start_lsn)
      ? 0 : xtrabackup_copy_log(start_lsn, lsn, last);
       
      ===
      

      Attachments

        Issue Links

          Activity

            I would suggest the following patch against Mariabackup 10.2:

            diff --git a/storage/innobase/log/log0recv.cc b/storage/innobase/log/log0recv.cc
            index bcd6254ebd6..5036939f4a5 100644
            --- a/storage/innobase/log/log0recv.cc
            +++ b/storage/innobase/log/log0recv.cc
            @@ -807,7 +807,9 @@ log_group_read_log_seg(
             			happen when InnoDB was killed while it was
             			writing redo log. We simply treat this as an
             			abrupt end of the redo log. */
            +fail:
             			end_lsn = *start_lsn;
            +			success = false;
             			break;
             		}
             
            @@ -829,9 +831,7 @@ log_group_read_log_seg(
             					    << log_block_get_checkpoint_no(buf)
             					    << " expected: " << crc
             					    << " found: " << cksum;
            -				end_lsn = *start_lsn;
            -				success = false;
            -				break;
            +				goto fail;
             			}
             
             			if (group->is_encrypted()) {
            @@ -845,8 +845,7 @@ log_group_read_log_seg(
             		    || (dl > OS_FILE_LOG_BLOCK_SIZE - LOG_BLOCK_TRL_SIZE
             			&& dl != OS_FILE_LOG_BLOCK_SIZE)) {
             			recv_sys->found_corrupt_log = true;
            -			end_lsn = *start_lsn;
            -			break;
            +			goto fail;
             		}
             	}
             
            

            marko Marko Mäkelä added a comment - I would suggest the following patch against Mariabackup 10.2: diff --git a/storage/innobase/log/log0recv.cc b/storage/innobase/log/log0recv.cc index bcd6254ebd6..5036939f4a5 100644 --- a/storage/innobase/log/log0recv.cc +++ b/storage/innobase/log/log0recv.cc @@ -807,7 +807,9 @@ log_group_read_log_seg( happen when InnoDB was killed while it was writing redo log. We simply treat this as an abrupt end of the redo log. */ +fail: end_lsn = *start_lsn; + success = false; break; } @@ -829,9 +831,7 @@ log_group_read_log_seg( << log_block_get_checkpoint_no(buf) << " expected: " << crc << " found: " << cksum; - end_lsn = *start_lsn; - success = false; - break; + goto fail; } if (group->is_encrypted()) { @@ -845,8 +845,7 @@ log_group_read_log_seg( || (dl > OS_FILE_LOG_BLOCK_SIZE - LOG_BLOCK_TRL_SIZE && dl != OS_FILE_LOG_BLOCK_SIZE)) { recv_sys->found_corrupt_log = true; - end_lsn = *start_lsn; - break; + goto fail; } }

            I pushed my version of the fix. It turns out that only mariabackup was affected; InnoDB crash recovery would ignore the return value of the function.

            marko Marko Mäkelä added a comment - I pushed my version of the fix. It turns out that only mariabackup was affected; InnoDB crash recovery would ignore the return value of the function.

            The customer still has the issue despite using mariabackup 10.3.13 with the fix for this task.

            vlad.lesin Vladislav Lesin added a comment - The customer still has the issue despite using mariabackup 10.3.13 with the fix for this task.
            vlad.lesin Vladislav Lesin added a comment - - edited

            Support sent me the information about the size of innodb log files:

            -rw-rw----. 1 mysql mysql 134217728 Sep 12 00:00 ib_logfile1
            -rw-rw----. 1 mysql mysql 134217728 Sep 12 00:00 ib_logfile0
            

            I have prepared mariabackup version with some extra logging for support team, and the following is the part of mariabackup log:

            2019-09-26 0:00:11 0 [Note] InnoDB: Number of pools: 1
            ?func: ib_log: checkpoint 2926242 at 2875890640474 found, checkpoint offset 187681370
            ?func: ib_log: checkpoint 2926241 at 2875890640161 found, checkpoint offset 187681057
            [00] 2019-09-26 00:00:11 mariabackup: Generating a list of tablespaces
            2019-09-26 0:00:11 0 [Warning] InnoDB: Allocated tablespace ID 13715 for ---/---, old maximum was 0
            [00] 2019-09-26 00:00:22 xtrabackup_copy_logfile: start throttling
            [00] 2019-09-26 00:00:22 xtrabackup_copy_logfile: stop throttling
            ?func: ib_log: read_log_seg: start_lsn: 2875890640384, source_offset: 187681280, at_eof: 0, len: 65536, page_no: 11455
            ?func: ib_log: read_log_seg: iterate 0, block_number: 248789068, calculated_block_number: 248264788
            2019-09-26 0:00:22 0 [Note] InnoDB: Read redo log up to LSN=2875890640384
            [00] 2019-09-26 00:00:22 Retrying read of log at LSN=2875890640384
            ?func: ib_log: read_log_seg: start_lsn: 2875890640384, source_offset: 187681280, at_eof: 0, len: 65536, page_no: 11455
            ?func: ib_log: read_log_seg: iterate 0, block_number: 248789068, calculated_block_number: 248264788
            2019-09-26 0:00:22 0 [Note] InnoDB: Read redo log up to LSN=2875890640384
            [00] 2019-09-26 00:00:22 Retrying read of log at LSN=2875890640384
            ?func: ib_log: read_log_seg: start_lsn: 2875890640384, source_offset: 187681280, at_eof: 0, len: 65536, page_no: 11455
            ?func: ib_log: read_log_seg: iterate 0, block_number: 248789068, calculated_block_number: 248264788
            2019-09-26 0:00:22 0 [Note] InnoDB: Read redo log up to LSN=2875890640384
            [00] 2019-09-26 00:00:22 Retrying read of log at LSN=2875890640384
            ?func: ib_log: read_log_seg: start_lsn: 2875890640384, source_offset: 187681280, at_eof: 0, len: 65536, page_no: 11455
            ?func: ib_log: read_log_seg: iterate 0, block_number: 248789068, calculated_block_number: 248264788
            

            Consider the following line:

            ?func: ib_log: read_log_seg: iterate 0, block_number: 248789068, calculated_block_number: 248264788
            

            (248264788-248789068)*512 = 268431360 - the difference between read and calculated log block number.

            The redo log size is 268435456. Log header size is 2k. There are two files, i.e. the total header size is 4k. So 268435456 - 4096 = 268431360. This is exactly the amount of bytes necessary to make full loop in circular innodb log buffer. And this is the difference between read and calculated LSN we found above.

            What we see is that mariabackup started at 00:00:11. The delay between the moment it read checkpoint lsn and the moment when it started to read redo log records from redo log(00:00:22) is 11s. And during this time server wrote greater or equal to 268431360 bytes in redo log, new checkpoint is started and redo log blocks were overwritten. But mariabackup did not refresh it's checkpoint info and that is why log block number read from redo log does not correspond to calculated redo log block number.

            vlad.lesin Vladislav Lesin added a comment - - edited Support sent me the information about the size of innodb log files: -rw-rw----. 1 mysql mysql 134217728 Sep 12 00:00 ib_logfile1 -rw-rw----. 1 mysql mysql 134217728 Sep 12 00:00 ib_logfile0 I have prepared mariabackup version with some extra logging for support team, and the following is the part of mariabackup log: 2019-09-26 0:00:11 0 [Note] InnoDB: Number of pools: 1 ?func: ib_log: checkpoint 2926242 at 2875890640474 found, checkpoint offset 187681370 ?func: ib_log: checkpoint 2926241 at 2875890640161 found, checkpoint offset 187681057 [00] 2019-09-26 00:00:11 mariabackup: Generating a list of tablespaces 2019-09-26 0:00:11 0 [Warning] InnoDB: Allocated tablespace ID 13715 for ---/---, old maximum was 0 [00] 2019-09-26 00:00:22 xtrabackup_copy_logfile: start throttling [00] 2019-09-26 00:00:22 xtrabackup_copy_logfile: stop throttling ?func: ib_log: read_log_seg: start_lsn: 2875890640384, source_offset: 187681280, at_eof: 0, len: 65536, page_no: 11455 ?func: ib_log: read_log_seg: iterate 0, block_number: 248789068, calculated_block_number: 248264788 2019-09-26 0:00:22 0 [Note] InnoDB: Read redo log up to LSN=2875890640384 [00] 2019-09-26 00:00:22 Retrying read of log at LSN=2875890640384 ?func: ib_log: read_log_seg: start_lsn: 2875890640384, source_offset: 187681280, at_eof: 0, len: 65536, page_no: 11455 ?func: ib_log: read_log_seg: iterate 0, block_number: 248789068, calculated_block_number: 248264788 2019-09-26 0:00:22 0 [Note] InnoDB: Read redo log up to LSN=2875890640384 [00] 2019-09-26 00:00:22 Retrying read of log at LSN=2875890640384 ?func: ib_log: read_log_seg: start_lsn: 2875890640384, source_offset: 187681280, at_eof: 0, len: 65536, page_no: 11455 ?func: ib_log: read_log_seg: iterate 0, block_number: 248789068, calculated_block_number: 248264788 2019-09-26 0:00:22 0 [Note] InnoDB: Read redo log up to LSN=2875890640384 [00] 2019-09-26 00:00:22 Retrying read of log at LSN=2875890640384 ?func: ib_log: read_log_seg: start_lsn: 2875890640384, source_offset: 187681280, at_eof: 0, len: 65536, page_no: 11455 ?func: ib_log: read_log_seg: iterate 0, block_number: 248789068, calculated_block_number: 248264788 Consider the following line: ?func: ib_log: read_log_seg: iterate 0, block_number: 248789068, calculated_block_number: 248264788 (248264788-248789068)*512 = 268431360 - the difference between read and calculated log block number. The redo log size is 268435456. Log header size is 2k. There are two files, i.e. the total header size is 4k. So 268435456 - 4096 = 268431360. This is exactly the amount of bytes necessary to make full loop in circular innodb log buffer. And this is the difference between read and calculated LSN we found above. What we see is that mariabackup started at 00:00:11. The delay between the moment it read checkpoint lsn and the moment when it started to read redo log records from redo log(00:00:22) is 11s. And during this time server wrote greater or equal to 268431360 bytes in redo log, new checkpoint is started and redo log blocks were overwritten. But mariabackup did not refresh it's checkpoint info and that is why log block number read from redo log does not correspond to calculated redo log block number.
            vlad.lesin Vladislav Lesin added a comment - Test case: https://github.com/vlad-lesin/mariadb-server/tree/10.2-MDEV-18611-test-case .

            We are not going to spend much effort to fix this edge case, we have improved the descriptiveness of the error message instead, please see MDEV-23711.

            vlad.lesin Vladislav Lesin added a comment - We are not going to spend much effort to fix this edge case, we have improved the descriptiveness of the error message instead, please see MDEV-23711 .

            People

              vlad.lesin Vladislav Lesin
              Richard Richard Stracke
              Votes:
              2 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.