[MDEV-16367] mariabackup: error: failed to copy enough redo log Created: 2018-06-01  Updated: 2020-08-25  Resolved: 2018-06-07

Status: Closed
Project: MariaDB Server
Component/s: mariabackup
Affects Version/s: 10.2.15
Fix Version/s: 10.2.16, 10.3.8

Type: Bug Priority: Critical
Reporter: Pavel Znamensky Assignee: Marko Mäkelä
Resolution: Fixed Votes: 1
Labels: None
Environment:

CentOS Linux release 7.5.1804
x86_64


Issue Links:
Problem/Incident
causes MDEV-17149 mariabackup hangs if innodb is not st... Closed
Relates
relates to MDEV-16568 mariabackup.log_checksum_mismatch fai... Closed
relates to MDEV-16569 mariabackup.xb_file_key_management fa... Closed

 Description   

We've switched from xtrabackup to mariabackup in order to get ability enable compression for some tables. But mariabackup failing with:

error: failed to copy enough redo log

We're running backup with this command:

mariabackup --backup --user=root --use-memory=4G --target-dir /backup

 
180601 02:01:01 Connecting to MySQL server host: localhost, user: root, password: not set, port: 3306, socket: not set
Using server version 10.2.15-MariaDB-log
/usr/bin/mariabackup based on MariaDB server 10.2.15-MariaDB Linux (x86_64) 
mariabackup: uses posix_fadvise().
mariabackup: cd to /var/lib/mysql/
mariabackup: open files limit requested 0, set to 262144
mariabackup: using the following InnoDB configuration:
mariabackup:   innodb_data_home_dir = 
mariabackup:   innodb_data_file_path = ibdata1:100M:autoextend
mariabackup:   innodb_log_group_home_dir = ./
mariabackup: using O_DIRECT
2018-06-01  2:01:02 140206076119232 [Note] InnoDB: Number of pools: 1
mariabackup: Generating a list of tablespaces
2018-06-01  2:01:02 140206076119232 [Warning] InnoDB: Allocated tablespace ID 2318 for deploy_labels/labels, old maximum was 0
180601 02:01:11 >> log scanned up to (9527405406246)
180601 02:01:11 [01] Copying ibdata1 to /backup/ibdata1
180601 02:01:12 >> log scanned up to (9527405502351)
180601 02:01:52 [01]        ...done
180601 02:01:52 [01] Copying ./deploy_labels/labels.ibd to /backup/deploy_labels/labels.ibd
180601 02:01:52 [01]        ...done
180601 02:01:52 [01] Copying ./deploy_labels/push_events.ibd to /backup/deploy_labels/push_events.ibd
180601 02:01:52 [01]        ...done
180601 02:01:52 [01] Copying ./deploy_labels/images.ibd to /backup/deploy_labels/images.ibd
180601 02:01:52 [01]        ...done
180601 02:01:52 [01] Copying ./deploy_labels/tags.ibd to /backup/deploy_labels/tags.ibd
180601 02:01:52 [01]        ...done
180601 02:01:52 [01] Copying ./deploy_labels/__diesel_schema_migrations.ibd to /backup/deploy_labels/__diesel_schema_migrations.ibd
180601 02:01:52 [01]        ...done
 
...
 
180601 03:41:47 Finished backing up non-InnoDB tables and files
180601 03:41:47 [01] Copying aria_log.00000001 to /backup/aria_log.00000001
180601 03:41:47 [01]        ...done
180601 03:41:47 [01] Copying aria_log_control to /backup/aria_log_control
180601 03:41:47 [01]        ...done
180601 03:41:47 [00] Writing xtrabackup_binlog_info
180601 03:41:47 [00]        ...done
180601 03:41:47 Executing FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS...
mariabackup: The latest check point (for incremental): '9528096474361'
mariabackup: Stopping log copying thread.
 
2018-06-01  3:41:47 140206076119232 [Note] InnoDB: Read redo log up to LSN=9527405501952
180601 03:41:47 >> log scanned up to (9527405501952)
180601 03:41:47 Executing UNLOCK TABLES
180601 03:41:47 All tables unlocked
180601 03:41:47 [00] Copying ib_buffer_pool to /backup/ib_buffer_pool
180601 03:41:47 [00]        ...done
180601 03:41:47 Backup created in directory '/backup/'
MySQL binlog position: filename 'mysql-bin.000004', position '328', GTID of the last change '0-2-6715969130'
180601 03:41:47 [00] Writing backup-my.cnf
180601 03:41:47 [00]        ...done
180601 03:41:47 [00] Writing xtrabackup_info
180601 03:41:47 [00]        ...done
mariabackup: Redo log (from LSN 9527342444268 to 9527405501952) was copied.
mariabackup: error: failed to copy enough redo log (LSN=9527405501952; checkpoint LSN=9528096474361).

The same problem from someone at stackoverflow:
https://stackoverflow.com/questions/50115613/mariadb-backup-tool-mariabackup-failing-with-error



 Comments   
Comment by Marko Mäkelä [ 2018-06-04 ]

Looking at the output:

mariabackup: Redo log (from LSN 9527342444268 to 9527405501952) was copied.
mariabackup: error: failed to copy enough redo log (LSN=9527405501952; checkpoint LSN=9528096474361).

We copied 63,057,684 bytes of redo log. That does not sound much. And the required amount 754,030,093 does not feel unreasonably large. kompastver, do you have a rough idea how much the InnoDB LSN could have grown during the mariabackup execution? Can you execute

SHOW ENGINE INNODB STATUS;

before and after the backup? All these numbers could be interesting:

part of SHOW ENGINE INNODB STATUS OUTPUT

---
LOG
---
Log sequence number …
Log flushed up to   …
Pages flushed up to …
Last checkpoint at  …

The following code looks a bit suspicious. I think that we should re-invoke log_group_header_read() if the first checkpoint page of the two blocks contained the latest checkpoint:

		if (recv_find_max_checkpoint(&max_cp_field) == DB_SUCCESS
		    && log_sys->log.format != 0) {
			metadata_to_lsn = mach_read_from_8(
				log_sys->checkpoint_buf + LOG_CHECKPOINT_LSN);
			msg("mariabackup: The latest check point"
			    " (for incremental): '" LSN_PF "'\n",
			    metadata_to_lsn);

But I would find it somewhat hard to believe that there were several hundred megabytes of log records between the two checkpoints. So, we probably need to fix something more.

Comment by Pavel Znamensky [ 2018-06-05 ]

It's strange. But the problem occurs on our backup slave though on a production one it works fine.
As you requested:
Right before the backup:

---
LOG
---
Log sequence number 9568006076714
Log flushed up to   9568006022277
Pages flushed up to 9568005165495
Last checkpoint at  9568004961759
0 pending log flushes, 0 pending chkp writes
50678742 log i/o's done, 60.44 log i/o's/second

mariabackup output:

180605 21:28:09 Connecting to MySQL server host: localhost, user: root, password: not set, port: 3306, socket: not set
Using server version 10.2.15-MariaDB-log
/usr/bin/mariabackup based on MariaDB server 10.2.15-MariaDB Linux (x86_64) 
mariabackup: uses posix_fadvise().
mariabackup: cd to /var/lib/mysql/
mariabackup: open files limit requested 0, set to 262144
mariabackup: using the following InnoDB configuration:
mariabackup:   innodb_data_home_dir = 
mariabackup:   innodb_data_file_path = ibdata1:100M:autoextend
mariabackup:   innodb_log_group_home_dir = ./
mariabackup: using O_DIRECT
2018-06-05 21:28:10 139888253319360 [Note] InnoDB: Number of pools: 1
mariabackup: Generating a list of tablespaces
2018-06-05 21:28:10 139888253319360 [Warning] InnoDB: Allocated tablespace ID 2318 for deploy_labels/labels, old maximum was 0
180605 21:28:17 >> log scanned up to (9568006785062)
180605 21:28:17 [01] Copying ibdata1 to /backup/ibdata1
180605 21:28:18 >> log scanned up to (9568006926105)
180605 21:28:59 [01]        ...done
180605 21:28:59 [01] Copying ./deploy_labels/labels.ibd to /backup/deploy_labels/labels.ibd
180605 21:28:59 [01]        ...done
180605 21:28:59 [01] Copying ./deploy_labels/push_events.ibd to /backup/deploy_labels/push_events.ibd
180605 21:28:59 [01]        ...done
 
...
 
180605 23:14:31 Finished backing up non-InnoDB tables and files
180605 23:14:31 [01] Copying aria_log.00000001 to /backup/aria_log.00000001
180605 23:14:31 [01]        ...done
180605 23:14:31 [01] Copying aria_log_control to /backup/aria_log_control
180605 23:14:31 [01]        ...done
180605 23:14:31 [00] Writing xtrabackup_binlog_info
180605 23:14:31 [00]        ...done
180605 23:14:31 Executing FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS...
mariabackup: The latest check point (for incremental): '9568584074795'
mariabackup: Stopping log copying thread.
 
2018-06-05 23:14:31 139888253319360 [Note] InnoDB: Read redo log up to LSN=9568006925824
180605 23:14:31 >> log scanned up to (9568006925824)
180605 23:14:31 Executing UNLOCK TABLES
180605 23:14:31 All tables unlocked
180605 23:14:31 [00] Copying ib_buffer_pool to /backup/ib_buffer_pool
180605 23:14:31 [00]        ...done
180605 23:14:31 Backup created in directory '/backup/'
MySQL binlog position: filename 'mysql-bin.000004', position '328', GTID of the last change '0-2-6740508642'
180605 23:14:31 [00] Writing backup-my.cnf
180605 23:14:31 [00]        ...done
180605 23:14:31 [00] Writing xtrabackup_info
180605 23:14:31 [00]        ...done
mariabackup: Redo log (from LSN 9568004961759 to 9568006925824) was copied.
mariabackup: error: failed to copy enough redo log (LSN=9568006925824; checkpoint LSN=9568584074795).

Immediately after the backup:

---
LOG
---
Log sequence number 9568638040617
Log flushed up to   9568638040617
Pages flushed up to 9568585610728
Last checkpoint at  9568585018623
0 pending log flushes, 0 pending chkp writes
51116187 log i/o's done, 68.21 log i/o's/second

Comment by Marko Mäkelä [ 2018-06-06 ]

kompastver, thank you for the additional details. As expected, the backup starts copying log from the latest checkpoint LSN as it was at that time. For some reason, copying the log terminates prematurely. The backup should have copied about 633,078,858 bytes of redo log, but it stopped after 11,964,065 bytes. The missing log_group_header_read() call that I mentioned earlier could explain the slight discrepancy of the end checkpoint LSN (943,828 bytes of redo log between the 9568584074795 and 9568585018623) in these messages:

mariabackup: error: failed to copy enough redo log (LSN=9568006925824; checkpoint LSN=9568584074795).

Last checkpoint at 9568585018623

I wonder if this problem goes away if you set innodb_log_files_in_group=1. The default and originally the minimum value of this parameter is 2. You may want to multiply innodb_log_file_size by the previous value of innodb_log_files_in_group, so that the redo log size remains unchanged.

How big is your innodb_log_file_size, and how big are the non-InnoDB tables? What was the timestamp of copying the last .ibd file before this one?

180605 23:14:31 Finished backing up non-InnoDB tables and files

I identified two potential reasons for the early termination. Can you please apply the following patch and check if either message will occur in the output?

diff --git a/extra/mariabackup/xtrabackup.cc b/extra/mariabackup/xtrabackup.cc
index 91c7b5faf0e..7468f70f05d 100644
--- a/extra/mariabackup/xtrabackup.cc
+++ b/extra/mariabackup/xtrabackup.cc
@@ -2485,6 +2485,9 @@ xtrabackup_copy_log(copy_logfile copy, lsn_t start_lsn, lsn_t end_lsn)
 		    && scanned_checkpoint - checkpoint >= 0x80000000UL) {
 			/* Garbage from a log buffer flush which was made
 			before the most recent database recovery */
+			msg("mariabackup: checkpoint wrap: "
+			    LSN_PF ",%zx,%zx\n",
+			    scanned_lsn, scanned_checkpoint, checkpoint);
 			break;
 		}
 
@@ -2505,6 +2508,8 @@ xtrabackup_copy_log(copy_logfile copy, lsn_t start_lsn, lsn_t end_lsn)
 			   >= OS_FILE_LOG_BLOCK_SIZE - LOG_BLOCK_TRL_SIZE
 			   || data_len <= LOG_BLOCK_HDR_SIZE) {
 			/* We got a garbage block (abrupt end of the log). */
+			msg("mariabackup: garbage block: " LSN_PF ",%zu\n",
+			    scanned_lsn, data_len);
 			break;
 		} else {
 			/* We got a partial block (abrupt end of the log). */

Termination due to the first one indicates that the log copying is having trouble to keep up with the writes by the server. Perhaps the circular redo log should be bigger, so that the server will not overwrite the portion of the log that mariabackup has not yet read. Given that you say that xtrabackup does not suffer from this, it is possible that xtrabackup is copying the log faster, possibly because it does not parse the individual log records in that phase, like mariabackup does since version 10.2.14 (MDEV-14545).

Comment by Pavel Znamensky [ 2018-06-06 ]

Bad news.

  • innodb_log_files_in_group=1 didn't help
  • messages from the patch didn't occur

= 1 =

innodb_log_file_size=268435456
innodb_log_files_in_group=2
non-InnoDB tables and files size:
* all *.opt files: 64K
* all *.frm files: 2.7M

Before the backup:

---
LOG
---
Log sequence number 9573651009045
Log flushed up to   9573650994783
Pages flushed up to 9573650354165
Last checkpoint at  9573650052505
0 pending log flushes, 0 pending chkp writes
54247027 log i/o's done, 58.39 log i/o's/second

mariabackup output:

180606 13:09:19 Connecting to MySQL server host: localhost, user: root, password: not set, port: 3306, socket: not set
Using server version 10.2.15-MariaDB-log
/usr/bin/mariabackup based on MariaDB server 10.2.15-MariaDB Linux (x86_64) 
mariabackup: uses posix_fadvise().
mariabackup: cd to /var/lib/mysql/
mariabackup: open files limit requested 0, set to 262144
mariabackup: using the following InnoDB configuration:
mariabackup:   innodb_data_home_dir = 
mariabackup:   innodb_data_file_path = ibdata1:100M:autoextend
mariabackup:   innodb_log_group_home_dir = ./
mariabackup: using O_DIRECT
2018-06-06 13:09:19 140155183319232 [Note] InnoDB: Number of pools: 1
mariabackup: Generating a list of tablespaces
2018-06-06 13:09:19 140155183319232 [Warning] InnoDB: Allocated tablespace ID 2318 for deploy_labels/labels, old maximum was 0
180606 13:09:26 >> log scanned up to (9573651401380)
180606 13:09:26 [01] Copying ibdata1 to /backup/ibdata1
180606 13:09:27 >> log scanned up to (9573651417815)
180606 13:10:04 [01]        ...done
180606 13:10:04 [01] Copying ./deploy_labels/labels.ibd to /backup/deploy_labels/labels.ibd
180606 13:10:04 [01]        ...done
 
 
...
 
 
180606 14:45:52 Executing FLUSH NO_WRITE_TO_BINLOG TABLES...
180606 14:45:52 Executing FLUSH TABLES WITH READ LOCK...
180606 14:45:52 Starting to backup non-InnoDB tables and files
180606 14:45:52 [01] Copying ./deploy_labels/__diesel_schema_migrations.frm to /backup/deploy_labels/__diesel_schema_migrations.frm
180606 14:45:52 [01]        ...done
 
 
...
 
 
180606 14:45:54 [01] Copying ./y2/links.frm to /backup/y2/links.frm
180606 14:45:54 [01]        ...done
180606 14:45:54 Finished backing up non-InnoDB tables and files
180606 14:45:54 [01] Copying aria_log.00000001 to /backup/aria_log.00000001
180606 14:45:54 [01]        ...done
180606 14:45:54 [01] Copying aria_log_control to /backup/aria_log_control
180606 14:45:54 [01]        ...done
180606 14:45:54 [00] Writing xtrabackup_binlog_info
180606 14:45:54 [00]        ...done
180606 14:45:54 Executing FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS...
mariabackup: The latest check point (for incremental): '9574225023573'
mariabackup: Stopping log copying thread.
 
2018-06-06 14:45:54 140155183319232 [Note] InnoDB: Read redo log up to LSN=9573651417600
180606 14:45:54 >> log scanned up to (9573651417600)
180606 14:45:54 Executing UNLOCK TABLES
180606 14:45:54 All tables unlocked
180606 14:45:54 [00] Copying ib_buffer_pool to /backup/ib_buffer_pool
180606 14:45:54 [00]        ...done
180606 14:45:54 Backup created in directory '/backup/'
MySQL binlog position: filename 'mysql-bin.000004', position '328', GTID of the last change '0-2-6743927672'
180606 14:45:54 [00] Writing backup-my.cnf
180606 14:45:54 [00]        ...done
180606 14:45:54 [00] Writing xtrabackup_info
180606 14:45:54 [00]        ...done
mariabackup: Redo log (from LSN 9573650052505 to 9573651417600) was copied.
mariabackup: error: failed to copy enough redo log (LSN=9573651417600; checkpoint LSN=9574225023573).

After the backup:

---
LOG
---
Log sequence number 9574290337338
Log flushed up to   9574290337199
Pages flushed up to 9574226570564
Last checkpoint at  9574226137899
0 pending log flushes, 0 pending chkp writes
54668717 log i/o's done, 54.64 log i/o's/second

= 2 =

innodb_log_files_in_group=1

Before the backup:

---
LOG
---
Log sequence number 9574352773790
Log flushed up to   9574352756467
Pages flushed up to 9574349130575
Last checkpoint at  9574348774318
0 pending log flushes, 0 pending chkp writes
5724 log i/o's done, 86.73 log i/o's/second

mariabackup output:

 
180606 14:55:08 Connecting to MySQL server host: localhost, user: root, password: not set, port: 3306, socket: not set
Using server version 10.2.15-MariaDB-log
/usr/bin/mariabackup based on MariaDB server 10.2.15-MariaDB Linux (x86_64) 
mariabackup: uses posix_fadvise().
mariabackup: cd to /var/lib/mysql/
mariabackup: open files limit requested 0, set to 262144
mariabackup: using the following InnoDB configuration:
mariabackup:   innodb_data_home_dir = 
mariabackup:   innodb_data_file_path = ibdata1:100M:autoextend
mariabackup:   innodb_log_group_home_dir = ./
mariabackup: using O_DIRECT
2018-06-06 14:55:08 139978597673152 [Note] InnoDB: Number of pools: 1
mariabackup: Generating a list of tablespaces
2018-06-06 14:55:08 139978597673152 [Warning] InnoDB: Allocated tablespace ID 2318 for deploy_labels/labels, old maximum was 0
180606 14:55:13 >> log scanned up to (9574353063845)
180606 14:55:13 [01] Copying ibdata1 to /backup/ibdata1
180606 14:55:14 >> log scanned up to (9574353077929)
180606 14:55:52 [01]        ...done
180606 14:55:52 [01] Copying ./deploy_labels/labels.ibd to /backup/deploy_labels/labels.ibd
180606 14:55:52 [01]        ...done
 
 
...
 
 
180606 16:35:02 Executing FLUSH NO_WRITE_TO_BINLOG TABLES...
180606 16:35:02 Executing FLUSH TABLES WITH READ LOCK...
180606 16:35:02 Starting to backup non-InnoDB tables and files
180606 16:35:02 [01] Copying ./deploy_labels/__diesel_schema_migrations.frm to /backup/deploy_labels/__diesel_schema_migrations.frm
180606 16:35:02 [01]        ...done
 
 
...
 
 
180606 16:35:03 [01] Copying ./y2/links.frm to /backup/y2/links.frm
180606 16:35:03 [01]        ...done
180606 16:35:03 Finished backing up non-InnoDB tables and files
180606 16:35:03 [01] Copying aria_log.00000001 to /backup/aria_log.00000001
180606 16:35:03 [01]        ...done
180606 16:35:03 [01] Copying aria_log_control to /backup/aria_log_control
180606 16:35:03 [01]        ...done
180606 16:35:03 [00] Writing xtrabackup_binlog_info
180606 16:35:03 [00]        ...done
180606 16:35:03 Executing FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS...
mariabackup: The latest check point (for incremental): '9575207519305'
mariabackup: Stopping log copying thread.
 
2018-06-06 16:35:03 139978597673152 [Note] InnoDB: Read redo log up to LSN=9574353077760
180606 16:35:03 >> log scanned up to (9574353077760)
180606 16:35:03 Executing UNLOCK TABLES
180606 16:35:03 All tables unlocked
180606 16:35:03 [00] Copying ib_buffer_pool to /backup/ib_buffer_pool
180606 16:35:03 [00]        ...done
180606 16:35:03 Backup created in directory '/backup/'
MySQL binlog position: filename 'mysql-bin.000005', position '328', GTID of the last change '0-2-6744784034'
180606 16:35:03 [00] Writing backup-my.cnf
180606 16:35:03 [00]        ...done
180606 16:35:03 [00] Writing xtrabackup_info
180606 16:35:03 [00]        ...done
mariabackup: Redo log (from LSN 9574348774318 to 9574353077760) was copied.
mariabackup: error: failed to copy enough redo log (LSN=9574353077760; checkpoint LSN=9575207519305).

After the backup:

---
LOG
---
Log sequence number 9575242077809
Log flushed up to   9575242077346
Pages flushed up to 9575209836529
Last checkpoint at  9575209726147
0 pending log flushes, 0 pending chkp writes
823117 log i/o's done, 75.96 log i/o's/second

Comment by Marko Mäkelä [ 2018-06-07 ]

I got a test case that repeats this:

--echo exec $MYSQL test < huge_dump.sql
sleep 1;
let $targetdir=$MYSQLTEST_VARDIR/tmp/backup;
--disable_result_log
exec $XTRABACKUP --defaults-file=$MYSQLTEST_VARDIR/my.cnf --core-file --backup --target-dir=$targetdir;
--enable_result_log

./mysql-test-run mariabackup.huge

When the test spits out the exec line, invoke the mysql command manually from another shell. (I did not know how to start a concurrent background process in mysql-test-run.) The dump that I used was 118,938,281 bytes of SQL.

While debugging this, I fixed multiple small bugs. The 'final nail in the coffin' was that the log_copying_thread() would only keep invoking xtrabackup_copy_logfile() if it returned a (fatal) error every time. So, it would effectively only copy an integer multiple of RECV_SCAN_SIZE bytes (4·innodb_page_size) of log, until the end of the log is momentarily reached. Instead, it should be copying the log until it runs out.

It turns out that we did not test Mariabackup with large enough datasets.

With the bugs fixed, but using a too small innodb_log_file_size, the above test would still fail like this:

mariabackup: Redo log (from LSN 1633766 to 243996709) was copied.
mariabackup: error: failed to copy enough redo log (LSN=243996709; checkpoint LSN=277109361).

Over 240 megabytes of log was read; we stopped some 30 megabytes short. With a larger innodb_log_file_size the backup would finish successfully.

I will test my fix a little more before pushing it to the 10.2 branch.

Comment by Pavel Znamensky [ 2018-06-07 ]

marko, thank you!

Generated at Thu Feb 08 08:28:23 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.