[MDEV-19492] Mariabackup hangs if table populated with INSERT... SELECT while it runs Created: 2019-05-16  Updated: 2023-10-12

Status: Stalled
Project: MariaDB Server
Component/s: mariabackup
Affects Version/s: 10.2.24, 10.4.14
Fix Version/s: 10.4

Type: Bug Priority: Major
Reporter: Rick Pizzi Assignee: Vladislav Lesin
Resolution: Unresolved Votes: 3
Labels: None

Attachments: Text File mariabackup_logs.log    
Issue Links:
Duplicate
duplicates MDEV-18611 mariabackup silently ended during xt... Closed
is duplicated by MDEV-31399 mariabackup fails with large innodb r... Open
is duplicated by MDEV-31410 mariadb-backup prepare crash with Inn... Closed
Relates
relates to MDEV-14992 BACKUP: in-server backup Open
relates to MDEV-23711 make mariabackup innodb redo log read... Closed
relates to MDEV-17029 mariabackup fails under load with sch... Closed
relates to MDEV-18612 mariabackup: Error: xtrabackup_copy_l... Closed

 Description   

Mariabackup consistently hangs when you load data into a table from another table while a full backup is running.

It hangs in the following phase:

[00] 2019-05-15 01:23:27 Waiting for log copy thread to read lsn 5607063606354

Before that, the following errors appear:

[00] 2019-05-15 00:25:44 Retrying read of log at LSN=5593738149376
[00] 2019-05-15 00:25:44 Retrying read of log at LSN=5593738149376
[00] 2019-05-15 00:25:44 Retrying read of log at LSN=5593738149376
[00] 2019-05-15 00:25:44 Retrying read of log at LSN=5593738149376
[00] 2019-05-15 00:25:44 Retrying read of log at LSN=5593738149376
[00] 2019-05-15 00:25:44 xtrabackup_copy_logfile() failed.

This is regularly happening at a customer of ours.
I was able to reproduce in my lab as follows:

1. start a full backup
2. run insert into empytable select * from sbtest1;

Command line used for backup:

mariabackup --backup --user=root --password=xxx --parallel=4 --extra-lsndir=/tmp --stream=xbstream --target-dir=/tmp/2019-05-16/mariabackup

Isolation level: default (REPEATABLE-READ)



 Comments   
Comment by suresh ramagiri [ 2020-07-27 ]

One of the customers reported this issue.

Locally on the MariaDB 1O.3.22, I can reproduce the issue.
Here are the steps, followed.

1. Created two partition tables as below and had some data in the sbtest4 table d1 partition.
2. Tried insert-select on sbtest3 from the sbtest4 d1 partition data
INSERT INTO sbtest3 SELECT * FROM sbtest4 PARTITION (d1);
3. While taking the incremental backup, it failed with the noted error.
mariabackup --defaults-file --backup --user=root --compress --parallel=4 --incremental-basedir=/tmp/back1/ --target-dir=/tmp/inc4

full backup logs attached - https://jira.mariadb.org/secure/attachment/52967/mariabackup_logs.log

CREATE TABLE `sbtest3` (
`id` bigint(20) unsigned NOT NULL,
`created` date NOT NULL DEFAULT '2020-06-27',
KEY `a` (`created`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
PARTITION BY LIST (dayofmonth(`created`))
(PARTITION `d27` VALUES IN (27) ENGINE = InnoDB,
PARTITION `d28` VALUES IN (28) ENGINE = InnoDB,
PARTITION `d31` VALUES IN (31) ENGINE = InnoDB,
PARTITION `d4` VALUES IN (4) ENGINE = InnoDB,
PARTITION `d5` VALUES IN (5) ENGINE = InnoDB,
PARTITION `d11` VALUES IN (11) ENGINE = InnoDB,
PARTITION `d12` VALUES IN (12) ENGINE = InnoDB,
PARTITION `d18` VALUES IN (18) ENGINE = InnoDB,
PARTITION `d19` VALUES IN (19) ENGINE = InnoDB,
PARTITION `d25` VALUES IN (25) ENGINE = InnoDB,
PARTITION `d26` VALUES IN (26) ENGINE = InnoDB,
PARTITION `d29` VALUES IN (29) ENGINE = InnoDB,
PARTITION `d30` VALUES IN (30) ENGINE = InnoDB,
PARTITION `d1` VALUES IN (1) ENGINE = InnoDB,
PARTITION `d2` VALUES IN (2) ENGINE = InnoDB,
PARTITION `d3` VALUES IN (3) ENGINE = InnoDB,
PARTITION `d6` VALUES IN (6) ENGINE = InnoDB,
PARTITION `d7` VALUES IN (7) ENGINE = InnoDB,
PARTITION `d8` VALUES IN (8) ENGINE = InnoDB,
PARTITION `d9` VALUES IN (9) ENGINE = InnoDB,
PARTITION `d10` VALUES IN (10) ENGINE = InnoDB,
PARTITION `d13` VALUES IN (13) ENGINE = InnoDB,
PARTITION `d14` VALUES IN (14) ENGINE = InnoDB,
PARTITION `d15` VALUES IN (15) ENGINE = InnoDB,
PARTITION `d16` VALUES IN (16) ENGINE = InnoDB,
PARTITION `d17` VALUES IN (17) ENGINE = InnoDB,
PARTITION `d20` VALUES IN (20) ENGINE = InnoDB,
PARTITION `d21` VALUES IN (21) ENGINE = InnoDB,
PARTITION `d22` VALUES IN (22) ENGINE = InnoDB,
PARTITION `d24` VALUES IN (24) ENGINE = InnoDB,
PARTITION `d23` VALUES IN (23) ENGINE = InnoDB)

CREATE TABLE `sbtest4` (
`id` bigint(20) unsigned NOT NULL,
`created` date NOT NULL DEFAULT '2020-06-27',
KEY `a` (`created`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
PARTITION BY LIST (dayofmonth(`created`))
(PARTITION `d27` VALUES IN (27) ENGINE = InnoDB,
PARTITION `d28` VALUES IN (28) ENGINE = InnoDB,
PARTITION `d31` VALUES IN (31) ENGINE = InnoDB,
PARTITION `d4` VALUES IN (4) ENGINE = InnoDB,
PARTITION `d5` VALUES IN (5) ENGINE = InnoDB,
PARTITION `d11` VALUES IN (11) ENGINE = InnoDB,
PARTITION `d12` VALUES IN (12) ENGINE = InnoDB,
PARTITION `d18` VALUES IN (18) ENGINE = InnoDB,
PARTITION `d19` VALUES IN (19) ENGINE = InnoDB,
PARTITION `d25` VALUES IN (25) ENGINE = InnoDB,
PARTITION `d26` VALUES IN (26) ENGINE = InnoDB,
PARTITION `d29` VALUES IN (29) ENGINE = InnoDB,
PARTITION `d30` VALUES IN (30) ENGINE = InnoDB,
PARTITION `d1` VALUES IN (1) ENGINE = InnoDB,
PARTITION `d2` VALUES IN (2) ENGINE = InnoDB,
PARTITION `d3` VALUES IN (3) ENGINE = InnoDB,
PARTITION `d6` VALUES IN (6) ENGINE = InnoDB,
PARTITION `d7` VALUES IN (7) ENGINE = InnoDB,
PARTITION `d8` VALUES IN (8) ENGINE = InnoDB,
PARTITION `d9` VALUES IN (9) ENGINE = InnoDB,
PARTITION `d10` VALUES IN (10) ENGINE = InnoDB,
PARTITION `d13` VALUES IN (13) ENGINE = InnoDB,
PARTITION `d14` VALUES IN (14) ENGINE = InnoDB,
PARTITION `d15` VALUES IN (15) ENGINE = InnoDB,
PARTITION `d16` VALUES IN (16) ENGINE = InnoDB,
PARTITION `d17` VALUES IN (17) ENGINE = InnoDB,
PARTITION `d20` VALUES IN (20) ENGINE = InnoDB,
PARTITION `d21` VALUES IN (21) ENGINE = InnoDB,
PARTITION `d22` VALUES IN (22) ENGINE = InnoDB,
PARTITION `d24` VALUES IN (24) ENGINE = InnoDB,
PARTITION `d23` VALUES IN (23) ENGINE = InnoDB)

backup logs

[00] 2020-07-27 21:50:04 Retrying read of log at LSN=30622709760
[00] 2020-07-27 21:50:04 Retrying read of log at LSN=30622709760
[00] 2020-07-27 21:50:04 Retrying read of log at LSN=30622709760
[00] 2020-07-27 21:50:04 Retrying read of log at LSN=30622709760
[00] 2020-07-27 21:50:04 Retrying read of log at LSN=30622709760
[00] 2020-07-27 21:50:04 Retrying read of log at LSN=30622709760
[00] FATAL ERROR: 2020-07-27 21:50:04 xtrabackup_copy_logfile() failed.

Comment by Vladislav Lesin [ 2020-08-17 ]

rpizzi, this looks like a duplicate of MDEV-18611. Heavy DML's with small innodb redo log size leads to the case, when inndo redo record is rewritten before mariabackup copy it. The workaround is to increase innodb redo log size. Try double it until the issue go away.

Comment by Rick Pizzi [ 2020-08-19 ]

I don't recall which customer was being hit by this bug - this was over one year ago!
But no one of our customers run with a small redo log file. So that may have been the case with my lab, certainly not with the customer.
Anyways, we need a solution here and not a workaround, i.e. mariabackup should be able to detect it and exit with a proper error if that was the case....

Comment by Rick Pizzi [ 2020-09-18 ]

vlad.lesin I am not convinced that this bug is a duplicate of MDEV-18611.
We have a customer who is running 10.4.14 and is hitting this on a replica consistently, in other words, they cannot take a backup. They currently have 2G redo log, but with single threaded replication I have an hard time believing they can fill that while mariabackup is copying. In any case, we will move that to 8G and try again. If it still fails, you will have to reconsider your idea that this is a duplicate. Yes, they do a lot of (small) insert into... select in their workload.

Comment by Rick Pizzi [ 2020-09-18 ]

vlad.lesin we retried with 16 GB of redo logs, on this replica with single SQL thread . It still failed
while trying to back up their largest table (2.5T).

So this has become a critical bug as the customer is unable to perform a full backup of its dataset.

Comment by Vladislav Lesin [ 2020-09-21 ]

rpizzi yes it will go in 10.4 too. I hope it will be in the next 10.4 upcoming release. And yes, we can create special build for the customer. We need to know customer's platform and the version of MariaDB they use. But for 10.4 the steps to detect the problem, I mentioned above, must work without MDEV-23711 fix.

I.e. if their backup log:

(does not contain "Invalid log block checksum." or "xtrabackup_copy_logfile() failed: corrupt log.") and
(innodb_log_checksums == 1 or redo log is encrypted),

then this is indeed the duplicate of MDEV-18611.

Comment by Julien Fritsch [ 2020-09-24 ]

rpizzi please follow the instructions to create the needed Jira ticket and assign Vlad. https://docs.google.com/document/d/1flTIbF8vJQ7vB2FGHLiECo2DrzhUdgq4UgiKpqEXTjY/edit#heading=h.l24n0d50js1p

Comment by Julien Fritsch [ 2020-09-29 ]

rpizzi?

Comment by Marko Mäkelä [ 2022-05-04 ]

In MDEV-14992 I wrote:

MDEV-27621 is a case where backup cannot keep up with the server that is writing log. If the server process itself streamed the log like proposed in this task, it could trivially slow down the write workload as necessary.

Making the server stream its ib_logfile0 to the backup should be much easier after the format change MDEV-14425 (MariaDB Server 10.8). That would implement only a subset of MDEV-14992; the full proposal there is that the server would generate an entire physical backup, which the InnoDB write-ahead log is only a part of.

Generated at Thu Feb 08 08:52:06 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.