[MDEV-31399] mariabackup fails with large innodb redo log - block is overwritten Created: 2023-06-05  Updated: 2023-12-14

Status: Open
Project: MariaDB Server
Component/s: mariabackup
Affects Version/s: 10.5.17
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Stephan Vos Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: Backup

Issue Links:
Duplicate
duplicates MDEV-19492 Mariabackup hangs if table populated ... Stalled

 Description   

We had recent failure where mariabackup failed shortly after starting a full backup of a 1.7TB database.

[01] 2023-06-04 02:16:20 Copying ./accounting/das_accounting_stats#P#p_202206.ibd to /mnt/local_db_backups/galera_cluster/20230604_full/accounting/das_accounting_stats#P#p_202206.ibd
FILE_CHECKPOINT(18207486069808) ignored at 18208436535773
FILE_CHECKPOINT(18207490734604) ignored at 18208443311562
[00] 2023-06-04 02:16:27 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:27 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:27 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:27 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:27 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:27 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:27 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:27 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:27 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:27 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:27 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:27 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:27 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:27 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:27 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:27 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:27 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:27 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:27 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:27 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:27 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:27 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:27 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:27 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] 2023-06-04 02:16:28 Retrying read of log at LSN=18208449177088
[00] FATAL ERROR: 2023-06-04 02:16:28 xtrabackup_copy_logfile() failed: redo log block is overwritten, please increase redo log size with innodb_log_file_size parameter.
Jun 04 02:16:28 Backup return status: 1

I saw some older issues referring to something similar and the suggestion was that redo log was too small. However our redo log is quite large (2 GB) and the backup is taken on a quiet node during off-peak times with little or no traffic expected so cannot imagine that this could be due to influx of transactions and mariabackup not being able to keep up.



 Comments   
Comment by Stephan Vos [ 2023-06-05 ]

Apologies, this should be of type bug but I cannot change it.

Comment by Marko Mäkelä [ 2023-06-05 ]

This report seems to duplicate MDEV-19492.

Because no affected version was stated in this report, I can’t tell for sure, but this might also duplicate MDEV-28772, which was fixed in MariaDB Server 10.9.4.

I wouldn’t call innodb_log_file_size=2G large nowadays. For write-heavy systems, I would recommend setting the log file size to at least to the size of the buffer pool. Recently, MDEV-29911 improved the memory usage when restoring a backup or recovering from a crash.

Comment by Stephan Vos [ 2023-06-05 ]

Version is 10.5.17

Node 2 is part of a 3 node cluster and is not a node that's write heavy especially during 2am in the morning.
I would be surprised if its a insert...select as its only used for reporting and Jira Service Management.
Our buffer pool is 464 GB which is mostly to be consistent with our more heavy write nodes 1 & 3
I can look at increasing the size of innodb_log_file_size but would be hesitant to set it to 464GB as this can then be quite slow at startup?

Comment by Marko Mäkelä [ 2023-06-05 ]

Write spikes could also be caused by some internal processing, such as the purge of committed transaction history. Any long-running transaction or long-open read view could cause the history list length (number of committed but not-yet-purged transactions) to increase due to any concurrent write transactions.

Comment by Stephan Vos [ 2023-06-05 ]

Thanks Marko.
I will look at increasing the buffer pool as the first step.

Comment by Stephan Vos [ 2023-06-14 ]

It still fails consistently once a week during full backup but I think I figured out that it is caused by Jira Service Management re-index operation which runs at the same time and seems to be EXTREMELY intensive. But I still would not expect it to overwrite a innodb_log_file_size=2G size.
I disabled the automated jira re-index process so will then see later this week if that was indeed the culprit.

The one question I have that might or might not relate to the issue at hand, is would mariabackup at any stage cause sessions to be disconnected as I see at the end of the backup (around 20 minutes after) about 250 of these messages below coming from various maxscale instances:
Jun 14 02:33:22 localhost mariadbd[2644]: 2023-06-14 2:33:22 27533228 [Warning] Aborted connection 27533228 to db: 'unconnected' user: 'maxscale' host: 'x.x.x.x' (Got an error reading communication packets)
It looks like maxscale is trying to connected to the database but its not able to do so successfully and retries as few times

Comment by Marko Mäkelä [ 2023-12-14 ]

Possibly, some buffer pool related changes such as MDEV-26055 may have helped here. It could also be useful to modify backup so that the server would execute a log checkpoint at the start (MDEV-30000).

markus makela, can you comment on the MaxScale question? At the end of the backup, some locks will be acquired, which will disrupt mainly DDL operations on the server.

Comment by markus makela [ 2023-12-14 ]

In general, MaxScale shouldn't be doing any DDLs, especially on servers not labeled as Master in MaxScale. The only thing that I can think of that may cause something to be interrupted is if the backup block some operations that the mariadbmon monitor is doing and auto_failover is enabled. However this seems very unlikely.

That particular error in the server logs usually indicates that MaxScale closed a TCP connection before authentication completed. This should only happen if MaxScale opened connections to multiple servers and another server was able to fully serve the client. Once the client disconnects, it might be that some of the unused connections were still trying to authenticate and closing them is not ideal: closing the TCP socket before successfully authenticating counts as an error and that increments the max_connect_errors counter which in turn may end up blocking the whole MaxScale host from accessing the server. If the database was slow, it would explain why it happened as there's a hard-coded time limit that allows "stale connections" to complete the authentication.

Another, somewhat far-fetched, possibility is that the client is using an older MaxScale version that doesn't prevent reads from being routed to non-Master servers if the SERIALIZABLE isolation level is in use. As far as I know, this would result in open transactions on the replicas which may affect the backups.

Generated at Thu Feb 08 10:23:35 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.