[MDEV-31399] mariabackup fails with large innodb redo log - block is overwritten Created: 2023-06-05 Updated: 2023-12-14 |
|
| Status: | Open |
| Project: | MariaDB Server |
| Component/s: | mariabackup |
| Affects Version/s: | 10.5.17 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Stephan Vos | Assignee: | Unassigned |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | Backup | ||
| Issue Links: |
|
||||||||
| Description |
|
We had recent failure where mariabackup failed shortly after starting a full backup of a 1.7TB database. [01] 2023-06-04 02:16:20 Copying ./accounting/das_accounting_stats#P#p_202206.ibd to /mnt/local_db_backups/galera_cluster/20230604_full/accounting/das_accounting_stats#P#p_202206.ibd I saw some older issues referring to something similar and the suggestion was that redo log was too small. However our redo log is quite large (2 GB) and the backup is taken on a quiet node during off-peak times with little or no traffic expected so cannot imagine that this could be due to influx of transactions and mariabackup not being able to keep up. |
| Comments |
| Comment by Stephan Vos [ 2023-06-05 ] |
|
Apologies, this should be of type bug but I cannot change it. |
| Comment by Marko Mäkelä [ 2023-06-05 ] |
|
This report seems to duplicate MDEV-19492. Because no affected version was stated in this report, I can’t tell for sure, but this might also duplicate I wouldn’t call innodb_log_file_size=2G large nowadays. For write-heavy systems, I would recommend setting the log file size to at least to the size of the buffer pool. Recently, |
| Comment by Stephan Vos [ 2023-06-05 ] |
|
Version is 10.5.17 Node 2 is part of a 3 node cluster and is not a node that's write heavy especially during 2am in the morning. |
| Comment by Marko Mäkelä [ 2023-06-05 ] |
|
Write spikes could also be caused by some internal processing, such as the purge of committed transaction history. Any long-running transaction or long-open read view could cause the history list length (number of committed but not-yet-purged transactions) to increase due to any concurrent write transactions. |
| Comment by Stephan Vos [ 2023-06-05 ] |
|
Thanks Marko. |
| Comment by Stephan Vos [ 2023-06-14 ] |
|
It still fails consistently once a week during full backup but I think I figured out that it is caused by Jira Service Management re-index operation which runs at the same time and seems to be EXTREMELY intensive. But I still would not expect it to overwrite a innodb_log_file_size=2G size. The one question I have that might or might not relate to the issue at hand, is would mariabackup at any stage cause sessions to be disconnected as I see at the end of the backup (around 20 minutes after) about 250 of these messages below coming from various maxscale instances: |
| Comment by Marko Mäkelä [ 2023-12-14 ] |
|
Possibly, some buffer pool related changes such as markus makela, can you comment on the MaxScale question? At the end of the backup, some locks will be acquired, which will disrupt mainly DDL operations on the server. |
| Comment by markus makela [ 2023-12-14 ] |
|
In general, MaxScale shouldn't be doing any DDLs, especially on servers not labeled as Master in MaxScale. The only thing that I can think of that may cause something to be interrupted is if the backup block some operations that the mariadbmon monitor is doing and auto_failover is enabled. However this seems very unlikely. That particular error in the server logs usually indicates that MaxScale closed a TCP connection before authentication completed. This should only happen if MaxScale opened connections to multiple servers and another server was able to fully serve the client. Once the client disconnects, it might be that some of the unused connections were still trying to authenticate and closing them is not ideal: closing the TCP socket before successfully authenticating counts as an error and that increments the max_connect_errors counter which in turn may end up blocking the whole MaxScale host from accessing the server. If the database was slow, it would explain why it happened as there's a hard-coded time limit that allows "stale connections" to complete the authentication. Another, somewhat far-fetched, possibility is that the client is using an older MaxScale version that doesn't prevent reads from being routed to non-Master servers if the SERIALIZABLE isolation level is in use. As far as I know, this would result in open transactions on the replicas which may affect the backups. |