[MDEV-36159] mariabackup failed after upgrade 10.11.10 - Jira

Details

Type: Bug
Status: Stalled (View Workflow)
Priority: Critical
Resolution: Unresolved
Affects Version/s: 10.11.10
Fix Version/s: 10.11
Component/s: Backup
Labels:
- performance

Description

Hi Team,

We upgraded from 10.11.8 to 10.11.10 two weeks ago and since then mariabackup keeps failing with the following logs.

[00] 2025-02-24 00:41:42 Waiting for log copy thread to read lsn 160568393706001

[00] 2025-02-24 00:41:43 Retrying read of log at LSN=160515356496404

[00] 2025-02-24 00:41:44 Retrying read of log at LSN=160515356496404

[00] 2025-02-24 00:41:45 Retrying read of log at LSN=160515356496404

[00] 2025-02-24 00:41:46 Retrying read of log at LSN=160515356496404

[00] 2025-02-24 00:41:47 Retrying read of log at LSN=160515356496404

[00] 2025-02-24 00:41:47 Was only able to copy log from 160487092534834 to 160515356496404, not 160568393706001; try increasing innodb_log_file_size

mariabackup: Stopping log copying thread.

I judged that this was caused by a small innodb_log_file_size value, so I tested to change it to an appropriate value and got the following results.

MariaDB engine ver : 10.11.10 / mariabackup engine ver : 10.11.10
innodb_log_file_size = 1G - failed
innodb_log_file_size = 4G - failed
innodb_log_file_size = 8G - failed

MariaDB engine ver : 10.11.8 / mariabackup engine ver : 10.11.8
innodb_log_file_size = 1G - success
innodb_log_file_size = 4G - success

MariaDB engine ver : 10.11.10 / mariabackup engine ver : 10.11.8
innodb_log_file_size = 1G - success

Is this a new bug different from MDEV-34062?
I would like to know if there is any impact on the acceptability of backups from engine 10.11.10 to mariabackup 10.11.8 in production environments.

Thanks and Regard.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

2025-02-22_pnucedbm02_backup_failed.log
5.87 MB
2025-03-05 01:54
2025-03-01_pnucedbm02_backup_success.log
1.26 MB
2025-03-05 01:54
backup_failed.log
631 kB
2025-03-13 01:31
backup_success.log
409 kB
2025-03-13 01:31
innodb_parameter.txt
25 kB
2025-02-25 08:54
MDEV_36159.txt
1 kB
2025-04-04 07:46
MDEV-36159.zip
5.87 MB
2025-04-04 08:03

Issue Links

duplicates

MDEV-35791 mariadb-backup 10.11.10 failed to create backup

Closed

relates to

MDEV-14992 BACKUP: in-server backup

Open

MDEV-30000 make mariadb-backup to force an innodb checkpoint

Closed

MDEV-34062 mariadb-backup --backup is extremely slow at copying ib_logfile0

Closed

MDEV-34070 mariadb-backup --prepare with larger --use-memory may be slower

Confirmed

MDEV-14992 BACKUP: in-server backup

Open

MDEV-21105 Port clone plugin API (MYSQL_CLONE_PLUGIN) from MySQL

Closed

MDEV-21923 LSN allocation is a bottleneck

Closed

(3 relates to)

Activity

Ascending order - Click to sort in descending order

View 25 older comments

Axel Schwenke added a comment - 2025-04-04 08:55 - edited

I had a closer look at the backup logs for

16GB REDO, SSD->HDD
workload with 8 threads running concurrenty
comparing 10.11.8 (1594s) and 10.11.10 (66s)

10.11.8:

[00] 2025-04-03 16:29:37 Connecting to MariaDB server host: localhost, user: root,

  password: not set, port: not set, socket: /tmp/mysqld.sock.sysbench

[00] 2025-04-03 16:29:37 BACKUP STAGE START

[00] 2025-04-03 16:29:37 Acquiring BACKUP LOCKS...

(copying table spaces)

[00] 2025-04-03 16:30:33 BACKUP STAGE FLUSH

(copying more table spaces)

[00] 2025-04-03 16:30:33 Finished backing up non-InnoDB tables and files

[00] 2025-04-03 16:30:33 Waiting for log copy thread to read lsn 14309104946

-- what happens in those 11 min?

[00] 2025-04-03 16:41:28 Start copying aria log file tail: /home/galera/var/regression//aria_log.00000001

[00] 2025-04-03 16:41:28 Stop copying aria log file tail: /home/galera/var/regression//aria_log.00000001, copied 0 bytes

[00] 2025-04-03 16:41:28 BACKUP STAGE BLOCK_COMMIT

(copy log tables)

(copy stats tables)

[00] 2025-04-03 16:41:28 Executing FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS...

[00] 2025-04-03 16:41:28 mariabackup: The latest check point (for incremental): '23536505333'

mariabackup: Stopping log copying thread......... (many more dots)

-- what happens in those 15 min?

[00] 2025-04-03 16:56:11 BACKUP STAGE END

[00] 2025-04-03 16:56:11 Executing BACKUP STAGE END

There are two periods where the backup process is mostly idle. The HDD holding the copied redo log sees ~2MB/s writes. That is both lower than the redo log is written on the server (~16MB/s) and what the HDD can do (~170MB/s).

In comparison the log for 10.11.10:

[00] 2025-04-03 19:38:34 Connecting to MariaDB server host: localhost, user: root,

  password: not set, port: not set, socket: /tmp/mysqld.sock.sysbench

...

[00] 2025-04-03 19:38:41 BACKUP STAGE START

[00] 2025-04-03 19:38:41 Acquiring BACKUP LOCKS...

(copying table spaces)

[00] 2025-04-03 19:39:38 BACKUP STAGE FLUSH

(copying more table spaces)

[00] 2025-04-03 19:39:38 Finished backing up non-InnoDB tables and files

[00] 2025-04-03 19:39:38 Waiting for log copy thread to read lsn 14353301653

[00] 2025-04-03 19:39:38 Start copying aria log file tail: /home/galera/var/regression//aria_log.00000001

[00] 2025-04-03 19:39:38 Stop copying aria log file tail: /home/galera/var/regression//aria_log.00000001, copied 0 bytes

[00] 2025-04-03 19:39:38 BACKUP STAGE BLOCK_COMMIT

(copy log tables)

[00] 2025-04-03 19:39:38 mariabackup: The latest check point (for incremental): '12786279177'

[00] 2025-04-03 19:39:38 Waiting for log copy thread to read lsn 18446744073709551615

(copy stats tables)

[00] 2025-04-03 19:39:39 Executing FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS...

[00] 2025-04-03 19:39:39 BACKUP STAGE END

[00] 2025-04-03 19:39:39 Executing BACKUP STAGE END

Also look at the sequence of events. 10.11.10 prints out the latest check point before copying the stats tables. 10.11.8 does that imediately after copying the log tables and reports the latest checkpoint later.

Axel Schwenke added a comment - 2025-04-04 08:55 - edited I had a closer look at the backup logs for 16GB REDO, SSD->HDD workload with 8 threads running concurrenty comparing 10.11.8 (1594s) and 10.11.10 (66s) 10.11.8: [00] 2025-04-03 16:29:37 Connecting to MariaDB server host: localhost, user: root, password: not set, port: not set, socket: /tmp/mysqld.sock.sysbench [00] 2025-04-03 16:29:37 BACKUP STAGE START [00] 2025-04-03 16:29:37 Acquiring BACKUP LOCKS... (copying table spaces) [00] 2025-04-03 16:30:33 BACKUP STAGE FLUSH (copying more table spaces) [00] 2025-04-03 16:30:33 Finished backing up non-InnoDB tables and files [00] 2025-04-03 16:30:33 Waiting for log copy thread to read lsn 14309104946 -- what happens in those 11 min? [00] 2025-04-03 16:41:28 Start copying aria log file tail: /home/galera/var/regression//aria_log.00000001 [00] 2025-04-03 16:41:28 Stop copying aria log file tail: /home/galera/var/regression//aria_log.00000001, copied 0 bytes [00] 2025-04-03 16:41:28 BACKUP STAGE BLOCK_COMMIT (copy log tables) (copy stats tables) [00] 2025-04-03 16:41:28 Executing FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS... [00] 2025-04-03 16:41:28 mariabackup: The latest check point (for incremental): '23536505333' mariabackup: Stopping log copying thread......... (many more dots) -- what happens in those 15 min? [00] 2025-04-03 16:56:11 BACKUP STAGE END [00] 2025-04-03 16:56:11 Executing BACKUP STAGE END There are two periods where the backup process is mostly idle. The HDD holding the copied redo log sees ~2MB/s writes. That is both lower than the redo log is written on the server (~16MB/s) and what the HDD can do (~170MB/s). In comparison the log for 10.11.10: [00] 2025-04-03 19:38:34 Connecting to MariaDB server host: localhost, user: root, password: not set, port: not set, socket: /tmp/mysqld.sock.sysbench ... [00] 2025-04-03 19:38:41 BACKUP STAGE START [00] 2025-04-03 19:38:41 Acquiring BACKUP LOCKS... (copying table spaces) [00] 2025-04-03 19:39:38 BACKUP STAGE FLUSH (copying more table spaces) [00] 2025-04-03 19:39:38 Finished backing up non-InnoDB tables and files [00] 2025-04-03 19:39:38 Waiting for log copy thread to read lsn 14353301653 [00] 2025-04-03 19:39:38 Start copying aria log file tail: /home/galera/var/regression//aria_log.00000001 [00] 2025-04-03 19:39:38 Stop copying aria log file tail: /home/galera/var/regression//aria_log.00000001, copied 0 bytes [00] 2025-04-03 19:39:38 BACKUP STAGE BLOCK_COMMIT (copy log tables) [00] 2025-04-03 19:39:38 mariabackup: The latest check point (for incremental): '12786279177' [00] 2025-04-03 19:39:38 Waiting for log copy thread to read lsn 18446744073709551615 (copy stats tables) [00] 2025-04-03 19:39:39 Executing FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS... [00] 2025-04-03 19:39:39 BACKUP STAGE END [00] 2025-04-03 19:39:39 Executing BACKUP STAGE END Also look at the sequence of events. 10.11.10 prints out the latest check point before copying the stats tables. 10.11.8 does that imediately after copying the log tables and reports the latest checkpoint later.

Marko Mäkelä added a comment - 2025-04-04 09:51

Thank you, axel. Your results are completely in line with my expectations.

Like I wrote earlier, the new format of ib_logfile0 allows the server to write changes much faster. With the fix of ~~MDEV-21923~~ it will be even faster. At the same time, the very same format change is necessarily slowing down mariadb-backup --backup.

The proper way to fix this is to avoid the parsing, which is necessarily single-threaded. This is best done by implementing backup in the server or making the mariadbd process responsible for creating another copy of the log for backup. In that way, the copy could be written in larger chunks, just like the crash recovery log is being written. An even better alternative might be to implement backups using something like the MySQL 8.0 Clone plugin functionality (~~MDEV-21105~~).

Marko Mäkelä added a comment - 2025-04-04 09:51 Thank you, axel . Your results are completely in line with my expectations. Like I wrote earlier, the new format of ib_logfile0 allows the server to write changes much faster. With the fix of MDEV-21923 it will be even faster. At the same time, the very same format change is necessarily slowing down mariadb-backup --backup . The proper way to fix this is to avoid the parsing, which is necessarily single-threaded. This is best done by implementing backup in the server or making the mariadbd process responsible for creating another copy of the log for backup. In that way, the copy could be written in larger chunks, just like the crash recovery log is being written. An even better alternative might be to implement backups using something like the MySQL 8.0 Clone plugin functionality ( MDEV-21105 ).

Debarun Banerjee added a comment - 2 days ago

~~MDEV-34062~~ in 10.11.10 introduced innodb_log_file_mmap and makes it default for backup. While it improves the performance for default case when the redo log is written directly to disk by server it could impact the case when innodb_log_file_buffering=ON in server when redo log write is done via OS buffer cache. It is apparent in such case doing a buffered read would synchronize read-write operations better.

Looking at the history of changes, we can see that buffered IO for redo logging was removed sometime back for linux and windows. A bit later we introduce a parameter back to get back the buffered write behaviour.

MDEV-28111: 10.8.3 [Redo log writes are being buffered on Linux for no good reason]

MDEV-28766: 10.8.4 [MDEV-28111 breaks innodb_flush_log_at_trx_commit=2]

        - innodb_log_file_buffering=OFF [default]

        - innodb_log_file_buffering=ON  [if innodb_flush_log_at_trx_commit=2]

Now from the customer's server configuration, we see ...

innodb_flush_log_at_trx_commit  2

innodb_log_file_buffering  ON

innodb_log_file_mmap   OFF

Where as for mariadb-backup, innodb_log_file_mmap = ON (default),

So, to get back the 10.11.8 behaviour, it is advisable to set innodb_log_file_mmap=OFF while running mariadb-backup. This should be recommended whenever server innodb_log_file_buffering is set to ON. AFAICS, mariadb-backup default should still stay as innodb_log_file_mmap=ON as it performs significantly better with server default innodb_log_file_buffering (=OFF).

Debarun Banerjee added a comment - 2 days ago MDEV-34062 in 10.11.10 introduced innodb_log_file_mmap and makes it default for backup. While it improves the performance for default case when the redo log is written directly to disk by server it could impact the case when innodb_log_file_buffering=ON in server when redo log write is done via OS buffer cache. It is apparent in such case doing a buffered read would synchronize read-write operations better. Looking at the history of changes, we can see that buffered IO for redo logging was removed sometime back for linux and windows. A bit later we introduce a parameter back to get back the buffered write behaviour. MDEV-28111: 10.8.3 [Redo log writes are being buffered on Linux for no good reason] MDEV-28766: 10.8.4 [MDEV-28111 breaks innodb_flush_log_at_trx_commit=2] - innodb_log_file_buffering=OFF [default] - innodb_log_file_buffering=ON [if innodb_flush_log_at_trx_commit=2] Now from the customer's server configuration, we see ... innodb_flush_log_at_trx_commit 2 innodb_log_file_buffering ON innodb_log_file_mmap OFF Where as for mariadb-backup, innodb_log_file_mmap = ON (default), So, to get back the 10.11.8 behaviour, it is advisable to set innodb_log_file_mmap=OFF while running mariadb-backup. This should be recommended whenever server innodb_log_file_buffering is set to ON. AFAICS, mariadb-backup default should still stay as innodb_log_file_mmap=ON as it performs significantly better with server default innodb_log_file_buffering (=OFF).

Vladislav Vaintroub added a comment - 2 days ago

as mariabackup can (and does) read server's parameters, it probably can optimize its redo log access, to be compatible with server's redo buffered-ness.

Vladislav Vaintroub added a comment - 2 days ago as mariabackup can (and does) read server's parameters, it probably can optimize its redo log access, to be compatible with server's redo buffered-ness.

Debarun Banerjee added a comment - 2 days ago

It could be an idea but is not completely void of impacts. It would mean ignoring user specified choices and defaults and could result in a change in behaviour. With MDEV-35248 in horizon, I am not sure if we should go for behaviour changes in current backup.

Debarun Banerjee added a comment - 2 days ago It could be an idea but is not completely void of impacts. It would mean ignoring user specified choices and defaults and could result in a change in behaviour. With MDEV-35248 in horizon, I am not sure if we should go for behaviour changes in current backup.

MariaDB Server

mariabackup failed after upgrade 10.11.10

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration