[MDEV-16883] MariaBackup very slow after upgrading from 10.2.14 to 10.3.8 Created: 2018-08-02  Updated: 2018-10-05  Resolved: 2018-10-05

Status: Closed
Project: MariaDB Server
Component/s: Backup
Affects Version/s: 10.3.8
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Peter Warmer Assignee: Vladislav Vaintroub
Resolution: Cannot Reproduce Votes: 0
Labels: mariabackup
Environment:

Fedora Server 27, Galera cluster with 3 nodes, DB size ~ 1.4TB


Attachments: Text File 10_2_14.log     Text File 10_3_8.log     File backup-DB.sh     File backup-mysql.sh     File my.cnf    
Issue Links:
Relates
relates to MDEV-15730 mariabackup: rename --stream=xbstream... Closed

 Description   

We have a 3 node Galera cluster and a separate backup server where we run MariaBackup. The backup worked well with version 10.2.14. After upgrading the 4 servers to MariaDB / MariaBackup 10.3.8 the backup is much slower (as you can see in the attached log files).
On the backup server we run this from a script:

/usr/bin/ssh backup@$IP "/usr/bin/sudo /usr/local/bin/backup-mysql.sh" | mbstream -x -C $BACKUPDIR/$1/$WEEKDAY

/usr/bin/mariabackup --host=$IP --user=backup --password=xxxx --use-memory=10G --prepare --target-dir $BACKUPDIR/$1/$WEEKDAY

and this is called on the DB server:

/usr/bin/mariabackup --backup --use-memory=10G --galera-info --ftwrl-wait-threshold=40 --ftwrl-wait-query-type=all --ftwrl-wait-timeout=180 --kill-long-queries-timeout=20 --kill-long-query-type=all --user=backup --password=xxxx --stream=xbstream --target-dir /var/tmp



 Comments   
Comment by Vladislav Vaintroub [ 2018-08-02 ]

There is not a big difference in Mariabackup from 10.2 to 10.3, apart from necessary code changes to make it compilable due to innodb refactoring.

(you can compare extra/mariabackup subtrees in the source code)

To exclude the external factors, such as networking, can you peterw17 backup the same dataset in 10.2 and 10.3, locally, i,e without --stream, and measure the difference
As well as --prepare ,and measure the difference,

Comment by Peter Warmer [ 2018-08-02 ]

Hi Vladislav,
Thank you for the quick answer. We don't have a 10.2 installation after upgrading it to 10.3. So I will do a local backup of the 10.3 tonight and send you the log file tomorrow.

Comment by Peter Warmer [ 2018-08-03 ]

Hi Vladislav,
the local backup with 10.3 is finished now and took about 6 hours so it takes about twice the time as the 10.2. Is it possible that the --use-memory parameter doesn't work correctly?

Comment by Vladislav Vaintroub [ 2018-08-03 ]

what takes more time as compared to 10.2 --backup. or --prepare?
--use-memory has no effect on --backup, because it does not use bufferpool

Comment by Peter Warmer [ 2018-08-03 ]

The --backup took much longer, the --prepare not so much. So it's not --use-memory.

Comment by Vladislav Vaintroub [ 2018-08-03 ]

can you attach the exact command line, server's my.cnf file.

Comment by Peter Warmer [ 2018-08-06 ]

There are 3 additional files attached now. The my.cnf of the DB server, backup-DB.sh which runs on the backup server and ssh-starts the backup-mysql.sh script on one of the DB nodes.

Comment by Peter Warmer [ 2018-08-08 ]

I switched back to our original backup setup with --stream. And the --prepare part takes much longer with --stream than it did during the local backup I did for testing. So maybe the MariaDB client software is slowing the backup? There is no my.cnf configuration for the client on the backup server and I think there was was none before the update to 10.3 too. Are there chances to improve the client configuration?

Comment by Vladislav Vaintroub [ 2018-08-08 ]

--stream has nothing to do with mariadb client software. stream is storing output to the stdout, not to a directory.
Mariadb client is used only minimalistically, there is no more than maybe a couple of queries during backup.

If there is something that slows down things, it is can be anything on your network, because people like to combine stream with netcat etc.
slower streaming leads to slower backup . slower backup leads to larger redo log. larger redo log leads to increased recovery times. increased recovery times lead to slower prepare.

Comment by Peter Warmer [ 2018-08-10 ]

So the real problem is the --backup part which takes in 10.3 twice the time than in 10.2 for us. And this is the same with --stream or local backup. We only made the update from 10.2 to 10.3 and didn't change anything else in the configs of the DB, OS and network. Could you find anything suspect in the attached config and log files?

Comment by Vladislav Vaintroub [ 2018-08-10 ]

To be honest, since there is not enough information I could gather from this bug report. There is no "how to reproduce" info, some speculations about --stream, and "it is slow for me" section.As I mentioned, there is no difference between 10.2 and 10.3.
mariabackup's parameter innodb_checksum_algorithm changed from "innodb" to "crc32" (in 10.2.16 / 10.3.8 due to MDEV-16457), and this could affect performance, but you would have to check it for yourself, what happens if you set the parameter back.

If you would like further help, I think you'd need to add "how to reproduce" section, or collect profiler info. Also, we would need the information on the size of the backup (the data, and the redo log). The current bug description is not very actionable.

Comment by Peter Warmer [ 2018-08-13 ]

For the moment I have no additional information for you but one last question. If you compare the beginnings of the attached logs 10_2_14.log and 10_3_8.log there are in 10.3 many more "log scanned up to ..." lines than in 10.2. Is it possible that the size(s) of some buffer(s) should be increased? I tried --use-memory and --innodb-buffer-pool-size, both didn't change this. Do you have any other suggestions for increasings?

Comment by Vladislav Vaintroub [ 2018-08-13 ]

I gave the suggestion last time , changing the innodb_checksum_algorithm in mariabackup, to innodb and see if it fares better. Because the default has changed, and if you happen to have lot of pages checksummed with old checksum algorithm (which is possible if you upgraded e.g from the 10.1), then
the new default will slow down checksumming process, which is relatively time consuming. Maybe your case is backwards to MDEV-16457, which was solved correctly to favor new 10.2 style CRC32 page checksums.

"log scanned up to.." is given by the redo log copying thread periodically, as a progress message.
.I do not think size of buffers need to be increased. I'm not even aware of any large buffers during --backup, except the ones that are used for copying files, and those have not changed. innodb buffer pool has nothing to do during --backup.

you can use --parallel to speed up the backup processing. You can backup when there is not a lot of write activity. I do not have suggestion other than that for the moment. a bug database is not right place for suggestions, you can try mailing list for example, this is better.

Comment by Vladislav Vaintroub [ 2018-09-11 ]

winstone, could you try to reproduce "very slow" mariabackup in 10.3 vs 10.2? If you do not find anything, feel free to close this one

Comment by Zdravelina Sokolovska (Inactive) [ 2018-09-11 ]

peterw17
hello Peter,
you mentioned that you deploy 3 Galera Nodes, but you have 5 Nodes in configuration.Can you check wsrep_cluster_size variable ?

Comment by Peter Warmer [ 2018-09-12 ]

Hello winstone,
running SHOW GLOBAL STATUS LIKE 'wsrep_%'; I get:

Variable_name Value

.
.

wsrep_cluster_size 3

.
.

wsrep_incoming_addresses 10.0.50.241:3306,10.0.50.240:3306,10.0.50.242:3306

The wsrep_incoming_addresses fit to "wsrep_cluster_address=gcomm://10.0.50.240,10.0.50.241,10.0.50.242" in my.cnf. Is there any other configuration for the cluster members that I missed?

Comment by Zdravelina Sokolovska (Inactive) [ 2018-09-12 ]

peterw17 ,
can you please send the output of the du --summarize /var/lib/mysql/
and
du --summarize $BACKUPDIR/$1/$WEEKDAY
from your backups from versions 10_2_14 and 10_3_8 ?

Comment by Peter Warmer [ 2018-09-12 ]

Hello winstone,
because our servers were upgraded from 10_2_14 to 10_3_8, I don't have a 10_2_14 anymore. For 10_3_8 I get for the backup last night:

root@dbcluster02:/var/backup/DB/db02#du --summarize Tue/
1542557240 Tue/

and the current disk usage of the DB is:

root@dbcluster02:/var/backup/DB/db02#du --summarize /var/lib/mysql/
1551101564 /var/lib/mysql/

Comment by Stijn [ 2018-09-26 ]

We had a similar issue with mysqldump being 2-3 times slower after the upgrade (issue https://jira.mariadb.org/browse/MDEV-14772 ). The fix was to use "stdbuf -o 10MB" in front of the backup command.

Comment by Peter Warmer [ 2018-09-27 ]

Hello Stijn,
thank you very much for your hint. Unfortunately the "stdbuf -o 10MB" did not speed up our backup, it took nearly the same time as before.

Comment by Zdravelina Sokolovska (Inactive) [ 2018-10-05 ]

hello stijn, excuse me for delayed feedback. You are operating evidently with increasing data volumes. I intended to try to reproduce with loaded data near to you size,
but unfortunately on environment with network bottlenecks, and in addition such huge data load is performing very slowly.
So, I started measuring mariabackup time on smaller size DBs.
From test results however it's seen that there is not diffrence in times when get the backup with or without --stream=xbstream option ,
but if run backup with options --rsync --parallel= the performance time become several times better .
I checked for regression on the best time, and it's was not seen .

 10.2.14-MariaDB  
 mariabackup  on 80G
 
*+Backup  without stream option+*          Duration: 1 hours 29 minutes 15 seconds
Prepare                                Duration: 0 hours 0 minutes 1 seconds
 
 
 
*+Backup  with --stream=xbstream option+*
                                                       Duration: 1 hours 29 minutes 25 seconds
*+mdbstrem  operation  +*            Duration: 0 hours 26 minutes 47 seconds
 
 
 
 
+*Backup done  with options --rsync  --parallel=6*+   Duration: 0 hours 39 minutes 13 seconds

 10.3.10-MariaDB 
+*Backup done  with options --rsync  --parallel=6*+  
 
Backup done                  Duration: 0 hours 35 minutes 3 seconds
Backup prepare done     Duration: 0 hours 0 minutes 2 seconds

Comment by Vladislav Vaintroub [ 2018-10-05 ]

I'm going to close this one, because nobody could localize the problem. Backup time in your case can depend on many factors, such as network load and database load.

Check if something will change if you use --innodb_checksum_algorithm=innodb (for mariabackup), it could have an impact since MDEV-16457 changed the default. use --parallel to speedup the backup time. Backup when database load is low. Those are generic tips, since there is nothing specific known about your case. And to know anything specific, we'd need some kind of performance data, which you maybe could collect with perf (while doing on-disk backup, to isolate it from external networking problems). IF you have some performance data, you can attach it to the bug report, and we can look further.

As I mentioned, there is practically no difference in backup from 10.2 to 10.3, so the difference you see will mostly likely be something external.

Generated at Thu Feb 08 08:32:16 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.