[MDEV-20219] PURGE BINARY LOGS does not always purge Created: 2019-07-30 Updated: 2023-12-05 |
|
| Status: | Stalled |
| Project: | MariaDB Server |
| Component/s: | Replication, Server |
| Affects Version/s: | 10.3.16 |
| Fix Version/s: | 10.3 |
| Type: | Bug | Priority: | Major |
| Reporter: | Oli Sennhauser | Assignee: | Brandon Nesterenko |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | binary, logging, logs, purge | ||
| Environment: |
Ubuntu 16.04 |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Description |
|
Purging binary logs does not always purge and delete it. This typically happens when only one binary log is there (little traffic or short purge interval). This is reproducible and does NOT happen with a competitive product. Good case: ll ~/database/mariadb-103/binlog/ ~/bck/mariadb-103/binlog/ ; cat ~/database/mariadb-103/binlog/binlog.index /home/mysql/bck/mariadb-103/binlog/: /home/mysql/database/mariadb-103/binlog/: /home/mysql/database/mariadb-103/binlog/binlog.000393 bman --target=brman:secret@127.0.0.1:3318 --type=binlog --policy=binlog --mode=logical --backupdir=/home/mysql/bck/mariadb-103 Start backup at 2019-07-30_18:12:39 Copy /home/mysql/database/mariadb-103/binlog/binlog.000393 to /home/mysql/bck/mariadb-103/binlog/bck_binlog.000393 Copy /home/mysql/database/mariadb-103/binlog/binlog.000394 to /home/mysql/bck/mariadb-103/binlog/bck_binlog.000394 ll ~/database/mariadb-103/binlog/ ~/bck/mariadb-103/binlog/ ; cat ~/database/mariadb-103/binlog/binlog.index /home/mysql/bck/mariadb-103/binlog/: /home/mysql/database/mariadb-103/binlog/: /home/mysql/database/mariadb-103/binlog/binlog.000395 Bad case: bman --target=brman:secret@127.0.0.1:3318 --type=binlog --policy=binlog --mode=logical --backupdir=/home/mysql/bck/mariadb-103 Start backup at 2019-07-30_18:12:56 Copy /home/mysql/database/mariadb-103/binlog/binlog.000395 to /home/mysql/bck/mariadb-103/binlog/bck_binlog.000395 ll ~/database/mariadb-103/binlog/ ~/bck/mariadb-103/binlog/ ; cat ~/database/mariadb-103/binlog/binlog.index /home/mysql/bck/mariadb-103/binlog/: /home/mysql/database/mariadb-103/binlog/: /home/mysql/database/mariadb-103/binlog/binlog.000395 This makes behaviour completely unpredictable for more sophisticated backup software and follow-up backup jobs will fail. |
| Comments |
| Comment by Andrei Elkin [ 2021-06-08 ] |
|
Might relate to |
| Comment by Brandon Nesterenko [ 2021-09-02 ] |
|
Hi oli! If you are still following this issue, I have a question. I see in the brman source code in lib/doBinlogBackup.inc, before the PURGE BINARY LOGS TO call, there is a comment. Does your environment have any slaves when you encounter this issue? In particular, if an old binlog is being read/used anywhere when the PURGE is issued, it will be ignored in the purge. A few examples of this are: Given your observation that this issue occurs with short purge intervals, if something is reading an old binlog file during the purge, I could see this issue happening. I have a proposed fix, but would first like to confirm the cause of the issue. Thanks! |
| Comment by Oli Sennhauser [ 2021-09-13 ] |
|
Hello Brandon Sorry for my late replay but I was in vacation... The comment is more about that one should not purge binary logs if you have slaves and they did not read/apply those binary logs yet... That was a complete testing environment. Typically mariadb-103 is a stand-alone instance. Otherwise I would call it something like mariadb-103m or so. But I cannot guarantee 100% for this... So most probably NO slave. According to my notes this should simulate the problem: "This typically happens when only one binary log is there (little traffic or short purge interval)." So I can try again, if I still can reproduce the issue without our brman... BR, |
| Comment by Brandon Nesterenko [ 2021-09-21 ] |
|
Hi Oli! Thanks for the reply! I have a thought, but would like to ask a few follow-up questions first: 1) In your test environment, do you remember how the data was being ingested? I saw that the utl folder in the brman source code has some scripts that create/populate tables. Do you remember if you used one of those? (to aid in my own attempts to correctly reproduce your behavior). 2) Do you have information on the cfg file and options used for running your MariaDB server instance? 3) If you could reproduce this, that would be great. Could you also attach the last two binlogs, e.g. binlog.000395 and binlog.000396, to aid in my analysis? Additionally if you can only reproduce this using brman, could you additionally attach the backup version of the problem binlog, e.g. bck_binlog.000395.gz? Thanks! |
| Comment by Oli Sennhauser [ 2021-09-22 ] |
|
Hello Brandon OK. It tried to reproduce it. But without success! There were to many things changing since 2019: Machine?, Ubuntu version (16.04 vs 18.04), MariaDB version 10.3.16 vs. 10.3.20, brman version, config, etc...
while [ 1 ] ; do bman --target=brman:secret@127.0.0.1:3318 --type=binlog --policy=binlog --mode=logical --backupdir=/home/mysql/bck/mariadb-103 | grep 'End backup' ; sleep 60 ; done
Do you have an idea what I shall try to provoke the problem? I will keep the system in the current state for a while... Regards, |
| Comment by Oli Sennhauser [ 2021-09-22 ] |
|
Before the interval was 60 seconds. But if I do the interval to random I get an error... while [ 1 ] ; do bman --target=brman:secret@127.0.0.1:3318 --type=binlog --policy=binlog --mode=logical --backupdir=/home/mysql/bck/mariadb-103 | grep 'End backup' ; sleep $(echo $(($RANDOM % 60))) ; done I try to narrow down the problem. |
| Comment by Oli Sennhauser [ 2021-09-22 ] |
|
It happens randomly: End backup at 2021-09-22 10:35:53 (rc=0) |
| Comment by Oli Sennhauser [ 2021-09-22 ] |
|
I just found, that we optimized the code yet in new brman releases: |
| Comment by Oli Sennhauser [ 2021-09-22 ] |
|
Fix was introduced in v2.2.1 |
| Comment by Oli Sennhauser [ 2021-09-22 ] |
|
With v2.2.0 it looks better: |
| Comment by Oli Sennhauser [ 2021-09-22 ] |
|
Files attached... |
| Comment by Oli Sennhauser [ 2021-09-22 ] |
|
So I can break it now regularly but not at will... |
| Comment by Brandon Nesterenko [ 2021-09-22 ] |
|
Thanks for your effort into reproducing this error! For clarification, in an earlier comment you mentioned that you added a slave into your testing environment. Did you stop/remove it afterwards? That is, was it running when you were able to reproduce the error? |
| Comment by Oli Sennhauser [ 2021-09-23 ] |
|
Slave WAS running when I reproduced the problem. Running it again now with a shutdown slave. |
| Comment by Oli Sennhauser [ 2021-09-23 ] |
|
I tried it again with and without a Slave. The results lead me to the conclusion that it is about 95% sure that it is the Slave: Without a Slave: mysql@chef:~ [mariadb-103, 3318]> while [ 1 ] ; do /home/mysql/product/brman-2.2.0/bin/bman --target=brman:secret@127.0.0.1:3318 --type=binlog --policy=binlog --mode=logical --backupdir=/home/mysql/bck/mariadb-103 | grep 'End backup' ; sleep $(echo $(($RANDOM % 20))) ; done Starting the Slave: mysql@chef:~ [mariadb-103, 3318]> while [ 1 ] ; do /home/mysql/product/brman-2.2.0/bin/bman --target=brman:secret@127.0.0.1:3318 --type=binlog --policy=binlog --mode=logical --backupdir=/home/mysql/bck/mariadb-103 | grep 'End backup' ; sleep $(echo $(($RANDOM % 20))) ; done Removed the problematic file. Then: End backup at 2021-09-23 08:20:40 (rc=0) Removed the problematic file. Then: End backup at 2021-09-23 08:26:50 (rc=0) Stop slave. Then: End backup at 2021-09-23 08:38:02 (rc=0) |
| Comment by Oli Sennhauser [ 2021-09-23 ] |
|
It was running now for 6 hours without a problem... If we calculate every 30 seconds a run in average this makes about 2400 runs. But then it broke again! WITHOUT a Slave attached... ... So it looks like this happens more probable WITH Slave. But it also happens rarely WITHOUT Slave... Does this make sense? |
| Comment by Brandon Nesterenko [ 2021-09-23 ] |
|
Interesting! Can you attach the problematic binary log file (live from the base directory), its gzipped backup version (assuming it was not overwritten?), and the next binary log file? So that would be binlog.003351, bck_binlog.003351.gz, and binlog.003352. Edit: The previous binlog may be helpful too, binlog.003350 |
| Comment by Brandon Nesterenko [ 2021-09-24 ] |
|
Hmm, my hunch didn't pan out. Thanks for attaching the binlog files. Can you also attach the applicable server/error logs? |
| Comment by Oli Sennhauser [ 2021-09-27 ] |
|
Uploaded. |