[MDEV-20219] PURGE BINARY LOGS does not always purge Created: 2019-07-30  Updated: 2023-12-05

Status: Stalled
Project: MariaDB Server
Component/s: Replication, Server
Affects Version/s: 10.3.16
Fix Version/s: 10.3

Type: Bug Priority: Major
Reporter: Oli Sennhauser Assignee: Brandon Nesterenko
Resolution: Unresolved Votes: 0
Labels: binary, logging, logs, purge
Environment:

Ubuntu 16.04


Attachments: File bck_binlog.000604.gz     File bck_binlog.000605     File bck_binlog.000605.gz     File binlog.000606     File chef_mariadb-103_error.log     File problematic_binary_logs_335x.tar.gz     File purge.sh    
Issue Links:
Relates
relates to MDEV-32953 main.rpl_mysqldump_slave Fails with "... Closed

 Description   

Purging binary logs does not always purge and delete it. This typically happens when only one binary log is there (little traffic or short purge interval). This is reproducible and does NOT happen with a competitive product.

Good case:

ll ~/database/mariadb-103/binlog/ ~/bck/mariadb-103/binlog/ ; cat ~/database/mariadb-103/binlog/binlog.index

/home/mysql/bck/mariadb-103/binlog/:
total 1560
rw-rr- 1 mysql mysql 779746 Jul 30 18:02 bck_binlog.000390.gz
rw-rr- 1 mysql mysql 424587 Jul 30 18:03 bck_binlog.000391.gz
rw-rr- 1 mysql mysql 388928 Jul 30 18:04 bck_binlog.000392.gz

/home/mysql/database/mariadb-103/binlog/:
total 30912
rw-rw--- 1 mysql mysql 13367733 Jul 30 18:07 binlog.000393
rw-rw--- 1 mysql mysql 18268189 Jul 30 18:12 binlog.000394
rw-rw--- 1 mysql mysql 108 Jul 30 18:07 binlog.index
rw-rw--- 1 mysql mysql 12 Jul 25 09:56 binlog.state

/home/mysql/database/mariadb-103/binlog/binlog.000393
/home/mysql/database/mariadb-103/binlog/binlog.000394

bman --target=brman:secret@127.0.0.1:3318 --type=binlog --policy=binlog --mode=logical --backupdir=/home/mysql/bck/mariadb-103

Start backup at 2019-07-30_18:12:39
Binlog Index file is: /home/mysql/database/mariadb-103/binlog/binlog.index
Getting lock: /home/oli/fromdual_devel/brman/lck/binlog-logical-binlog.lock
Releasing lock: /home/oli/fromdual_devel/brman/lck/binlog-logical-binlog.lock
FLUSH /*!50503 BINARY */ LOGS

Copy /home/mysql/database/mariadb-103/binlog/binlog.000393 to /home/mysql/bck/mariadb-103/binlog/bck_binlog.000393
Binary log binlog.000393 begin datetime is: 2019-07-30 18:04:10 and end datetime is: 2019-07-30 18:07:34
Do MD5 checksum of /home/mysql/bck/mariadb-103/binlog/bck_binlog.000393
md5sum --binary /home/mysql/bck/mariadb-103/binlog/bck_binlog.000393
md5 = 54c59d6ec5f14d04176065f50634c7dd
/usr/bin/pigz -6 /home/mysql/bck/mariadb-103/binlog/bck_binlog.000393
PURGE BINARY LOGS TO 'binlog.000394'

Copy /home/mysql/database/mariadb-103/binlog/binlog.000394 to /home/mysql/bck/mariadb-103/binlog/bck_binlog.000394
Binary log binlog.000394 begin datetime is: 2019-07-30 18:07:34 and end datetime is: 2019-07-30 18:12:39
Do MD5 checksum of /home/mysql/bck/mariadb-103/binlog/bck_binlog.000394
md5sum --binary /home/mysql/bck/mariadb-103/binlog/bck_binlog.000394
md5 = 01538bee0e2de03eaa42b385d369387a
/usr/bin/pigz -6 /home/mysql/bck/mariadb-103/binlog/bck_binlog.000394
PURGE BINARY LOGS TO 'binlog.000395'
End backup at 2019-07-30 18:12:40 (rc=0)

ll ~/database/mariadb-103/binlog/ ~/bck/mariadb-103/binlog/ ; cat ~/database/mariadb-103/binlog/binlog.index

/home/mysql/bck/mariadb-103/binlog/:
total 7152
rw-rr- 1 mysql mysql 779746 Jul 30 18:02 bck_binlog.000390.gz
rw-rr- 1 mysql mysql 424587 Jul 30 18:03 bck_binlog.000391.gz
rw-rr- 1 mysql mysql 388928 Jul 30 18:04 bck_binlog.000392.gz
rw-rr- 1 mysql mysql 2330609 Jul 30 18:07 bck_binlog.000393.gz
rw-rr- 1 mysql mysql 3390627 Jul 30 18:12 bck_binlog.000394.gz

/home/mysql/database/mariadb-103/binlog/:
total 540
rw-rw--- 1 mysql mysql 542899 Jul 30 18:12 binlog.000395
rw-rw--- 1 mysql mysql 54 Jul 30 18:12 binlog.index
rw-rw--- 1 mysql mysql 12 Jul 25 09:56 binlog.state

/home/mysql/database/mariadb-103/binlog/binlog.000395

Bad case:

bman --target=brman:secret@127.0.0.1:3318 --type=binlog --policy=binlog --mode=logical --backupdir=/home/mysql/bck/mariadb-103

Start backup at 2019-07-30_18:12:56
Binlog Index file is: /home/mysql/database/mariadb-103/binlog/binlog.index
Getting lock: /home/oli/fromdual_devel/brman/lck/binlog-logical-binlog.lock
Releasing lock: /home/oli/fromdual_devel/brman/lck/binlog-logical-binlog.lock
FLUSH /*!50503 BINARY */ LOGS

Copy /home/mysql/database/mariadb-103/binlog/binlog.000395 to /home/mysql/bck/mariadb-103/binlog/bck_binlog.000395
Binary log binlog.000395 begin datetime is: 2019-07-30 18:12:39 and end datetime is: 2019-07-30 18:12:56
Do MD5 checksum of /home/mysql/bck/mariadb-103/binlog/bck_binlog.000395
md5sum --binary /home/mysql/bck/mariadb-103/binlog/bck_binlog.000395
md5 = 6b83ea44eb4a86cff21aa0e68d78e7d0
/usr/bin/pigz -6 /home/mysql/bck/mariadb-103/binlog/bck_binlog.000395
PURGE BINARY LOGS TO 'binlog.000396'
End backup at 2019-07-30 18:12:56 (rc=0)

ll ~/database/mariadb-103/binlog/ ~/bck/mariadb-103/binlog/ ; cat ~/database/mariadb-103/binlog/binlog.index

/home/mysql/bck/mariadb-103/binlog/:
total 7344
rw-rr- 1 mysql mysql 779746 Jul 30 18:02 bck_binlog.000390.gz
rw-rr- 1 mysql mysql 424587 Jul 30 18:03 bck_binlog.000391.gz
rw-rr- 1 mysql mysql 388928 Jul 30 18:04 bck_binlog.000392.gz
rw-rr- 1 mysql mysql 2330609 Jul 30 18:07 bck_binlog.000393.gz
rw-rr- 1 mysql mysql 3390627 Jul 30 18:12 bck_binlog.000394.gz
rw-rr- 1 mysql mysql 195160 Jul 30 18:12 bck_binlog.000395.gz

/home/mysql/database/mariadb-103/binlog/:
total 1260
rw-rw--- 1 mysql mysql 1116483 Jul 30 18:12 binlog.000395
rw-rw--- 1 mysql mysql 161749 Jul 30 18:12 binlog.000396
rw-rw--- 1 mysql mysql 108 Jul 30 18:12 binlog.index
rw-rw--- 1 mysql mysql 12 Jul 25 09:56 binlog.state

/home/mysql/database/mariadb-103/binlog/binlog.000395
/home/mysql/database/mariadb-103/binlog/binlog.000396

This makes behaviour completely unpredictable for more sophisticated backup software and follow-up backup jobs will fail.



 Comments   
Comment by Andrei Elkin [ 2021-06-08 ]

Might relate to MDEV-20215.

Comment by Brandon Nesterenko [ 2021-09-02 ]

Hi oli!

If you are still following this issue, I have a question. I see in the brman source code in lib/doBinlogBackup.inc, before the PURGE BINARY LOGS TO call, there is a comment.
// Be carefull with this step if you have slaves!

Does your environment have any slaves when you encounter this issue? In particular, if an old binlog is being read/used anywhere when the PURGE is issued, it will be ignored in the purge. A few examples of this are:
1. A slave is reading the binlog from the master during the same time period that the PURGE is issued
2. mysql-binlog, invoked with --read-from-remote-server, is active while the PURGE is issued
3. The SHOW BINLOG EVENTS command is running while the purge is issued

Given your observation that this issue occurs with short purge intervals, if something is reading an old binlog file during the purge, I could see this issue happening. I have a proposed fix, but would first like to confirm the cause of the issue.

Thanks!

Comment by Oli Sennhauser [ 2021-09-13 ]

Hello Brandon

Sorry for my late replay but I was in vacation...

The comment is more about that one should not purge binary logs if you have slaves and they did not read/apply those binary logs yet...

That was a complete testing environment. Typically mariadb-103 is a stand-alone instance. Otherwise I would call it something like mariadb-103m or so. But I cannot guarantee 100% for this... So most probably NO slave.
--read-from-remote-server I did NOT in this case.
SHOW BINLOG EVENTS I did NOT run and I cannot find it anywhere in the code either.

According to my notes this should simulate the problem: "This typically happens when only one binary log is there (little traffic or short purge interval)."

So I can try again, if I still can reproduce the issue without our brman...

BR,
Oli

Comment by Brandon Nesterenko [ 2021-09-21 ]

Hi Oli!

Thanks for the reply! I have a thought, but would like to ask a few follow-up questions first:

1) In your test environment, do you remember how the data was being ingested? I saw that the utl folder in the brman source code has some scripts that create/populate tables. Do you remember if you used one of those? (to aid in my own attempts to correctly reproduce your behavior).

2) Do you have information on the cfg file and options used for running your MariaDB server instance?

3) If you could reproduce this, that would be great. Could you also attach the last two binlogs, e.g. binlog.000395 and binlog.000396, to aid in my analysis? Additionally if you can only reproduce this using brman, could you additionally attach the backup version of the problem binlog, e.g. bck_binlog.000395.gz?

Thanks!

Comment by Oli Sennhauser [ 2021-09-22 ]

Hello Brandon

OK. It tried to reproduce it. But without success! There were to many things changing since 2019: Machine?, Ubuntu version (16.04 vs 18.04), MariaDB version 10.3.16 vs. 10.3.20, brman version, config, etc...

  • Load: I am not sure any more but typically I use our simple insert_test.sh script from myenv/utl. It just does INSERTs in a loop.
  • Then I wrote a script to simulate brman (attached). But it did not provoke the problem
  • Then I used brman instead (newest version). But it did not provoke the problem either...
  • Then I added a 10.3 Slave to it. But it also did not provoke the problem.

while [ 1 ] ; do bman --target=brman:secret@127.0.0.1:3318 --type=binlog --policy=binlog --mode=logical --backupdir=/home/mysql/bck/mariadb-103 | grep 'End backup' ; sleep 60 ; done
End backup at 2021-09-22 10:03:00 (rc=0)
End backup at 2021-09-22 10:04:00 (rc=0)
End backup at 2021-09-22 10:05:00 (rc=0)
End backup at 2021-09-22 10:06:00 (rc=0)
End backup at 2021-09-22 10:07:01 (rc=0)
End backup at 2021-09-22 10:08:01 (rc=0)
End backup at 2021-09-22 10:09:01 (rc=0)
End backup at 2021-09-22 10:10:01 (rc=0)
End backup at 2021-09-22 10:11:02 (rc=0)
End backup at 2021-09-22 10:12:02 (rc=0)
End backup at 2021-09-22 10:13:02 (rc=0)
End backup at 2021-09-22 10:14:02 (rc=0)
End backup at 2021-09-22 10:15:03 (rc=0)
End backup at 2021-09-22 10:16:03 (rc=0)
End backup at 2021-09-22 10:17:03 (rc=0)
End backup at 2021-09-22 10:18:03 (rc=0)
End backup at 2021-09-22 10:19:04 (rc=0)
End backup at 2021-09-22 10:20:04 (rc=0)
End backup at 2021-09-22 10:21:04 (rc=0)

  • Config I do not know any more. This is a testing system and the config there is very dynamic.

Do you have an idea what I shall try to provoke the problem? I will keep the system in the current state for a while...

Regards,
Oli

Comment by Oli Sennhauser [ 2021-09-22 ]

Before the interval was 60 seconds. But if I do the interval to random I get an error...

while [ 1 ] ; do bman --target=brman:secret@127.0.0.1:3318 --type=binlog --policy=binlog --mode=logical --backupdir=/home/mysql/bck/mariadb-103 | grep 'End backup' ; sleep $(echo $(($RANDOM % 60))) ; done
ERROR: Zipped file /home/mysql/bck/mariadb-103/binlog/bck_binlog.000510.gz already exists and you did not specify --backup-overwrite. (rc=4178)
End backup at 2021-09-22 10:29:12 (rc=0)
End backup at 2021-09-22 10:29:19 (rc=0)
ERROR: Zipped file /home/mysql/bck/mariadb-103/binlog/bck_binlog.000512.gz already exists and you did not specify --backup-overwrite. (rc=4178)
End backup at 2021-09-22 10:29:23 (rc=0)
End backup at 2021-09-22 10:29:26 (rc=0)
End backup at 2021-09-22 10:29:55 (rc=0)

I try to narrow down the problem.

Comment by Oli Sennhauser [ 2021-09-22 ]

It happens randomly:

End backup at 2021-09-22 10:35:53 (rc=0)
End backup at 2021-09-22 10:36:04 (rc=0)
End backup at 2021-09-22 10:36:06 (rc=0)
End backup at 2021-09-22 10:36:12 (rc=0)
End backup at 2021-09-22 10:36:30 (rc=0)
End backup at 2021-09-22 10:36:34 (rc=0)
ERROR: Zipped file /home/mysql/bck/mariadb-103/binlog/bck_binlog.000537.gz already exists and you did not specify --backup-overwrite. (rc=4178)
End backup at 2021-09-22 10:36:40 (rc=0)
End backup at 2021-09-22 10:37:00 (rc=0)
End backup at 2021-09-22 10:37:05 (rc=0)
End backup at 2021-09-22 10:37:11 (rc=0)
End backup at 2021-09-22 10:37:11 (rc=0)

Comment by Oli Sennhauser [ 2021-09-22 ]

I just found, that we optimized the code yet in new brman releases:
$ret = compressFile($aOptions, $dst);
if ( (4178 == $ret) && (1 == $lBinlogCount) )

{ // This step may fail in certain cases on MariaDB 10.3 because the // PURGE BINARY LOGS does not currently purge the file. So in the next // run we copy/compress a file which is already in the backup // location... // TODO: possibly we can do it more safe by checking if binary log still exists in the server? // runRemote as well (ssh)? unlink($dst); }
Comment by Oli Sennhauser [ 2021-09-22 ]

Fix was introduced in v2.2.1

Comment by Oli Sennhauser [ 2021-09-22 ]

With v2.2.0 it looks better:
End backup at 2021-09-22 11:06:00 (rc=0)
End backup at 2021-09-22 11:06:08 (rc=0)
End backup at 2021-09-22 11:06:11 (rc=0)
ERROR: Zipped file /home/mysql/bck/mariadb-103/binlog/bck_binlog.000605.gz already exists and you did not specify --backup-overwrite. (rc=4178)
End backup at 2021-09-22 11:06:20 (rc=4178)
ERROR: Zipped file /home/mysql/bck/mariadb-103/binlog/bck_binlog.000605.gz already exists and you did not specify --backup-overwrite. (rc=4178)
End backup at 2021-09-22 11:06:36 (rc=4178)
ERROR: Zipped file /home/mysql/bck/mariadb-103/binlog/bck_binlog.000605.gz already exists and you did not specify --backup-overwrite. (rc=4178)

Comment by Oli Sennhauser [ 2021-09-22 ]

Files attached...

Comment by Oli Sennhauser [ 2021-09-22 ]

So I can break it now regularly but not at will...

Comment by Brandon Nesterenko [ 2021-09-22 ]

Thanks for your effort into reproducing this error! For clarification, in an earlier comment you mentioned that you added a slave into your testing environment. Did you stop/remove it afterwards? That is, was it running when you were able to reproduce the error?

Comment by Oli Sennhauser [ 2021-09-23 ]

Slave WAS running when I reproduced the problem. Running it again now with a shutdown slave.

Comment by Oli Sennhauser [ 2021-09-23 ]

I tried it again with and without a Slave. The results lead me to the conclusion that it is about 95% sure that it is the Slave:

Without a Slave:

mysql@chef:~ [mariadb-103, 3318]> while [ 1 ] ; do /home/mysql/product/brman-2.2.0/bin/bman --target=brman:secret@127.0.0.1:3318 --type=binlog --policy=binlog --mode=logical --backupdir=/home/mysql/bck/mariadb-103 | grep 'End backup' ; sleep $(echo $(($RANDOM % 20))) ; done
End backup at 2021-09-23 08:07:05 (rc=0)
... 79 times the same result
End backup at 2021-09-23 08:18:39 (rc=0)

Starting the Slave:

mysql@chef:~ [mariadb-103, 3318]> while [ 1 ] ; do /home/mysql/product/brman-2.2.0/bin/bman --target=brman:secret@127.0.0.1:3318 --type=binlog --policy=binlog --mode=logical --backupdir=/home/mysql/bck/mariadb-103 | grep 'End backup' ; sleep $(echo $(($RANDOM % 20))) ; done
End backup at 2021-09-23 08:19:03 (rc=0)
... 3 times the same result
End backup at 2021-09-23 08:19:34 (rc=0)
End backup at 2021-09-23 08:19:39 (rc=0)
ERROR: Zipped file /home/mysql/bck/mariadb-103/binlog/bck_binlog.000774.gz already exists and you did not specify --backup-overwrite. (rc=4178)
End backup at 2021-09-23 08:19:56 (rc=4178)

Removed the problematic file. Then:

End backup at 2021-09-23 08:20:40 (rc=0)
... 24 times the same result
End backup at 2021-09-23 08:24:52 (rc=0)
End backup at 2021-09-23 08:24:52 (rc=0)
ERROR: Zipped file /home/mysql/bck/mariadb-103/binlog/bck_binlog.000804.gz already exists and you did not specify --backup-overwrite. (rc=4178)
End backup at 2021-09-23 08:25:03 (rc=4178)

Removed the problematic file. Then:

End backup at 2021-09-23 08:26:50 (rc=0)
... 56 times the same result
End backup at 2021-09-23 08:36:25 (rc=0)
End backup at 2021-09-23 08:36:30 (rc=0)
ERROR: Zipped file /home/mysql/bck/mariadb-103/binlog/bck_binlog.000872.gz already exists and you did not specify --backup-overwrite. (rc=4178)
End backup at 2021-09-23 08:36:40 (rc=4178)

Stop slave. Then:

End backup at 2021-09-23 08:38:02 (rc=0)
... 270 times the same result
End backup at 2021-09-23 09:20:43 (rc=0)

Comment by Oli Sennhauser [ 2021-09-23 ]

It was running now for 6 hours without a problem... If we calculate every 30 seconds a run in average this makes about 2400 runs. But then it broke again! WITHOUT a Slave attached...

...
End backup at 2021-09-23 15:18:40 (rc=0)
End backup at 2021-09-23 15:18:47 (rc=0)
ERROR: Zipped file /home/mysql/bck/mariadb-103/binlog/bck_binlog.003351.gz already exists and you did not specify --backup-overwrite. (rc=4178)
End backup at 2021-09-23 15:18:51 (rc=4178)

So it looks like this happens more probable WITH Slave. But it also happens rarely WITHOUT Slave... Does this make sense?

Comment by Brandon Nesterenko [ 2021-09-23 ]

Interesting! Can you attach the problematic binary log file (live from the base directory), its gzipped backup version (assuming it was not overwritten?), and the next binary log file? So that would be binlog.003351, bck_binlog.003351.gz, and binlog.003352.

Edit: The previous binlog may be helpful too, binlog.003350

Comment by Brandon Nesterenko [ 2021-09-24 ]

Hmm, my hunch didn't pan out.

Thanks for attaching the binlog files. Can you also attach the applicable server/error logs?

Comment by Oli Sennhauser [ 2021-09-27 ]

Uploaded.

Generated at Thu Feb 08 08:57:46 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.