[MDEV-29981] Replica stops with "Found invalid event in binary log" Created: 2022-11-08  Updated: 2023-03-16

Status: Open
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.4.25, 10.6.9
Fix Version/s: 10.4

Type: Bug Priority: Major
Reporter: Anton Assignee: Andrei Elkin
Resolution: Unresolved Votes: 1
Labels: None

Attachments: Text File CS0474981_maxscale.log     File server.cfg    

 Description   

I got a strange behaviour with replication
Sometimes (I can't reproduce ) the replica stops IO_thread with next error in the log

2022-11-08 13:56:47 87893133 [ERROR] Read invalid event from master: 'Found invalid event in binary log', master could be corrupt but a more likely cause of this is a bug
2022-11-08 13:56:47 87893133 [ERROR] Slave I/O: Relay log write failure: could not queue event from master, Internal MariaDB error code: 1595
2022-11-08 13:56:47 87893133 [Note] Slave I/O thread exiting, read up to log 'mysqld-bin.202006', position 4; GTID position 11-11-48058108793

The errors always appear at position 4 , at the start of binlog.
The queries in 11-11-48058108793 and previous statement looks as correct insert/update requests

The replication begins to work after restart replications `stop slave; start slave`

I've attached config file



 Comments   
Comment by Angelique Sklavounos (Inactive) [ 2022-11-09 ]

Hi R,

Could you please upload to the private FTP server:

  1. the error logs of both the primary and replica for when the replica stops
  2. output of show global variables for both primary and replica
  3. binary logs mysqld-bin.202006 and mysqld-bin.202005

Thank you

Comment by Luigi [ 2022-11-09 ]

adding a case with the same issue on 10.4.24-15-MariaDB-enterprise-log

Comment by Andrei Elkin [ 2022-11-10 ]

R, it'd be very helpful to see SHOW-SLAVE-STATUS log if you still have one for sharing. Thank you.

Comment by Luigi [ 2022-11-10 ]

uploaded 'MDEV-29981_logs.tgz' to the ftp server with the error logs of both primary and replica, the show global variables of both and the 2 binlogs involved in the issue

Comment by Anton [ 2022-11-10 ]

Hello Angelique, Andrei

I've upload files your private ftp ftp://ftp.mariadb.org/private/
I need to notice that I've cropped binary logs and renamed personal/private information

MDEV-29981.master.vars
MDEV-29981.mysqld-bin.202005.cutted
MDEV-29981.mysqld-bin.202006.cutted
MDEV-29981.mysqld_master.log
MDEV-29981.mysqld_replica.log
MDEV-29981.replica.vars
MDEV-29981.slave_status.log

Comment by Anton [ 2022-11-10 ]

I've also uploaded by a one archive file
MDEV-29981.zip

Comment by Andrei Elkin [ 2022-11-11 ]

R , thanks for uploads!

Show-slave-status is not from the time of the failure. We still need the one that displays the error from the error log.
I also see traces of using MXS, which asks further questions about
Its configuration and its error log.
Could you also share those (feel free to attach it to the MDEV instead of ftp)?

Comment by Andrei Elkin [ 2022-11-11 ]

After having found no ways to get to the slave error we need MXS analysis of these two events:

2022-11-07 16:02:44   notice : Server changed state: cmkprod02-mdb1[10.0.3.120:3306]: master_down. [Master, Running] -> 

2022-11-07 16:02:49 18635141 [ERROR] Slave I/O: Relay log write failure: could not queue event from master

R, your MXS log might have a similar failover message just before the slave error one. I am looking for confirmation/denial.

I can't dismiss yet that something is inserted into slave.

Comment by Anton [ 2022-11-11 ]

Hi Elkin

At 11.11 evening I've got the same error again
I've uploader logs and configs to ftp MDEV-29981_20221111_v2.zip
I din't find something interesting in MSX logs - when replication (IO_thread) is broken , then MSX just marked this host as failed and connections to this host are stoped

But I notice that the problem always appear with heavy transaction (I mean long text transaction) in the end of the binlog .
(The transactions themselves are correct and when they are located in the middle of the binlog all fine)

The size of the binlog with heavy transaction in the end a little bit bigger then other (mysqld-bin.204621)
And replication stops at the moment of reading next binlog (mysqld-bin.204622)
Looks like replica tried to read file with unexpected size and failed

-rw-rw---- 1 mysql mysql 1073779380 Nov 11 17:42 /var/lib/mysql/mysqld-bin.204618
-rw-rw---- 1 mysql mysql 1074212072 Nov 11 17:43 /var/lib/mysql/mysqld-bin.204619
-rw-rw---- 1 mysql mysql 1073991721 Nov 11 17:45 /var/lib/mysql/mysqld-bin.204620
-rw-rw---- 1 mysql mysql 1075319895 Nov 11 17:46 /var/lib/mysql/mysqld-bin.204621
-rw-rw---- 1 mysql mysql 1073754302 Nov 11 17:48 /var/lib/mysql/mysqld-bin.204622
-rw-rw---- 1 mysql mysql 1073754552 Nov 11 17:49 /var/lib/mysql/mysqld-bin.204623

Comment by Anton [ 2022-11-11 ]

And I checked at 10.4.26 version - the same problem

Comment by Andrei Elkin [ 2022-11-14 ]

R, thank you for the reply! I am analyzing it deeper now.
It looks like slave receives an unexpected packet following the Rotate event from the "OLD" binlog.
The next event must be a Format_description_log_event but the actual network packet brings in a different one
having in its 5th byte the value 1. That maps to a very old event type which modern masters do not generate so I have reasons to strongly doubt the packet was sent by the master server.

In this regard, could you please also look into the master and slave servers' syslogs to possibly identify anything unusual at the failure time?

I also notice your log_warnings is of the default value 2, and since you're "lucky" to summon the failure, may I ask you to raise it to at least 3, on both master and slave?

It'd be useful to try reproducing the failure in a pure no-MXS master-slave. Sure makes sense if you can afford that and the failure occurs rather reliably. If it's reproducible reliably but MXS can't be touched, I'd love to see network dump.

Comment by Andrei Elkin [ 2022-11-14 ]

R, and one request for you (I am copying from markus makela who asked it elsewhere),

We need to know the MaxScale version that this happens with as well as the configuration used with it.
The output of `maxctrl create report mxs-report.txt` would also be of use: run the command and upload the
`mxs-report.txt` file it creates.

Comment by Andrei Elkin [ 2022-11-14 ]

R, could you please also consider to set master_verify_checksum = 1 which activates checksum verification on master, which of course would slow it a bit. This may help to clear out doubts about what the master actually sends.

Comment by Anton [ 2022-11-14 ]

Hello Elkin

At the weekend I got the same error with stop replication , with the same events in the logs
but on this time last transaction in binlog before fail was not so long as previous . It was usual update. And logsize had the same size as binlogs near him.
So my assumption is not confirmed in this time .

I've upload MDEV-29981-mxs-report.txt to ftp

I've changed log warning to 3 , with 4 are generated a log of Aborted connection warnings (This connection closed normally)
In syslog/dmesg I didn't find interesting related events , I'll provide them when error appear again

Comment by Andrei Elkin [ 2022-11-23 ]

R, hello. Did you observe by any chance this one?

 Got fatal error 1236 from master when reading data from binary log: 'bogus data in log event;

Comment by Anton [ 2022-11-24 ]

Hi Elkin ,

I've never seen this error in the logs `bogus data in log event`

But I often see the next error in the maxscale logs

2022-11-04 13:06:28   warning: [mariadbmon] Error during monitor update of server 'sc-db-06': Query 'SHOW ALL SLAVES STATUS;' failed: 'Lost connection to server during query'.
2022-11-04 13:24:29   warning: [mariadbmon] Error during monitor update of server 'sc-db-13': Query 'SHOW ALL SLAVES STATUS;' failed: 'Lost connection to server during query'.

I'm not sure , but I can assume that this can be related with
https://jira.mariadb.org/browse/MDEV-21087

I'm waiting approve for update master to 4.26 version

Comment by Andrei Elkin [ 2022-11-25 ]

Thanks for the reply, R. MDEV-21087 is certainly out of concern here. I am considering a corrupted replication event created which would be caught with {{ set @@global.master_verify_checksum = 1}} (on master).

I am all ears how it would proceed should you have upgraded.
Also I'd like to know whether your server may experience out of swap space.
It's hard to suspect networking issue, nevertheless could you please look after that too.

Comment by Anton [ 2022-12-19 ]

Hello Elkin
Sorry for delay with answer

I did next changes few weeks ago
global.master_verify_checksum: 0 -> 1
global.slave_net_timeout: 10 -> 60

This actions are reduced count of issues with stop replica . I got this error only one time at last week . But the error still exists

Yes, I had swap on master , and I'm going to turn off it . On replicas I've already switched off
At this moment replica have 10.4.26 and master is 10.4.25

Comment by Andrei Elkin [ 2022-12-19 ]

Thank you R!
I am not sure how would global.slave_net_timeout increase decrease the error rate. And even more to the checksum verification.

To my swap question, it was about weather you might observe correlation of a very active swap usage and the error appearance. We've been investigating potentially related MDEV-28986.

All the best to you on the upcoming holidays !

Comment by Anton [ 2023-02-20 ]

Hi Elkin

Sorry for long silence

I've just switched master at the previous week (waited approve for a long time)
New master has 10.4.27 version and swap is turned off
On previous master I've resetup OS (centos8) , updated to 10.4.27 and turned off swap
Now it's working as replica .

I'll provide information if the error with replication appear again
And anyway I'll give a status in 1-2 weeks

Comment by Anton [ 2023-03-15 ]

Hello Elkin

The problem didn't reproduce for more than 3 weeks
I don't have proper answer what reason of the error was
Maybe it was related with 10.4.25 version , maybe swap, network or something else

I think the current ticket can be closed
If you need any additional information/logs just let me know

Comment by Andrei Elkin [ 2023-03-16 ]

Thank you R! With my teammate bnestere we have had a plan to print replication event dumps into the server error log near binlog rotation, with log_warnings > 2. I think we'd close the ticket when it's done.

Generated at Thu Feb 08 10:12:44 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.