Details
-
Bug
-
Status: Confirmed (View Workflow)
-
Major
-
Resolution: Unresolved
-
10.4.25, 10.6.9
-
None
Description
I got a strange behaviour with replication
Sometimes (I can't reproduce ) the replica stops IO_thread with next error in the log
2022-11-08 13:56:47 87893133 [ERROR] Read invalid event from master: 'Found invalid event in binary log', master could be corrupt but a more likely cause of this is a bug |
2022-11-08 13:56:47 87893133 [ERROR] Slave I/O: Relay log write failure: could not queue event from master, Internal MariaDB error code: 1595 |
2022-11-08 13:56:47 87893133 [Note] Slave I/O thread exiting, read up to log 'mysqld-bin.202006', position 4; GTID position 11-11-48058108793 |
The errors always appear at position 4 , at the start of binlog.
The queries in 11-11-48058108793 and previous statement looks as correct insert/update requests
The replication begins to work after restart replications `stop slave; start slave`
I've attached config file
Attachments
- CS0474981_maxscale.log
- 231 kB
- server.cfg
- 2 kB
Issue Links
- is duplicated by
-
MDEV-16900 [Draft] Read invalid event from master: 'Found invalid event in binary log'
-
- Closed
-
- relates to
-
MDEV-16146 MariaDB slave stops with incompatible heartbeat
-
- Closed
-
Activity
adding a case with the same issue on 10.4.24-15-MariaDB-enterprise-log
R, it'd be very helpful to see SHOW-SLAVE-STATUS log if you still have one for sharing. Thank you.
uploaded 'MDEV-29981_logs.tgz' to the ftp server with the error logs of both primary and replica, the show global variables of both and the 2 binlogs involved in the issue
Hello Angelique, Andrei
I've upload files your private ftp ftp://ftp.mariadb.org/private/
I need to notice that I've cropped binary logs and renamed personal/private information
MDEV-29981.master.vars
|
MDEV-29981.mysqld-bin.202005.cutted
|
MDEV-29981.mysqld-bin.202006.cutted
|
MDEV-29981.mysqld_master.log
|
MDEV-29981.mysqld_replica.log
|
MDEV-29981.replica.vars
|
MDEV-29981.slave_status.log
|
R , thanks for uploads!
Show-slave-status is not from the time of the failure. We still need the one that displays the error from the error log.
I also see traces of using MXS, which asks further questions about
Its configuration and its error log.
Could you also share those (feel free to attach it to the MDEV instead of ftp)?
After having found no ways to get to the slave error we need MXS analysis of these two events:
2022-11-07 16:02:44 notice : Server changed state: cmkprod02-mdb1[10.0.3.120:3306]: master_down. [Master, Running] ->
|
2022-11-07 16:02:49 18635141 [ERROR] Slave I/O: Relay log write failure: could not queue event from master
|
R, your MXS log might have a similar failover message just before the slave error one. I am looking for confirmation/denial.
I can't dismiss yet that something is inserted into slave.
Hi Elkin
At 11.11 evening I've got the same error again
I've uploader logs and configs to ftp MDEV-29981_20221111_v2.zip
I din't find something interesting in MSX logs - when replication (IO_thread) is broken , then MSX just marked this host as failed and connections to this host are stoped
But I notice that the problem always appear with heavy transaction (I mean long text transaction) in the end of the binlog .
(The transactions themselves are correct and when they are located in the middle of the binlog all fine)
The size of the binlog with heavy transaction in the end a little bit bigger then other (mysqld-bin.204621)
And replication stops at the moment of reading next binlog (mysqld-bin.204622)
Looks like replica tried to read file with unexpected size and failed
-rw-rw---- 1 mysql mysql 1073779380 Nov 11 17:42 /var/lib/mysql/mysqld-bin.204618 |
-rw-rw---- 1 mysql mysql 1074212072 Nov 11 17:43 /var/lib/mysql/mysqld-bin.204619 |
-rw-rw---- 1 mysql mysql 1073991721 Nov 11 17:45 /var/lib/mysql/mysqld-bin.204620 |
-rw-rw---- 1 mysql mysql 1075319895 Nov 11 17:46 /var/lib/mysql/mysqld-bin.204621 |
-rw-rw---- 1 mysql mysql 1073754302 Nov 11 17:48 /var/lib/mysql/mysqld-bin.204622 |
-rw-rw---- 1 mysql mysql 1073754552 Nov 11 17:49 /var/lib/mysql/mysqld-bin.204623 |
R, thank you for the reply! I am analyzing it deeper now.
It looks like slave receives an unexpected packet following the Rotate event from the "OLD" binlog.
The next event must be a Format_description_log_event but the actual network packet brings in a different one
having in its 5th byte the value 1. That maps to a very old event type which modern masters do not generate so I have reasons to strongly doubt the packet was sent by the master server.
In this regard, could you please also look into the master and slave servers' syslogs to possibly identify anything unusual at the failure time?
I also notice your log_warnings is of the default value 2, and since you're "lucky" to summon the failure, may I ask you to raise it to at least 3, on both master and slave?
It'd be useful to try reproducing the failure in a pure no-MXS master-slave. Sure makes sense if you can afford that and the failure occurs rather reliably. If it's reproducible reliably but MXS can't be touched, I'd love to see network dump.
R, and one request for you (I am copying from markus makela who asked it elsewhere),
We need to know the MaxScale version that this happens with as well as the configuration used with it.
The output of `maxctrl create report mxs-report.txt` would also be of use: run the command and upload the
`mxs-report.txt` file it creates.
R, could you please also consider to set master_verify_checksum = 1 which activates checksum verification on master, which of course would slow it a bit. This may help to clear out doubts about what the master actually sends.
Hello Elkin
At the weekend I got the same error with stop replication , with the same events in the logs
but on this time last transaction in binlog before fail was not so long as previous . It was usual update. And logsize had the same size as binlogs near him.
So my assumption is not confirmed in this time .
I've upload MDEV-29981-mxs-report.txt to ftp
I've changed log warning to 3 , with 4 are generated a log of Aborted connection warnings (This connection closed normally)
In syslog/dmesg I didn't find interesting related events , I'll provide them when error appear again
R, hello. Did you observe by any chance this one?
Got fatal error 1236 from master when reading data from binary log: 'bogus data in log event;
|
Hi Elkin ,
I've never seen this error in the logs `bogus data in log event`
But I often see the next error in the maxscale logs
2022-11-04 13:06:28 warning: [mariadbmon] Error during monitor update of server 'sc-db-06': Query 'SHOW ALL SLAVES STATUS;' failed: 'Lost connection to server during query'. |
2022-11-04 13:24:29 warning: [mariadbmon] Error during monitor update of server 'sc-db-13': Query 'SHOW ALL SLAVES STATUS;' failed: 'Lost connection to server during query'. |
I'm not sure , but I can assume that this can be related with
https://jira.mariadb.org/browse/MDEV-21087
I'm waiting approve for update master to 4.26 version
Thanks for the reply, R. MDEV-21087 is certainly out of concern here. I am considering a corrupted replication event created which would be caught with {{ set @@global.master_verify_checksum = 1}} (on master).
I am all ears how it would proceed should you have upgraded.
Also I'd like to know whether your server may experience out of swap space.
It's hard to suspect networking issue, nevertheless could you please look after that too.
Hello Elkin
Sorry for delay with answer
I did next changes few weeks ago
global.master_verify_checksum: 0 -> 1
global.slave_net_timeout: 10 -> 60
This actions are reduced count of issues with stop replica . I got this error only one time at last week . But the error still exists
Yes, I had swap on master , and I'm going to turn off it . On replicas I've already switched off
At this moment replica have 10.4.26 and master is 10.4.25
Thank you R!
I am not sure how would global.slave_net_timeout increase decrease the error rate. And even more to the checksum verification.
To my swap question, it was about weather you might observe correlation of a very active swap usage and the error appearance. We've been investigating potentially related MDEV-28986.
All the best to you on the upcoming holidays !
Hi Elkin
Sorry for long silence
I've just switched master at the previous week (waited approve for a long time)
New master has 10.4.27 version and swap is turned off
On previous master I've resetup OS (centos8) , updated to 10.4.27 and turned off swap
Now it's working as replica .
I'll provide information if the error with replication appear again
And anyway I'll give a status in 1-2 weeks
Hello Elkin
The problem didn't reproduce for more than 3 weeks
I don't have proper answer what reason of the error was
Maybe it was related with 10.4.25 version , maybe swap, network or something else
I think the current ticket can be closed
If you need any additional information/logs just let me know
Thank you R! With my teammate bnestere we have had a plan to print replication event dumps into the server error log near binlog rotation, with log_warnings > 2. I think we'd close the ticket when it's done.
Elkin I'm able to reproduce this, and indeed this only happens at rotation.
When the server rotates logs, the slave sees this and initializes its Format_description_log_event with binlog version 3 (this is hard-coded), see the following code snippet from process_io_rotate in 10.6:
if (mi->rli.relay_log.description_event_for_queue->binlog_version >= 4)
|
{
|
DBUG_ASSERT(mi->rli.relay_log.description_event_for_queue->checksum_alg ==
|
mi->rli.relay_log.relay_log_checksum_alg);
|
|
delete mi->rli.relay_log.description_event_for_queue;
|
/* start from format 3 (MySQL 4.0) again */
|
mi->rli.relay_log.description_event_for_queue= new
|
Format_description_log_event(3);
|
mi->rli.relay_log.description_event_for_queue->checksum_alg=
|
mi->rli.relay_log.relay_log_checksum_alg;
|
}
|
So immediately after rotation (and until a new Format_descriptor with binlog_format 4 is sent from the master), the IO thread is expecting binlog_format 3. This invalidates any events that are sent with an event type higher than 14. In theory, we wouldn't expect any events to be sent in between a rotate and the next format descriptor log event, but if a long enough period of time passes between then, the primary will generate and send a Heartbeat event (of type 27). In such case, the slave will see the heartbeat event of type 27, see it is higher than 14, and results in the 'Found invalid event in binary log' error, with the expected log coordinates of the new log (which is optimistically populated from the Rotate log event, not the new event). For example, from my testing:
2024-10-16 9:47:02 10 [ERROR] Read invalid event from master: 'Found invalid event in binary log', master could be corrupt but a more likely cause of this is a bug
|
2024-10-16 9:47:02 10 [ERROR] Slave I/O: Relay log write failure: could not queue event from master, Internal MariaDB error code: 1595
|
2024-10-16 9:47:02 10 [Note] Slave I/O thread exiting, read up to log 'master-bin.000002', position 4, master 127.0.0.1:19000
|
Note that after legacy binlog code was removed (I think starting in version 11.X), the error changes to an invalid heartbeat event.
2024-10-16 9:42:45 6 [ERROR] Slave I/O: Unexpected master's heartbeat data: heartbeat is not compatible with local info;the event's data: log_file_name master-bin.000001 log_pos 377, Internal MariaDB error code: 1623
|
2024-10-16 9:42:45 6 [ERROR] Slave I/O: Relay log write failure: could not queue event from master, Internal MariaDB error code: 1595
|
2024-10-16 9:42:45 6 [Note] Slave I/O thread exiting, read up to log 'master-bin.000002', position 4; GTID position , master 127.0.0.1:19000
|
Hi R,
Could you please upload to the private FTP server:
Thank you