[MDEV-29981] Replica stops with "Found invalid event in binary log" - Jira

Details

Type: Bug
Status: Confirmed (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 10.4.25, 10.6.9
Fix Version/s: 10.6
Component/s: Replication
Labels:
None

Description

I got a strange behaviour with replication
Sometimes (I can't reproduce ) the replica stops IO_thread with next error in the log

2022-11-08 13:56:47 87893133 [ERROR] Read invalid event from master: 'Found invalid event in binary log', master could be corrupt but a more likely cause of this is a bug

2022-11-08 13:56:47 87893133 [ERROR] Slave I/O: Relay log write failure: could not queue event from master, Internal MariaDB error code: 1595

2022-11-08 13:56:47 87893133 [Note] Slave I/O thread exiting, read up to log 'mysqld-bin.202006', position 4; GTID position 11-11-48058108793

The errors always appear at position 4 , at the start of binlog.
The queries in 11-11-48058108793 and previous statement looks as correct insert/update requests

The replication begins to work after restart replications `stop slave; start slave`

I've attached config file

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

CS0474981_maxscale.log
231 kB
2022-11-11 14:37
server.cfg
2 kB
2022-11-08 18:42

Issue Links

is duplicated by

MDEV-16900 [Draft] Read invalid event from master: 'Found invalid event in binary log'

Closed

relates to

MDEV-16146 MariaDB slave stops with incompatible heartbeat

Closed

Activity

Ascending order - Click to sort in descending order

View 19 older comments

Anton added a comment - 2023-02-20 08:32 - edited

Hi Elkin

Sorry for long silence

I've just switched master at the previous week (waited approve for a long time)
New master has 10.4.27 version and swap is turned off
On previous master I've resetup OS (centos8) , updated to 10.4.27 and turned off swap
Now it's working as replica .

I'll provide information if the error with replication appear again
And anyway I'll give a status in 1-2 weeks

Anton added a comment - 2023-02-20 08:32 - edited Hi Elkin Sorry for long silence I've just switched master at the previous week (waited approve for a long time) New master has 10.4.27 version and swap is turned off On previous master I've resetup OS (centos8) , updated to 10.4.27 and turned off swap Now it's working as replica . I'll provide information if the error with replication appear again And anyway I'll give a status in 1-2 weeks

Anton added a comment - 2023-03-15 09:17

Hello Elkin

The problem didn't reproduce for more than 3 weeks
I don't have proper answer what reason of the error was
Maybe it was related with 10.4.25 version , maybe swap, network or something else

I think the current ticket can be closed
If you need any additional information/logs just let me know

Anton added a comment - 2023-03-15 09:17 Hello Elkin The problem didn't reproduce for more than 3 weeks I don't have proper answer what reason of the error was Maybe it was related with 10.4.25 version , maybe swap, network or something else I think the current ticket can be closed If you need any additional information/logs just let me know

Andrei Elkin added a comment - 2023-03-16 12:40

Thank you R! With my teammate bnestere we have had a plan to print replication event dumps into the server error log near binlog rotation, with log_warnings > 2. I think we'd close the ticket when it's done.

Andrei Elkin added a comment - 2023-03-16 12:40 Thank you R ! With my teammate bnestere we have had a plan to print replication event dumps into the server error log near binlog rotation, with log_warnings > 2 . I think we'd close the ticket when it's done.

Brandon Nesterenko added a comment - 2024-10-16 16:21 - edited

Elkin I'm able to reproduce this, and indeed this only happens at rotation.

When the server rotates logs, the slave sees this and initializes its Format_description_log_event with binlog version 3 (this is hard-coded), see the following code snippet from process_io_rotate in 10.6:

  if (mi->rli.relay_log.description_event_for_queue->binlog_version >= 4)

    DBUG_ASSERT(mi->rli.relay_log.description_event_for_queue->checksum_alg ==

                mi->rli.relay_log.relay_log_checksum_alg);

    delete mi->rli.relay_log.description_event_for_queue;

    /* start from format 3 (MySQL 4.0) again */

    mi->rli.relay_log.description_event_for_queue= new

      Format_description_log_event(3);

    mi->rli.relay_log.description_event_for_queue->checksum_alg=

      mi->rli.relay_log.relay_log_checksum_alg;

So immediately after rotation (and until a new Format_descriptor with binlog_format 4 is sent from the master), the IO thread is expecting binlog_format 3. This invalidates any events that are sent with an event type higher than 14. In theory, we wouldn't expect any events to be sent in between a rotate and the next format descriptor log event, but if a long enough period of time passes between then, the primary will generate and send a Heartbeat event (of type 27). In such case, the slave will see the heartbeat event of type 27, see it is higher than 14, and results in the 'Found invalid event in binary log' error, with the expected log coordinates of the new log (which is optimistically populated from the Rotate log event, not the new event). For example, from my testing:

2024-10-16  9:47:02 10 [ERROR] Read invalid event from master: 'Found invalid event in binary log', master could be corrupt but a more likely cause of this is a bug

2024-10-16  9:47:02 10 [ERROR] Slave I/O: Relay log write failure: could not queue event from master, Internal MariaDB error code: 1595

2024-10-16  9:47:02 10 [Note] Slave I/O thread exiting, read up to log 'master-bin.000002', position 4, master 127.0.0.1:19000

Note that after legacy binlog code was removed (I think starting in version 11.X), the error changes to an invalid heartbeat event.

2024-10-16  9:42:45 6 [ERROR] Slave I/O: Unexpected master's heartbeat data: heartbeat is not compatible with local info;the event's data: log_file_name master-bin.000001 log_pos 377, Internal MariaDB error code: 1623

2024-10-16  9:42:45 6 [ERROR] Slave I/O: Relay log write failure: could not queue event from master, Internal MariaDB error code: 1595

2024-10-16  9:42:45 6 [Note] Slave I/O thread exiting, read up to log 'master-bin.000002', position 4; GTID position , master 127.0.0.1:19000

Brandon Nesterenko added a comment - 2024-10-16 16:21 - edited Elkin I'm able to reproduce this, and indeed this only happens at rotation. When the server rotates logs, the slave sees this and initializes its Format_description_log_event with binlog version 3 (this is hard-coded), see the following code snippet from process_io_rotate in 10.6: if (mi->rli.relay_log.description_event_for_queue->binlog_version >= 4) { DBUG_ASSERT(mi->rli.relay_log.description_event_for_queue->checksum_alg == mi->rli.relay_log.relay_log_checksum_alg); delete mi->rli.relay_log.description_event_for_queue; /* start from format 3 (MySQL 4.0) again */ mi->rli.relay_log.description_event_for_queue= new Format_description_log_event(3); mi->rli.relay_log.description_event_for_queue->checksum_alg= mi->rli.relay_log.relay_log_checksum_alg; } So immediately after rotation (and until a new Format_descriptor with binlog_format 4 is sent from the master), the IO thread is expecting binlog_format 3. This invalidates any events that are sent with an event type higher than 14. In theory, we wouldn't expect any events to be sent in between a rotate and the next format descriptor log event, but if a long enough period of time passes between then, the primary will generate and send a Heartbeat event (of type 27). In such case, the slave will see the heartbeat event of type 27, see it is higher than 14, and results in the 'Found invalid event in binary log' error, with the expected log coordinates of the new log (which is optimistically populated from the Rotate log event, not the new event). For example, from my testing: 2024-10-16 9:47:02 10 [ERROR] Read invalid event from master: 'Found invalid event in binary log', master could be corrupt but a more likely cause of this is a bug 2024-10-16 9:47:02 10 [ERROR] Slave I/O: Relay log write failure: could not queue event from master, Internal MariaDB error code: 1595 2024-10-16 9:47:02 10 [Note] Slave I/O thread exiting, read up to log 'master-bin.000002', position 4, master 127.0.0.1:19000 Note that after legacy binlog code was removed (I think starting in version 11.X), the error changes to an invalid heartbeat event. 2024-10-16 9:42:45 6 [ERROR] Slave I/O: Unexpected master's heartbeat data: heartbeat is not compatible with local info;the event's data: log_file_name master-bin.000001 log_pos 377, Internal MariaDB error code: 1623 2024-10-16 9:42:45 6 [ERROR] Slave I/O: Relay log write failure: could not queue event from master, Internal MariaDB error code: 1595 2024-10-16 9:42:45 6 [Note] Slave I/O thread exiting, read up to log 'master-bin.000002', position 4; GTID position , master 127.0.0.1:19000

Andrei Elkin added a comment - 2024-10-17 16:34

bnestere, you seem to have nailed it. Awesome! Many thanks!

Andrei Elkin added a comment - 2024-10-17 16:34 bnestere , you seem to have nailed it. Awesome! Many thanks!

MariaDB Server

Replica stops with "Found invalid event in binary log"

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration