Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-29981

Replica stops with "Found invalid event in binary log"

Details

    • Bug
    • Status: Confirmed (View Workflow)
    • Major
    • Resolution: Unresolved
    • 10.4.25, 10.6.9
    • 10.6
    • Replication
    • None

    Description

      I got a strange behaviour with replication
      Sometimes (I can't reproduce ) the replica stops IO_thread with next error in the log

      2022-11-08 13:56:47 87893133 [ERROR] Read invalid event from master: 'Found invalid event in binary log', master could be corrupt but a more likely cause of this is a bug
      2022-11-08 13:56:47 87893133 [ERROR] Slave I/O: Relay log write failure: could not queue event from master, Internal MariaDB error code: 1595
      2022-11-08 13:56:47 87893133 [Note] Slave I/O thread exiting, read up to log 'mysqld-bin.202006', position 4; GTID position 11-11-48058108793
      

      The errors always appear at position 4 , at the start of binlog.
      The queries in 11-11-48058108793 and previous statement looks as correct insert/update requests

      The replication begins to work after restart replications `stop slave; start slave`

      I've attached config file

      Attachments

        Issue Links

          Activity

            R Anton added a comment - - edited

            Hi Elkin

            Sorry for long silence

            I've just switched master at the previous week (waited approve for a long time)
            New master has 10.4.27 version and swap is turned off
            On previous master I've resetup OS (centos8) , updated to 10.4.27 and turned off swap
            Now it's working as replica .

            I'll provide information if the error with replication appear again
            And anyway I'll give a status in 1-2 weeks

            R Anton added a comment - - edited Hi Elkin Sorry for long silence I've just switched master at the previous week (waited approve for a long time) New master has 10.4.27 version and swap is turned off On previous master I've resetup OS (centos8) , updated to 10.4.27 and turned off swap Now it's working as replica . I'll provide information if the error with replication appear again And anyway I'll give a status in 1-2 weeks
            R Anton added a comment -

            Hello Elkin

            The problem didn't reproduce for more than 3 weeks
            I don't have proper answer what reason of the error was
            Maybe it was related with 10.4.25 version , maybe swap, network or something else

            I think the current ticket can be closed
            If you need any additional information/logs just let me know

            R Anton added a comment - Hello Elkin The problem didn't reproduce for more than 3 weeks I don't have proper answer what reason of the error was Maybe it was related with 10.4.25 version , maybe swap, network or something else I think the current ticket can be closed If you need any additional information/logs just let me know
            Elkin Andrei Elkin added a comment -

            Thank you R! With my teammate bnestere we have had a plan to print replication event dumps into the server error log near binlog rotation, with log_warnings > 2. I think we'd close the ticket when it's done.

            Elkin Andrei Elkin added a comment - Thank you R ! With my teammate bnestere we have had a plan to print replication event dumps into the server error log near binlog rotation, with log_warnings > 2 . I think we'd close the ticket when it's done.
            bnestere Brandon Nesterenko added a comment - - edited

            Elkin I'm able to reproduce this, and indeed this only happens at rotation.

            When the server rotates logs, the slave sees this and initializes its Format_description_log_event with binlog version 3 (this is hard-coded), see the following code snippet from process_io_rotate in 10.6:

              if (mi->rli.relay_log.description_event_for_queue->binlog_version >= 4)
              {
                DBUG_ASSERT(mi->rli.relay_log.description_event_for_queue->checksum_alg ==
                            mi->rli.relay_log.relay_log_checksum_alg);
                
                delete mi->rli.relay_log.description_event_for_queue;
                /* start from format 3 (MySQL 4.0) again */
                mi->rli.relay_log.description_event_for_queue= new
                  Format_description_log_event(3);
                mi->rli.relay_log.description_event_for_queue->checksum_alg=
                  mi->rli.relay_log.relay_log_checksum_alg;    
              }
            

            So immediately after rotation (and until a new Format_descriptor with binlog_format 4 is sent from the master), the IO thread is expecting binlog_format 3. This invalidates any events that are sent with an event type higher than 14. In theory, we wouldn't expect any events to be sent in between a rotate and the next format descriptor log event, but if a long enough period of time passes between then, the primary will generate and send a Heartbeat event (of type 27). In such case, the slave will see the heartbeat event of type 27, see it is higher than 14, and results in the 'Found invalid event in binary log' error, with the expected log coordinates of the new log (which is optimistically populated from the Rotate log event, not the new event). For example, from my testing:

            2024-10-16  9:47:02 10 [ERROR] Read invalid event from master: 'Found invalid event in binary log', master could be corrupt but a more likely cause of this is a bug
            2024-10-16  9:47:02 10 [ERROR] Slave I/O: Relay log write failure: could not queue event from master, Internal MariaDB error code: 1595
            2024-10-16  9:47:02 10 [Note] Slave I/O thread exiting, read up to log 'master-bin.000002', position 4, master 127.0.0.1:19000
            

            Note that after legacy binlog code was removed (I think starting in version 11.X), the error changes to an invalid heartbeat event.

            2024-10-16  9:42:45 6 [ERROR] Slave I/O: Unexpected master's heartbeat data: heartbeat is not compatible with local info;the event's data: log_file_name master-bin.000001 log_pos 377, Internal MariaDB error code: 1623
            2024-10-16  9:42:45 6 [ERROR] Slave I/O: Relay log write failure: could not queue event from master, Internal MariaDB error code: 1595
            2024-10-16  9:42:45 6 [Note] Slave I/O thread exiting, read up to log 'master-bin.000002', position 4; GTID position , master 127.0.0.1:19000
            

            bnestere Brandon Nesterenko added a comment - - edited Elkin I'm able to reproduce this, and indeed this only happens at rotation. When the server rotates logs, the slave sees this and initializes its Format_description_log_event with binlog version 3 (this is hard-coded), see the following code snippet from process_io_rotate in 10.6: if (mi->rli.relay_log.description_event_for_queue->binlog_version >= 4) { DBUG_ASSERT(mi->rli.relay_log.description_event_for_queue->checksum_alg == mi->rli.relay_log.relay_log_checksum_alg); delete mi->rli.relay_log.description_event_for_queue; /* start from format 3 (MySQL 4.0) again */ mi->rli.relay_log.description_event_for_queue= new Format_description_log_event(3); mi->rli.relay_log.description_event_for_queue->checksum_alg= mi->rli.relay_log.relay_log_checksum_alg; } So immediately after rotation (and until a new Format_descriptor with binlog_format 4 is sent from the master), the IO thread is expecting binlog_format 3. This invalidates any events that are sent with an event type higher than 14. In theory, we wouldn't expect any events to be sent in between a rotate and the next format descriptor log event, but if a long enough period of time passes between then, the primary will generate and send a Heartbeat event (of type 27). In such case, the slave will see the heartbeat event of type 27, see it is higher than 14, and results in the 'Found invalid event in binary log' error, with the expected log coordinates of the new log (which is optimistically populated from the Rotate log event, not the new event). For example, from my testing: 2024-10-16 9:47:02 10 [ERROR] Read invalid event from master: 'Found invalid event in binary log', master could be corrupt but a more likely cause of this is a bug 2024-10-16 9:47:02 10 [ERROR] Slave I/O: Relay log write failure: could not queue event from master, Internal MariaDB error code: 1595 2024-10-16 9:47:02 10 [Note] Slave I/O thread exiting, read up to log 'master-bin.000002', position 4, master 127.0.0.1:19000 Note that after legacy binlog code was removed (I think starting in version 11.X), the error changes to an invalid heartbeat event. 2024-10-16 9:42:45 6 [ERROR] Slave I/O: Unexpected master's heartbeat data: heartbeat is not compatible with local info;the event's data: log_file_name master-bin.000001 log_pos 377, Internal MariaDB error code: 1623 2024-10-16 9:42:45 6 [ERROR] Slave I/O: Relay log write failure: could not queue event from master, Internal MariaDB error code: 1595 2024-10-16 9:42:45 6 [Note] Slave I/O thread exiting, read up to log 'master-bin.000002', position 4; GTID position , master 127.0.0.1:19000
            Elkin Andrei Elkin added a comment -

            bnestere, you seem to have nailed it. Awesome! Many thanks!

            Elkin Andrei Elkin added a comment - bnestere , you seem to have nailed it. Awesome! Many thanks!

            People

              bnestere Brandon Nesterenko
              R Anton
              Votes:
              2 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.