Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-29981

Replica stops with "Found invalid event in binary log"

Details

    • Bug
    • Status: Confirmed (View Workflow)
    • Major
    • Resolution: Unresolved
    • 10.4.25, 10.6.9
    • 10.6
    • Replication
    • None

    Description

      I got a strange behaviour with replication
      Sometimes (I can't reproduce ) the replica stops IO_thread with next error in the log

      2022-11-08 13:56:47 87893133 [ERROR] Read invalid event from master: 'Found invalid event in binary log', master could be corrupt but a more likely cause of this is a bug
      2022-11-08 13:56:47 87893133 [ERROR] Slave I/O: Relay log write failure: could not queue event from master, Internal MariaDB error code: 1595
      2022-11-08 13:56:47 87893133 [Note] Slave I/O thread exiting, read up to log 'mysqld-bin.202006', position 4; GTID position 11-11-48058108793
      

      The errors always appear at position 4 , at the start of binlog.
      The queries in 11-11-48058108793 and previous statement looks as correct insert/update requests

      The replication begins to work after restart replications `stop slave; start slave`

      I've attached config file

      Attachments

        Issue Links

          Activity

            Hi R,

            Could you please upload to the private FTP server:

            1. the error logs of both the primary and replica for when the replica stops
            2. output of show global variables for both primary and replica
            3. binary logs mysqld-bin.202006 and mysqld-bin.202005

            Thank you

            angelique.sklavounos Angelique Sklavounos (Inactive) added a comment - Hi R , Could you please upload to the private FTP server : the error logs of both the primary and replica for when the replica stops output of show global variables for both primary and replica binary logs mysqld-bin.202006 and mysqld-bin.202005 Thank you

            adding a case with the same issue on 10.4.24-15-MariaDB-enterprise-log

            Bongini Luigi (Inactive) added a comment - adding a case with the same issue on 10.4.24-15-MariaDB-enterprise-log
            Elkin Andrei Elkin added a comment -

            R, it'd be very helpful to see SHOW-SLAVE-STATUS log if you still have one for sharing. Thank you.

            Elkin Andrei Elkin added a comment - R , it'd be very helpful to see SHOW-SLAVE-STATUS log if you still have one for sharing. Thank you.

            uploaded 'MDEV-29981_logs.tgz' to the ftp server with the error logs of both primary and replica, the show global variables of both and the 2 binlogs involved in the issue

            Bongini Luigi (Inactive) added a comment - uploaded ' MDEV-29981 _logs.tgz' to the ftp server with the error logs of both primary and replica, the show global variables of both and the 2 binlogs involved in the issue
            R Anton added a comment -

            Hello Angelique, Andrei

            I've upload files your private ftp ftp://ftp.mariadb.org/private/
            I need to notice that I've cropped binary logs and renamed personal/private information

            MDEV-29981.master.vars
            MDEV-29981.mysqld-bin.202005.cutted
            MDEV-29981.mysqld-bin.202006.cutted
            MDEV-29981.mysqld_master.log
            MDEV-29981.mysqld_replica.log
            MDEV-29981.replica.vars
            MDEV-29981.slave_status.log
            

            R Anton added a comment - Hello Angelique, Andrei I've upload files your private ftp ftp://ftp.mariadb.org/private/ I need to notice that I've cropped binary logs and renamed personal/private information MDEV-29981.master.vars MDEV-29981.mysqld-bin.202005.cutted MDEV-29981.mysqld-bin.202006.cutted MDEV-29981.mysqld_master.log MDEV-29981.mysqld_replica.log MDEV-29981.replica.vars MDEV-29981.slave_status.log
            R Anton added a comment -

            I've also uploaded by a one archive file
            MDEV-29981.zip

            R Anton added a comment - I've also uploaded by a one archive file MDEV-29981 .zip
            Elkin Andrei Elkin added a comment - - edited

            R , thanks for uploads!

            Show-slave-status is not from the time of the failure. We still need the one that displays the error from the error log.
            I also see traces of using MXS, which asks further questions about
            Its configuration and its error log.
            Could you also share those (feel free to attach it to the MDEV instead of ftp)?

            Elkin Andrei Elkin added a comment - - edited R , thanks for uploads! Show-slave-status is not from the time of the failure. We still need the one that displays the error from the error log. I also see traces of using MXS, which asks further questions about Its configuration and its error log. Could you also share those (feel free to attach it to the MDEV instead of ftp)?
            Elkin Andrei Elkin added a comment - - edited

            After having found no ways to get to the slave error we need MXS analysis of these two events:

            2022-11-07 16:02:44   notice : Server changed state: cmkprod02-mdb1[10.0.3.120:3306]: master_down. [Master, Running] -> 
            

            2022-11-07 16:02:49 18635141 [ERROR] Slave I/O: Relay log write failure: could not queue event from master
            

            R, your MXS log might have a similar failover message just before the slave error one. I am looking for confirmation/denial.

            I can't dismiss yet that something is inserted into slave.

            Elkin Andrei Elkin added a comment - - edited After having found no ways to get to the slave error we need MXS analysis of these two events: 2022-11-07 16:02:44 notice : Server changed state: cmkprod02-mdb1[10.0.3.120:3306]: master_down. [Master, Running] -> 2022-11-07 16:02:49 18635141 [ERROR] Slave I/O: Relay log write failure: could not queue event from master R , your MXS log might have a similar failover message just before the slave error one. I am looking for confirmation/denial. I can't dismiss yet that something is inserted into slave.
            R Anton added a comment - - edited

            Hi Elkin

            At 11.11 evening I've got the same error again
            I've uploader logs and configs to ftp MDEV-29981_20221111_v2.zip
            I din't find something interesting in MSX logs - when replication (IO_thread) is broken , then MSX just marked this host as failed and connections to this host are stoped

            But I notice that the problem always appear with heavy transaction (I mean long text transaction) in the end of the binlog .
            (The transactions themselves are correct and when they are located in the middle of the binlog all fine)

            The size of the binlog with heavy transaction in the end a little bit bigger then other (mysqld-bin.204621)
            And replication stops at the moment of reading next binlog (mysqld-bin.204622)
            Looks like replica tried to read file with unexpected size and failed

            -rw-rw---- 1 mysql mysql 1073779380 Nov 11 17:42 /var/lib/mysql/mysqld-bin.204618
            -rw-rw---- 1 mysql mysql 1074212072 Nov 11 17:43 /var/lib/mysql/mysqld-bin.204619
            -rw-rw---- 1 mysql mysql 1073991721 Nov 11 17:45 /var/lib/mysql/mysqld-bin.204620
            -rw-rw---- 1 mysql mysql 1075319895 Nov 11 17:46 /var/lib/mysql/mysqld-bin.204621
            -rw-rw---- 1 mysql mysql 1073754302 Nov 11 17:48 /var/lib/mysql/mysqld-bin.204622
            -rw-rw---- 1 mysql mysql 1073754552 Nov 11 17:49 /var/lib/mysql/mysqld-bin.204623
            

            R Anton added a comment - - edited Hi Elkin At 11.11 evening I've got the same error again I've uploader logs and configs to ftp MDEV-29981 _20221111_v2.zip I din't find something interesting in MSX logs - when replication (IO_thread) is broken , then MSX just marked this host as failed and connections to this host are stoped But I notice that the problem always appear with heavy transaction (I mean long text transaction) in the end of the binlog . (The transactions themselves are correct and when they are located in the middle of the binlog all fine) The size of the binlog with heavy transaction in the end a little bit bigger then other (mysqld-bin.204621) And replication stops at the moment of reading next binlog (mysqld-bin.204622) Looks like replica tried to read file with unexpected size and failed -rw-rw---- 1 mysql mysql 1073779380 Nov 11 17 : 42 /var/lib/mysql/mysqld-bin. 204618 -rw-rw---- 1 mysql mysql 1074212072 Nov 11 17 : 43 /var/lib/mysql/mysqld-bin. 204619 -rw-rw---- 1 mysql mysql 1073991721 Nov 11 17 : 45 /var/lib/mysql/mysqld-bin. 204620 -rw-rw---- 1 mysql mysql 1075319895 Nov 11 17 : 46 /var/lib/mysql/mysqld-bin. 204621 -rw-rw---- 1 mysql mysql 1073754302 Nov 11 17 : 48 /var/lib/mysql/mysqld-bin. 204622 -rw-rw---- 1 mysql mysql 1073754552 Nov 11 17 : 49 /var/lib/mysql/mysqld-bin. 204623
            R Anton added a comment -

            And I checked at 10.4.26 version - the same problem

            R Anton added a comment - And I checked at 10.4.26 version - the same problem
            Elkin Andrei Elkin added a comment - - edited

            R, thank you for the reply! I am analyzing it deeper now.
            It looks like slave receives an unexpected packet following the Rotate event from the "OLD" binlog.
            The next event must be a Format_description_log_event but the actual network packet brings in a different one
            having in its 5th byte the value 1. That maps to a very old event type which modern masters do not generate so I have reasons to strongly doubt the packet was sent by the master server.

            In this regard, could you please also look into the master and slave servers' syslogs to possibly identify anything unusual at the failure time?

            I also notice your log_warnings is of the default value 2, and since you're "lucky" to summon the failure, may I ask you to raise it to at least 3, on both master and slave?

            It'd be useful to try reproducing the failure in a pure no-MXS master-slave. Sure makes sense if you can afford that and the failure occurs rather reliably. If it's reproducible reliably but MXS can't be touched, I'd love to see network dump.

            Elkin Andrei Elkin added a comment - - edited R , thank you for the reply! I am analyzing it deeper now. It looks like slave receives an unexpected packet following the Rotate event from the "OLD" binlog. The next event must be a Format_description_log_event but the actual network packet brings in a different one having in its 5th byte the value 1. That maps to a very old event type which modern masters do not generate so I have reasons to strongly doubt the packet was sent by the master server. In this regard, could you please also look into the master and slave servers' syslogs to possibly identify anything unusual at the failure time? I also notice your log_warnings is of the default value 2, and since you're "lucky" to summon the failure, may I ask you to raise it to at least 3, on both master and slave? It'd be useful to try reproducing the failure in a pure no-MXS master-slave. Sure makes sense if you can afford that and the failure occurs rather reliably. If it's reproducible reliably but MXS can't be touched, I'd love to see network dump.
            Elkin Andrei Elkin added a comment -

            R, and one request for you (I am copying from markus makela who asked it elsewhere),

            We need to know the MaxScale version that this happens with as well as the configuration used with it.
            The output of `maxctrl create report mxs-report.txt` would also be of use: run the command and upload the
            `mxs-report.txt` file it creates.

            Elkin Andrei Elkin added a comment - R , and one request for you (I am copying from markus makela who asked it elsewhere), We need to know the MaxScale version that this happens with as well as the configuration used with it. The output of `maxctrl create report mxs-report.txt` would also be of use: run the command and upload the `mxs-report.txt` file it creates.
            Elkin Andrei Elkin added a comment -

            R, could you please also consider to set master_verify_checksum = 1 which activates checksum verification on master, which of course would slow it a bit. This may help to clear out doubts about what the master actually sends.

            Elkin Andrei Elkin added a comment - R , could you please also consider to set master_verify_checksum = 1 which activates checksum verification on master, which of course would slow it a bit. This may help to clear out doubts about what the master actually sends.
            R Anton added a comment - - edited

            Hello Elkin

            At the weekend I got the same error with stop replication , with the same events in the logs
            but on this time last transaction in binlog before fail was not so long as previous . It was usual update. And logsize had the same size as binlogs near him.
            So my assumption is not confirmed in this time .

            I've upload MDEV-29981-mxs-report.txt to ftp

            I've changed log warning to 3 , with 4 are generated a log of Aborted connection warnings (This connection closed normally)
            In syslog/dmesg I didn't find interesting related events , I'll provide them when error appear again

            R Anton added a comment - - edited Hello Elkin At the weekend I got the same error with stop replication , with the same events in the logs but on this time last transaction in binlog before fail was not so long as previous . It was usual update. And logsize had the same size as binlogs near him. So my assumption is not confirmed in this time . I've upload MDEV-29981 -mxs-report.txt to ftp I've changed log warning to 3 , with 4 are generated a log of Aborted connection warnings (This connection closed normally) In syslog/dmesg I didn't find interesting related events , I'll provide them when error appear again
            Elkin Andrei Elkin added a comment -

            R, hello. Did you observe by any chance this one?

             Got fatal error 1236 from master when reading data from binary log: 'bogus data in log event;
            

            Elkin Andrei Elkin added a comment - R , hello. Did you observe by any chance this one? Got fatal error 1236 from master when reading data from binary log: 'bogus data in log event;
            R Anton added a comment - - edited

            Hi Elkin ,

            I've never seen this error in the logs `bogus data in log event`

            But I often see the next error in the maxscale logs

            2022-11-04 13:06:28   warning: [mariadbmon] Error during monitor update of server 'sc-db-06': Query 'SHOW ALL SLAVES STATUS;' failed: 'Lost connection to server during query'.
            2022-11-04 13:24:29   warning: [mariadbmon] Error during monitor update of server 'sc-db-13': Query 'SHOW ALL SLAVES STATUS;' failed: 'Lost connection to server during query'.
            

            I'm not sure , but I can assume that this can be related with
            https://jira.mariadb.org/browse/MDEV-21087

            I'm waiting approve for update master to 4.26 version

            R Anton added a comment - - edited Hi Elkin , I've never seen this error in the logs `bogus data in log event` But I often see the next error in the maxscale logs 2022-11-04 13:06:28 warning: [mariadbmon] Error during monitor update of server 'sc-db-06' : Query 'SHOW ALL SLAVES STATUS;' failed: 'Lost connection to server during query' . 2022-11-04 13:24:29 warning: [mariadbmon] Error during monitor update of server 'sc-db-13' : Query 'SHOW ALL SLAVES STATUS;' failed: 'Lost connection to server during query' . I'm not sure , but I can assume that this can be related with https://jira.mariadb.org/browse/MDEV-21087 I'm waiting approve for update master to 4.26 version
            Elkin Andrei Elkin added a comment - - edited

            Thanks for the reply, R. MDEV-21087 is certainly out of concern here. I am considering a corrupted replication event created which would be caught with {{ set @@global.master_verify_checksum = 1}} (on master).

            I am all ears how it would proceed should you have upgraded.
            Also I'd like to know whether your server may experience out of swap space.
            It's hard to suspect networking issue, nevertheless could you please look after that too.

            Elkin Andrei Elkin added a comment - - edited Thanks for the reply, R . MDEV-21087 is certainly out of concern here. I am considering a corrupted replication event created which would be caught with {{ set @@global.master_verify_checksum = 1}} (on master). I am all ears how it would proceed should you have upgraded. Also I'd like to know whether your server may experience out of swap space. It's hard to suspect networking issue, nevertheless could you please look after that too.
            R Anton added a comment -

            Hello Elkin
            Sorry for delay with answer

            I did next changes few weeks ago
            global.master_verify_checksum: 0 -> 1
            global.slave_net_timeout: 10 -> 60

            This actions are reduced count of issues with stop replica . I got this error only one time at last week . But the error still exists

            Yes, I had swap on master , and I'm going to turn off it . On replicas I've already switched off
            At this moment replica have 10.4.26 and master is 10.4.25

            R Anton added a comment - Hello Elkin Sorry for delay with answer I did next changes few weeks ago global.master_verify_checksum: 0 -> 1 global.slave_net_timeout: 10 -> 60 This actions are reduced count of issues with stop replica . I got this error only one time at last week . But the error still exists Yes, I had swap on master , and I'm going to turn off it . On replicas I've already switched off At this moment replica have 10.4.26 and master is 10.4.25
            Elkin Andrei Elkin added a comment -

            Thank you R!
            I am not sure how would global.slave_net_timeout increase decrease the error rate. And even more to the checksum verification.

            To my swap question, it was about weather you might observe correlation of a very active swap usage and the error appearance. We've been investigating potentially related MDEV-28986.

            All the best to you on the upcoming holidays !

            Elkin Andrei Elkin added a comment - Thank you R ! I am not sure how would global.slave_net_timeout increase decrease the error rate. And even more to the checksum verification. To my swap question, it was about weather you might observe correlation of a very active swap usage and the error appearance. We've been investigating potentially related MDEV-28986 . All the best to you on the upcoming holidays !
            R Anton added a comment - - edited

            Hi Elkin

            Sorry for long silence

            I've just switched master at the previous week (waited approve for a long time)
            New master has 10.4.27 version and swap is turned off
            On previous master I've resetup OS (centos8) , updated to 10.4.27 and turned off swap
            Now it's working as replica .

            I'll provide information if the error with replication appear again
            And anyway I'll give a status in 1-2 weeks

            R Anton added a comment - - edited Hi Elkin Sorry for long silence I've just switched master at the previous week (waited approve for a long time) New master has 10.4.27 version and swap is turned off On previous master I've resetup OS (centos8) , updated to 10.4.27 and turned off swap Now it's working as replica . I'll provide information if the error with replication appear again And anyway I'll give a status in 1-2 weeks
            R Anton added a comment -

            Hello Elkin

            The problem didn't reproduce for more than 3 weeks
            I don't have proper answer what reason of the error was
            Maybe it was related with 10.4.25 version , maybe swap, network or something else

            I think the current ticket can be closed
            If you need any additional information/logs just let me know

            R Anton added a comment - Hello Elkin The problem didn't reproduce for more than 3 weeks I don't have proper answer what reason of the error was Maybe it was related with 10.4.25 version , maybe swap, network or something else I think the current ticket can be closed If you need any additional information/logs just let me know
            Elkin Andrei Elkin added a comment -

            Thank you R! With my teammate bnestere we have had a plan to print replication event dumps into the server error log near binlog rotation, with log_warnings > 2. I think we'd close the ticket when it's done.

            Elkin Andrei Elkin added a comment - Thank you R ! With my teammate bnestere we have had a plan to print replication event dumps into the server error log near binlog rotation, with log_warnings > 2 . I think we'd close the ticket when it's done.
            bnestere Brandon Nesterenko added a comment - - edited

            Elkin I'm able to reproduce this, and indeed this only happens at rotation.

            When the server rotates logs, the slave sees this and initializes its Format_description_log_event with binlog version 3 (this is hard-coded), see the following code snippet from process_io_rotate in 10.6:

              if (mi->rli.relay_log.description_event_for_queue->binlog_version >= 4)
              {
                DBUG_ASSERT(mi->rli.relay_log.description_event_for_queue->checksum_alg ==
                            mi->rli.relay_log.relay_log_checksum_alg);
                
                delete mi->rli.relay_log.description_event_for_queue;
                /* start from format 3 (MySQL 4.0) again */
                mi->rli.relay_log.description_event_for_queue= new
                  Format_description_log_event(3);
                mi->rli.relay_log.description_event_for_queue->checksum_alg=
                  mi->rli.relay_log.relay_log_checksum_alg;    
              }
            

            So immediately after rotation (and until a new Format_descriptor with binlog_format 4 is sent from the master), the IO thread is expecting binlog_format 3. This invalidates any events that are sent with an event type higher than 14. In theory, we wouldn't expect any events to be sent in between a rotate and the next format descriptor log event, but if a long enough period of time passes between then, the primary will generate and send a Heartbeat event (of type 27). In such case, the slave will see the heartbeat event of type 27, see it is higher than 14, and results in the 'Found invalid event in binary log' error, with the expected log coordinates of the new log (which is optimistically populated from the Rotate log event, not the new event). For example, from my testing:

            2024-10-16  9:47:02 10 [ERROR] Read invalid event from master: 'Found invalid event in binary log', master could be corrupt but a more likely cause of this is a bug
            2024-10-16  9:47:02 10 [ERROR] Slave I/O: Relay log write failure: could not queue event from master, Internal MariaDB error code: 1595
            2024-10-16  9:47:02 10 [Note] Slave I/O thread exiting, read up to log 'master-bin.000002', position 4, master 127.0.0.1:19000
            

            Note that after legacy binlog code was removed (I think starting in version 11.X), the error changes to an invalid heartbeat event.

            2024-10-16  9:42:45 6 [ERROR] Slave I/O: Unexpected master's heartbeat data: heartbeat is not compatible with local info;the event's data: log_file_name master-bin.000001 log_pos 377, Internal MariaDB error code: 1623
            2024-10-16  9:42:45 6 [ERROR] Slave I/O: Relay log write failure: could not queue event from master, Internal MariaDB error code: 1595
            2024-10-16  9:42:45 6 [Note] Slave I/O thread exiting, read up to log 'master-bin.000002', position 4; GTID position , master 127.0.0.1:19000
            

            bnestere Brandon Nesterenko added a comment - - edited Elkin I'm able to reproduce this, and indeed this only happens at rotation. When the server rotates logs, the slave sees this and initializes its Format_description_log_event with binlog version 3 (this is hard-coded), see the following code snippet from process_io_rotate in 10.6: if (mi->rli.relay_log.description_event_for_queue->binlog_version >= 4) { DBUG_ASSERT(mi->rli.relay_log.description_event_for_queue->checksum_alg == mi->rli.relay_log.relay_log_checksum_alg); delete mi->rli.relay_log.description_event_for_queue; /* start from format 3 (MySQL 4.0) again */ mi->rli.relay_log.description_event_for_queue= new Format_description_log_event(3); mi->rli.relay_log.description_event_for_queue->checksum_alg= mi->rli.relay_log.relay_log_checksum_alg; } So immediately after rotation (and until a new Format_descriptor with binlog_format 4 is sent from the master), the IO thread is expecting binlog_format 3. This invalidates any events that are sent with an event type higher than 14. In theory, we wouldn't expect any events to be sent in between a rotate and the next format descriptor log event, but if a long enough period of time passes between then, the primary will generate and send a Heartbeat event (of type 27). In such case, the slave will see the heartbeat event of type 27, see it is higher than 14, and results in the 'Found invalid event in binary log' error, with the expected log coordinates of the new log (which is optimistically populated from the Rotate log event, not the new event). For example, from my testing: 2024-10-16 9:47:02 10 [ERROR] Read invalid event from master: 'Found invalid event in binary log', master could be corrupt but a more likely cause of this is a bug 2024-10-16 9:47:02 10 [ERROR] Slave I/O: Relay log write failure: could not queue event from master, Internal MariaDB error code: 1595 2024-10-16 9:47:02 10 [Note] Slave I/O thread exiting, read up to log 'master-bin.000002', position 4, master 127.0.0.1:19000 Note that after legacy binlog code was removed (I think starting in version 11.X), the error changes to an invalid heartbeat event. 2024-10-16 9:42:45 6 [ERROR] Slave I/O: Unexpected master's heartbeat data: heartbeat is not compatible with local info;the event's data: log_file_name master-bin.000001 log_pos 377, Internal MariaDB error code: 1623 2024-10-16 9:42:45 6 [ERROR] Slave I/O: Relay log write failure: could not queue event from master, Internal MariaDB error code: 1595 2024-10-16 9:42:45 6 [Note] Slave I/O thread exiting, read up to log 'master-bin.000002', position 4; GTID position , master 127.0.0.1:19000
            Elkin Andrei Elkin added a comment -

            bnestere, you seem to have nailed it. Awesome! Many thanks!

            Elkin Andrei Elkin added a comment - bnestere , you seem to have nailed it. Awesome! Many thanks!

            People

              bnestere Brandon Nesterenko
              R Anton
              Votes:
              2 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.