Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-7237

Parallel replication: incorrect relaylog position after stop/start the slave

Details

    Description

      141201  8:59:25 [Note] Slave SQL thread initialized, starting replication in log 'mysql-bin.000116' at position 1146, relay log './frigg-relay-bin.000256' position: 1780
      141201  8:59:25 [ERROR] Error in Log_event::read_log_event(): 'Event too big', data_len: 1734964078, event_type: 105
      141201  8:59:25 [ERROR] Error reading relay log event: slave SQL thread aborted because of I/O error
      141201  8:59:25 [ERROR] Slave SQL: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Internal MariaDB error code: 1594
      141201  8:59:25 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000116' position 1146
      141201  8:59:25 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:3310',replication started in log 'mysql-bin.000323' at position 4

      I reproduced this by first preparing a binlog using RQG replication-dml. I
      then setup a master-slave using this binlog. While the slave was running, I
      executed STOP SLAVE and START SLAVE a couple of times, and eventually got this
      error.

      This was reproduced in non-GTID mode.

      I had a couple reports of a similar problem from two users. Given the ease of
      reproducing this problem, it seems likely that it is the same as at least some
      of the cases experienced by users.

      Attachments

        Activity

          knielsen Kristian Nielsen created issue -
          knielsen Kristian Nielsen made changes -
          Field Original Value New Value
          Status Open [ 1 ] In Progress [ 3 ]

          Ok, so the problem is when a transaction has committed and goes to update the
          group relay log position. The relay log file name to set is taken from
          Relay_log_info, so it is whatever position the SQL driver thread has
          reached. Thus, if the SQL driver thread has scheduled a transaction in a
          following relay log file, then position updates for transaction in the
          previous relay log files will be with the wrong relay log file name.

          The result is that if the SQL thread is stopped (by STOP SLAVE or due to an
          error) near the end of a relay log, then the slave position can have the wrong
          relay log file. Then when the SQL thread is later started again, it will start
          in the wrong relay log file, probably getting a relay log read error. If we
          are unlucky, the position in the previous relay log file might happen to work
          in the following relay log file, and then the SQL thread will skip a number of
          events, causing serious replication corruption.

          This problem does not happen as easy in GTID mode (stopping both slave threads
          resets the relay log position), which is probably why this serious problem was
          not fixed until now.

          knielsen Kristian Nielsen added a comment - Ok, so the problem is when a transaction has committed and goes to update the group relay log position. The relay log file name to set is taken from Relay_log_info, so it is whatever position the SQL driver thread has reached. Thus, if the SQL driver thread has scheduled a transaction in a following relay log file, then position updates for transaction in the previous relay log files will be with the wrong relay log file name. The result is that if the SQL thread is stopped (by STOP SLAVE or due to an error) near the end of a relay log, then the slave position can have the wrong relay log file. Then when the SQL thread is later started again, it will start in the wrong relay log file, probably getting a relay log read error. If we are unlucky, the position in the previous relay log file might happen to work in the following relay log file, and then the SQL thread will skip a number of events, causing serious replication corruption. This problem does not happen as easy in GTID mode (stopping both slave threads resets the relay log position), which is probably why this serious problem was not fixed until now.
          knielsen Kristian Nielsen added a comment - Patch: http://lists.askmonty.org/pipermail/commits/2014-December/007104.html
          knielsen Kristian Nielsen made changes -
          Fix Version/s 10.0.16 [ 17900 ]
          knielsen Kristian Nielsen made changes -
          Resolution Fixed [ 1 ]
          Status In Progress [ 3 ] Closed [ 6 ]
          arjen Arjen Lentz added a comment -

          Sounds like our problem. Good catch Kristian!
          And indeed we already had the suspicion that when it did work, it was actually skipping (lots of) events.

          arjen Arjen Lentz added a comment - Sounds like our problem. Good catch Kristian! And indeed we already had the suspicion that when it did work, it was actually skipping (lots of) events.
          ratzpo Rasmus Johansson (Inactive) made changes -
          Workflow MariaDB v2 [ 58800 ] MariaDB v3 [ 62362 ]
          serg Sergei Golubchik made changes -
          Workflow MariaDB v3 [ 62362 ] MariaDB v4 [ 148568 ]

          People

            knielsen Kristian Nielsen
            knielsen Kristian Nielsen
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.