Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-10863

parallel replication tries to continue from wrong position after stopping and restarting slave threads

Details

    Description

      When doing STOP SLAVE SQL_THREAD and START SLAVE SQL_THREAD repeatedly parallel replication sometimes fails on START with a duplicate key error. SHOW SLAVE STATUS then always shows a relay log position at the beginning of a log file, e.g. "Relay_Log_Pos: 4". This only seems to happen if a large transaction is split across more than one relay log file.

      How to reproduce:

      Master config:

      [mysqld]
      server-id=1
      log-bin=master-bin
      binlog-format=statement
      

      Slave config:

      [mysqld]
      server-id=2
      slave_parallel_threads=20
      max_relay_log_size=98304
      

      Create replication user on the master:

      CREATE USER 'repl'@'%' IDENTIFIED BY 'password';
      GRANT ALL PRIVILEGES ON *.* TO 'repl'@'%';
      RESET MASTER;
      

      Start slave:

      SET GLOBAL gtid_slave_pos='0-1-2';
      CHANGE MASTER TO Master_host='mymasterip', Master_user='repl', Master_password='password', master_use_gtid=slave_pos;
      START SLAVE;
      

      On the master: create a simple PHP script "test.php" to populate a table using given domain ID (modulo 3) and table name in transactions of 1000 rows each (transactions will take about 40K binlog space each, so talking about half of the max_relaylog_size which makes hitting a transaction split very likely)

      <?php
       
      $domain_id = $argv[1]%3 + 1;
      $table     = $argv[2];
       
      echo "SET gtid_domain_id=$domain_id;";
       
      echo "DROP TABLE IF EXISTS test.$table;";
      echo "CREATE TABLE test.$table(id int primary key, msg varchar(100));"; 
       
      $n=1;
       
      while (true) {
        echo "BEGIN;\n"; 
        for ($i = 1; $i < 1000; $i++) {
          echo "INSERT INTO test.$table VALUES($n, MD5(RAND()));\n";
          $n++;
        }
        echo "COMMIT;\n";
      }
      

      Then run several instances of it in parallel:

      for a in $(seq 20); do (php test.php $a t$a | mysql &) ; done
      

      Now on the slave run this script to repeatedly stop and restart the SQL thread:

      ( sudo mysql <<< "select version(); start slave sql_thread"; while date; do
      sudo mysql <<< 'stop slave sql_thread; show slave status\G'
      gsp_stopped="$(sudo mysql <<< 'SELECT @@GLOBAL.gtid_slave_pos;')"; echo "$gsp_stopped"; 
      sudo mysql <<< 'start slave sql_thread;'; sleep 5;
      sss="$(sudo mysql <<< 'show slave status\G')"
      gsp="$(sudo mysql <<< 'SELECT @@GLOBAL.gtid_slave_pos')"; echo "$sss"; echo "$gsp"; 
      if test "$(awk '$1 == "Slave_SQL_Running:"{print $2}' <<< "$sss")" == "Yes"; then echo; echo "sleeping..."; sleep 5; else echo "things are broken..."; break; fi; done; ) | tee test.log
      

      This usually fails with a duplicate key error after a small number of iterations already

      Attachments

        1. binlogs.tar.gz
          599 kB
        2. test.log
          14 kB

        Issue Links

          Activity

            > I checked your patch, the main idea is ignore the first part without a
            > begining GTID event in the first relay log, right?

            The way the code works is, it scans over the earlier GTIDs in the relay log
            file. If the GTID was already applied, it sets the GTID_SKIP_TRANSACTION (or
            GTID_SKIP_STANDALONE) flag, so that the associated events are skipped.

            But the code starts scanning at the beginning of a relay log, which can be
            in the middle of an event group that was already applied. So it does not see
            the GTID for the first event group, thus the gtid_skip_flag needs to be
            explicitly initialised to GTID_SKIP_TRANSACTION.

            We can be sure that any GTID from an earlier relay log was already applied,
            so that skipping it is correct. This is ensured by the inuse_relaylog
            mechanism. See struct inuse_relaylog in sql/rpl_rli.h.

            An alternative might be to keep track of the position of the first GTID (or
            other event outside of an event group), and then start scanning from that
            point. Then we need to handle also the case where a relay log file contains
            no GTID at all because both the start and the end of a big transaction is
            outside that file. I am not sure that is any simpler or better.

            > I'm not sure if this logic always be right under parallels replication.

            Yes. This code which handles restarting the SQL thread is really
            complicated, and I was always very unhappy about it. There have been a
            number of bugs related to it. I would certainly welcome any suggestions for
            a simpler solution.

            I think the better overall approach would be that the SQL thread was able to
            correctly start in the relay log file from a GTID position, using a
            mechanism similar to how this works when mysql_binlog_send() runs on the
            master. Then, if relay log recovery was implemented on the slave similar to
            binlog recovery on the master, we would also be able to re-use the relay log
            after a slave server restart, without having to re-fetch it from the
            master. But that is a much bigger change, probably too much for 10.0.

            (And it would still be necessary to handle partiel event groups at the start
            of the relay log).

            > Do you think keep the previous relay log file is a better way? I mean, if
            > Event[0] be splitted into realy log File[0] and File[1], and GTID moved to
            > Evnet[1], should we keep the File[0] until Evnet[1] be applied? Then
            > parallels replication can always get the start postion in the exist relay
            > log file.

            I don't see how this will help. There might still be another earlier
            Event[-1] in File[0] that was split and does not have its beginning GTID in
            File[0]. No matter what, I think we need to be able to handle the
            possibility that there is a partial event group at the start of a relay log
            file without the starting GTID.

            Thanks,

            • Kristian.
            knielsen Kristian Nielsen added a comment - > I checked your patch, the main idea is ignore the first part without a > begining GTID event in the first relay log, right? The way the code works is, it scans over the earlier GTIDs in the relay log file. If the GTID was already applied, it sets the GTID_SKIP_TRANSACTION (or GTID_SKIP_STANDALONE) flag, so that the associated events are skipped. But the code starts scanning at the beginning of a relay log, which can be in the middle of an event group that was already applied. So it does not see the GTID for the first event group, thus the gtid_skip_flag needs to be explicitly initialised to GTID_SKIP_TRANSACTION. We can be sure that any GTID from an earlier relay log was already applied, so that skipping it is correct. This is ensured by the inuse_relaylog mechanism. See struct inuse_relaylog in sql/rpl_rli.h. An alternative might be to keep track of the position of the first GTID (or other event outside of an event group), and then start scanning from that point. Then we need to handle also the case where a relay log file contains no GTID at all because both the start and the end of a big transaction is outside that file. I am not sure that is any simpler or better. > I'm not sure if this logic always be right under parallels replication. Yes. This code which handles restarting the SQL thread is really complicated, and I was always very unhappy about it. There have been a number of bugs related to it. I would certainly welcome any suggestions for a simpler solution. I think the better overall approach would be that the SQL thread was able to correctly start in the relay log file from a GTID position, using a mechanism similar to how this works when mysql_binlog_send() runs on the master. Then, if relay log recovery was implemented on the slave similar to binlog recovery on the master, we would also be able to re-use the relay log after a slave server restart, without having to re-fetch it from the master. But that is a much bigger change, probably too much for 10.0. (And it would still be necessary to handle partiel event groups at the start of the relay log). > Do you think keep the previous relay log file is a better way? I mean, if > Event [0] be splitted into realy log File [0] and File [1] , and GTID moved to > Evnet [1] , should we keep the File [0] until Evnet [1] be applied? Then > parallels replication can always get the start postion in the exist relay > log file. I don't see how this will help. There might still be another earlier Event [-1] in File [0] that was split and does not have its beginning GTID in File [0] . No matter what, I think we need to be able to handle the possibility that there is a partial event group at the start of a relay log file without the starting GTID. Thanks, Kristian.
            plinux Lixun Peng added a comment -

            Hi Kritian,

            I think you can merge your code into the master branch.

            Thanks,
            Lixun

            plinux Lixun Peng added a comment - Hi Kritian, I think you can merge your code into the master branch. Thanks, Lixun

            > I think you can merge your code into the master branch.

            Ok, will do.

            • Kristian.
            knielsen Kristian Nielsen added a comment - > I think you can merge your code into the master branch. Ok, will do. Kristian.

            Pushed to 10.0.

            Thanks to Hartmut for an excellent bug report. Detailed analysis, and a
            stand-alone testcase to reproduce a complex failure scenario.

            knielsen Kristian Nielsen added a comment - Pushed to 10.0. Thanks to Hartmut for an excellent bug report. Detailed analysis, and a stand-alone testcase to reproduce a complex failure scenario.

            Actually most of it was originally done by Jean-François, the only part really done by me was the PHP generator script that made hitting the situation described by J-F much more likely ...

            hholzgra Hartmut Holzgraefe added a comment - Actually most of it was originally done by Jean-François, the only part really done by me was the PHP generator script that made hitting the situation described by J-F much more likely ...

            People

              knielsen Kristian Nielsen
              hholzgra Hartmut Holzgraefe
              Votes:
              2 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.