Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-6386

Assertion `thd->transaction.stmt.is_empty() || thd->in_sub_stmt || (thd->state_flags & Open_tables_state::BACKUPS_AVAIL)' fails with parallel replication

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • 10.0.12
    • 10.0.13
    • None

    Description

      While replicating from the attached binary logs, a slave started with slave-parallel-threads=8 fails and then goes down with the assertion failure.

      The binary log contains changes made by a bootstrap process on master, so the replication failure with the duplicate key is expected, since the system table on the slave already exists and is populated.

      Assertion failure is the problem.

      140625 15:01:47 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='', master_port='3306', master_log_file='', master_log_pos='4'. New state master_host='127.0.0.1', master_port='3306', master_log_file='', master_log_pos='4'.
      140625 15:01:47 [Note] Previous Using_Gtid=No. New Using_Gtid=Current_Pos
      140625 15:01:55 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './wheezy-64-relay-bin.000001' position: 4; GTID position ''
      140625 15:01:55 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:3306',replication starts at GTID position ''
      140625 15:02:15 [ERROR] Slave SQL: Error 'Duplicate entry '%-test-' for key 'PRIMARY'' on query. Default database: 'mysql'. Query: 'INSERT INTO db SELECT * FROM tmp_db WHERE @had_db_table=0;', Internal MariaDB error code: 1062
      140625 15:02:15 [Warning] Slave: Duplicate entry '%-test-' for key 'PRIMARY' Error_code: 1062
      140625 15:02:15 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000001' position 63928; GTID position '0-1-85'
      mysqld: 10.0/sql/sql_base.cc:902: void close_thread_tables(THD*): Assertion `thd->transaction.stmt.is_empty() || thd->in_sub_stmt || (thd->state_flags & Open_tables_state::BACKUPS_AVAIL)' failed.
      140625 15:02:15 [ERROR] mysqld got signal 6 ;

      #6  0x00007fcfc01ff621 in *__GI___assert_fail (assertion=0xf16a20 "thd->transaction.stmt.is_empty() || thd->in_sub_stmt || (thd->state_flags & Open_tables_state::BACKUPS_AVAIL)", file=<optimized out>, line=902, function=0xf19fb0 "void close_thread_tables(THD*)") at assert.c:81
      #7  0x0000000000614865 in close_thread_tables (thd=0x7fcfa3049070) at 10.0/sql/sql_base.cc:901
      #8  0x00000000007eae3f in rpl_slave_state::record_gtid (this=0x1886760, thd=0x7fcfa3049070, gtid=0x7fcf969ab090, sub_id=89, in_transaction=false, in_statement=false) at 10.0/sql/rpl_gtid.cc:688
      #9  0x00000000007e9950 in rpl_slave_state::record_and_update_gtid (this=0x1886760, thd=0x7fcfa3049070, rgi=0x7fcf969ab070) at 10.0/sql/rpl_gtid.cc:80
      #10 0x00000000007825a2 in Relay_log_info::stmt_done (this=0x7fcfad7fccb8, event_master_log_pos=64952, event_creation_time=1403693523, thd=0x7fcfa3049070, rgi=0x7fcf969ab070) at 10.0/sql/rpl_rli.cc:1244
      #11 0x00000000009484e4 in Log_event::do_update_pos (this=0x7fcf96998f30, rgi=0x7fcf969ab070) at 10.0/sql/log_event.cc:975
      #12 0x000000000094ea7b in Query_log_event::do_update_pos (this=0x7fcf96998f30, rgi=0x7fcf969ab070) at 10.0/sql/log_event.cc:4487
      #13 0x00000000005ed13a in Log_event::update_pos (this=0x7fcf96998f30, rgi=0x7fcf969ab070) at 10.0/sql/log_event.h:1357
      #14 0x00000000005e4538 in apply_event_and_update_pos (ev=0x7fcf96998f30, thd=0x7fcfa3049070, rgi=0x7fcf969ab070, rpt=0x7fcfb661a930) at 10.0/sql/slave.cc:3271
      #15 0x00000000007eebd6 in rpt_handle_event (qev=0x7fcf969d3b70, rpt=0x7fcfb661a930) at 10.0/sql/rpl_parallel.cc:46
      #16 0x00000000007efad4 in handle_rpl_parallel_thread (arg=0x7fcfb661a930) at 10.0/sql/rpl_parallel.cc:502
      #17 0x00007fcfc1db6b50 in start_thread (arg=<optimized out>) at pthread_create.c:304
      #18 0x00007fcfc02aea7d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112

      I couldn't reproduce it without parallel replication or with smaller number of parallel threads, but with slave-parallel-threads=8 the failure seems to be reliably reproducible for me.

      RQG test to reproduce, in case replaying replication from binary logs does not do it:

      perl ./runall-new.pl  --threads=1 --duration=10 --queries=100M --grammar=conf/optimizer/optimizer_subquery.yy --rpl_mode=row  --use-gtid=current_pos  --mysqld=--slave_parallel_threads=8 --basedir=<basedir> --vardir=<vardir>

      It is to be run with MariaDB's flavor of RQG (lp:~elenst/randgen/mariadb-patches)

      Attachments

        1. mysql-bin.000001
          1.01 MB
        2. mysql-bin.000002
          62 kB
        3. mysql-bin.index
          0.0 kB

        Activity

          knielsen Kristian Nielsen added a comment - - edited

          Great test, thanks.

          I actually found two issues that together result in the assertion:

          1. Given two transactions T1 and T2 in the master's binlog, not in the same
          group commit. On the slave, T2 is only allowed to run once T1 has started to
          commit. However, there was a bug, so that if T1 failed, then T2 was still
          allowed to start (T2 would then catch the error at its own commit and roll
          back, so the problem was often not visible).

          2. When T2 is DDL (drop temporary table in this case), record_gtid() needs to
          update mysql.gtid_slave_pos in an auto-committed transaction. This autocommit
          ends up noticing the error in T1 and failing. There was missing error handling
          in this case (missing ha_rollback_trans()), which triggered the assertion.

          Note that it is necessary to run without binlogging on the slave
          (--skip-log-bin or --log-slave-updates=0) to trigger this.

          Here is an MTR test case that reproduces the assertion:

          --source include/have_innodb.inc
          --source include/master-slave.inc
           
          --connection master
          # ToDo: Remove this FLUSH LOGS when MDEV-6403 is fixed.
          ALTER TABLE mysql.gtid_slave_pos ENGINE = InnoDB;
          FLUSH LOGS;
          CREATE TABLE t1 (a INT PRIMARY KEY, b INT) Engine=InnoDB;
          --sync_slave_with_master
           
          --connection slave
          --source include/stop_slave.inc
          # Provoke a duplicate key error on replication.
          SET sql_log_bin= 0;
          INSERT INTO t1 VALUES (1, 2);
          SET sql_log_bin= 1;
          CHANGE MASTER TO master_use_gtid= current_pos;
          --echo Contents on slave before:
          SELECT * FROM t1 ORDER BY a;
           
          SET @old_parallel= @@GLOBAL.slave_parallel_threads;
          SET GLOBAL slave_parallel_threads=8;
           
          --connection master
           
          CREATE TEMPORARY TABLE t2 LIKE t1;
          INSERT INTO t2 VALUE (1, 1);
          INSERT INTO t2 VALUE (2, 1);
          INSERT INTO t2 VALUE (3, 1);
          INSERT INTO t2 VALUE (4, 1);
          INSERT INTO t2 VALUE (5, 1);
          INSERT INTO t1 SELECT * FROM t2;
          DROP TEMPORARY TABLE t2;
          --save_master_pos
          --echo Contents on master:
          SELECT * FROM t1 ORDER BY a;
           
          --connection slave
          START SLAVE;
          # The slave will stop with a duplicate key error.
          # The bug was 1) that the next DROP TEMPORARY TABLE would be allowed to run
          # anyway, and 2) that then record_gtid() would get an error during commit
          # (since the prior commit failed), and this error was not correctly handled,
          # which caused an assertion about closing tables while a statement was still
          # active.
          --let $slave_sql_errno=1062
          --source include/wait_for_slave_sql_error.inc
           
          STOP SLAVE IO_THREAD;
          --echo Contents on slave on slave error:
          SELECT * FROM t1 ORDER BY a;
           
          # Resolve the duplicate key error so replication can be resumed.
          SET sql_log_bin= 0;
          DELETE FROM t1 WHERE a=1;
          SET sql_log_bin= 1;
           
          --source include/start_slave.inc
          --sync_with_master
          --echo Contents on slave after:
          SELECT * FROM t1 ORDER BY a;
           
          --connection master
          DROP TABLE t1;
           
          --connection slave
          --source include/stop_slave.inc
          SET GLOBAL slave_parallel_threads= @old_parallel;
          --source include/start_slave.inc
           
          --source include/rpl_end.inc

          -slave.opt file:

          --disable-log-slave-updates

          knielsen Kristian Nielsen added a comment - - edited Great test, thanks. I actually found two issues that together result in the assertion: 1. Given two transactions T1 and T2 in the master's binlog, not in the same group commit. On the slave, T2 is only allowed to run once T1 has started to commit. However, there was a bug, so that if T1 failed, then T2 was still allowed to start (T2 would then catch the error at its own commit and roll back, so the problem was often not visible). 2. When T2 is DDL (drop temporary table in this case), record_gtid() needs to update mysql.gtid_slave_pos in an auto-committed transaction. This autocommit ends up noticing the error in T1 and failing. There was missing error handling in this case (missing ha_rollback_trans()), which triggered the assertion. Note that it is necessary to run without binlogging on the slave (--skip-log-bin or --log-slave-updates=0) to trigger this. Here is an MTR test case that reproduces the assertion: --source include/have_innodb.inc --source include/master-slave.inc   --connection master # ToDo: Remove this FLUSH LOGS when MDEV-6403 is fixed. ALTER TABLE mysql.gtid_slave_pos ENGINE = InnoDB; FLUSH LOGS; CREATE TABLE t1 (a INT PRIMARY KEY, b INT) Engine=InnoDB; --sync_slave_with_master   --connection slave --source include/stop_slave.inc # Provoke a duplicate key error on replication. SET sql_log_bin= 0; INSERT INTO t1 VALUES (1, 2); SET sql_log_bin= 1; CHANGE MASTER TO master_use_gtid= current_pos; --echo Contents on slave before: SELECT * FROM t1 ORDER BY a;   SET @old_parallel= @@GLOBAL.slave_parallel_threads; SET GLOBAL slave_parallel_threads=8;   --connection master   CREATE TEMPORARY TABLE t2 LIKE t1; INSERT INTO t2 VALUE (1, 1); INSERT INTO t2 VALUE (2, 1); INSERT INTO t2 VALUE (3, 1); INSERT INTO t2 VALUE (4, 1); INSERT INTO t2 VALUE (5, 1); INSERT INTO t1 SELECT * FROM t2; DROP TEMPORARY TABLE t2; --save_master_pos --echo Contents on master: SELECT * FROM t1 ORDER BY a;   --connection slave START SLAVE; # The slave will stop with a duplicate key error. # The bug was 1) that the next DROP TEMPORARY TABLE would be allowed to run # anyway, and 2) that then record_gtid() would get an error during commit # (since the prior commit failed), and this error was not correctly handled, # which caused an assertion about closing tables while a statement was still # active. --let $slave_sql_errno=1062 --source include/wait_for_slave_sql_error.inc   STOP SLAVE IO_THREAD; --echo Contents on slave on slave error: SELECT * FROM t1 ORDER BY a;   # Resolve the duplicate key error so replication can be resumed. SET sql_log_bin= 0; DELETE FROM t1 WHERE a=1; SET sql_log_bin= 1;   --source include/start_slave.inc --sync_with_master --echo Contents on slave after: SELECT * FROM t1 ORDER BY a;   --connection master DROP TABLE t1;   --connection slave --source include/stop_slave.inc SET GLOBAL slave_parallel_threads= @old_parallel; --source include/start_slave.inc   --source include/rpl_end.inc -slave.opt file: --disable-log-slave-updates

          Pushed to 10.0

          knielsen Kristian Nielsen added a comment - Pushed to 10.0

          People

            knielsen Kristian Nielsen
            elenst Elena Stepanova
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.