Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Fixed
-
10.0.12
-
None
Description
While replicating from the attached binary logs, a slave started with slave-parallel-threads=8 fails and then goes down with the assertion failure.
The binary log contains changes made by a bootstrap process on master, so the replication failure with the duplicate key is expected, since the system table on the slave already exists and is populated.
Assertion failure is the problem.
140625 15:01:47 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='', master_port='3306', master_log_file='', master_log_pos='4'. New state master_host='127.0.0.1', master_port='3306', master_log_file='', master_log_pos='4'.
|
140625 15:01:47 [Note] Previous Using_Gtid=No. New Using_Gtid=Current_Pos
|
140625 15:01:55 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './wheezy-64-relay-bin.000001' position: 4; GTID position ''
|
140625 15:01:55 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:3306',replication starts at GTID position ''
|
140625 15:02:15 [ERROR] Slave SQL: Error 'Duplicate entry '%-test-' for key 'PRIMARY'' on query. Default database: 'mysql'. Query: 'INSERT INTO db SELECT * FROM tmp_db WHERE @had_db_table=0;', Internal MariaDB error code: 1062
|
140625 15:02:15 [Warning] Slave: Duplicate entry '%-test-' for key 'PRIMARY' Error_code: 1062
|
140625 15:02:15 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000001' position 63928; GTID position '0-1-85'
|
mysqld: 10.0/sql/sql_base.cc:902: void close_thread_tables(THD*): Assertion `thd->transaction.stmt.is_empty() || thd->in_sub_stmt || (thd->state_flags & Open_tables_state::BACKUPS_AVAIL)' failed.
|
140625 15:02:15 [ERROR] mysqld got signal 6 ;
|
#6 0x00007fcfc01ff621 in *__GI___assert_fail (assertion=0xf16a20 "thd->transaction.stmt.is_empty() || thd->in_sub_stmt || (thd->state_flags & Open_tables_state::BACKUPS_AVAIL)", file=<optimized out>, line=902, function=0xf19fb0 "void close_thread_tables(THD*)") at assert.c:81
|
#7 0x0000000000614865 in close_thread_tables (thd=0x7fcfa3049070) at 10.0/sql/sql_base.cc:901
|
#8 0x00000000007eae3f in rpl_slave_state::record_gtid (this=0x1886760, thd=0x7fcfa3049070, gtid=0x7fcf969ab090, sub_id=89, in_transaction=false, in_statement=false) at 10.0/sql/rpl_gtid.cc:688
|
#9 0x00000000007e9950 in rpl_slave_state::record_and_update_gtid (this=0x1886760, thd=0x7fcfa3049070, rgi=0x7fcf969ab070) at 10.0/sql/rpl_gtid.cc:80
|
#10 0x00000000007825a2 in Relay_log_info::stmt_done (this=0x7fcfad7fccb8, event_master_log_pos=64952, event_creation_time=1403693523, thd=0x7fcfa3049070, rgi=0x7fcf969ab070) at 10.0/sql/rpl_rli.cc:1244
|
#11 0x00000000009484e4 in Log_event::do_update_pos (this=0x7fcf96998f30, rgi=0x7fcf969ab070) at 10.0/sql/log_event.cc:975
|
#12 0x000000000094ea7b in Query_log_event::do_update_pos (this=0x7fcf96998f30, rgi=0x7fcf969ab070) at 10.0/sql/log_event.cc:4487
|
#13 0x00000000005ed13a in Log_event::update_pos (this=0x7fcf96998f30, rgi=0x7fcf969ab070) at 10.0/sql/log_event.h:1357
|
#14 0x00000000005e4538 in apply_event_and_update_pos (ev=0x7fcf96998f30, thd=0x7fcfa3049070, rgi=0x7fcf969ab070, rpt=0x7fcfb661a930) at 10.0/sql/slave.cc:3271
|
#15 0x00000000007eebd6 in rpt_handle_event (qev=0x7fcf969d3b70, rpt=0x7fcfb661a930) at 10.0/sql/rpl_parallel.cc:46
|
#16 0x00000000007efad4 in handle_rpl_parallel_thread (arg=0x7fcfb661a930) at 10.0/sql/rpl_parallel.cc:502
|
#17 0x00007fcfc1db6b50 in start_thread (arg=<optimized out>) at pthread_create.c:304
|
#18 0x00007fcfc02aea7d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
|
I couldn't reproduce it without parallel replication or with smaller number of parallel threads, but with slave-parallel-threads=8 the failure seems to be reliably reproducible for me.
RQG test to reproduce, in case replaying replication from binary logs does not do it:
perl ./runall-new.pl --threads=1 --duration=10 --queries=100M --grammar=conf/optimizer/optimizer_subquery.yy --rpl_mode=row --use-gtid=current_pos --mysqld=--slave_parallel_threads=8 --basedir=<basedir> --vardir=<vardir>
|
It is to be run with MariaDB's flavor of RQG (lp:~elenst/randgen/mariadb-patches)
Great test, thanks.
I actually found two issues that together result in the assertion:
1. Given two transactions T1 and T2 in the master's binlog, not in the same
group commit. On the slave, T2 is only allowed to run once T1 has started to
commit. However, there was a bug, so that if T1 failed, then T2 was still
allowed to start (T2 would then catch the error at its own commit and roll
back, so the problem was often not visible).
2. When T2 is DDL (drop temporary table in this case), record_gtid() needs to
update mysql.gtid_slave_pos in an auto-committed transaction. This autocommit
ends up noticing the error in T1 and failing. There was missing error handling
in this case (missing ha_rollback_trans()), which triggered the assertion.
Note that it is necessary to run without binlogging on the slave
(--skip-log-bin or --log-slave-updates=0) to trigger this.
Here is an MTR test case that reproduces the assertion:
--source include/have_innodb.inc
--source include/master-slave.inc
--connection master
# ToDo: Remove this FLUSH LOGS when MDEV-6403 is fixed.
ALTER TABLE mysql.gtid_slave_pos ENGINE = InnoDB;
FLUSH LOGS;
CREATE TABLE t1 (a INT PRIMARY KEY, b INT) Engine=InnoDB;
--sync_slave_with_master
--connection slave
--source include/stop_slave.inc
# Provoke a duplicate key error on replication.
SET sql_log_bin= 0;
INSERT INTO t1 VALUES (1, 2);
SET sql_log_bin= 1;
CHANGE MASTER TO master_use_gtid= current_pos;
--echo Contents on slave before:
SELECT * FROM t1 ORDER BY a;
SET @old_parallel= @@GLOBAL.slave_parallel_threads;
SET GLOBAL slave_parallel_threads=8;
--connection master
CREATE TEMPORARY TABLE t2 LIKE t1;
INSERT INTO t2 VALUE (1, 1);
INSERT INTO t2 VALUE (2, 1);
INSERT INTO t2 VALUE (3, 1);
INSERT INTO t2 VALUE (4, 1);
INSERT INTO t2 VALUE (5, 1);
INSERT INTO t1 SELECT * FROM t2;
DROP TEMPORARY TABLE t2;
--save_master_pos
--echo Contents on master:
SELECT * FROM t1 ORDER BY a;
--connection slave
START SLAVE;
# The slave will stop with a duplicate key error.
# The bug was 1) that the next DROP TEMPORARY TABLE would be allowed to run
# anyway, and 2) that then record_gtid() would get an error during commit
# (since the prior commit failed), and this error was not correctly handled,
# which caused an assertion about closing tables while a statement was still
# active.
--let $slave_sql_errno=1062
--source include/wait_for_slave_sql_error.inc
STOP SLAVE IO_THREAD;
--echo Contents on slave on slave error:
SELECT * FROM t1 ORDER BY a;
# Resolve the duplicate key error so replication can be resumed.
SET sql_log_bin= 0;
DELETE FROM t1 WHERE a=1;
SET sql_log_bin= 1;
--source include/start_slave.inc
--sync_with_master
--echo Contents on slave after:
SELECT * FROM t1 ORDER BY a;
--connection master
DROP TABLE t1;
--connection slave
--source include/stop_slave.inc
SET GLOBAL slave_parallel_threads= @old_parallel;
--source include/start_slave.inc
--source include/rpl_end.inc
-slave.opt file:
--disable-log-slave-updates