[MDEV-14822] binlog.binlog_killed fails with wrong result Created: 2017-12-30  Updated: 2018-01-10  Resolved: 2018-01-10

Status: Closed
Project: MariaDB Server
Component/s: Replication, Tests
Affects Version/s: 10.3
Fix Version/s: 10.3.4

Type: Bug Priority: Major
Reporter: Vicențiu Ciorbaru Assignee: Michael Widenius
Resolution: Fixed Votes: 0
Labels: None


 Description   

In travis, we build the server with the following config line:

cmake . -DCMAKE_BUILD_TYPE=RelWithDebInfo -DWITH_SSL=system -DWITH_ZLIB=system -DPLUGIN_AWS_KEY_MANAGEMENT=DYNAMIC -DAWS_SDK_EXTERNAL_PROJECT=ON -DWITHOUT_MROONGA_STORAGE_ENGINE=True

Due to how travis environment is set up, the test fails when compiling using clang 5.0.1 or with GCC 6.3.0

Locally, I was able the reproduce it with the following failure:

+++ /home/vicentiu/Workspace/MariaDB-10.3/mysql-test/suite/binlog/r/binlog_killed.reject	2017-12-30 15:35:00.079888311 +0200
@@ -264,12 +264,6 @@
 master-bin.000002	#	Intvar	#	#	INSERT_ID=7
 master-bin.000002	#	Query	#	#	use `test`; INSERT INTO t2 VALUES (NULL, 1)
 master-bin.000002	#	Query	#	#	COMMIT
-master-bin.000002	#	Gtid	#	#	BEGIN GTID #-#-#
-master-bin.000002	#	Intvar	#	#	INSERT_ID=8
-master-bin.000002	#	Query	#	#	use `test`; INSERT INTO t1 VALUES (NULL, 1)
-master-bin.000002	#	Intvar	#	#	INSERT_ID=9
-master-bin.000002	#	Query	#	#	use `test`; INSERT INTO t1 VALUES (NULL, 1)
-master-bin.000002	#	Query	#	#	ROLLBACK
 connection default;
 disconnect con1;
 disconnect con2;
 
mysqltest: Result length mismatch

However, at times the failure is like this:

57589899 +0200
+++ /home/vicentiu/Workspace/MariaDB-10.3/mysql-test/suite/binlog/r/binlog_killed.reject	2017-12-30 15:36:12.180412517 +0200
@@ -198,10 +198,6 @@
 master-bin.000002	#	Intvar	#	#	INSERT_ID=3
 master-bin.000002	#	Query	#	#	use `test`; INSERT INTO t2 VALUES (NULL, 1)
 master-bin.000002	#	Query	#	#	COMMIT
-master-bin.000002	#	Gtid	#	#	BEGIN GTID #-#-#
-master-bin.000002	#	Intvar	#	#	INSERT_ID=4
-master-bin.000002	#	Query	#	#	use `test`; INSERT INTO t1 VALUES (NULL, 1)
-master-bin.000002	#	Query	#	#	ROLLBACK
 disconnect con3;
 connect  con3, localhost, root,,;
 connection con3;
@@ -264,12 +260,6 @@
 master-bin.000002	#	Intvar	#	#	INSERT_ID=7
 master-bin.000002	#	Query	#	#	use `test`; INSERT INTO t2 VALUES (NULL, 1)
 master-bin.000002	#	Query	#	#	COMMIT
-master-bin.000002	#	Gtid	#	#	BEGIN GTID #-#-#
-master-bin.000002	#	Intvar	#	#	INSERT_ID=8
-master-bin.000002	#	Query	#	#	use `test`; INSERT INTO t1 VALUES (NULL, 1)
-master-bin.000002	#	Intvar	#	#	INSERT_ID=9
-master-bin.000002	#	Query	#	#	use `test`; INSERT INTO t1 VALUES (NULL, 1)
-master-bin.000002	#	Query	#	#	ROLLBACK
 connection default;
 disconnect con1;
 disconnect con2;
 
mysqltest: Result length mismatch

The common denominator it seems is that when running a transaction with a transactional storage engine (Innodb), in some cases we don't immediately flush to the binlog. If the kill query signal happens earlier, we don't write the transaction at all (including the ROLLBACK event). The problem is that the test doesn't always fail. With --repeat=100, I was able to get it to reproduce once every 3-4 runs.

Ultimately, the binlog contents are valid either way, but the test needs to be stabilized in some way.

The sequence of queries is roughly this:

create table t1 (a int auto_increment, b int, PRIMARY KEY (a)) ENGINE=InnoDB;
create table t2 (a int auto_increment, b int, PRIMARY KEY (a)) ENGINE=MyISAM;
flush logs;
connect  con3, localhost, root,,;
connection con3;
BEGIN;
INSERT INTO t2 VALUES (NULL, 1);
INSERT INTO t1 VALUES (NULL, 1);
connection con3;
KILL $id_of_con3;
-- Since t2 is non-transactional, the binlog should have those rows, however t1 changes should be rolled back.



 Comments   
Comment by Vicențiu Ciorbaru [ 2017-12-30 ]

CC: Elkin monty

Comment by Vicențiu Ciorbaru [ 2017-12-30 ]

Reverting c4581735d0210beba0733b30df8dd994786663fe partially seems to have solved the problem.

 
diff --git a/sql/log.cc b/sql/log.cc
index bdf0b6fdc59d..fc4b961e136b 100644
--- a/sql/log.cc
+++ b/sql/log.cc
@@ -4178,6 +4178,7 @@ bool MYSQL_BIN_LOG::reset_logs(THD *thd, bool create_new_log,
     mysql_mutex_unlock(&LOCK_xid_list);
   }
 
+  mysql_mutex_lock(&LOCK_thread_count);
   /* Save variables so that we can reopen the log */
   save_name=name;
   name=0;					// Protect against free
@@ -4285,6 +4286,8 @@ bool MYSQL_BIN_LOG::reset_logs(THD *thd, bool create_new_log,
   if (error == 1)
     name= const_cast<char*>(save_name);
 
+  mysql_mutex_unlock(&LOCK_thread_count);
+
   if (!is_relay_log)
   {
     xid_count_per_binlog *b;

Comment by Vicențiu Ciorbaru [ 2017-12-30 ]

While trying to reproduce the problem on GCC 7.2.1, I've gotten the following failure:

--- /home/vicentiu/Workspace/MariaDB-10.3/mysql-test/suite/binlog/r/binlog_killed.result	2017-12-16 18:24:27.257589899 +0200
+++ /home/vicentiu/Workspace/MariaDB-10.3/mysql-test/suite/binlog/r/binlog_killed.reject	2017-12-30 19:20:03.784691846 +0200
@@ -193,11 +193,11 @@
 master-bin.000002	#	Format_desc	#	#	SERVER_VERSION, BINLOG_VERSION
 master-bin.000002	#	Gtid_list	#	#	[#-#-#]
 master-bin.000002	#	Binlog_checkpoint	#	#	master-bin.000001
-master-bin.000002	#	Binlog_checkpoint	#	#	master-bin.000002
 master-bin.000002	#	Gtid	#	#	BEGIN GTID #-#-#
 master-bin.000002	#	Intvar	#	#	INSERT_ID=3
 master-bin.000002	#	Query	#	#	use `test`; INSERT INTO t2 VALUES (NULL, 1)
 master-bin.000002	#	Query	#	#	COMMIT
+master-bin.000002	#	Binlog_checkpoint	#	#	master-bin.000002
 master-bin.000002	#	Gtid	#	#	BEGIN GTID #-#-#
 master-bin.000002	#	Intvar	#	#	INSERT_ID=4
 master-bin.000002	#	Query	#	#	use `test`; INSERT INTO t1 VALUES (NULL, 1)

Comment by Michael Widenius [ 2018-01-09 ]

Binlog checkpoints can happen any time. They are more or less random.

To get rid of them, one should start scripts with 'reset master' or use the new mtr variable "--let $skip_checkpoint_events=1" which removes them from show_binlog_events

Comment by Michael Widenius [ 2018-01-10 ]

Problem was timing between the thread that was killed and the thread that was
reading the binary log.

Updated the test to wait until the killed thread was properly terminated
before checking what's in the binary log.

To make check safe, I changed "threads_connected" to be updated after
thd::cleanup() is done, to ensure that all binary logs updates are done
before the variable is changed. This was mainly done to get the
test deterministic and have now other real influence in how the server
works.

Generated at Thu Feb 08 08:16:33 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.