[MDEV-8301] Test failures in rpl.rpl_000011 Created: 2015-06-11  Updated: 2023-12-11  Resolved: 2023-12-11

Status: Closed
Project: MariaDB Server
Component/s: Replication, Tests
Affects Version/s: 10.0.20, 10.1.6
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Kristian Nielsen Assignee: Michael Widenius
Resolution: Fixed Votes: 0
Labels: None

Sprint: 10.1.6-2, 10.0.21

 Description   

rpl.rpl_000011 'row' w1 [ fail ]
Test ended at 2015-06-10 17:26:28

CURRENT_TEST: rpl.rpl_000011
— /usr/share/mysql/mysql-test/suite/rpl/r/rpl_000011.result 2015-06-10 13:40:30.000000000 +0300
+++ /run/shm/var/1/log/rpl_000011.reject 2015-06-10 17:26:28.339504149 +0300
@@ -8,7 +8,7 @@
stop slave;
show global status like 'com_insert';
Variable_name Value
-Com_insert 1
+Com_insert 2
include/wait_for_slave_to_stop.inc
start slave;
include/wait_for_slave_to_start.inc

mysqltest: Result content mismatch

Fails sporadically, but on many hosts.



 Comments   
Comment by Kristian Nielsen [ 2015-06-11 ]

So this turns out to actually be an old bug.
It just happened to trigger more easily after patch for MDEV-8294.

The race is that there is a small window where SHOW GLOBAL STATUS may count
twice the values of a thread that is exiting. This happens also for normal
threads, not just the slave thread.

Here is a test case for latest 10.0, triggers with the following sleep
patched in:

diff --git a/sql/mysqld.cc b/sql/mysqld.cc
index e05c0b6..75af8f7 100644
--- a/sql/mysqld.cc
+++ b/sql/mysqld.cc
@@ -2759,6 +2759,7 @@ void unlink_thd(THD *thd)
 
   thd->add_status_to_global();
 
+my_sleep(2000000);
   mysql_mutex_lock(&LOCK_thread_count);
   thd->unlink();
   /*

CREATE TABLE t1 (a INT PRIMARY KEY);
 
connect (c1,localhost,root,,);
INSERT INTO t1 VALUES (1);
INSERT INTO t1 VALUES (2);
INSERT INTO t1 VALUES (3);
--let $cid= `SELECT CONNECTION_ID()`
 
--connection default
eval KILL CONNECTION $cid;
--sleep 1
SHOW GLOBAL STATUS LIKE 'Com_insert';
--sleep 2
SHOW GLOBAL STATUS LIKE 'Com_insert';
 
DROP TABLE t1;

The first SHOW GLOBAL STATUS counts the thread twice:

CREATE TABLE t1 (a INT PRIMARY KEY);
INSERT INTO t1 VALUES (1);
INSERT INTO t1 VALUES (2);
INSERT INTO t1 VALUES (3);
KILL CONNECTION 4;
SHOW GLOBAL STATUS LIKE 'Com_insert';
Variable_name	Value
Com_insert	6
SHOW GLOBAL STATUS LIKE 'Com_insert';
Variable_name	Value
Com_insert	3
DROP TABLE t1;

Comment by Kristian Nielsen [ 2015-06-11 ]

The race occurs because there is a window of time between a thread does THD::add_status_to_global() and THD::unlink(). During this window, calc_sum_of_all_status() can run and count the contribution of the thread twice.

Comment by Kristian Nielsen [ 2015-06-11 ]

The original buildbot test failure can be reproduced easily by injecting the
following sleep in the code:

diff --git a/sql/slave.cc b/sql/slave.cc
index 4129c4c..32e6bd4 100644
--- a/sql/slave.cc
+++ b/sql/slave.cc
@@ -4837,6 +4837,7 @@ log '%s' at position %s, relay log '%s' position: %s%s", RPL_LOG_NAME,
   mysql_cond_broadcast(&rli->stop_cond);
   DBUG_EXECUTE_IF("simulate_slave_delay_at_terminate_bug38694", sleep(5););
   mysql_mutex_unlock(&rli->run_lock);  // tell the world we are done
+my_sleep(500000);
 
   /*
     Deactivate the parallel replication thread pool, if there are now no more

Comment by Kristian Nielsen [ 2015-06-11 ]

Monty said he would fix this during this week

Comment by Kristian Nielsen [ 2015-06-15 ]

Monty, I've pushed a temporary work-around for the test failure in rpl.rpl_000011:

http://lists.askmonty.org/pipermail/commits/2015-June/008039.html

Please remove that work-around from the testcase when you push the fix for this underlying issue.

Comment by Michael Widenius [ 2015-06-26 ]

Fixed by adding a marker if we have added the thread counters to the global counters.
Fix will be pushed into 5.5 tree

Comment by Michael Widenius [ 2015-06-26 ]

Fix pushed into 5.5
When merged to 10.0, the above temporary fix should be removed and this bug closed.

Comment by Julien Fritsch [ 2023-12-05 ]

Automated message:
----------------------------
Since this issue has not been updated since 6 weeks, it's time to move it back to Stalled.

Comment by Kristian Nielsen [ 2023-12-11 ]

This was completed long ago. T bug is fixed, as described in comments. And the temporary work-around was removed as part of commit 1e0f09cacbb05eec95078e10df1ea610736b9c1a.

Generated at Thu Feb 08 07:26:07 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.