[MDEV-5206] Slave with parallel threads is overly optimistic about its master log position after FLUSH LOGS Created: 2013-10-30  Updated: 2013-10-31  Resolved: 2013-10-31

Status: Closed
Project: MariaDB Server
Component/s: None
Affects Version/s: None
Fix Version/s: 10.0.5

Type: Bug Priority: Major
Reporter: Elena Stepanova Assignee: Kristian Nielsen
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Relates
relates to MDEV-4506 MWL#184: Parallel replication of grou... Closed

 Description   

The test case runs a number ($n) of INSERTs on master, then flushes logs, and tries to synchronize the slave with the master.
If it's executed with slave_parallel_threads > 0, we can see in the slave status that the exec master position is the same as on the master right away, although the number of inserted values differs.
After a while, the position is still the same, but the count increases (up to the expected value, eventually).

The test case shows the problem reliably for me with $n = 100 (the first count on the slave is 20 or so), but if it does not for you, please try to increase $n, I suppose it should increase the probability.

--source include/master-slave.inc
--source include/have_innodb.inc
--source include/have_binlog_format_mixed.inc
 
--enable_connect_log
 
--connection slave
--source include/stop_slave.inc
 
--connection master
--disable_warnings
DROP TABLE IF EXISTS t1;
--enable_warnings
 
CREATE TABLE t1 (id INT NOT NULL AUTO_INCREMENT PRIMARY KEY) ENGINE=MyISAM;
let $n = 100;
--disable_query_log
--echo 
--echo # Running $n single inserts on master
--echo 
while ($n) 
{
  INSERT INTO t1 VALUES();
  dec $n;
}
--enable_query_log
FLUSH LOGS;
 
SHOW MASTER STATUS;
--save_master_pos
 
--connection slave
--source include/start_slave.inc
 
--sync_with_master
 
--echo 
--echo # The slave thinks it's synchronized, but the count is off
--echo 
 
query_vertical SHOW SLAVE STATUS;
select count(*) from t1;
 
--echo 
--echo # After waiting a bit, the position is the same, but the count increased
--echo 
sleep 2;
 
query_vertical SHOW SLAVE STATUS;
select count(*) from t1;
 
connection master;
DROP TABLE t1;
sync_slave_with_master;
 
--disable_connect_log
 
--source include/rpl_end.inc

Output:

CREATE TABLE t1 (id INT NOT NULL AUTO_INCREMENT PRIMARY KEY) ENGINE=MyISAM;
 
# Running 100 single inserts on master
 
FLUSH LOGS;
SHOW MASTER STATUS;
File	Position	Binlog_Do_DB	Binlog_Ignore_DB
master-bin.000002	367		
connection slave;
include/start_slave.inc
connection slave;
 
# The slave thinks it's synchronized, but the count is off
 
SHOW SLAVE STATUS;
Slave_IO_State	Waiting for master to send event
Master_Host	127.0.0.1
Master_User	root
Master_Port	16000
Connect_Retry	1
Master_Log_File	master-bin.000002
Read_Master_Log_Pos	367
Relay_Log_File	slave-relay-bin.000005
Relay_Log_Pos	7051
Relay_Master_Log_File	master-bin.000002
Slave_IO_Running	Yes
Slave_SQL_Running	Yes
Replicate_Do_DB	
Replicate_Ignore_DB	
Replicate_Do_Table	
Replicate_Ignore_Table	
Replicate_Wild_Do_Table	
Replicate_Wild_Ignore_Table	
Last_Errno	0
Last_Error	
Skip_Counter	0
Exec_Master_Log_Pos	367
Relay_Log_Space	996
Until_Condition	None
Until_Log_File	
Until_Log_Pos	0
Master_SSL_Allowed	No
Master_SSL_CA_File	
Master_SSL_CA_Path	
Master_SSL_Cert	
Master_SSL_Cipher	
Master_SSL_Key	
Seconds_Behind_Master	0
Master_SSL_Verify_Server_Cert	No
Last_IO_Errno	0
Last_IO_Error	
Last_SQL_Errno	0
Last_SQL_Error	
Replicate_Ignore_Server_Ids	
Master_Server_Id	1
Using_Gtid	No
select count(*) from t1;
count(*)
29
 
# After waiting a bit, the position is the same, but the count increased
 
SHOW SLAVE STATUS;
Slave_IO_State	Waiting for master to send event
Master_Host	127.0.0.1
Master_User	root
Master_Port	16000
Connect_Retry	1
Master_Log_File	master-bin.000002
Read_Master_Log_Pos	367
Relay_Log_File	slave-relay-bin.000005
Relay_Log_Pos	22963
Relay_Master_Log_File	master-bin.000002
Slave_IO_Running	Yes
Slave_SQL_Running	Yes
Replicate_Do_DB	
Replicate_Ignore_DB	
Replicate_Do_Table	
Replicate_Ignore_Table	
Replicate_Wild_Do_Table	
Replicate_Wild_Ignore_Table	
Last_Errno	0
Last_Error	
Skip_Counter	0
Exec_Master_Log_Pos	367
Relay_Log_Space	996
Until_Condition	None
Until_Log_File	
Until_Log_Pos	0
Master_SSL_Allowed	No
Master_SSL_CA_File	
Master_SSL_CA_Path	
Master_SSL_Cert	
Master_SSL_Cipher	
Master_SSL_Key	
Seconds_Behind_Master	2
Master_SSL_Verify_Server_Cert	No
Last_IO_Errno	0
Last_IO_Error	
Last_SQL_Errno	0
Last_SQL_Error	
Replicate_Ignore_Server_Ids	
Master_Server_Id	1
Using_Gtid	No
select count(*) from t1;
count(*)
100
connection master;
DROP TABLE t1;
connection slave;
include/rpl_end.inc

revision-id: knielsen@knielsen-hq.org-20131030065230-kp8dykgyeth6ma55
revno: 3690
branch-nick: 10.0-knielsen
BUILD/compile-pentium-debug-max-no-ndb



 Comments   
Comment by Kristian Nielsen [ 2013-10-31 ]

In parallel replication, there are two kinds of events which are
executed in different ways.

Normal events that are part of event groups/transactions are executed
asynchroneously by being queued for a worker thread.

Other events like format description and rotate and such are executed
directly in the driver SQL thread.

If the direct execution of the other events were to update the old-style
position, then the position gets updated too far ahead, before the normal
events that have been queued for a worker thread have been executed.

So I now pushed a patch that adds some special cases to prevent such position
updates ahead of time, and instead queues dummy events for the worker threads,
so that they will at an appropriate time do the position updates instead.

With this patch, the count is the correct one, 100, in both cases in the
reported test case.

Generated at Thu Feb 08 07:02:30 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.