[MDEV-5206] Slave with parallel threads is overly optimistic about its master log position after FLUSH LOGS - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 10.0.5
Component/s: None
Labels:
None

Description

The test case runs a number ($n) of INSERTs on master, then flushes logs, and tries to synchronize the slave with the master.
If it's executed with slave_parallel_threads > 0, we can see in the slave status that the exec master position is the same as on the master right away, although the number of inserted values differs.
After a while, the position is still the same, but the count increases (up to the expected value, eventually).

The test case shows the problem reliably for me with $n = 100 (the first count on the slave is 20 or so), but if it does not for you, please try to increase $n, I suppose it should increase the probability.

--source include/master-slave.inc

--source include/have_innodb.inc

--source include/have_binlog_format_mixed.inc

--enable_connect_log

--connection slave

--source include/stop_slave.inc

--connection master

--disable_warnings

DROP TABLE IF EXISTS t1;

--enable_warnings

CREATE TABLE t1 (id INT NOT NULL AUTO_INCREMENT PRIMARY KEY) ENGINE=MyISAM;

let $n = 100;

--disable_query_log

--echo

--echo # Running $n single inserts on master

--echo

while ($n)

  INSERT INTO t1 VALUES();

  dec $n;

--enable_query_log

FLUSH LOGS;

SHOW MASTER STATUS;

--save_master_pos

--connection slave

--source include/start_slave.inc

--sync_with_master

--echo

--echo # The slave thinks it's synchronized, but the count is off

--echo

query_vertical SHOW SLAVE STATUS;

select count(*) from t1;

--echo

--echo # After waiting a bit, the position is the same, but the count increased

--echo

sleep 2;

query_vertical SHOW SLAVE STATUS;

select count(*) from t1;

connection master;

DROP TABLE t1;

sync_slave_with_master;

--disable_connect_log

--source include/rpl_end.inc

Output:

CREATE TABLE t1 (id INT NOT NULL AUTO_INCREMENT PRIMARY KEY) ENGINE=MyISAM;

# Running 100 single inserts on master

FLUSH LOGS;

SHOW MASTER STATUS;

File	Position	Binlog_Do_DB	Binlog_Ignore_DB

master-bin.000002	367

connection slave;

include/start_slave.inc

connection slave;

# The slave thinks it's synchronized, but the count is off

SHOW SLAVE STATUS;

Slave_IO_State	Waiting for master to send event

Master_Host	127.0.0.1

Master_User	root

Master_Port	16000

Connect_Retry	1

Master_Log_File	master-bin.000002

Read_Master_Log_Pos	367

Relay_Log_File	slave-relay-bin.000005

Relay_Log_Pos	7051

Relay_Master_Log_File	master-bin.000002

Slave_IO_Running	Yes

Slave_SQL_Running	Yes

Replicate_Do_DB

Replicate_Ignore_DB

Replicate_Do_Table

Replicate_Ignore_Table

Replicate_Wild_Do_Table

Replicate_Wild_Ignore_Table

Last_Errno	0

Last_Error

Skip_Counter	0

Exec_Master_Log_Pos	367

Relay_Log_Space	996

Until_Condition	None

Until_Log_File

Until_Log_Pos	0

Master_SSL_Allowed	No

Master_SSL_CA_File

Master_SSL_CA_Path

Master_SSL_Cert

Master_SSL_Cipher

Master_SSL_Key

Seconds_Behind_Master	0

Master_SSL_Verify_Server_Cert	No

Last_IO_Errno	0

Last_IO_Error

Last_SQL_Errno	0

Last_SQL_Error

Replicate_Ignore_Server_Ids

Master_Server_Id	1

Using_Gtid	No

select count(*) from t1;

count(*)

# After waiting a bit, the position is the same, but the count increased

SHOW SLAVE STATUS;

Slave_IO_State	Waiting for master to send event

Master_Host	127.0.0.1

Master_User	root

Master_Port	16000

Connect_Retry	1

Master_Log_File	master-bin.000002

Read_Master_Log_Pos	367

Relay_Log_File	slave-relay-bin.000005

Relay_Log_Pos	22963

Relay_Master_Log_File	master-bin.000002

Slave_IO_Running	Yes

Slave_SQL_Running	Yes

Replicate_Do_DB

Replicate_Ignore_DB

Replicate_Do_Table

Replicate_Ignore_Table

Replicate_Wild_Do_Table

Replicate_Wild_Ignore_Table

Last_Errno	0

Last_Error

Skip_Counter	0

Exec_Master_Log_Pos	367

Relay_Log_Space	996

Until_Condition	None

Until_Log_File

Until_Log_Pos	0

Master_SSL_Allowed	No

Master_SSL_CA_File

Master_SSL_CA_Path

Master_SSL_Cert

Master_SSL_Cipher

Master_SSL_Key

Seconds_Behind_Master	2

Master_SSL_Verify_Server_Cert	No

Last_IO_Errno	0

Last_IO_Error

Last_SQL_Errno	0

Last_SQL_Error

Replicate_Ignore_Server_Ids

Master_Server_Id	1

Using_Gtid	No

select count(*) from t1;

count(*)

connection master;

DROP TABLE t1;

connection slave;

include/rpl_end.inc

revision-id: knielsen@knielsen-hq.org-20131030065230-kp8dykgyeth6ma55

revno: 3690

branch-nick: 10.0-knielsen

BUILD/compile-pentium-debug-max-no-ndb

Attachments

Issue Links

relates to

MDEV-4506 MWL#184: Parallel replication of group-committed transactions

Closed

Activity

Kristian Nielsen added a comment - 2013-10-31 15:13

In parallel replication, there are two kinds of events which are
executed in different ways.

Normal events that are part of event groups/transactions are executed
asynchroneously by being queued for a worker thread.

Other events like format description and rotate and such are executed
directly in the driver SQL thread.

If the direct execution of the other events were to update the old-style
position, then the position gets updated too far ahead, before the normal
events that have been queued for a worker thread have been executed.

So I now pushed a patch that adds some special cases to prevent such position
updates ahead of time, and instead queues dummy events for the worker threads,
so that they will at an appropriate time do the position updates instead.

With this patch, the count is the correct one, 100, in both cases in the
reported test case.

Kristian Nielsen added a comment - 2013-10-31 15:13 In parallel replication, there are two kinds of events which are executed in different ways. Normal events that are part of event groups/transactions are executed asynchroneously by being queued for a worker thread. Other events like format description and rotate and such are executed directly in the driver SQL thread. If the direct execution of the other events were to update the old-style position, then the position gets updated too far ahead, before the normal events that have been queued for a worker thread have been executed. So I now pushed a patch that adds some special cases to prevent such position updates ahead of time, and instead queues dummy events for the worker threads, so that they will at an appropriate time do the position updates instead. With this patch, the count is the correct one, 100, in both cases in the reported test case.

People

Assignee:: Kristian Nielsen

Reporter:: Elena Stepanova

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 2013-10-30 12:01

Updated:: 2013-10-31 15:13

Resolved:: 2013-10-31 15:13

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server