Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-5206

Slave with parallel threads is overly optimistic about its master log position after FLUSH LOGS

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • None
    • 10.0.5
    • None
    • None

    Description

      The test case runs a number ($n) of INSERTs on master, then flushes logs, and tries to synchronize the slave with the master.
      If it's executed with slave_parallel_threads > 0, we can see in the slave status that the exec master position is the same as on the master right away, although the number of inserted values differs.
      After a while, the position is still the same, but the count increases (up to the expected value, eventually).

      The test case shows the problem reliably for me with $n = 100 (the first count on the slave is 20 or so), but if it does not for you, please try to increase $n, I suppose it should increase the probability.

      --source include/master-slave.inc
      --source include/have_innodb.inc
      --source include/have_binlog_format_mixed.inc
       
      --enable_connect_log
       
      --connection slave
      --source include/stop_slave.inc
       
      --connection master
      --disable_warnings
      DROP TABLE IF EXISTS t1;
      --enable_warnings
       
      CREATE TABLE t1 (id INT NOT NULL AUTO_INCREMENT PRIMARY KEY) ENGINE=MyISAM;
      let $n = 100;
      --disable_query_log
      --echo 
      --echo # Running $n single inserts on master
      --echo 
      while ($n) 
      {
        INSERT INTO t1 VALUES();
        dec $n;
      }
      --enable_query_log
      FLUSH LOGS;
       
      SHOW MASTER STATUS;
      --save_master_pos
       
      --connection slave
      --source include/start_slave.inc
       
      --sync_with_master
       
      --echo 
      --echo # The slave thinks it's synchronized, but the count is off
      --echo 
       
      query_vertical SHOW SLAVE STATUS;
      select count(*) from t1;
       
      --echo 
      --echo # After waiting a bit, the position is the same, but the count increased
      --echo 
      sleep 2;
       
      query_vertical SHOW SLAVE STATUS;
      select count(*) from t1;
       
      connection master;
      DROP TABLE t1;
      sync_slave_with_master;
       
      --disable_connect_log
       
      --source include/rpl_end.inc

      Output:

      CREATE TABLE t1 (id INT NOT NULL AUTO_INCREMENT PRIMARY KEY) ENGINE=MyISAM;
       
      # Running 100 single inserts on master
       
      FLUSH LOGS;
      SHOW MASTER STATUS;
      File	Position	Binlog_Do_DB	Binlog_Ignore_DB
      master-bin.000002	367		
      connection slave;
      include/start_slave.inc
      connection slave;
       
      # The slave thinks it's synchronized, but the count is off
       
      SHOW SLAVE STATUS;
      Slave_IO_State	Waiting for master to send event
      Master_Host	127.0.0.1
      Master_User	root
      Master_Port	16000
      Connect_Retry	1
      Master_Log_File	master-bin.000002
      Read_Master_Log_Pos	367
      Relay_Log_File	slave-relay-bin.000005
      Relay_Log_Pos	7051
      Relay_Master_Log_File	master-bin.000002
      Slave_IO_Running	Yes
      Slave_SQL_Running	Yes
      Replicate_Do_DB	
      Replicate_Ignore_DB	
      Replicate_Do_Table	
      Replicate_Ignore_Table	
      Replicate_Wild_Do_Table	
      Replicate_Wild_Ignore_Table	
      Last_Errno	0
      Last_Error	
      Skip_Counter	0
      Exec_Master_Log_Pos	367
      Relay_Log_Space	996
      Until_Condition	None
      Until_Log_File	
      Until_Log_Pos	0
      Master_SSL_Allowed	No
      Master_SSL_CA_File	
      Master_SSL_CA_Path	
      Master_SSL_Cert	
      Master_SSL_Cipher	
      Master_SSL_Key	
      Seconds_Behind_Master	0
      Master_SSL_Verify_Server_Cert	No
      Last_IO_Errno	0
      Last_IO_Error	
      Last_SQL_Errno	0
      Last_SQL_Error	
      Replicate_Ignore_Server_Ids	
      Master_Server_Id	1
      Using_Gtid	No
      select count(*) from t1;
      count(*)
      29
       
      # After waiting a bit, the position is the same, but the count increased
       
      SHOW SLAVE STATUS;
      Slave_IO_State	Waiting for master to send event
      Master_Host	127.0.0.1
      Master_User	root
      Master_Port	16000
      Connect_Retry	1
      Master_Log_File	master-bin.000002
      Read_Master_Log_Pos	367
      Relay_Log_File	slave-relay-bin.000005
      Relay_Log_Pos	22963
      Relay_Master_Log_File	master-bin.000002
      Slave_IO_Running	Yes
      Slave_SQL_Running	Yes
      Replicate_Do_DB	
      Replicate_Ignore_DB	
      Replicate_Do_Table	
      Replicate_Ignore_Table	
      Replicate_Wild_Do_Table	
      Replicate_Wild_Ignore_Table	
      Last_Errno	0
      Last_Error	
      Skip_Counter	0
      Exec_Master_Log_Pos	367
      Relay_Log_Space	996
      Until_Condition	None
      Until_Log_File	
      Until_Log_Pos	0
      Master_SSL_Allowed	No
      Master_SSL_CA_File	
      Master_SSL_CA_Path	
      Master_SSL_Cert	
      Master_SSL_Cipher	
      Master_SSL_Key	
      Seconds_Behind_Master	2
      Master_SSL_Verify_Server_Cert	No
      Last_IO_Errno	0
      Last_IO_Error	
      Last_SQL_Errno	0
      Last_SQL_Error	
      Replicate_Ignore_Server_Ids	
      Master_Server_Id	1
      Using_Gtid	No
      select count(*) from t1;
      count(*)
      100
      connection master;
      DROP TABLE t1;
      connection slave;
      include/rpl_end.inc

      revision-id: knielsen@knielsen-hq.org-20131030065230-kp8dykgyeth6ma55
      revno: 3690
      branch-nick: 10.0-knielsen
      BUILD/compile-pentium-debug-max-no-ndb

      Attachments

        Issue Links

          Activity

            In parallel replication, there are two kinds of events which are
            executed in different ways.

            Normal events that are part of event groups/transactions are executed
            asynchroneously by being queued for a worker thread.

            Other events like format description and rotate and such are executed
            directly in the driver SQL thread.

            If the direct execution of the other events were to update the old-style
            position, then the position gets updated too far ahead, before the normal
            events that have been queued for a worker thread have been executed.

            So I now pushed a patch that adds some special cases to prevent such position
            updates ahead of time, and instead queues dummy events for the worker threads,
            so that they will at an appropriate time do the position updates instead.

            With this patch, the count is the correct one, 100, in both cases in the
            reported test case.

            knielsen Kristian Nielsen added a comment - In parallel replication, there are two kinds of events which are executed in different ways. Normal events that are part of event groups/transactions are executed asynchroneously by being queued for a worker thread. Other events like format description and rotate and such are executed directly in the driver SQL thread. If the direct execution of the other events were to update the old-style position, then the position gets updated too far ahead, before the normal events that have been queued for a worker thread have been executed. So I now pushed a patch that adds some special cases to prevent such position updates ahead of time, and instead queues dummy events for the worker threads, so that they will at an appropriate time do the position updates instead. With this patch, the count is the correct one, 100, in both cases in the reported test case.

            People

              knielsen Kristian Nielsen
              elenst Elena Stepanova
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.