Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-4329

CHANGE MASTER ... master_gtid_pos='' does not reset the position

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • None
    • None
    • None
    • None

    Description

      I'm trying to tweak the test case initially described in MDEV-4325 to make it work. As discussed in the comments (https://mariadb.atlassian.net/browse/MDEV-4325?focusedCommentId=30821&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-30821), I'm now setting master_gtid_pos='' after slave reset. It still does not seem to work:

      [connection master]
      RESET MASTER;
      include/stop_slave.inc
      RESET SLAVE ALL;
      CHANGE MASTER TO master_host='127.0.0.1', master_port=16000, master_user='root', master_gtid_pos=auto;
      include/start_slave.inc
      CREATE TABLE t1 (i INT);
      include/stop_slave.inc
      DROP TABLE t1;
      RESET SLAVE;
      CHANGE MASTER TO master_gtid_pos='';
      ####################################################
      # We have set master_gtid_pos to '', so it's 
      # expected to be empty now (and it is)
      ####################################################
      SHOW ALL SLAVES STATUS;
      Connection_name	
      Slave_SQL_State	
      Slave_IO_State	
      Master_Host	127.0.0.1
      Master_User	root
      Master_Port	16000
      Connect_Retry	1
      Master_Log_File	
      Read_Master_Log_Pos	0
      Relay_Log_File	slave-relay-bin.000001
      Relay_Log_Pos	4
      Relay_Master_Log_File	
      Slave_IO_Running	No
      Slave_SQL_Running	No
      Replicate_Do_DB	
      Replicate_Ignore_DB	
      Replicate_Do_Table	
      Replicate_Ignore_Table	
      Replicate_Wild_Do_Table	
      Replicate_Wild_Ignore_Table	
      Last_Errno	0
      Last_Error	
      Skip_Counter	0
      Exec_Master_Log_Pos	0
      Relay_Log_Space	248
      Until_Condition	None
      Until_Log_File	
      Until_Log_Pos	0
      Master_SSL_Allowed	No
      Master_SSL_CA_File	
      Master_SSL_CA_Path	
      Master_SSL_Cert	
      Master_SSL_Cipher	
      Master_SSL_Key	
      Seconds_Behind_Master	NULL
      Master_SSL_Verify_Server_Cert	No
      Last_IO_Errno	0
      Last_IO_Error	
      Last_SQL_Errno	0
      Last_SQL_Error	
      Replicate_Ignore_Server_Ids	
      Master_Server_Id	1
      Using_Gtid	1
      Retried_transactions	0
      Max_relay_log_size	1073741824
      Executed_log_entries	16
      Slave_received_heartbeats	0
      Slave_heartbeat_period	60.000
      Gtid_Pos	
      ####################################################
      # But it still claims we are using an invalid value 
      ####################################################
      include/start_slave.inc
      SHOW SLAVE STATUS;
      Slave_IO_State	
      Master_Host	127.0.0.1
      Master_User	root
      Master_Port	16000
      Connect_Retry	1
      Master_Log_File	
      Read_Master_Log_Pos	0
      Relay_Log_File	slave-relay-bin.000001
      Relay_Log_Pos	4
      Relay_Master_Log_File	
      Slave_IO_Running	No
      Slave_SQL_Running	Yes
      Replicate_Do_DB	
      Replicate_Ignore_DB	
      Replicate_Do_Table	
      Replicate_Ignore_Table	
      Replicate_Wild_Do_Table	
      Replicate_Wild_Ignore_Table	
      Last_Errno	0
      Last_Error	
      Skip_Counter	0
      Exec_Master_Log_Pos	0
      Relay_Log_Space	248
      Until_Condition	None
      Until_Log_File	
      Until_Log_Pos	0
      Master_SSL_Allowed	No
      Master_SSL_CA_File	
      Master_SSL_CA_Path	
      Master_SSL_Cert	
      Master_SSL_Cipher	
      Master_SSL_Key	
      Seconds_Behind_Master	NULL
      Master_SSL_Verify_Server_Cert	No
      Last_IO_Errno	1236
      Last_IO_Error	Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 0-2-2, which is not in the master's binlog'
      Last_SQL_Errno	0
      Last_SQL_Error	
      Replicate_Ignore_Server_Ids	
      Master_Server_Id	1
      Using_Gtid	1

      Test case:

      --source include/master-slave.inc
      --source include/have_xtradb.inc
      --source include/have_binlog_format_mixed.inc
       
      RESET MASTER;
       
      --connection slave
      --source include/stop_slave.inc
      RESET SLAVE ALL;
      eval CHANGE MASTER TO master_host='127.0.0.1', master_port=$MASTER_MYPORT, 
           master_user='root', master_gtid_pos=auto;
      --source include/start_slave.inc
       
      --connection master
      CREATE TABLE t1 (i INT);
      --save_master_pos
       
      --sync_slave_with_master
      --source include/stop_slave.inc
      DROP TABLE t1;
      RESET SLAVE;
      # We can optionally delete the contents of the table,
      # it doesn't help anyway
      # DELETE FROM mysql.rpl_slave_state;
      eval CHANGE MASTER TO master_gtid_pos='';
       
      --echo ####################################################
      --echo # We have set master_gtid_pos to '', so it's 
      --echo # expected to be empty now (and it is)
      --echo ####################################################
      query_vertical SHOW ALL SLAVES STATUS;
       
      --echo ####################################################
      --echo # But it still claims we are using an invalid value 
      --echo ####################################################
       
      --source include/start_slave.inc
      --sleep 1
      query_vertical SHOW SLAVE STATUS;

      revision-id: knielsen@knielsen-hq.org-20130322102628-hxohewmbfyd1wig6
      revno: 3538
      branch-nick: 10.0-mdev26

      Attachments

        Issue Links

          Activity

            Please also note that the expected position looks corrupted: how did it suddenly become 0-2-.. ? If anything, it should still be 0-1-...

            elenst Elena Stepanova added a comment - Please also note that the expected position looks corrupted: how did it suddenly become 0-2-.. ? If anything, it should still be 0-1-...

            It's the same problem again I need to get this fixed properly.

            So the issue here is that the slave has GTID 0-2-2 in its binlog (the DROP
            TABLE t1). When the slave connects, it sees that the binlog has something
            newer, and appends it to the slave state. If one adds RESET MASTER on the
            slave, it works.

            But this is highly unacceptable behaviour, of course. I thought I implemented
            something such that the CHANGE MASTER MASTER_GTID_POS="" gives an error in
            this case, suggesting the RESET MASTER. I will see if I can get this to work
            properly in your case.

            But I'm starting to think the root problem is deeper. There are two different
            situations that conflict here. One is when a master is changed to a slave, and
            I want it to automagically resume from the position in its binlog. The other
            is user explicitly setting manually a start position, which should not be
            overridden by the binlog, of course.

            I'm wondering if I'm trying to make things too magic. Maybe it would be better
            if I never automatically use the binlog to determine where slave
            starts.

            Instead, if user wants to make old master into a new slave, they can
            explicitly do SHOW MASTER STATUS (when that is implemented) to get the binlog
            state and use that for CHANGE MASTER TO MASTER_GTID_POS='xxx'. Or I could
            implement a special MASTER_GTID_POS=MASTER_STATE.

            Or maybe I can fix it so they get an error instead of surprising behaviour.

            Let's discuss this on IRC or something, I really want to get this working
            properly!

            knielsen Kristian Nielsen added a comment - It's the same problem again I need to get this fixed properly. So the issue here is that the slave has GTID 0-2-2 in its binlog (the DROP TABLE t1). When the slave connects, it sees that the binlog has something newer, and appends it to the slave state. If one adds RESET MASTER on the slave, it works. But this is highly unacceptable behaviour, of course. I thought I implemented something such that the CHANGE MASTER MASTER_GTID_POS="" gives an error in this case, suggesting the RESET MASTER. I will see if I can get this to work properly in your case. But I'm starting to think the root problem is deeper. There are two different situations that conflict here. One is when a master is changed to a slave, and I want it to automagically resume from the position in its binlog. The other is user explicitly setting manually a start position, which should not be overridden by the binlog, of course. I'm wondering if I'm trying to make things too magic. Maybe it would be better if I never automatically use the binlog to determine where slave starts. Instead, if user wants to make old master into a new slave, they can explicitly do SHOW MASTER STATUS (when that is implemented) to get the binlog state and use that for CHANGE MASTER TO MASTER_GTID_POS='xxx'. Or I could implement a special MASTER_GTID_POS=MASTER_STATE. Or maybe I can fix it so they get an error instead of surprising behaviour. Let's discuss this on IRC or something, I really want to get this working properly!
            elenst Elena Stepanova added a comment - - edited

            >> So the issue here is that the slave has GTID 0-2-2 in its binlog (the DROP
            >> TABLE t1). When the slave connects, it sees that the binlog has something
            >> newer, and appends it to the slave state. If one adds RESET MASTER on the
            >> slave, it works.

            Okay, now I understand where 0-2-2 comes from, but the error message itself is highly confusing.
            'Error: connecting slave requested to start from GTID 0-2-2, which is not in the master's binlog'
            First, we didn't request slave to start from GTID 0-2-2; secondly, of course it's not in the master's binlog – why would it be, master has 0-1-...
            We need to re-word it somehow.

            >> But this is highly unacceptable behaviour, of course. I thought I implemented
            >> something such that the CHANGE MASTER MASTER_GTID_POS="" gives an error in
            >> this case, suggesting the RESET MASTER. I will see if I can get this to work
            >> properly in your case.

            But in this case, I do NOT want to do RESET MASTER! On the contrary, I want to replay the existing master binlog from the beginning, which is why I drop table t1 (so that it doesn't cause an error when slave attempts to execute the create table event).
            I tried to describe scenarios that I had in mind in more details (maybe in excessive details) in MDEV-4325.

            >> The other
            >> is user explicitly setting manually a start position, which should not be
            >> overridden by the binlog, of course.

            That's right, in this particular case I expected my explicit setting to work rather than be overridden by auto magic; especially since, as it was discussed before, it's the only way to actually reset the GTID position.

            >> I'm wondering if I'm trying to make things too magic. Maybe it would be better
            >> if I never automatically use the binlog to determine where slave
            >> starts.

            'auto' mode is still a mystery for me, so I don't have a strong opinion yet.

            >> Let's discuss this on IRC or something, I really want to get this working
            >> properly!

            Yep, let's. At this point I'm especially interested in figuring out the difference between the three cases:
            1) we use old-fashioned way to configure the slave (master_log_pos/master_log_file);
            2) we use an explicit value of GTID position to start replication;
            3) we use master_gtid_pos=auto

            How these three cases are supposed to differ, what are expected limitations of (1) comparing to (2) and (2) comparing to (3), etc.

            elenst Elena Stepanova added a comment - - edited >> So the issue here is that the slave has GTID 0-2-2 in its binlog (the DROP >> TABLE t1). When the slave connects, it sees that the binlog has something >> newer, and appends it to the slave state. If one adds RESET MASTER on the >> slave, it works. Okay, now I understand where 0-2-2 comes from, but the error message itself is highly confusing. 'Error: connecting slave requested to start from GTID 0-2-2, which is not in the master's binlog' First, we didn't request slave to start from GTID 0-2-2; secondly, of course it's not in the master's binlog – why would it be, master has 0-1-... We need to re-word it somehow. >> But this is highly unacceptable behaviour, of course. I thought I implemented >> something such that the CHANGE MASTER MASTER_GTID_POS="" gives an error in >> this case, suggesting the RESET MASTER. I will see if I can get this to work >> properly in your case. But in this case, I do NOT want to do RESET MASTER! On the contrary, I want to replay the existing master binlog from the beginning, which is why I drop table t1 (so that it doesn't cause an error when slave attempts to execute the create table event). I tried to describe scenarios that I had in mind in more details (maybe in excessive details) in MDEV-4325 . >> The other >> is user explicitly setting manually a start position, which should not be >> overridden by the binlog, of course. That's right, in this particular case I expected my explicit setting to work rather than be overridden by auto magic; especially since, as it was discussed before, it's the only way to actually reset the GTID position. >> I'm wondering if I'm trying to make things too magic. Maybe it would be better >> if I never automatically use the binlog to determine where slave >> starts. 'auto' mode is still a mystery for me, so I don't have a strong opinion yet. >> Let's discuss this on IRC or something, I really want to get this working >> properly! Yep, let's. At this point I'm especially interested in figuring out the difference between the three cases: 1) we use old-fashioned way to configure the slave (master_log_pos/master_log_file); 2) we use an explicit value of GTID position to start replication; 3) we use master_gtid_pos=auto How these three cases are supposed to differ, what are expected limitations of (1) comparing to (2) and (2) comparing to (3), etc.

            > But in this case, I do NOT want to do RESET MASTER! On the contrary, I want
            > to replay the existing master binlog from the beginning, which is why I drop
            > table t1 (so that it doesn't cause an error when slave attempts to execute
            > the create table event).

            Yes, I understand. You need to RESET MASTER on the slave, not on the master.
            In fact the RESET MASTER on the slave is anyway a good idea. Without it, you
            would get duplicate events in the binlog on the slave, which would cause
            trouble if you were to use the slave as a master for a third server.

            A fundamental concept for MariaDB GTID is that binlog order must be identical
            across all servers (hence the "global") (when using multiple domains order
            must be identical only within each domain).

            I think it is getting to the point where I should use your feedback so far and
            write up some proper documentation. This will force me to think the whole
            thing through properly, and once written will allow you to work without
            fumbling too much in the dark.

            I worry that I still have so many gotchas in the user interface after several
            iterations, but hopefully we can find some way to make it work reasonably.

            knielsen Kristian Nielsen added a comment - > But in this case, I do NOT want to do RESET MASTER! On the contrary, I want > to replay the existing master binlog from the beginning, which is why I drop > table t1 (so that it doesn't cause an error when slave attempts to execute > the create table event). Yes, I understand. You need to RESET MASTER on the slave, not on the master. In fact the RESET MASTER on the slave is anyway a good idea. Without it, you would get duplicate events in the binlog on the slave, which would cause trouble if you were to use the slave as a master for a third server. A fundamental concept for MariaDB GTID is that binlog order must be identical across all servers (hence the "global") (when using multiple domains order must be identical only within each domain). I think it is getting to the point where I should use your feedback so far and write up some proper documentation. This will force me to think the whole thing through properly, and once written will allow you to work without fumbling too much in the dark. I worry that I still have so many gotchas in the user interface after several iterations, but hopefully we can find some way to make it work reasonably.

            Ok, so turns out I made a simple mistake in the code, it is fixed now.
            Now the testcase gets this error message:

            mysqltest: At line 26: query 'CHANGE MASTER TO master_gtid_pos=''' failed: 1947: Requested MASTER_GTID_POS contains no value for replication domain 0. This conflicts with the binary log which contains GTID 0-2-2. To use the requested MASTER_GTID_POS, the old binlog must be removed with RESET MASTER to avoid out-of-order binlog

            So this is the new testcase:

            --connection master
            CREATE TABLE t1 (i INT);
            --sync_slave_with_master
            --source include/stop_slave.inc
            DROP TABLE t1;
            RESET SLAVE;
            --error ER_MASTER_GTID_POS_MISSING_DOMAIN
            eval CHANGE MASTER TO master_gtid_pos='';
            RESET MASTER;
            eval CHANGE MASTER TO master_gtid_pos='';
            --source include/start_slave.inc
            --sleep 1
            query_vertical SHOW ALL SLAVES STATUS;
            SELECT * FROM t1;

            So I've pushed this fix. However, I'm still open to discussing the deeper
            issue of whether this is the best way to handle things.

            It was a fundamental design decision I made early that I wanted the slave GTID
            state to be just a position in the binlog (or one per replication domain) -
            not a set of all applied GTIDs, like in the MySQL 5.6 design. This makes
            things simpler for the user, but it also gives the user a great
            responsibility: to ensure that binlogs are identical on all servers that can
            at some point become a master.

            Because GTID promises to allow to put any server as a slave of any other
            server with just MASTER_GTID_POS=AUTO. And the only thing the slave knows is
            the single GTID to start at. So starting from this GTID has to return the
            exact same sequence of events, no matter what master server is
            selected. Otherwise inconsistent/incorrect replication will result.

            So the lesson I took from your previous extensive feedback was to try much
            harder to protect the user from mistakes with inconsistent binlogs and
            configurations, and give errors in many more cases. Like in this one,
            unfortunately I missed the case MASTER_GTID_POS='', but fortunately you caught
            it immediately.

            Basically, with GTID, you can no longer do local changes on the slave without
            thinking about what goes into the slave binlog. Because only the current
            master within a domain is allowed to write to the binlog. One needs to do such
            local changes with SQL_LOG_BIN=0 if they are not meant to be replicated
            elsewhere, or clean them up afterwards with RESET MASTER (on the slave).

            There is definitely still need for improvement with this and the user
            interface in general. So I suggest we continue the discussion in the context
            of your excellent analysis in MDEV-4325.

            knielsen Kristian Nielsen added a comment - Ok, so turns out I made a simple mistake in the code, it is fixed now. Now the testcase gets this error message: mysqltest: At line 26: query 'CHANGE MASTER TO master_gtid_pos=''' failed: 1947: Requested MASTER_GTID_POS contains no value for replication domain 0. This conflicts with the binary log which contains GTID 0-2-2. To use the requested MASTER_GTID_POS, the old binlog must be removed with RESET MASTER to avoid out-of-order binlog So this is the new testcase: --connection master CREATE TABLE t1 (i INT); --sync_slave_with_master --source include/stop_slave.inc DROP TABLE t1; RESET SLAVE; --error ER_MASTER_GTID_POS_MISSING_DOMAIN eval CHANGE MASTER TO master_gtid_pos=''; RESET MASTER; eval CHANGE MASTER TO master_gtid_pos=''; --source include/start_slave.inc --sleep 1 query_vertical SHOW ALL SLAVES STATUS; SELECT * FROM t1; So I've pushed this fix. However, I'm still open to discussing the deeper issue of whether this is the best way to handle things. It was a fundamental design decision I made early that I wanted the slave GTID state to be just a position in the binlog (or one per replication domain) - not a set of all applied GTIDs, like in the MySQL 5.6 design. This makes things simpler for the user, but it also gives the user a great responsibility: to ensure that binlogs are identical on all servers that can at some point become a master. Because GTID promises to allow to put any server as a slave of any other server with just MASTER_GTID_POS=AUTO. And the only thing the slave knows is the single GTID to start at. So starting from this GTID has to return the exact same sequence of events, no matter what master server is selected. Otherwise inconsistent/incorrect replication will result. So the lesson I took from your previous extensive feedback was to try much harder to protect the user from mistakes with inconsistent binlogs and configurations, and give errors in many more cases. Like in this one, unfortunately I missed the case MASTER_GTID_POS='', but fortunately you caught it immediately. Basically, with GTID, you can no longer do local changes on the slave without thinking about what goes into the slave binlog. Because only the current master within a domain is allowed to write to the binlog. One needs to do such local changes with SQL_LOG_BIN=0 if they are not meant to be replicated elsewhere, or clean them up afterwards with RESET MASTER (on the slave). There is definitely still need for improvement with this and the user interface in general. So I suggest we continue the discussion in the context of your excellent analysis in MDEV-4325 .

            People

              knielsen Kristian Nielsen
              elenst Elena Stepanova
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.