[MDEV-10703] SHOW SLAVE STATUS does not correctly reflect broken replication Created: 2016-08-30  Updated: 2016-11-30  Resolved: 2016-11-30

Status: Closed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.1.16, 5.5
Fix Version/s: N/A

Type: Bug Priority: Critical
Reporter: Oli Sennhauser Assignee: Unassigned
Resolution: Incomplete Votes: 0
Labels: need_feedback, replication
Environment:

CentOS 7, Ubuntu 14.04



 Description   

We intentionally inserted some data on a slave which should arrive from the master at the same time. The following Query was running on Master all the time and twice on the Slave:

insert into test values (null, 'stop repl', null);

this was to show how replication will stop and how this will look like in SHOW SLAVE STATUS. Unfortunately SHOW SLAVE STATUS die NOT show this behaviour but was displaying:

             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes

But replication clearly stopped. No more rows were inserted on Slave in table test.

After stopping and starting slave this state was also reflected in the SHOW SLAVE STATUS command.

Last_Errno: 1062
Last_Error: Error 'Duplicate entry '174985' for key 'PRIMARY'' on query. Default database: 'test'. Query: 'insert into test values (NULL, 'Test data insert', CURRENT_TIMESTAMP())'

Unfortunately we could not reproduce this behaviour nor create a reproducible test case.

The steps we did on the Slave were as follows:

zcat /tmp/fulldump.sql.gz | mysql -uroot
set global innodb_flush_log_at_trx_commit = 0;
 
change master to master_host='notebook12', master_user='slave', master_password='password';
show slave status\G
CHANGE MASTER TO MASTER_LOG_FILE='master-bin.000002', MASTER_LOG_POS=368;
show slave status\G
start slave;
show slave status\G
set global read_only = on;
use test
show create table test\G
-- 2x
insert into test values (null, 'stop repl', null);
select * from test where data = 'stop repl';
show slave status\G
select * from test where data = 'stop repl';
select * from test where id between 174980 and 174990;
select * from test where id between 174980 and 174995;
select * from test where id between 174980 and 175000;
-- We were confused here why rows did not show up
show slave status\G
select * from test where id between 174980 and 175000;
select * from test where id between 174980 and 176000;
select * from test where id between 174980;
select * from test;
stop slave;
start slave;
show slave status\G --> Error

Table looks as follows:

CREATE TABLE `test` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `data` varchar(64) DEFAULT NULL,
  `ts` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
;



 Comments   
Comment by Oli Sennhauser [ 2016-08-30 ]

Customer reported, that this happens from time to time on his MariaDB replication farm. AND this is very dangerous because all replication monitoring solutions will not reflect proper replication state. You also have to watch SQL Thread pos constantly...

Comment by Elena Stepanova [ 2016-08-30 ]

oli,
As you can imagine, there are many questions about the described scenario.

  • what exactly is at MASTER_LOG_FILE='master-bin.000002', MASTER_LOG_POS=368? Please quote or attach the relevant portion of the binary log (the one which you expect a conflict with);
  • what '– 2x' in this context means? That you execute this fragment both on master and slave?
  • "– We were confused here why rows did not show up" – where did they not show up, on master or on slave? Which rows are we talking about, from what I see you only inserted one? Why are you sure that the inserted row(s) should fall into this particular range of IDs – nothing in the scenario seems to make it certain? I can only assume you take it from AUTO_INCREMENT in SHOW CREATE statement that you ran before, but this is unfounded guess because it's not actually said anywhere? Do the rows appear in the query select * from test where data = 'stop repl'?
  • how exactly do you check that replication has stopped? How long do you wait? What happens if you run any other query, not involving the test table? Please paste the output of SHOW SLAVE STATUS;
  • What's the value of autocommit in your session on the slave? Please attach the output of SHOW VARIABLES and your cnf file(s);
  • You've put 5.5 and 10.1.16 in the 'Affected version' field – which is it? If it's 10.1.16, are you using parallel replication?

After we clear it out, maybe we'll have a better picture of what's happening.

Comment by Oli Sennhauser [ 2016-08-31 ]

This Morning we had a similar problem. Master was running over night. Slave was hybernated. After we woke up Slave it still shows SQL/IO Thread running but did not make any progress in processing the statements. After waiting for MORE than 2 minutes (tcp timeout) the situation was still the same.

This was not the same master/slave system as before but was running more or less the same traffic.

MariaDB [(none)]> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.1.221
                  Master_User: replication
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: bin-log.000004
          Read_Master_Log_Pos: 216073261
               Relay_Log_File: master11-relay-bin.000002
                Relay_Log_Pos: 22749221
        Relay_Master_Log_File: bin-log.000004
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 216073261
              Relay_Log_Space: 22749522
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 21
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
                   Using_Gtid: No
                  Gtid_IO_Pos: 
      Replicate_Do_Domain_Ids: 
  Replicate_Ignore_Domain_Ids: 
                Parallel_Mode: conservative
1 row in set (0.00 sec)

MariaDB [(none)]> do sleep(10);
Query OK, 0 rows affected (10.00 sec)

MariaDB [(none)]> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.1.221
                  Master_User: replication
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: bin-log.000004
          Read_Master_Log_Pos: 216073261
               Relay_Log_File: master11-relay-bin.000002
                Relay_Log_Pos: 22749221
        Relay_Master_Log_File: bin-log.000004
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 216073261
              Relay_Log_Space: 22749522
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 21
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
                   Using_Gtid: No
                  Gtid_IO_Pos: 
      Replicate_Do_Domain_Ids: 
  Replicate_Ignore_Domain_Ids: 
                Parallel_Mode: conservative
1 row in set (0.00 sec)

MariaDB [(none)]> stop slave;
Query OK, 0 rows affected (0.01 sec)

MariaDB [(none)]> start slave; 
Query OK, 0 rows affected (0.00 sec)
 
MariaDB [(none)]> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Queueing master event to the relay log
                  Master_Host: 192.168.1.221
                  Master_User: replication
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: bin-log.000004
          Read_Master_Log_Pos: 226924419
               Relay_Log_File: master11-relay-bin.000003
                Relay_Log_Pos: 65047
        Relay_Master_Log_File: bin-log.000004
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 216137773
              Relay_Log_Space: 33600966
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 54154
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 21
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
                   Using_Gtid: No
                  Gtid_IO_Pos: 
      Replicate_Do_Domain_Ids: 
  Replicate_Ignore_Domain_Ids: 
                Parallel_Mode: conservative
1 row in set (0.00 sec)

Comment by Oli Sennhauser [ 2016-08-31 ]

Hi Elena

To your questions from yesterday:

  • LOG_FILE/POS are the ones from the backup (mysqldump) line 22 CHANGE MASTER TO...
  • 2x means executed 2 time on the slave
  • the rows inserted on the master did not show up on the slave any more (this is because repl. stopped)
  • we continuously insert rows on the master as follows: insert into test values (null, 'some data', null) on master. the same we did twice on the slave to break replication.
  • the general way to check replication is SHOW SLAVE STATUS
  • We waited as long as it took to type all the statements from the insert up to the show slave status (1 to 2 minutes?)
  • we did not try any other queries/tables. we wanted to break the replication
  • auto_commit is default
  • we use here in the training 10.1 but customer told me that he has the same phenomena also in 5.5 in his systems at home.
  • we used MariaDB out of the box. no special configuration. so I assume no parallel replication
    Parallel_Mode: conservative
Comment by Elena Stepanova [ 2016-08-31 ]

oli,
Please provide the output of SHOW FULL PROCESSLIST at the time when the replication has supposedly stopped (but before you stopped the slave). In your comment above, it's the time when you ran do sleep(10). Please also attach or paste the relevant portion of the error log (or better still, the whole error log from the slave start).

Normally with a complaint like that, one would expect to see that the slave stopped executing events on whatever reason. But the slave status clearly shows that in fact the slave was not reading any events during this 10 second period, which a very different story.

Comment by Elena Stepanova [ 2016-11-02 ]

oli, would you be able to provide the information as above?

Comment by Elena Stepanova [ 2016-11-30 ]

Please comment to re-open if you have new information on the issue.

Generated at Thu Feb 08 07:44:17 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.