[MDEV-7323] Replica with GTID stopping replication on some transactions when no gtid replica didn't stop Created: 2014-12-15  Updated: 2020-02-13  Resolved: 2015-01-06

Status: Closed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.0.14
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Peter McLarty Assignee: Kristian Nielsen
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Redhat Linux 2.6.32-431.29.2.el6.x86_64
VM on Citrix Xen 16GB RAM 8vcpu



 Description   

We have had replication stop on our GTID enabled slave when the same events are passing through
Things such as this have failed and skip has been executed to allow replication to continue

Last_Errno: 1396
Last_Error: Error 'Operation CREATE USER failed for 'user1'@'192.168.x.xx'' on query. Default database: 'information_schema'. Query: 'CREATE USER ''user1'@'192.168.x.xx''
Actual user details changed for security reasons

Drop user has stalled and a error from what I believe to be a stored procedure which inserts some records into a table has caused issues.



 Comments   
Comment by Kristian Nielsen [ 2014-12-15 ]

We need some more details to understand what the problem here is.

1. You mention "slave when the same events are passing through". Do you mean that you have a setup S1->S2->S3, where S2 is using GTID but S3 is not? Or something else? What makes you think the problem is related to GTID?

2. Might there be some reasonable explanation for the CREATE USER statement to fail, like the user existing before or something like that? What is in the error log? What happens if the CREATE USER statement is run manually on the server?

3. With respect to the "Drop user has stalled and ..." - we really need a detailed, precise description of each problem to be able to say anything meaningful.

Comment by Peter McLarty [ 2014-12-15 ]

I have

 master -> slave 1 
        -> slave 2

My thinking around gtid is that the problems are only occurring on the gtid enabled slave

MariaDB [(none)]> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Queueing master event to the relay log
                  Master_Host: 192.168.x.xx
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000010
          Read_Master_Log_Pos: 442468943
               Relay_Log_File: mysql-relay-bin.000002
                Relay_Log_Pos: 1967113
        Relay_Master_Log_File: mysql-bin.000010
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 320379884
              Relay_Log_Space: 124056469
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 25425
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 148
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 0-148-5430559
1 row in set (0.00 sec)

I cannot access ftp server at present from internal network will upload the error log later this evening. A number of errors which have stopped the slave replication in that log

Comment by Peter McLarty [ 2014-12-15 ]

Uploaded the error log to ftp.askmonty.org. Let me know what else to investigate and what tools and I will endeavour to find the information

Comment by Kristian Nielsen [ 2014-12-15 ]

Thank you for uploading the error log. I took a quick look at it.

The error log contains a lot of replication errors, like "Can't find record in
..." errors, and messages that such errors were skipped over using
@@sql_slave_skip_counter. This suggests that the slave is diverged from the
master (the slave does not have the same data as the master had at the
corresponding replication position).

A diverged slave can in general lead to various replication failures, if some
query or operation executed on the slave sees different data from what was
originally on the master. It seems likely that this could also be the cause of
the failure of CREATE USER, though I cannot know this for sure one way or the
other from the data available.

If the problem can be reproduced on a slave that has been re-provisioned to be
identical to the master, then that would make it easier to say more.

Or alternatively, the actual problem needs to be narrowed down more
clearly. For example, if the CREATE USER statement fails in replication, but
then the exact same statement executed manually on the slave server succeeds,
then that would be something that could be investigated further. Or if it can
be shown that GTID replication somehow tries to replicate the wrong statement.

From the information given, the most likely explanation is just that the slave
data is sufficiently different from the master data to cause the errors
mentioned.

Comment by Kristian Nielsen [ 2015-01-06 ]

Without further information, I will have to assume that this is caused by the slave being out of sync with the master. So closing ...

Comment by Peter McLarty [ 2015-01-07 ]

Thanks guys it was a replication issue pt-table-sync was not able to repair and we have rebuilt the slave

Generated at Thu Feb 08 07:18:42 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.