[MDEV-20637] MariaDb upgrade leads to fatal error 1236 after startup Master Slave Created: 2019-09-20  Updated: 2020-12-01

Status: Open
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.2.26, 10.1.41
Fix Version/s: 10.2

Type: Bug Priority: Major
Reporter: Ragul Assignee: Andrei Elkin
Resolution: Unresolved Votes: 1
Labels: master-slave, replication
Environment:

Ubuntu Rhel 7
Master Slave Cluster


Attachments: PNG File DifferenceInPositionsBWTwoVersions.png     PNG File Without_upgrade.png     File mariaTest.sh    
Issue Links:
PartOf
includes MDEV-20894 Gtid Position difference between Mari... Closed

 Description   

Hi,

I have upgraded MariaDB from 10.1.38 to 10.1.41 version and 10.2.18 to 10.2.26 in my project for different branches, which caused the below problem during the start of Database Replication

MariaDB [(none)]> show slave status\G;
*************************** 1. row ***************************
               Slave_IO_State:
                  Master_Host: 10.99.32.8
                  Master_User: repl
                  Master_Port: 3307
                Connect_Retry: 60
              Master_Log_File:
          Read_Master_Log_Pos: 4
               Relay_Log_File: mysql-1-relay-bin.000001
                Relay_Log_Pos: 4
        Relay_Master_Log_File:
             Slave_IO_Running: No
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB: mysql
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 4
              Relay_Log_Space: 256
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 1236
                Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 0-1-1, which is not in the master's binlog. Since the master's binlog contains GTIDs with higher sequence numbers, it probably means that the slave has diverged due to executing extra erroneous transactions'
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 2
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Current_Pos
                  Gtid_IO_Pos: 0-1-1
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: optimistic
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it

In earlier versions, it was working properly. Are there any changes went related to replication which we need to adapt explicitly during installation?

Binary Logs details:

Master bin logs events

MariaDB [(none)]> show binlog events in 'binary-log.000001';
Log_name        Pos     Event_type      Server_id       End_log_pos     Info
binary-log.000001       4       Format_desc     1       256     Server ver: 10.2.26-MariaDB-log, Binlog ver: 4
binary-log.000001       256     Gtid_list       1       285     []
binary-log.000001       285     Binlog_checkpoint       1       329     binary-log.000001
binary-log.000001       329     Gtid    1       371     GTID 0-1-1
binary-log.000001       371     Query   1       473     create database if not exists test;
binary-log.000001       473     Stop    1       496
MariaDB [(none)]> show binlog events in 'binary-log.000002';
Log_name        Pos     Event_type      Server_id       End_log_pos     Info
binary-log.000002       4       Format_desc     1       256     Server ver: 10.2.26-MariaDB-log, Binlog ver: 4
binary-log.000002       256     Gtid_list       1       299     [0-1-1]
binary-log.000002       299     Binlog_checkpoint       1       343     binary-log.000002
binary-log.000002       343     Stop    1       366
MariaDB [(none)]> show binlog events in 'binary-log.000003';
Log_name        Pos     Event_type      Server_id       End_log_pos     Info
binary-log.000003       4       Format_desc     1       256     Server ver: 10.2.26-MariaDB-log, Binlog ver: 4
binary-log.000003       256     Gtid_list       1       299     [0-1-1]
binary-log.000003       299     Binlog_checkpoint       1       343     binary-log.000003



 Comments   
Comment by Ragul [ 2019-09-26 ]

Is there anything we need to adapt explicitly during installation?

Comment by Andrei Elkin [ 2019-09-26 ]

RagulR: Thanks for the report!

What are the slave's @@global.gtid_slave_pos? I suggest to use CHANGE MASTER TO .. MASTER_USE_GTID=slave_pos unless you already do that.

Comment by Ragul [ 2019-09-26 ]

In the earlier version, we had MASTER_USE_GTID=Current_Pos and the above mentioned issue didn't occur. But only with the latest version upgrade, we are facing this issue.

Any recent change has been done regarding this?

What will be the impact on changing to MASTER_USE_GTID=slave_pos ?

Comment by Andrei Elkin [ 2019-09-26 ]

RagulR In this sort of failure, sorry, it's not enough to refer to any successful past without providing more details,
including those that I've just asked for.

Your slave error suggests that either "slave"@@global.gtid_slave_pos or "slave"@@global.gtid_binlog_pos has a gtid value that does not exist in the master binlog.
Could you please paste result of SELECT of the two as well as a pair of variables from the master:
"master"@@global.gtid_binlog_pos and "master"@@global.gtid_binlog_state ?

I think it's reasonable that we first explain what is happening to your slave currently, and after that answer the question of why it had worked before.

Thank you.

Comment by Ragul [ 2019-09-27 ]

Thanks. Here are the details of the server's and master's pos & state. Is there any other details needed

Slave Details:

>  mysql --defaults-file=/etc/na/my-1.cnf -uroot -pmysql
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 22
Server version: 10.2.26-MariaDB-log MariaDB Server
 
Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.
 
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
 
MariaDB [(none)]> select @@GLOBAL.gtid_binlog_pos;
+--------------------------+
| @@GLOBAL.gtid_binlog_pos |
+--------------------------+
| 0-1-1                    |
+--------------------------+
1 row in set (0.00 sec)
 
MariaDB [(none)]> select @@GLOBAL.gtid_binlog_state;
+----------------------------+
| @@GLOBAL.gtid_binlog_state |
+----------------------------+
| 0-1-1                      |
+----------------------------+
1 row in set (0.00 sec)

Master Detail:

> mysql --defaults-file=/etc/na/my-1.cnf -uroot -pmysql
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 75
Server version: 10.2.26-MariaDB-log MariaDB Server
 
Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.
 
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
 
MariaDB [(none)]> select @@GLOBAL.gtid_binlog_state;
+----------------------------+
| @@GLOBAL.gtid_binlog_state |
+----------------------------+
| 0-2-976                    |
+----------------------------+
1 row in set (0.00 sec)
 
MariaDB [(none)]> select @@GLOBAL.gtid_binlog_pos;
+--------------------------+
| @@GLOBAL.gtid_binlog_pos |
+--------------------------+
| 0-2-980                  |
+--------------------------+
1 row in set (0.00 sec)

Comment by Andrei Elkin [ 2019-09-27 ]

Now with the gtid states available it's clear that the master refused to feed the slave correctly.
As why it may have worked before, there are few options. E.g slave@@GLOBAL.gtid_current_pos was empty, unlike as it is currently.

Comment by Ragul [ 2019-09-27 ]

Thanks, Elkin. But still, I don't understand why this change I could see only after MariaDB upgrade

Before the upgrade, I could see the value of gtid_binlog_pos as empty in slave but gtid_current_pos has some value.

But after upgrade the state is different. Is there any recent change has been done regarding this and that should adapt during installation because no explicit change has been made from our end except the MariaDB version change?

What will be the impact on changing to MASTER_USE_GTID=slave_pos?

Comment by Ragul [ 2019-10-01 ]

Hi Elkin. Is there any information needed from our end.

Comment by Andrei Elkin [ 2019-10-02 ]

> gtid_binlog_pos as empty in slave but gtid_current_pos has some value

This may be fine 'cos of the definition of gtid_current_pos. Please check with the docs that defines both.
I can't really comment on your snapshot, except that it really looks nonsensical.
Specifically to the slave, where gtid_current_pos < gtid_slave_pos which goes against the definition.

If you have a reproducible test case that creates such a state we would really appreciate for sharing it,
otherwise I don't see how we can be helpful, sorry.

To your question of 'recent changes', I can't point to any that may relate to your description.

To 'What will be the impact on changing to MASTER_USE_GTID=slave_pos', I suggest you to read the mariadb docs' GTID section. Normally, the gtid tracking by slave should base on 'slave_pos'. To a normal case I relate for instance one that the slave does not update its state on its own.

Comment by Ragul [ 2019-10-24 ]

Hi Elkin, Thanks for the clarification. But still, we have some doubts regarding this.
As u said, I tried to reproduce the scenario as simple as possible by a shell script mariaTest.sh with the below my.cnf file.

[mysqld]
server-id=1
log-bin=binary-log.
binlog-ignore-db=mysql
binlog-ignore-db=performance_schema

from where I could see the difference in the output of gtid_positions in v10.2.18 where first 3 positions are null while in v10.2.27 it has 0-1-1.

As part of my analysis, I could see the above difference only after adding "log-bin=binary-log." properties in my cnf file. while without the property I could not see the above difference.
Confused about the behavior after adding this property.

Can you please clarify the above behavior? Why the position get affected after adding "log-bin" ?
Or Is there any mistake I am doing in the my.cnf file?

Comment by Ragul [ 2019-11-12 ]

Hi Elkin, can you comment on the below behaviour
On my understand and to summarize the above issue,

Note that when the binlog is empty (such as on a fresh install or after RESET MASTER), there are no event groups written in any replication domain, so in this case the value of gtid_binlog_pos will be the *empty string*.

As per the above note found in https://mariadb.com/kb/en/library/gtid/#gtid_binlog_pos.
Binlog-pos should be *empty string on fresh install but why the value is set as 0-1-1 *

Generated at Thu Feb 08 09:01:01 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.