[MDEV-12290] Wrong timestamps in binary log causes replication issues - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.2.4, 10.0(EOL), 10.1(EOL), 10.2(EOL)
Fix Version/s: 10.2.6
Component/s: Replication
Labels:
- 10.2-ga

Description

I have configured a MariaDB 10.2.4 GTID Slave with master_delay=3600, and it seems to cause replication to lag behind more and more. Waits trigger even though Seconds_Behind_Master is > 3600.

        Seconds_Behind_Master: 21476

Master_SSL_Verify_Server_Cert: No

                Last_IO_Errno: 0

                Last_IO_Error:

               Last_SQL_Errno: 0

               Last_SQL_Error:

  Replicate_Ignore_Server_Ids:

             Master_Server_Id: 3

               Master_SSL_Crl:

           Master_SSL_Crlpath:

                   Using_Gtid: No

                  Gtid_IO_Pos:

      Replicate_Do_Domain_Ids:

  Replicate_Ignore_Domain_Ids:

                Parallel_Mode: conservative

                    SQL_Delay: 3600

          SQL_Remaining_Delay: 3598

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

patch
0.6 kB
2017-05-02 11:53
patch
0.4 kB
2017-05-02 11:31

Activity

Ascending order - Click to sort in descending order

Guillaume Lefranc created issue - 2017-03-17 09:01

Elena Stepanova added a comment - 2017-03-18 23:37

Could you please paste or attach a few consequent slave status outputs, to see the dynamics, including Read_Master_Log_Pos and Exec_Master_Log_Pos (or just the full output).
Please also attach cnf file(s).

Elena Stepanova added a comment - 2017-03-18 23:37 Could you please paste or attach a few consequent slave status outputs, to see the dynamics, including Read_Master_Log_Pos and Exec_Master_Log_Pos (or just the full output). Please also attach cnf file(s).

Elena Stepanova made changes - 2017-03-18 23:38

Field	Original Value	New Value
Labels		need_feedback

Guillaume Lefranc added a comment - 2017-03-20 14:21

The cnf files are unaltered defaults.

I forgot to mention that the master is a 10.1 MariaDB, but the manual doesn't mention incompatibility.

Will try to gather more info/write a reproduceable test case.

Guillaume Lefranc added a comment - 2017-03-20 14:21 The cnf files are unaltered defaults. I forgot to mention that the master is a 10.1 MariaDB, but the manual doesn't mention incompatibility. Will try to gather more info/write a reproduceable test case.

Guillaume Lefranc added a comment - 2017-03-20 15:03

In the example below, even I am really far behind the master, SQL_Delay still triggers.

                  Master_Host: 10.0.5.13

                  Master_User: replicator

                  Master_Port: 3306

                Connect_Retry: 60

              Master_Log_File: mariadb-bin.000246

          Read_Master_Log_Pos: 40136843

               Relay_Log_File: mysqld-relay-bin.000007

                Relay_Log_Pos: 10794

        Relay_Master_Log_File: mariadb-bin.000122

             Slave_IO_Running: Yes

            Slave_SQL_Running: Yes

              Replicate_Do_DB:

          Replicate_Ignore_DB:

           Replicate_Do_Table:

       Replicate_Ignore_Table:

      Replicate_Wild_Do_Table:

  Replicate_Wild_Ignore_Table:

                   Last_Errno: 0

                   Last_Error:

                 Skip_Counter: 0

          Exec_Master_Log_Pos: 10497

              Relay_Log_Space: 13043090327

              Until_Condition: None

               Until_Log_File:

                Until_Log_Pos: 0

           Master_SSL_Allowed: No

           Master_SSL_CA_File:

           Master_SSL_CA_Path:

              Master_SSL_Cert:

            Master_SSL_Cipher:

               Master_SSL_Key:

        Seconds_Behind_Master: 289621

Master_SSL_Verify_Server_Cert: No

                Last_IO_Errno: 0

                Last_IO_Error:

               Last_SQL_Errno: 0

               Last_SQL_Error:

  Replicate_Ignore_Server_Ids:

             Master_Server_Id: 3

               Master_SSL_Crl:

           Master_SSL_Crlpath:

                   Using_Gtid: No

                  Gtid_IO_Pos:

      Replicate_Do_Domain_Ids:

  Replicate_Ignore_Domain_Ids:

                Parallel_Mode: conservative

                    SQL_Delay: 10

          SQL_Remaining_Delay: 6

      Slave_SQL_Running_State: Waiting until MASTER_DELAY seconds after master executed event

Guillaume Lefranc added a comment - 2017-03-20 15:03 In the example below, even I am really far behind the master, SQL_Delay still triggers. Master_Host: 10.0.5.13 Master_User: replicator Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mariadb-bin.000246 Read_Master_Log_Pos: 40136843 Relay_Log_File: mysqld-relay-bin.000007 Relay_Log_Pos: 10794 Relay_Master_Log_File: mariadb-bin.000122 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 10497 Relay_Log_Space: 13043090327 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 289621 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 3 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: No Gtid_IO_Pos: Replicate_Do_Domain_Ids: Replicate_Ignore_Domain_Ids: Parallel_Mode: conservative SQL_Delay: 10 SQL_Remaining_Delay: 6 Slave_SQL_Running_State: Waiting until MASTER_DELAY seconds after master executed event

Guillaume Lefranc added a comment - 2017-03-20 15:13 - edited

OK, I think I've got something.

If I look at the binary log entry where the delay is triggered (this entry is one of the first after binlog rotation) here is what I get:

/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;

/*!40019 SET @@session.max_insert_delayed_threads=0*/;

/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;

DELIMITER /*!*/;

# at 4

#170317  7:34:44 server id 3  end_log_pos 249   Start: binlog v 4, server v 10.1.21-MariaDB-1~jessie created 170317  7:34:44

BINLOG '

hIPLWA8DAAAA9QAAAPkAAAAAAAQAMTAuMS4yMS1NYXJpYURCLTF+amVzc2llAGxvZwAAAAAAAAAA

AAAAAAAAAAAAAAAAAAAAAAAAEzgNAAgAEgAEBAQEEgAA3QAEGggAAAAICAgCAAAACgoKAAAAAAAA

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA                                                                                                                               AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAAAAAAAAAAEEwQAAKnQd0I=

'/*!*/;

*# at 10497

#700101  1:00:00 server id 3  end_log_pos 10538         Binlog checkpoint mariadb-bin.000122*

# at 10538

#170317  7:34:45 server id 0  end_log_pos 10576         GTID 1-0-8845285 trans

/*!100101 SET @@session.skip_parallel_replication=0*//*!*/;

/*!100001 SET @@session.gtid_domain_id=1*//*!*/;

/*!100001 SET @@session.server_id=0*//*!*/;

/*!100001 SET @@session.gtid_seq_no=8845285*//*!*/;

BEGIN

/*!*/;

# at 10576

The timestamp of binlog entry 10497 seems completely wrong at "700101 1:00:00" (Unix time anyone?)

So, at each binlog rotation and/or binlog checkpoint, my slave gets an additional N seconds of delay...

Guillaume Lefranc added a comment - 2017-03-20 15:13 - edited OK, I think I've got something. If I look at the binary log entry where the delay is triggered (this entry is one of the first after binlog rotation) here is what I get: /*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/; /*!40019 SET @@session.max_insert_delayed_threads=0*/; /*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/; DELIMITER /*!*/; # at 4 #170317 7:34:44 server id 3 end_log_pos 249 Start: binlog v 4, server v 10.1.21-MariaDB-1~jessie created 170317 7:34:44 BINLOG ' hIPLWA8DAAAA9QAAAPkAAAAAAAQAMTAuMS4yMS1NYXJpYURCLTF+amVzc2llAGxvZwAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAEzgNAAgAEgAEBAQEEgAA3QAEGggAAAAICAgCAAAACgoKAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAEEwQAAKnQd0I= '/*!*/; *# at 10497 #700101 1:00:00 server id 3 end_log_pos 10538 Binlog checkpoint mariadb-bin.000122* # at 10538 #170317 7:34:45 server id 0 end_log_pos 10576 GTID 1-0-8845285 trans /*!100101 SET @@session.skip_parallel_replication=0*//*!*/; /*!100001 SET @@session.gtid_domain_id=1*//*!*/; /*!100001 SET @@session.server_id=0*//*!*/; /*!100001 SET @@session.gtid_seq_no=8845285*//*!*/; BEGIN /*!*/; # at 10576 The timestamp of binlog entry 10497 seems completely wrong at "700101 1:00:00" (Unix time anyone?) So, at each binlog rotation and/or binlog checkpoint, my slave gets an additional N seconds of delay...

Elena Stepanova made changes - 2017-03-20 15:23

Labels

need_feedback

Guillaume Lefranc added a comment - 2017-03-20 15:49

Before you ask, the time of my servers is in sync.

Guillaume Lefranc added a comment - 2017-03-20 15:49 Before you ask, the time of my servers is in sync.

Elena Stepanova made changes - 2017-03-20 17:07

Status

Open [ 1 ]

Confirmed [ 10101 ]

Elena Stepanova added a comment - 2017-03-20 17:20

Great catch, thanks.

plinux, so, the problem is two-fold. The timestamp obviously needs to be fixed in all 10.x versions; but also, we need to somehow make sure that 10.2 is able to replicate even from older 10.0/10.1 versions, before the timestamp fix.

Elena Stepanova added a comment - 2017-03-20 17:20 Great catch, thanks. plinux , so, the problem is two-fold. The timestamp obviously needs to be fixed in all 10.x versions; but also, we need to somehow make sure that 10.2 is able to replicate even from older 10.0/10.1 versions, before the timestamp fix.

Elena Stepanova made changes - 2017-03-20 17:20

Fix Version/s		10.0 [ 16000 ]
Fix Version/s		10.1 [ 16100 ]
Fix Version/s		10.2 [ 14601 ]
Affects Version/s		10.0 [ 16000 ]
Affects Version/s		10.1 [ 16100 ]
Affects Version/s		10.2 [ 14601 ]
Assignee		Lixun Peng [ plinux ]

Guillaume Lefranc made changes - 2017-03-21 03:16

Summary

CHANGE MASTER TO ... MASTER_DELAY causes the slave to lag behind indefinitely

Wrong timestamps in binary log causes replication issues

Elena Stepanova made changes - 2017-03-21 14:25

Labels

10.2-ga

Sachin Setiya (Inactive) made changes - 2017-03-23 12:52

Assignee

Lixun Peng [ plinux ]

Sachin Setiya [ sachin.setiya.007 ]

Elena Stepanova made changes - 2017-04-23 21:36

Priority

Major [ 3 ]

Critical [ 2 ]

Rasmus Johansson (Inactive) made changes - 2017-04-27 10:35

Sprint

10.2.6 [ 158 ]

Rasmus Johansson (Inactive) made changes - 2017-04-27 10:35

Rank

Ranked lower

Rasmus Johansson (Inactive) made changes - 2017-04-28 07:21

Sprint

10.2.6 [ 158 ]

Rasmus Johansson (Inactive) made changes - 2017-04-28 07:21

Rank

Ranked higher

Sachin Setiya (Inactive) made changes - 2017-05-02 11:31

Attachment

patch [ 43603 ]

Sachin Setiya (Inactive) added a comment - 2017-05-02 11:35

Hi tanj
I have created a patch for this issue which solves this wrong timestamp issue, This is because when binlog_background thread logs BInlog_checkpoint_log_event time is not set in the thread so , Binlog_checkpoint_log_event->when is zero.
But I am not able to simulate seconds behind master case. Can you please provide some reproducible test case ?
I tried this
master - slave configuration
both use binlog_format = row
I have used gtid
sql_delay=60

And run lacs of (insert )query on server 1 , with after every 10 insert there was flush logs

patch

Sachin Setiya (Inactive) added a comment - 2017-05-02 11:35 Hi tanj I have created a patch for this issue which solves this wrong timestamp issue, This is because when binlog_background thread logs BInlog_checkpoint_log_event time is not set in the thread so , Binlog_checkpoint_log_event->when is zero. But I am not able to simulate seconds behind master case. Can you please provide some reproducible test case ? I tried this master - slave configuration both use binlog_format = row I have used gtid sql_delay=60 And run lacs of (insert )query on server 1 , with after every 10 insert there was flush logs patch

Kristian Nielsen added a comment - 2017-05-02 11:42

Doesn't the set_time() need to go inside the loop, before the call to
mysql_bin_log.mark_xid_done()? Otherwise the time will always be the time
the binlog thread started, which is not correct - it needs to be the current
time.

But nice catch, good to get this fixed.

Kristian Nielsen added a comment - 2017-05-02 11:42 Doesn't the set_time() need to go inside the loop, before the call to mysql_bin_log.mark_xid_done()? Otherwise the time will always be the time the binlog thread started, which is not correct - it needs to be the current time. But nice catch, good to get this fixed.

Sachin Setiya (Inactive) made changes - 2017-05-02 11:53

Attachment

patch [ 43606 ]

Sachin Setiya (Inactive) added a comment - 2017-05-02 11:53 - edited

Hi knielsen,

Patch updated
Regards
sachin patch

Sachin Setiya (Inactive) added a comment - 2017-05-02 11:53 - edited Hi knielsen , Patch updated Regards sachin patch

Sachin Setiya (Inactive) made changes - 2017-05-09 11:41

Fix Version/s		10.2.6 [ 22527 ]
Fix Version/s	10.2 [ 14601 ]
Fix Version/s	10.0 [ 16000 ]
Fix Version/s	10.1 [ 16100 ]
Resolution		Fixed [ 1 ]
Status	Confirmed [ 10101 ]	Closed [ 6 ]

Marcus Zambrano added a comment - 2017-09-20 22:55

I am experiencing this issue when attaching a 10.2.8 Slave to a 10.0.24 Master. At every #700101 Binlog checkpoint event, the SQL Delay counter is reapplied and the Slave is continually delayed.

master# mysql -e 'select @@version;'
-----------------

@@version

-----------------

10.0.24-MariaDB

-----------------

master# mysqlbinlog bin-log.023901 | grep ^#700
#700101 2:00:00 server id 123 end_log_pos 423 Binlog checkpoint bin-log.023901

slave# mysql -e 'select @@version;'
--------------------

@@version

--------------------

10.2.8-MariaDB-log

--------------------

Seconds_Behind_Master: 2170
SQL_Delay: 900
SQL_Remaining_Delay: 825
Slave_SQL_Running_State: Waiting until MASTER_DELAY seconds after master executed event

Even if the Master is upgraded to 10.0.32 (latest), the bogus timestamps will still be present - as confirmed in a sandbox:

../sandbox.sh mysql -e 'select @@version;'
-----------------

@@version

-----------------

10.0.32-MariaDB

-----------------

mysqlbinlog bin-log.00000* | grep ^#700
#700101 2:00:00 server id 10032 end_log_pos 347 Binlog checkpoint bin-log.000002
#700101 2:00:00 server id 10032 end_log_pos 347 Binlog checkpoint bin-log.000003

Marcus Zambrano added a comment - 2017-09-20 22:55 I am experiencing this issue when attaching a 10.2.8 Slave to a 10.0.24 Master. At every #700101 Binlog checkpoint event, the SQL Delay counter is reapplied and the Slave is continually delayed. master# mysql -e 'select @@version;' ----------------- @@version ----------------- 10.0.24-MariaDB ----------------- master# mysqlbinlog bin-log.023901 | grep ^#700 #700101 2:00:00 server id 123 end_log_pos 423 Binlog checkpoint bin-log.023901 slave# mysql -e 'select @@version;' -------------------- @@version -------------------- 10.2.8-MariaDB-log -------------------- Seconds_Behind_Master: 2170 SQL_Delay: 900 SQL_Remaining_Delay: 825 Slave_SQL_Running_State: Waiting until MASTER_DELAY seconds after master executed event Even if the Master is upgraded to 10.0.32 (latest), the bogus timestamps will still be present - as confirmed in a sandbox: ../sandbox.sh mysql -e 'select @@version;' ----------------- @@version ----------------- 10.0.32-MariaDB ----------------- mysqlbinlog bin-log.00000* | grep ^#700 #700101 2:00:00 server id 10032 end_log_pos 347 Binlog checkpoint bin-log.000002 #700101 2:00:00 server id 10032 end_log_pos 347 Binlog checkpoint bin-log.000003

Sachin Setiya (Inactive) added a comment - 2017-09-21 08:00

Hi zam1081!

This patched was pushed only to 10.2, I have back-ported this patch to 10.0 , by this merge this will also be available to 10.1.

Sachin

Sachin Setiya (Inactive) added a comment - 2017-09-21 08:00 Hi zam1081 ! This patched was pushed only to 10.2, I have back-ported this patch to 10.0 , by this merge this will also be available to 10.1. Sachin

Sergei Golubchik made changes - 2021-12-06 21:44

Workflow

MariaDB v3 [ 80017 ]

MariaDB v4 [ 151820 ]

People

Assignee:: Sachin Setiya (Inactive)

Reporter:: Guillaume Lefranc

Votes:: 1 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 2017-03-17 09:01

Updated:: 2017-09-21 08:00

Resolved:: 2017-05-09 11:41

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Activity

People

Dates

Git Integration