[MDEV-4937] sql_slave_skip_counter does not work with GTID Created: 2013-08-22  Updated: 2017-07-27  Resolved: 2014-07-09

Status: Closed
Project: MariaDB Server
Component/s: None
Affects Version/s: 10.0.6
Fix Version/s: 10.0.13

Type: Bug Priority: Critical
Reporter: Elena Stepanova Assignee: Kristian Nielsen
Resolution: Fixed Votes: 0
Labels: gtid, replication

Issue Links:
Relates
relates to MDEV-26 Global transaction ID Closed

 Description   

sql_slave_skip_counter seems to be completely ignored when slave is configured to use GTID. It might be not such a bad thing, but then I suppose a warning or an error should be produced on slave start if sql_slave_skip_counter was previously set to a non-zero value.

--source include/have_innodb.inc
--source include/have_binlog_format_statement.inc
 
--let $rpl_topology=1->2
--source include/rpl_init.inc
 
--echo [master]
create table t1 (i int) engine=InnoDB;
--save_master_pos
 
--echo [slave]
--connection server_2
--sync_with_master
--source include/stop_slave.inc
change master to master_use_gtid=current_pos;
set global sql_slave_skip_counter = 4;
--source include/start_slave.inc
 
--connection server_1
insert into t1 values (1);
insert into t1 values (2);
insert into t1 values (3);
--save_master_pos
 
--connection server_2
--sync_with_master
select * from t1;
show binlog events;
 
--connection server_1
#show binlog events;
drop table t1;
--source include/rpl_end.inc

cnf

!include suite/rpl/rpl_1slave_base.cnf
!include include/default_client.cnf
 
[mysqld.1]
log-slave-updates
 
[mysqld.2]
log-slave-updates

bzr version-info

revision-id: timour@askmonty.org-20130821075108-33rptvhha6vfjzd8
revno: 3681
branch-nick: 10.0-base



 Comments   
Comment by Jeremy Cole [ 2013-09-28 ]

I would say that START SLAVE should produce a fatal error and not start the slave if sql_slave_skip_counter is non-zero. Otherwise, the DBA may be surprised at the outcome.

Comment by Kristian Nielsen [ 2014-02-11 ]

> I would say that START SLAVE should produce a fatal error and not start the
> slave if sql_slave_skip_counter is non-zero. Otherwise, the DBA may be
> surprised at the outcome.

Are you saying that slave_skip_counter should not be supported at all for
GTID?

Clearly, it should not be just ignored. But if it worked correctly (skipping
the requested number of events and continuing replication normally after
that), would you still want to prevent it?

We probably want to allow it in GTID non-strict mode (otherwise how would one
recover from a bad event in the binlog?), but we could forbid it in strict
mode, if it makes sense...

Comment by Kristian Nielsen [ 2014-02-11 ]

After some discussions with Elena, I'm thinking that giving an error for sql_slave_skip_counter
in GTID mode seems right.

In GTID mode, one can always set @@gtid_slave_pos explicitly to skip to after a given GTID, that seems rather safer anyway. And it is a better fit with GTID, which has a global position (slave_skip_counter is per-master-connection), deals with whole event groups (slave_skip_counter counts events not GTIDs), and has multiple binlog streams interleaved using domain_id (slave_skip_counter cannot be told which stream/domain_id to skip within).

Comment by Kristian Nielsen [ 2014-02-11 ]

Pushed to 10.0-base a fix that throws an error if trying to set sql_slave_skip_counter for a master connection that is configured to use GTID.

Comment by Kristian Nielsen [ 2014-02-11 ]

Pushed to 10.0-base

Comment by Kristian Nielsen [ 2014-06-19 ]

Unfortunately, the first proposed solution is not sufficient.

For example, suppose an INCIDENT_EVENT is generated for some reason on the
master. Such event will cause the slave to halt to alert the DBA about some
error. To restart replication, it will be necessary to skip over the incident
event. But an incident event has no GTID, so it cannot be skipped by setting
@@gtid_slave_pos. So sql_slave_skip_counter needs to work in this case.

Comment by Kristian Nielsen [ 2014-06-20 ]

Patch: http://lists.askmonty.org/pipermail/commits/2014-June/006234.html

Comment by Michael Widenius [ 2014-07-09 ]

Review

revno: 4260
revision-id: knielsen at knielsen-hq.org-20140620104939-xdvn985cgmwy3odd
parent: knielsen at knielsen-hq.org-20140619125043-acw2gx7rw90y3vqp
committer: Kristian Nielsen <knielsen at knielsen-hq.org>
branch nick: work-10.0
timestamp: Fri 2014-06-20 12:49:39 +0200
message:
  MDEV-4937: sql_slave_skip_counter does not work with GTID
  
  The sql_slave_skip_counter is important to be able to recover replication from
  certain errors. Often, an appropriate solution is to set
  sql_slave_skip_counter to skip over a problem event. But setting
  sql_slave_skip_counter produced an error in GTID mode, with a suggestion to
  instead set @@gtid_slave_pos to point past the problem event. This however is
  not always possible; for example, in case of an INCIDENT event, that event
  does not have any GTID to assign to @@gtid_slave_pos.
  
  With this patch, sql_slave_skip_counter now works in GTID mode the same was as
  in non-GTID mode. When set, that many initial events are skipped when the SQL
  thread starts, plus as many extra events are needed to completely skip any
  partially skipped event group. The GTID position is updated to point past the
  skipped event(s).
 

<cut>

 
=== modified file 'sql/slave.cc'
--- a/sql/slave.cc	2014-06-19 12:50:43 +0000
+++ b/sql/slave.cc	2014-06-20 10:49:39 +0000
@@ -3521,9 +3521,6 @@ static int exec_relay_log_event(THD* thd
 
       if (opt_gtid_ignore_duplicates)
       {
-        serial_rgi->current_gtid.domain_id= gev->domain_id;
-        serial_rgi->current_gtid.server_id= gev->server_id;
-        serial_rgi->current_gtid.seq_no= gev->seq_no;
         int res= rpl_global_gtid_slave_state.check_duplicate_gtid
           (&serial_rgi->current_gtid, serial_rgi);
         if (res < 0)
 

What was the reason for removing the above assignments?
Was it a bug in the old code ?
(I have to ask as I am not 100% of what the old code did).

 
@@ -4366,6 +4363,7 @@ pthread_handler_t handle_slave_sql(void
   char saved_master_log_name[FN_REFLEN];
   my_off_t UNINIT_VAR(saved_log_pos);
   my_off_t UNINIT_VAR(saved_master_log_pos);
+  String saved_skip_gtid_pos;
   my_off_t saved_skip= 0;
   Master_info *mi= ((Master_info*)arg);
   Relay_log_info* rli = &mi->rli;
@@ -4571,6 +4569,12 @@ log '%s' at position %s, relay log '%s'
     strmake_buf(saved_master_log_name, rli->group_master_log_name);
     saved_log_pos= rli->group_relay_log_pos;
     saved_master_log_pos= rli->group_master_log_pos;
+    if (mi->using_gtid != Master_info::USE_GTID_NO)
+    {
+      saved_skip_gtid_pos.append(STRING_WITH_LEN(", GTID '"));
+      rpl_append_gtid_state(&saved_skip_gtid_pos, false);
+      saved_skip_gtid_pos.append(STRING_WITH_LEN("'; "));
+    }
     saved_skip= rli->slave_skip_counter;
   }
   if ((rli->until_condition == Relay_log_info::UNTIL_MASTER_POS ||
@@ -4594,16 +4598,27 @@ log '%s' at position %s, relay log '%s'
 
     if (saved_skip && rli->slave_skip_counter == 0)
     {
+      String tmp;
+      if (mi->using_gtid != Master_info::USE_GTID_NO)
+      {
+        tmp.append(STRING_WITH_LEN(", GTID '"));
+        rpl_append_gtid_state(&tmp, false);
+        tmp.append(STRING_WITH_LEN("'; "));
+      }
 

If you want to avoid checking if tmp.append() fails, you could
create tmp with a static buffer.

char tmp_buff[MAX_GTID_LENGTH+10];
String tmp(tmp_buff, sizeof(tmp_buff), &my_charset_bin);
tmp.length(0);

Same goes for saved_skip_gtid_pos().

 
       sql_print_information("'SQL_SLAVE_SKIP_COUNTER=%ld' executed at "
         "relay_log_file='%s', relay_log_pos='%ld', master_log_name='%s', "
-        "master_log_pos='%ld' and new position at "
+        "master_log_pos='%ld'%s and new position at "
         "relay_log_file='%s', relay_log_pos='%ld', master_log_name='%s', "
-        "master_log_pos='%ld' ",
+        "master_log_pos='%ld'%s ",
         (ulong) saved_skip, saved_log_name, (ulong) saved_log_pos,
         saved_master_log_name, (ulong) saved_master_log_pos,
+        saved_skip_gtid_pos.c_ptr_safe(),
         rli->group_relay_log_name, (ulong) rli->group_relay_log_pos,
-        rli->group_master_log_name, (ulong) rli->group_master_log_pos);
+        rli->group_master_log_name, (ulong) rli->group_master_log_pos,
+        tmp.c_ptr_safe());
       saved_skip= 0;
+      saved_skip_gtid_pos.free();
 

If you free saved_skip_gtid_pos, it good to also free tmp!

 
     }
     
     if (exec_relay_log_event(thd, rli, serial_rgi))
 

<cut>

 
=== modified file 'sql/sys_vars.cc'
--- a/sql/sys_vars.cc	2014-06-09 18:00:23 +0000
+++ b/sql/sys_vars.cc	2014-06-20 10:49:39 +0000
@@ -4287,11 +4287,6 @@ bool update_multi_source_variable(sys_va
 
 static bool update_slave_skip_counter(sys_var *self, THD *thd, Master_info *mi)
 {
-  if (mi->using_gtid != Master_info::USE_GTID_NO)
-  {
-    my_error(ER_SLAVE_SKIP_NOT_IN_GTID, MYF(0));
-    return true;
-  }
 

Shouldn't you also remove the error ER_SLAVE_SKIP_NOT_IN_GTID from the
error file?

Regards,
Monty

Comment by Kristian Nielsen [ 2014-07-09 ]

Pushed to 10.0.13

Generated at Thu Feb 08 07:00:24 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.