[MDEV-9573] 'Stop slave' hangs on replication slave Created: 2016-02-17  Updated: 2023-06-28  Resolved: 2017-05-22

Status: Closed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.1.10, 10.1.11, 10.1.13
Fix Version/s: 10.0.30, 10.1.22

Type: Bug Priority: Major
Reporter: Markus Nägele Assignee: Michael Widenius
Resolution: Fixed Votes: 2
Labels: replication
Environment:

CentOS release 6.7 (Final) x64 on Dell PowerEdge R510


Attachments: File my.cnf     File myMaster.cnf     Text File mysqld-DEBUG.log     Text File mysqld.log     Text File mysqlprocesslist.txt     Text File processlistVerbose.txt    
Issue Links:
PartOf
includes MDEV-10644 One of parallel replication threads r... Closed
Relates
relates to MDEV-12104 Testing for MDEV-9573 and extra repli... Stalled
relates to MDEV-17346 parallel slave start and stop races t... Closed
relates to MDEV-31572 STOP SLAVE hangs on 10.3.39 Closed

 Description   

Since we switched from MariaDB 10.0.x to MariaDB 10.1 we are having trouble with the statement 'stop slave;' it just hangs and doesn't return even after hours. The 'show slave status\G' statement made with another connection is also blocking at this moment. It doesn't produce any output.
This happens randomly, if there is low load on the server it is quite hard to reproduce the issue. Stopping and starting the slave in short intervals may succeed 30 times or more without problems.
If the server is under heavy load it needs only very few tries to reproduce it. I can reproduce it very quickly when table checksums are created with pt-table-checksum.
The only way to stop MariaDB when 'stop slave' is hanging is 'kill -9'.
We are using parallel replication, as you can see in the my.cnf attached.
Further there is a back trace attached, that has been created as described on mariadb.org. If necessary, I could repeat it with a DEBUG build.
I also attached the running mysql processes in this moment.
We never had this issue with MariaDB 10.0.x with the same configuration expect slave_run_triggers_for_rbr = 1 of course, since it is available in MariaDB 10.1 only.

Please let me know, if I can provide any more details that might be helpful.



 Comments   
Comment by Elena Stepanova [ 2016-02-17 ]

optonaegele, are you able to reproduce it when slave_run_triggers_for_rbr is not enabled?

Comment by Markus Nägele [ 2016-03-14 ]

Sorry for the delayed answer. I needed to setup a testing machine first.

If slave_run_triggers_for_rbr is not enabled I can reproduce it as well, it seems to make no difference.
The issue can't be reproduced, if the parallel replication has been disabled (setting slave_parallel_threads = 0)
There are 3 kind of triggers defined only on the slave: AFTER INSERT, AFTER UPDATE and AFTER DELETE. I noticed already statements from the AFTER INSERT and AFTER UPDATE triggers in the process list that are hanging while stop slave; hangs. For AFTER DELETE statement, I can't say for sure.
If there are no triggers defined at all, the issue also doesn't come up.

Comment by Elena Stepanova [ 2016-03-19 ]

optonaegele,

Is the config same on master and slave?
Do you use multi-source replication? Do you use GTID?
How heavy does the load need to be in order to reproduce it on your side? What kind of load is it?
What kind of tables and of what size are involved? Are there long or short transactions?
In the processlist, I see that some activity happens directly on the slave, is it necessary in order to reproduce the problem? If so, can you show the full query, it's truncated in the process list?
You mentioned pt-table-checksum, do you run it on master or on slave while reproducing?
You also said that triggers seem to make a difference, what do they do exactly? Do they update other tables?
Did you upgrade to 10.1.12 since you reported the issue? There was a fix about a problem on slave shutdown, it does not look particularly close to what you're facing, but you never know.

If you can reproduce it with a debug build, and take all threads' stack trace on the hanging slave, it could indeed be helpful.

Comment by Markus Nägele [ 2016-04-13 ]

The config on master an slave is basically the same, I attached the configuration of the master as well.
We don't use multi-source replication and GTID.
To reproduce it, it is sufficient to stop the slave some minutes and start it again to catch up. At the related tables there are short running transactions.
I reproduced it, with a debug build and attached the stack trace mysqld-DEBUG.log. Also I made another process list processlistVerbose.txt where you can see the complete queries running.
This activity on the slave is indeed necessary to reproduce the problem.
We are using an additional tool for replication: SymmetricDS. It is running on the slave only and it is creating triggers on several tables to monitor changes on them.
A table where we monitor changes this way is defined like this (the table has around 10GB of data and receives several updates per second):

CREATE TABLE `jobdata` (
	`OID` BIGINT(20) NOT NULL AUTO_INCREMENT,
	`IDJOB` INT(11) NOT NULL,
	`IDATTR` INT(11) NOT NULL,
	`SIDE` CHAR(1) NOT NULL,
	`VALUE` VARCHAR(100) NULL DEFAULT NULL,
	`BIGVALUE` LONGTEXT NULL,
	`OptimisticLockField` INT(11) NULL DEFAULT NULL,
	`GCRecord` INT(11) NULL DEFAULT NULL,
	`DVALUE` DOUBLE NULL DEFAULT NULL,
	PRIMARY KEY (`OID`),
	INDEX `iDValue` (`IDATTR`, `SIDE`, `DVALUE`),
	INDEX `Search` (`IDJOB`, `IDATTR`, `SIDE`, `VALUE`),
	INDEX `iValue` (`IDATTR`, `SIDE`, `VALUE`)
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB
AUTO_INCREMENT=2136040630
;

SymmtericDS creates triggers like this at the table (there are two more for this table, basically the same, but for UPDATE and DELETE):

CREATE DEFINER=`SymmetricDS`@`127.0.0.1` TRIGGER `SYM_ON_I_FOR_JBDT_V_LNGN` AFTER INSERT ON `jobdata` FOR EACH ROW begin                                                                                                                                                     
                                  if 1=1 and @sync_triggers_disabled is null then                                                                                                 
                                    insert into `replication_info`.sym_data (table_name, event_type, trigger_hist_id, row_data, channel_id, transaction_id, source_node_id, external_data, create_time)
                                    values(                                                                                                                                                            
                                      'jobdata',                                                                                                                                            
                                      'I',                                                                                                                                                             
                                      157,                                                                                                                                             
                                      concat(
          if(new.`OID` is null,'',concat('"',cast(new.`OID` as char),'"')),',',
          if(new.`IDJOB` is null,'',concat('"',cast(new.`IDJOB` as char),'"')),',',
          if(new.`IDATTR` is null,'',concat('"',cast(new.`IDATTR` as char),'"')),',',
          if(new.`SIDE` is null,'',concat('"',replace(replace(new.`SIDE`,'\\','\\\\'),'"','\\"'),'"')),',',
          if(new.`VALUE` is null,'',concat('"',replace(replace(new.`VALUE`,'\\','\\\\'),'"','\\"'),'"')),',',
          if(new.`BIGVALUE` is null,'',concat('"',replace(replace(new.`BIGVALUE`,'\\','\\\\'),'"','\\"'),'"')),',',
          if(new.`OptimisticLockField` is null,'',concat('"',cast(new.`OptimisticLockField` as char),'"')),',',
          if(new.`GCRecord` is null,'',concat('"',cast(new.`GCRecord` as char),'"')),',',
          if(new.`DVALUE` is null,'',concat('"',cast(new.`DVALUE` as char),'"'))                                                                                                                                                
                                       ),                                                                                                                                                              
                                      'default', `replication_info`.sym_transaction_id_post_5_1_23(), @sync_node_disabled,                                                                                                        
                                      (select c.KDLKZ from optovision.cusd001 c
join optovision.jobhead h on new.IDJOB = h.OID
where c.A_KDNR = h.colACCN),                                                                                                                                               
                                      CURRENT_TIMESTAMP                                                                                                                                                
                                    );                                                                                                                                                                 
                                  end if;                                                                                                                                                              
                                                                                                                                                                                  
                                end

Those triggers updates this table:

CREATE TABLE `sym_data` (
	`data_id` BIGINT(20) NOT NULL AUTO_INCREMENT,
	`table_name` VARCHAR(255) NOT NULL COLLATE 'utf8_unicode_ci',
	`event_type` CHAR(1) NOT NULL COLLATE 'utf8_unicode_ci',
	`row_data` MEDIUMTEXT NULL COLLATE 'utf8_unicode_ci',
	`pk_data` MEDIUMTEXT NULL COLLATE 'utf8_unicode_ci',
	`old_data` MEDIUMTEXT NULL COLLATE 'utf8_unicode_ci',
	`trigger_hist_id` INT(11) NOT NULL,
	`channel_id` VARCHAR(128) NULL DEFAULT NULL COLLATE 'utf8_unicode_ci',
	`transaction_id` VARCHAR(255) NULL DEFAULT NULL COLLATE 'utf8_unicode_ci',
	`source_node_id` VARCHAR(50) NULL DEFAULT NULL COLLATE 'utf8_unicode_ci',
	`external_data` VARCHAR(50) NULL DEFAULT NULL COLLATE 'utf8_unicode_ci',
	`node_list` VARCHAR(255) NULL DEFAULT NULL COLLATE 'utf8_unicode_ci',
	`create_time` DATETIME NULL DEFAULT NULL,
	PRIMARY KEY (`data_id`),
	INDEX `sym_idx_d_channel_id` (`data_id`, `channel_id`)
)
COLLATE='utf8_unicode_ci'
ENGINE=InnoDB
AUTO_INCREMENT=1045978428
;

In the queries shown in the process list and in the triggers you can find a procedure sym_transaction_id_post_5_1_23, it is defined as followed:

CREATE DEFINER=`SymmetricDS`@`127.0.0.1` FUNCTION `sym_transaction_id_post_5_1_23`()
	RETURNS varchar(50) CHARSET utf8 COLLATE utf8_unicode_ci
	LANGUAGE SQL
	NOT DETERMINISTIC
	READS SQL DATA
	SQL SECURITY DEFINER
	COMMENT ''
begin                                                                                                                          
    declare comm_value varchar(50);                                                                                             
    declare comm_cur cursor for select VARIABLE_VALUE from INFORMATION_SCHEMA.SESSION_STATUS where VARIABLE_NAME='COM_COMMIT';  
    open comm_cur;                                                                                                              
    fetch comm_cur into comm_value;                                                                                             
    close comm_cur;                                                                                                             
    return concat(concat(connection_id(), '.'), comm_value);                                                                    
 end

pt-table-checksum is running on a separate server, neither on the master or slave. As you can see, now I tried with the latest version 10.1.13.

Comment by Kristian Nielsen [ 2016-04-15 ]

Thank you for providing the stack traces, that is really useful to find the
cause of the problem.

I think I see what is happening, seems like an obvious deadlock problem in
the code:

In mysql_execute_command() for SQLCOM_SLAVE_STOP, the code takes
LOCK_active_mi and then starts waiting for the replication threads to stop.

But something like INSERT ... SELECT FROM INFORMATION_SCHEMA might also need
to take LOCK_active_mi. In the problem reported here, it happens when
populating the SESSION_STATUS table with the Slave_heartbeat_period value -
this code (in show_heartbeat_period()) also takes LOCK_active_mi.

The result is that STOP SLAVE is waiting forever for the SQL thread to stop

  • while the SQL thread is waiting forever inside show_heartbeat_period() for
    the LOCK_active_mi to become available.

It doesn't seem that this is specific to parallel replication, though
certainly parallel replication would make it easier to hit the race that
triggers the bug. I will see if I can make a simple test case that
reproduces.

Frankly, I have no idea how this code was ever supposed to work, I don't see
how STOP SLAVE can hold LOCK_active_mi while waiting for slave to stop
without causing deadlocks. I guess I'll try asking on the mailing list if
nothing comes up from my work on a test case.

My compliments to Markus Nägele for a very good report on a complicated bug,
which made it much easier to understand what the problem was.

Comment by Kristian Nielsen [ 2016-04-15 ]

Yep, easy to reproduce also without parallel replication or triggers or
such. Here is a test case, using a patch that injects a sleep in the code to
make it easy to trigger the race:

diff --git a/sql/mysqld.cc b/sql/mysqld.cc
index ced9225..2358d6b 100644
--- a/sql/mysqld.cc
+++ b/sql/mysqld.cc
@@ -7765,6 +7765,7 @@ static int show_heartbeat_period(THD *thd, SHOW_VAR *var, char *buff,
 
   var->type= SHOW_CHAR;
   var->value= buff;
+if (thd->slave_thread) my_sleep(5000000);
   mysql_mutex_lock(&LOCK_active_mi);
   if (master_info_index) 
   {

--source include/have_innodb.inc
--source include/master-slave.inc
 
--connection master
CREATE TABLE t1 (a INT PRIMARY KEY) ENGINE=InnoDB;
Call mtr.add_suppression("Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT");
--save_master_pos
 
--connection slave
--sync_with_master
--source include/stop_slave.inc
 
--connection master
INSERT INTO t1
SELECT VARIABLE_VALUE
  FROM INFORMATION_SCHEMA.SESSION_STATUS
 WHERE VARIABLE_NAME='COM_COMMIT';
--save_master_pos
 
--connection slave
--source include/start_slave.inc
# Wait a bit to hit a sleep() in code after taking LOCK_active_mi.
SELECT SLEEP(2);
STOP SLAVE;
SELECT * FROM t1 ORDER BY a;
 
START SLAVE;
--sync_with_master
SELECT * FROM t1 ORDER BY a;
 
# Clean up.
--connection master
DROP TABLE t1;
--source include/rpl_end.inc

This triggers a hang with a similar stack trace as in this report.

Comment by Kristian Nielsen [ 2016-04-15 ]

Mailing list thread:

https://lists.launchpad.net/maria-developers/msg09486.html

So the thing that seems to trigger the hang here is accessing information_schema.session_status from a slave-replicated query (in this case from a trigger on a table modified during replication).

Comment by Michael Widenius [ 2017-01-29 ]

Agree with Kristian that the issue is mainly due to keeping LOCK_active_mi active over STOP SLAVE

I have now a fix that introduces object counting for Master_info and not taking LOCK_active_mi over stop slave or even stop_all_slaves().

This seams to fix this problem and gives us some other benefits:

  • Multiple threads can run SHOW SLAVE STATUS at the same time
    (There are still some internal locks between sql level and slave level that locks, but less than before)
  • START/STOP/RESET/SLAVE STATUS on a slave will not block other slaves
Comment by Michael Widenius [ 2017-02-28 ]

Have been working on this and related bugs found by Elena.
The patch I have been working on should fix most cases of parallel start slave, stop slave, change master combined with global read lock and shutdown.

An added benefit of the new approach is that one will now be able to run start/stop/change master on different connections in parallel. Before these was serialized with the LOCK_active_mi mutex.

Will update this Jira entry with a full set of commit logs when I am done.

Comment by Michael Widenius [ 2017-03-13 ]

A fix is now pushed into 10.0 tree that fixes a major part of the
problems with START/STOP SLAVE when running in parallel and with stop
of server or FLUSH TABLES WITH READ LOCK. There is still a few edge
cases that I will try to work out over time. However the code is now
much more robust than ever before.

Here follows a list of the commit's related to my fixing this issue.

Fixed dead locks when doing stop slave while slave was starting.

  • Added a separate lock for protecting start/stop/reset of a specific slave.
    This solves some possible dead locks when one calls stop slave while
    the slave is starting as the old run_locks was over used for other things.
  • Set hash->records to 0 before calling free of all hash elements.
    This was set to stop concurrent threads to loop over hash elements and
    access members that was already freed.
    This was a problem especially in start_all_slaves/stop_all_slaves
    as the mutex protecting the hash was temporarily released while a slave
    was started/stopped.
  • Because of change to hash->records during hash_reset(),
    any_slave_sql_running() will return 1 during shutdown as one can't
    loop over master_info_index->master_info_hash while hash_reset() of it
    is in progress.
    This also fixes a potential old bug in any_slave_sql_running() where
    during shutdown and ~Master_info_index(), my_hash_free() we could
    potentially try to access elements that was already freed.

Fixed hang doing FLUSH TABLES WITH READ LOCK and parallel replication

  • The problem was that waiting for pause_for_ftwrl was done before
    event_group was completed. This caused rpl_pause_for_ftwrl() to wait
    forever during FLUSH TABLES WITH READ LOCK.
    Now we only wait for FLUSH TABLES WITH READ LOCK when we are changing
    to a new event group.

Add protection to not access is_open() without LOCK_log mutex

  • Protection added to reopen_file() and new_file_impl().
    Without this we could get an assert in fn_format() as name == 0,
    because the file was closed and name reset, atthe same time
    new_file_impl() was called.

Don't allow one to kill START SLAVE while the slaves IO_THREAD or SQL_THREAD
is starting.

  • This is needed as if we kill the START SLAVE thread too early during
    shutdown then the IO_THREAD or SQL_THREAD will not have time to properly
    initlize it's replication or THD structures and clean_up() will try to
    delete master_info structures that are still in use.

Add protection for reinitialization of mutex in parallel replaction

  • Added mutex_lock/mutex_unlock of mutex that is to be destroyed in
    wait_for_commit::reinit() in a similar fashion that we do in
    ~wait_for_commit

MDEV-9573 'Stop slave' hangs on replication slave

  • The reason for this is that stop slave takes LOCK_active_mi over the
    whole operation while some slave operations will also need LOCK_active_mi
    which causes deadlocks.
  • Fixed by introducing object counting for Master_info and not taking
    LOCK_active_mi over stop slave or even stop_all_slaves()
  • Another benefit of this approach is that it allows:
  • Multiple threads can run SHOW SLAVE STATUS at the same time
  • START/STOP/RESET/SLAVE STATUS on a slave will not block other slaves
  • Simpler interface for handling get_master_info()
  • Added some missing unlock of 'log_lock' in error condtions
  • Moved rpl_parallel_inactivate_pool(&global_rpl_thread_pool) to end
    of stop_slave() to not have to use LOCK_active_mi inside
    terminate_slave_threads()
  • Changed argument for remove_master_info() to Master_info, as we always
    have this available
  • Fixed core dump when doing FLUSH TABLES WITH READ LOCK and parallel
    replication. Problem was that waiting for pause_for_ftwrl was not done
    when deleting rpt->current_owner after a force_abort.
Comment by Michael Widenius [ 2017-05-22 ]

Pushed to 10.0 at end of January. Should be in all newer releases

Generated at Thu Feb 08 07:35:42 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.