[MDEV-9573] 'Stop slave' hangs on replication slave - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.1.10, 10.1.11, 10.1.13
Fix Version/s: 10.0.30, 10.1.22
Component/s: Replication
Labels:
- replication
Environment:
CentOS release 6.7 (Final) x64 on Dell PowerEdge R510

Description

Since we switched from MariaDB 10.0.x to MariaDB 10.1 we are having trouble with the statement 'stop slave;' it just hangs and doesn't return even after hours. The 'show slave status\G' statement made with another connection is also blocking at this moment. It doesn't produce any output.
This happens randomly, if there is low load on the server it is quite hard to reproduce the issue. Stopping and starting the slave in short intervals may succeed 30 times or more without problems.
If the server is under heavy load it needs only very few tries to reproduce it. I can reproduce it very quickly when table checksums are created with pt-table-checksum.
The only way to stop MariaDB when 'stop slave' is hanging is 'kill -9'.
We are using parallel replication, as you can see in the my.cnf attached.
Further there is a back trace attached, that has been created as described on mariadb.org. If necessary, I could repeat it with a DEBUG build.
I also attached the running mysql processes in this moment.
We never had this issue with MariaDB 10.0.x with the same configuration expect slave_run_triggers_for_rbr = 1 of course, since it is available in MariaDB 10.1 only.

Please let me know, if I can provide any more details that might be helpful.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

my.cnf
1.0 kB
2016-02-17 13:56
myMaster.cnf
0.8 kB
2016-04-13 06:59
mysqld.log
459 kB
2016-02-17 13:57
mysqld-DEBUG.log
343 kB
2016-04-13 07:21
mysqlprocesslist.txt
11 kB
2016-02-17 13:57
processlistVerbose.txt
96 kB
2016-04-13 07:21

Issue Links

includes

MDEV-10644 One of parallel replication threads remains active after STOP SLAVE SQL_THREAD completes

Closed

relates to

MDEV-12104 Testing for MDEV-9573 and extra replication bugfixes

Stalled

MDEV-17346 parallel slave start and stop races to workers disappeared

Closed

MDEV-31572 STOP SLAVE hangs on 10.3.39

Closed

Activity

Ascending order - Click to sort in descending order

View 6 older comments

Kristian Nielsen added a comment - 2016-04-15 12:39

Mailing list thread:

https://lists.launchpad.net/maria-developers/msg09486.html

So the thing that seems to trigger the hang here is accessing information_schema.session_status from a slave-replicated query (in this case from a trigger on a table modified during replication).

Kristian Nielsen added a comment - 2016-04-15 12:39 Mailing list thread: https://lists.launchpad.net/maria-developers/msg09486.html So the thing that seems to trigger the hang here is accessing information_schema.session_status from a slave-replicated query (in this case from a trigger on a table modified during replication).

Michael Widenius added a comment - 2017-01-29 20:10

Agree with Kristian that the issue is mainly due to keeping LOCK_active_mi active over STOP SLAVE

I have now a fix that introduces object counting for Master_info and not taking LOCK_active_mi over stop slave or even stop_all_slaves().

This seams to fix this problem and gives us some other benefits:

Multiple threads can run SHOW SLAVE STATUS at the same time
(There are still some internal locks between sql level and slave level that locks, but less than before)
START/STOP/RESET/SLAVE STATUS on a slave will not block other slaves

Michael Widenius added a comment - 2017-01-29 20:10 Agree with Kristian that the issue is mainly due to keeping LOCK_active_mi active over STOP SLAVE I have now a fix that introduces object counting for Master_info and not taking LOCK_active_mi over stop slave or even stop_all_slaves(). This seams to fix this problem and gives us some other benefits: Multiple threads can run SHOW SLAVE STATUS at the same time (There are still some internal locks between sql level and slave level that locks, but less than before) START/STOP/RESET/SLAVE STATUS on a slave will not block other slaves

Michael Widenius added a comment - 2017-02-28 14:19 - edited

Have been working on this and related bugs found by Elena.
The patch I have been working on should fix most cases of parallel start slave, stop slave, change master combined with global read lock and shutdown.

An added benefit of the new approach is that one will now be able to run start/stop/change master on different connections in parallel. Before these was serialized with the LOCK_active_mi mutex.

Will update this Jira entry with a full set of commit logs when I am done.

Michael Widenius added a comment - 2017-02-28 14:19 - edited Have been working on this and related bugs found by Elena. The patch I have been working on should fix most cases of parallel start slave, stop slave, change master combined with global read lock and shutdown. An added benefit of the new approach is that one will now be able to run start/stop/change master on different connections in parallel. Before these was serialized with the LOCK_active_mi mutex. Will update this Jira entry with a full set of commit logs when I am done.

Michael Widenius added a comment - 2017-03-13 10:53

A fix is now pushed into 10.0 tree that fixes a major part of the
problems with START/STOP SLAVE when running in parallel and with stop
of server or FLUSH TABLES WITH READ LOCK. There is still a few edge
cases that I will try to work out over time. However the code is now
much more robust than ever before.

Here follows a list of the commit's related to my fixing this issue.

Fixed dead locks when doing stop slave while slave was starting.

Added a separate lock for protecting start/stop/reset of a specific slave.
This solves some possible dead locks when one calls stop slave while
the slave is starting as the old run_locks was over used for other things.
Set hash->records to 0 before calling free of all hash elements.
This was set to stop concurrent threads to loop over hash elements and
access members that was already freed.
This was a problem especially in start_all_slaves/stop_all_slaves
as the mutex protecting the hash was temporarily released while a slave
was started/stopped.
Because of change to hash->records during hash_reset(),
any_slave_sql_running() will return 1 during shutdown as one can't
loop over master_info_index->master_info_hash while hash_reset() of it
is in progress.
This also fixes a potential old bug in any_slave_sql_running() where
during shutdown and ~Master_info_index(), my_hash_free() we could
potentially try to access elements that was already freed.

Fixed hang doing FLUSH TABLES WITH READ LOCK and parallel replication

The problem was that waiting for pause_for_ftwrl was done before
event_group was completed. This caused rpl_pause_for_ftwrl() to wait
forever during FLUSH TABLES WITH READ LOCK.
Now we only wait for FLUSH TABLES WITH READ LOCK when we are changing
to a new event group.

Add protection to not access is_open() without LOCK_log mutex

Protection added to reopen_file() and new_file_impl().
Without this we could get an assert in fn_format() as name == 0,
because the file was closed and name reset, atthe same time
new_file_impl() was called.

Don't allow one to kill START SLAVE while the slaves IO_THREAD or SQL_THREAD
is starting.

This is needed as if we kill the START SLAVE thread too early during
shutdown then the IO_THREAD or SQL_THREAD will not have time to properly
initlize it's replication or THD structures and clean_up() will try to
delete master_info structures that are still in use.

Add protection for reinitialization of mutex in parallel replaction

Added mutex_lock/mutex_unlock of mutex that is to be destroyed in
wait_for_commit::reinit() in a similar fashion that we do in
~wait_for_commit

~~MDEV-9573~~ 'Stop slave' hangs on replication slave

The reason for this is that stop slave takes LOCK_active_mi over the
whole operation while some slave operations will also need LOCK_active_mi
which causes deadlocks.

Fixed by introducing object counting for Master_info and not taking
LOCK_active_mi over stop slave or even stop_all_slaves()

Another benefit of this approach is that it allows:
Multiple threads can run SHOW SLAVE STATUS at the same time
START/STOP/RESET/SLAVE STATUS on a slave will not block other slaves
Simpler interface for handling get_master_info()
Added some missing unlock of 'log_lock' in error condtions
Moved rpl_parallel_inactivate_pool(&global_rpl_thread_pool) to end
of stop_slave() to not have to use LOCK_active_mi inside
terminate_slave_threads()
Changed argument for remove_master_info() to Master_info, as we always
have this available
Fixed core dump when doing FLUSH TABLES WITH READ LOCK and parallel
replication. Problem was that waiting for pause_for_ftwrl was not done
when deleting rpt->current_owner after a force_abort.

Michael Widenius added a comment - 2017-03-13 10:53 A fix is now pushed into 10.0 tree that fixes a major part of the problems with START/STOP SLAVE when running in parallel and with stop of server or FLUSH TABLES WITH READ LOCK. There is still a few edge cases that I will try to work out over time. However the code is now much more robust than ever before. Here follows a list of the commit's related to my fixing this issue. Fixed dead locks when doing stop slave while slave was starting. Added a separate lock for protecting start/stop/reset of a specific slave. This solves some possible dead locks when one calls stop slave while the slave is starting as the old run_locks was over used for other things. Set hash->records to 0 before calling free of all hash elements. This was set to stop concurrent threads to loop over hash elements and access members that was already freed. This was a problem especially in start_all_slaves/stop_all_slaves as the mutex protecting the hash was temporarily released while a slave was started/stopped. Because of change to hash->records during hash_reset(), any_slave_sql_running() will return 1 during shutdown as one can't loop over master_info_index->master_info_hash while hash_reset() of it is in progress. This also fixes a potential old bug in any_slave_sql_running() where during shutdown and ~Master_info_index(), my_hash_free() we could potentially try to access elements that was already freed. Fixed hang doing FLUSH TABLES WITH READ LOCK and parallel replication The problem was that waiting for pause_for_ftwrl was done before event_group was completed. This caused rpl_pause_for_ftwrl() to wait forever during FLUSH TABLES WITH READ LOCK. Now we only wait for FLUSH TABLES WITH READ LOCK when we are changing to a new event group. Add protection to not access is_open() without LOCK_log mutex Protection added to reopen_file() and new_file_impl(). Without this we could get an assert in fn_format() as name == 0, because the file was closed and name reset, atthe same time new_file_impl() was called. Don't allow one to kill START SLAVE while the slaves IO_THREAD or SQL_THREAD is starting. This is needed as if we kill the START SLAVE thread too early during shutdown then the IO_THREAD or SQL_THREAD will not have time to properly initlize it's replication or THD structures and clean_up() will try to delete master_info structures that are still in use. Add protection for reinitialization of mutex in parallel replaction Added mutex_lock/mutex_unlock of mutex that is to be destroyed in wait_for_commit::reinit() in a similar fashion that we do in ~wait_for_commit MDEV-9573 'Stop slave' hangs on replication slave The reason for this is that stop slave takes LOCK_active_mi over the whole operation while some slave operations will also need LOCK_active_mi which causes deadlocks. Fixed by introducing object counting for Master_info and not taking LOCK_active_mi over stop slave or even stop_all_slaves() Another benefit of this approach is that it allows: Multiple threads can run SHOW SLAVE STATUS at the same time START/STOP/RESET/SLAVE STATUS on a slave will not block other slaves Simpler interface for handling get_master_info() Added some missing unlock of 'log_lock' in error condtions Moved rpl_parallel_inactivate_pool(&global_rpl_thread_pool) to end of stop_slave() to not have to use LOCK_active_mi inside terminate_slave_threads() Changed argument for remove_master_info() to Master_info, as we always have this available Fixed core dump when doing FLUSH TABLES WITH READ LOCK and parallel replication. Problem was that waiting for pause_for_ftwrl was not done when deleting rpt->current_owner after a force_abort.

Michael Widenius added a comment - 2017-05-22 12:50

Pushed to 10.0 at end of January. Should be in all newer releases

Michael Widenius added a comment - 2017-05-22 12:50 Pushed to 10.0 at end of January. Should be in all newer releases

MariaDB Server

'Stop slave' hangs on replication slave

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration