Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-9573

'Stop slave' hangs on replication slave

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • 10.1.10, 10.1.11, 10.1.13
    • 10.0.30, 10.1.22
    • Replication
    • CentOS release 6.7 (Final) x64 on Dell PowerEdge R510

    Description

      Since we switched from MariaDB 10.0.x to MariaDB 10.1 we are having trouble with the statement 'stop slave;' it just hangs and doesn't return even after hours. The 'show slave status\G' statement made with another connection is also blocking at this moment. It doesn't produce any output.
      This happens randomly, if there is low load on the server it is quite hard to reproduce the issue. Stopping and starting the slave in short intervals may succeed 30 times or more without problems.
      If the server is under heavy load it needs only very few tries to reproduce it. I can reproduce it very quickly when table checksums are created with pt-table-checksum.
      The only way to stop MariaDB when 'stop slave' is hanging is 'kill -9'.
      We are using parallel replication, as you can see in the my.cnf attached.
      Further there is a back trace attached, that has been created as described on mariadb.org. If necessary, I could repeat it with a DEBUG build.
      I also attached the running mysql processes in this moment.
      We never had this issue with MariaDB 10.0.x with the same configuration expect slave_run_triggers_for_rbr = 1 of course, since it is available in MariaDB 10.1 only.

      Please let me know, if I can provide any more details that might be helpful.

      Attachments

        1. my.cnf
          1.0 kB
          Markus Nägele
        2. myMaster.cnf
          0.8 kB
          Markus Nägele
        3. mysqld.log
          459 kB
          Markus Nägele
        4. mysqld-DEBUG.log
          343 kB
          Markus Nägele
        5. mysqlprocesslist.txt
          11 kB
          Markus Nägele
        6. processlistVerbose.txt
          96 kB
          Markus Nägele

        Issue Links

          Activity

            Mailing list thread:

            https://lists.launchpad.net/maria-developers/msg09486.html

            So the thing that seems to trigger the hang here is accessing information_schema.session_status from a slave-replicated query (in this case from a trigger on a table modified during replication).

            knielsen Kristian Nielsen added a comment - Mailing list thread: https://lists.launchpad.net/maria-developers/msg09486.html So the thing that seems to trigger the hang here is accessing information_schema.session_status from a slave-replicated query (in this case from a trigger on a table modified during replication).

            Agree with Kristian that the issue is mainly due to keeping LOCK_active_mi active over STOP SLAVE

            I have now a fix that introduces object counting for Master_info and not taking LOCK_active_mi over stop slave or even stop_all_slaves().

            This seams to fix this problem and gives us some other benefits:

            • Multiple threads can run SHOW SLAVE STATUS at the same time
              (There are still some internal locks between sql level and slave level that locks, but less than before)
            • START/STOP/RESET/SLAVE STATUS on a slave will not block other slaves
            monty Michael Widenius added a comment - Agree with Kristian that the issue is mainly due to keeping LOCK_active_mi active over STOP SLAVE I have now a fix that introduces object counting for Master_info and not taking LOCK_active_mi over stop slave or even stop_all_slaves(). This seams to fix this problem and gives us some other benefits: Multiple threads can run SHOW SLAVE STATUS at the same time (There are still some internal locks between sql level and slave level that locks, but less than before) START/STOP/RESET/SLAVE STATUS on a slave will not block other slaves
            monty Michael Widenius added a comment - - edited

            Have been working on this and related bugs found by Elena.
            The patch I have been working on should fix most cases of parallel start slave, stop slave, change master combined with global read lock and shutdown.

            An added benefit of the new approach is that one will now be able to run start/stop/change master on different connections in parallel. Before these was serialized with the LOCK_active_mi mutex.

            Will update this Jira entry with a full set of commit logs when I am done.

            monty Michael Widenius added a comment - - edited Have been working on this and related bugs found by Elena. The patch I have been working on should fix most cases of parallel start slave, stop slave, change master combined with global read lock and shutdown. An added benefit of the new approach is that one will now be able to run start/stop/change master on different connections in parallel. Before these was serialized with the LOCK_active_mi mutex. Will update this Jira entry with a full set of commit logs when I am done.

            A fix is now pushed into 10.0 tree that fixes a major part of the
            problems with START/STOP SLAVE when running in parallel and with stop
            of server or FLUSH TABLES WITH READ LOCK. There is still a few edge
            cases that I will try to work out over time. However the code is now
            much more robust than ever before.

            Here follows a list of the commit's related to my fixing this issue.

            Fixed dead locks when doing stop slave while slave was starting.

            • Added a separate lock for protecting start/stop/reset of a specific slave.
              This solves some possible dead locks when one calls stop slave while
              the slave is starting as the old run_locks was over used for other things.
            • Set hash->records to 0 before calling free of all hash elements.
              This was set to stop concurrent threads to loop over hash elements and
              access members that was already freed.
              This was a problem especially in start_all_slaves/stop_all_slaves
              as the mutex protecting the hash was temporarily released while a slave
              was started/stopped.
            • Because of change to hash->records during hash_reset(),
              any_slave_sql_running() will return 1 during shutdown as one can't
              loop over master_info_index->master_info_hash while hash_reset() of it
              is in progress.
              This also fixes a potential old bug in any_slave_sql_running() where
              during shutdown and ~Master_info_index(), my_hash_free() we could
              potentially try to access elements that was already freed.

            Fixed hang doing FLUSH TABLES WITH READ LOCK and parallel replication

            • The problem was that waiting for pause_for_ftwrl was done before
              event_group was completed. This caused rpl_pause_for_ftwrl() to wait
              forever during FLUSH TABLES WITH READ LOCK.
              Now we only wait for FLUSH TABLES WITH READ LOCK when we are changing
              to a new event group.

            Add protection to not access is_open() without LOCK_log mutex

            • Protection added to reopen_file() and new_file_impl().
              Without this we could get an assert in fn_format() as name == 0,
              because the file was closed and name reset, atthe same time
              new_file_impl() was called.

            Don't allow one to kill START SLAVE while the slaves IO_THREAD or SQL_THREAD
            is starting.

            • This is needed as if we kill the START SLAVE thread too early during
              shutdown then the IO_THREAD or SQL_THREAD will not have time to properly
              initlize it's replication or THD structures and clean_up() will try to
              delete master_info structures that are still in use.

            Add protection for reinitialization of mutex in parallel replaction

            • Added mutex_lock/mutex_unlock of mutex that is to be destroyed in
              wait_for_commit::reinit() in a similar fashion that we do in
              ~wait_for_commit

            MDEV-9573 'Stop slave' hangs on replication slave

            • The reason for this is that stop slave takes LOCK_active_mi over the
              whole operation while some slave operations will also need LOCK_active_mi
              which causes deadlocks.
            • Fixed by introducing object counting for Master_info and not taking
              LOCK_active_mi over stop slave or even stop_all_slaves()
            • Another benefit of this approach is that it allows:
            • Multiple threads can run SHOW SLAVE STATUS at the same time
            • START/STOP/RESET/SLAVE STATUS on a slave will not block other slaves
            • Simpler interface for handling get_master_info()
            • Added some missing unlock of 'log_lock' in error condtions
            • Moved rpl_parallel_inactivate_pool(&global_rpl_thread_pool) to end
              of stop_slave() to not have to use LOCK_active_mi inside
              terminate_slave_threads()
            • Changed argument for remove_master_info() to Master_info, as we always
              have this available
            • Fixed core dump when doing FLUSH TABLES WITH READ LOCK and parallel
              replication. Problem was that waiting for pause_for_ftwrl was not done
              when deleting rpt->current_owner after a force_abort.
            monty Michael Widenius added a comment - A fix is now pushed into 10.0 tree that fixes a major part of the problems with START/STOP SLAVE when running in parallel and with stop of server or FLUSH TABLES WITH READ LOCK. There is still a few edge cases that I will try to work out over time. However the code is now much more robust than ever before. Here follows a list of the commit's related to my fixing this issue. Fixed dead locks when doing stop slave while slave was starting. Added a separate lock for protecting start/stop/reset of a specific slave. This solves some possible dead locks when one calls stop slave while the slave is starting as the old run_locks was over used for other things. Set hash->records to 0 before calling free of all hash elements. This was set to stop concurrent threads to loop over hash elements and access members that was already freed. This was a problem especially in start_all_slaves/stop_all_slaves as the mutex protecting the hash was temporarily released while a slave was started/stopped. Because of change to hash->records during hash_reset(), any_slave_sql_running() will return 1 during shutdown as one can't loop over master_info_index->master_info_hash while hash_reset() of it is in progress. This also fixes a potential old bug in any_slave_sql_running() where during shutdown and ~Master_info_index(), my_hash_free() we could potentially try to access elements that was already freed. Fixed hang doing FLUSH TABLES WITH READ LOCK and parallel replication The problem was that waiting for pause_for_ftwrl was done before event_group was completed. This caused rpl_pause_for_ftwrl() to wait forever during FLUSH TABLES WITH READ LOCK. Now we only wait for FLUSH TABLES WITH READ LOCK when we are changing to a new event group. Add protection to not access is_open() without LOCK_log mutex Protection added to reopen_file() and new_file_impl(). Without this we could get an assert in fn_format() as name == 0, because the file was closed and name reset, atthe same time new_file_impl() was called. Don't allow one to kill START SLAVE while the slaves IO_THREAD or SQL_THREAD is starting. This is needed as if we kill the START SLAVE thread too early during shutdown then the IO_THREAD or SQL_THREAD will not have time to properly initlize it's replication or THD structures and clean_up() will try to delete master_info structures that are still in use. Add protection for reinitialization of mutex in parallel replaction Added mutex_lock/mutex_unlock of mutex that is to be destroyed in wait_for_commit::reinit() in a similar fashion that we do in ~wait_for_commit MDEV-9573 'Stop slave' hangs on replication slave The reason for this is that stop slave takes LOCK_active_mi over the whole operation while some slave operations will also need LOCK_active_mi which causes deadlocks. Fixed by introducing object counting for Master_info and not taking LOCK_active_mi over stop slave or even stop_all_slaves() Another benefit of this approach is that it allows: Multiple threads can run SHOW SLAVE STATUS at the same time START/STOP/RESET/SLAVE STATUS on a slave will not block other slaves Simpler interface for handling get_master_info() Added some missing unlock of 'log_lock' in error condtions Moved rpl_parallel_inactivate_pool(&global_rpl_thread_pool) to end of stop_slave() to not have to use LOCK_active_mi inside terminate_slave_threads() Changed argument for remove_master_info() to Master_info, as we always have this available Fixed core dump when doing FLUSH TABLES WITH READ LOCK and parallel replication. Problem was that waiting for pause_for_ftwrl was not done when deleting rpt->current_owner after a force_abort.

            Pushed to 10.0 at end of January. Should be in all newer releases

            monty Michael Widenius added a comment - Pushed to 10.0 at end of January. Should be in all newer releases

            People

              monty Michael Widenius
              optonaegele Markus Nägele
              Votes:
              2 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.