Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-27850

MTR tests can hang due to DEBUG_SYNC race condition

Details

    • Bug
    • Status: Stalled (View Workflow)
    • Major
    • Resolution: Unresolved
    • 10.2(EOL), 10.3(EOL), 10.4(EOL), 10.5, 10.6, 10.7(EOL), 10.8(EOL), 10.9(EOL), 10.10(EOL), 10.11
    • 10.5, 10.6, 10.11
    • Replication, Tests
    • None

    Description

      If a DEBUG_SYNC signal is overwritten before the target thread acknowledges the signal, the thread will become stuck (until timeout) due to awaiting the missed signal. rpl.rpl_seconds_behind_master_spike highlights this problem with an example fix commit cdf19cd.

      Other tests which may be impacted by this issue are rpl.rpl_dump_request_retry_warning, main.query_cache_debug, and main.partition_debug_sync. A comprehensive list of effected tests should be created, and then they should be fixed.

      Edit:
      The following is an (ongoing) list of tests which are potentially impacted by this race condition along with a message if fixed. Note that part of this work extended the debug_sync mechanism to automatically detect when an unacknowledged signal is overwritten or reset, and this list contains all tests which fail from that detection:

      • innodb.innodb-table-online
      • innodb.innodb-index-online
      • binlog_encryption.rpl_parallel
      • binlog_encryption.rpl_parallel_ignored_errors
      • rpl.rpl_get_master_version_and_clock
      • rpl.rpl_parallel
      • rpl.rpl_parallel_ignored_errors
      • rpl.kill_race_condition
      • rpl.rpl_seconds_behind_master_spike (fixed)
      • rpl.rpl_dump_request_retry_warning (fixed)
      • main.query_cache_debug (fixed)
      • main.partition_debug_sync (fixed)

      Attachments

        Issue Links

          Activity

            bnestere Brandon Nesterenko created issue -
            bnestere Brandon Nesterenko made changes -
            Field Original Value New Value
            Fix Version/s 10.2 [ 14601 ]
            bnestere Brandon Nesterenko made changes -
            Status Open [ 1 ] In Progress [ 3 ]

            Hey Andrei!

            Can you review my patch for fixing tests main.query_cache_debug, main.partition_debug_sync, and rpl.rpl_dump_request_retry_warning?

            Commit: 883fe83

            Buildbot: bb-10.2-MDEV-27850

            Thanks!

            bnestere Brandon Nesterenko added a comment - Hey Andrei! Can you review my patch for fixing tests main.query_cache_debug, main.partition_debug_sync, and rpl.rpl_dump_request_retry_warning? Commit: 883fe83 Buildbot: bb-10.2-MDEV-27850 Thanks!
            bnestere Brandon Nesterenko made changes -
            Assignee Brandon Nesterenko [ JIRAUSER48702 ] Andrei Elkin [ elkin ]
            Status In Progress [ 3 ] In Review [ 10002 ]
            Elkin Andrei Elkin added a comment -

            The test changes look good. Let's address DEBUG_SYNC 's single signal limitation as a followup.

            Elkin Andrei Elkin added a comment - The test changes look good. Let's address DEBUG_SYNC 's single signal limitation as a followup.
            Elkin Andrei Elkin made changes -
            Assignee Andrei Elkin [ elkin ] Brandon Nesterenko [ JIRAUSER48702 ]
            Status In Review [ 10002 ] Stalled [ 10000 ]
            bnestere Brandon Nesterenko made changes -
            Description If a DEBUG_SYNC signal is overwritten before the target thread acknowledges the signal, the thread will become stuck (until timeout) due to awaiting the missed signal. rpl.rpl_seconds_behind_master_spike highlights this problem with an example fix commit [cdf19cd|https://github.com/MariaDB/server/commit/cdf19cd618ed23fbf7051130b2a6b587c4a4316b].

            Other tests which may be impacted by this issue are rpl.rpl_dump_request_retry_warning, main.query_cache_debug, and main.partition_debug_sync. A comprehensive list of effected tests should be created, and then they should be fixed.
            If a DEBUG_SYNC signal is overwritten before the target thread acknowledges the signal, the thread will become stuck (until timeout) due to awaiting the missed signal. rpl.rpl_seconds_behind_master_spike highlights this problem with an example fix commit [cdf19cd|https://github.com/MariaDB/server/commit/cdf19cd618ed23fbf7051130b2a6b587c4a4316b].

            Other tests which may be impacted by this issue are rpl.rpl_dump_request_retry_warning, main.query_cache_debug, and main.partition_debug_sync. A comprehensive list of effected tests should be created, and then they should be fixed.

            Edit:
            The following is an (ongoing) list of tests which are potentially impacted by this race condition along with a message if fixed. Note that part of this work extended the debug_sync mechanism to automatically detect when an unacknowledged signal is overwritten or reset, and this list contains all tests which fail from that detection:

            * innodb.innodb-table-online
            * innodb.innodb-index-online
            * binlog_encryption.rpl_parallel
            * binlog_encryption.rpl_parallel_ignored_errors
            * rpl.rpl_get_master_version_and_clock
            * rpl.rpl_parallel
            * rpl.rpl_parallel_ignored_errors
            * rpl.kill_race_condition
            * rpl.rpl_seconds_behind_master_spike (fixed)
            * rpl.rpl_dump_request_retry_warning (fixed)
            * main.query_cache_debug (fixed)
            * main.partition_debug_sync (fixed)
            bnestere Brandon Nesterenko made changes -
            Status Stalled [ 10000 ] In Progress [ 3 ]
            ralf.gebhardt Ralf Gebhardt made changes -
            Fix Version/s 10.3 [ 22126 ]
            Fix Version/s 10.2 [ 14601 ]
            marko Marko Mäkelä made changes -
            Fix Version/s 10.4 [ 22408 ]
            Fix Version/s 10.5 [ 23123 ]
            Fix Version/s 10.6 [ 24028 ]
            Fix Version/s 10.7 [ 24805 ]
            Fix Version/s 10.8 [ 26121 ]
            Fix Version/s 10.9 [ 26905 ]
            Fix Version/s 10.10 [ 27530 ]
            Fix Version/s 10.11 [ 27614 ]
            Affects Version/s 10.10 [ 27530 ]
            Affects Version/s 10.11 [ 27614 ]
            julien.fritsch Julien Fritsch made changes -
            Fix Version/s 10.7 [ 24805 ]
            julien.fritsch Julien Fritsch made changes -
            Fix Version/s 10.3 [ 22126 ]
            julien.fritsch Julien Fritsch made changes -
            Fix Version/s 10.8 [ 26121 ]
            bnestere Brandon Nesterenko made changes -
            Status In Progress [ 3 ] Stalled [ 10000 ]
            bnestere Brandon Nesterenko made changes -
            julien.fritsch Julien Fritsch made changes -
            Fix Version/s 10.9 [ 26905 ]
            julien.fritsch Julien Fritsch made changes -
            Fix Version/s 10.10 [ 27530 ]
            julien.fritsch Julien Fritsch made changes -
            Fix Version/s 10.4 [ 22408 ]

            People

              bnestere Brandon Nesterenko
              bnestere Brandon Nesterenko
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.