Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-36302

Add option for SQL thread to kill any blocking DML

Details

    • New Feature
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • 12.1
    • Replication
    • None

    Description

      DML executed on primary may be blocked by any long running DML that takes write locks, like SELECT FOR UPDATE or SELECT IN SHARE MODE, on the replica, thus breaking the "high availability" of this setup.

      Please, add a server variable to force priority for SQL thread in such cases,, similar to --slave-abort-blocking-timeout.

      Attachments

        Issue Links

          Activity

            Can we improve the logging for this type of error by recording which thread id blocked the slave SQL thread? In the case of a user thread, it’s easy to identify.

            2025-03-04 10:04:29 525132 [Warning] Aborted connection 525132 to db: 'database' user: 'appuser' host: '127.0.0.1' real ip: '10.232.147.130' (Got timeout reading communication packets)
            2025-03-04 10:04:32 526165 [ERROR] Slave worker thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable.
            2025-03-04 10:04:32 526165 [ERROR] Slave SQL: Lock wait timeout exceeded; try restarting transaction, Gtid 1-1-1191236814, Internal MariaDB error code: 1205
            

            pandi.gurusamy Pandikrishnan Gurusamy added a comment - Can we improve the logging for this type of error by recording which thread id blocked the slave SQL thread? In the case of a user thread, it’s easy to identify. 2025-03-04 10:04:29 525132 [Warning] Aborted connection 525132 to db: 'database' user: 'appuser' host: '127.0.0.1' real ip: '10.232.147.130' (Got timeout reading communication packets) 2025-03-04 10:04:32 526165 [ERROR] Slave worker thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable. 2025-03-04 10:04:32 526165 [ERROR] Slave SQL: Lock wait timeout exceeded; try restarting transaction, Gtid 1-1-1191236814, Internal MariaDB error code: 1205

            The description says: "DML executed on primary may be blocked by any long
            running DML that takes write locks, like SELECT FOR UPDATE or SELECT IN
            SHARE MODE, on the replica".

            I do not understand what is meant. How can DML on the master be blocked by
            DML running on a slave?

            From the title, it seems like the intention is to give the slave threads the
            ability to kill non-slave queries if it gets blocked on any metadata lock or
            engine table/row lock. It is not clear if this kill should be immediate, or
            after some timeout like --slave-abort-blocking-timeout.

            Also it is not clear what 'breaking the "high availability"' means here.
            From other discussions I think this comes from a specific case at a user,
            where some MaxScale switchover was delayed or blocked, but I am not sure the
            actual root cause in this scenario was ever fully understood.

            I would like to see a detailed description of the intended use case or use
            cases here, be able to understand if this is the best and correct solution
            for these use cases.

            In a normal read-only slave, there should not normally be a need to use
            SELECT FOR UPDATE or IN SHARE MODE in user queries. REPEATABLE READ will
            provide snapshot isolation that should give consistent queries without
            needing any locks that would block slave threads. This would be a better
            solution that avoids user SELECTs getting randomly killed.

            If the slave is non-readonly and there are conflicting user updates, the
            issue is more complex, and merely killing off user queries will likely not
            avoid problems, eg. the slave diverging from the master for example.

            Also, even if there are user queries temporarily blocking a slave thread,
            those queries would release their locks at transaction commit. So if they
            are permanently blocking the slave, something else seems to be wrong, again
            suggesting the real solution may lie elsewhere.

            Now, onto possible implementation details.

            There are (at least) two different kinds of locks that can block a slave
            thread: metadata locks and storage engine locks (typically InnoDB row- or
            table-level locks).

            The new --slave-abort-blocking-timeout option works on metadata locks.
            Metadata locks can be exclusive; these can block against other queries, and
            are used mostly by DDL, I think. Normal DML would not block each other on
            metadata locks, IIUC.

            There is an existing facility for blocking storage engine locks, using the
            functions thd_need_wait_reports() and thd_rpl_deadlock_check(). This
            facility is used for detecting deadlocks between in-order committed
            transactions in parallel replication, so currently only active for slave
            threads.

            This facility could be extended so that it would kill any non-slave
            transaction that blocks any slave transaction. This needs support in each
            storage engine, support is there currently for InnoDB. There is no support
            in this facility to have a timeout before the kill happens, it would be
            immediate whenever a slave thread gets blocked on an InnoDB lock.

            The thd_rpl_deadlock_check() can have performance impact on non-slave
            threads, see MDEV-24948, and there is a pending patch for a performance bug
            in branch origin/knielsen_mdev24948. Therefore, using this to kill
            non-slave queries would need to be configurable. Enabling the facility would
            only affect new connections, as thd_rpl_deadlock_check() is called at
            connection start and cached, for efficiency.

            knielsen Kristian Nielsen added a comment - The description says: "DML executed on primary may be blocked by any long running DML that takes write locks, like SELECT FOR UPDATE or SELECT IN SHARE MODE, on the replica". I do not understand what is meant. How can DML on the master be blocked by DML running on a slave? From the title, it seems like the intention is to give the slave threads the ability to kill non-slave queries if it gets blocked on any metadata lock or engine table/row lock. It is not clear if this kill should be immediate, or after some timeout like --slave-abort-blocking-timeout . Also it is not clear what 'breaking the "high availability"' means here. From other discussions I think this comes from a specific case at a user, where some MaxScale switchover was delayed or blocked, but I am not sure the actual root cause in this scenario was ever fully understood. I would like to see a detailed description of the intended use case or use cases here, be able to understand if this is the best and correct solution for these use cases. In a normal read-only slave, there should not normally be a need to use SELECT FOR UPDATE or IN SHARE MODE in user queries. REPEATABLE READ will provide snapshot isolation that should give consistent queries without needing any locks that would block slave threads. This would be a better solution that avoids user SELECTs getting randomly killed. If the slave is non-readonly and there are conflicting user updates, the issue is more complex, and merely killing off user queries will likely not avoid problems, eg. the slave diverging from the master for example. Also, even if there are user queries temporarily blocking a slave thread, those queries would release their locks at transaction commit. So if they are permanently blocking the slave, something else seems to be wrong, again suggesting the real solution may lie elsewhere. Now, onto possible implementation details. There are (at least) two different kinds of locks that can block a slave thread: metadata locks and storage engine locks (typically InnoDB row- or table-level locks). The new --slave-abort-blocking-timeout option works on metadata locks. Metadata locks can be exclusive; these can block against other queries, and are used mostly by DDL, I think. Normal DML would not block each other on metadata locks, IIUC. There is an existing facility for blocking storage engine locks, using the functions thd_need_wait_reports() and thd_rpl_deadlock_check() . This facility is used for detecting deadlocks between in-order committed transactions in parallel replication, so currently only active for slave threads. This facility could be extended so that it would kill any non-slave transaction that blocks any slave transaction. This needs support in each storage engine, support is there currently for InnoDB. There is no support in this facility to have a timeout before the kill happens, it would be immediate whenever a slave thread gets blocked on an InnoDB lock. The thd_rpl_deadlock_check() can have performance impact on non-slave threads, see MDEV-24948 , and there is a pending patch for a performance bug in branch origin/knielsen_mdev24948 . Therefore, using this to kill non-slave queries would need to be configurable. Enabling the facility would only affect new connections, as thd_rpl_deadlock_check() is called at connection start and cached, for efficiency.
            knielsen Kristian Nielsen added a comment - - edited

            > Pandikrishnan Gurusamy added a comment

            > Kristian Nielsen It looks like the description was by mistake. The actual
            > issue was that some user connection in. the slave was blocking the slave
            > SQL thread.

            What exactly was that user connection doing that was blocking the slave, and
            why did the user need to have such connection?

            > In the read-only slave, we are not allowing SELECT FOR UPDATE, but it
            > seems that LOCK IN SHARE MODE is possible for a non-super user in the
            > slave (read-only=1)

            Is there any evidence that LOCK IN SHARE MODE was used here? And if so, why
            is the application doing LOCK IN SHARE MODE?

            > Connection blocks with SQL thread ( It looks like a direct Database connection).
            >  
            > | 249175 | XXX | X.X.X.X:59912 | XXX | Sleep | 16 | | NULL | 0.000 |
            

            So this is not a running query, it seems to be an idle connection?

            > +---------------------+--------+-----------+------+---------------------+------------------+-----------------------+--------------------+-----------------+---------------------+--------------+---------------------------+
            > | trx_mysql_thread_id | trx_id | trx_state | time | trx_started | trx_is_read_only | trx_active_in_seconds | trx_active_in_days | lock_mode | lock_type | table_schema | table_name |
            > +---------------------+--------+-----------+------+---------------------+------------------+-----------------------+--------------------+-----------------+---------------------+--------------+---------------------------+
            > | 249175 | 0 | RUNNING | 17 | 2025-02-27 22:16:37 | 0 | 3386 | 0.03918981 | MDL_SHARED_READ | Table metadata lock | XXX | XXX |
            

            So this is a transaction that has been active for almost an hour, while the
            connection seems to be idle. Why is there an open transaction running for
            that long without completing? Isn't that the real problem here?

            If the problem is idle connections that are holding a transaction open, then
            maybe using wait_timeout (or interactive_timeout) and/or
            idle_transaction_timeout is the solution. Also consider using autocommit on
            the client.

            Killing a blocking DML, as suggested in the title of this MDEV, will not
            help if an idle connection is holding a transaction open. Should the slave
            kill (and thus disconnect) any transaction holding a blocking lock? That
            seems overly drastic. The --slave-abort-blocking-timeout option, for
            comparison, kills a blocking query, not a connection.

            knielsen Kristian Nielsen added a comment - - edited > Pandikrishnan Gurusamy added a comment > Kristian Nielsen It looks like the description was by mistake. The actual > issue was that some user connection in. the slave was blocking the slave > SQL thread. What exactly was that user connection doing that was blocking the slave, and why did the user need to have such connection? > In the read-only slave, we are not allowing SELECT FOR UPDATE, but it > seems that LOCK IN SHARE MODE is possible for a non-super user in the > slave (read-only=1) Is there any evidence that LOCK IN SHARE MODE was used here? And if so, why is the application doing LOCK IN SHARE MODE? > Connection blocks with SQL thread ( It looks like a direct Database connection). > > | 249175 | XXX | X.X.X.X:59912 | XXX | Sleep | 16 | | NULL | 0.000 | So this is not a running query, it seems to be an idle connection? > +---------------------+--------+-----------+------+---------------------+------------------+-----------------------+--------------------+-----------------+---------------------+--------------+---------------------------+ > | trx_mysql_thread_id | trx_id | trx_state | time | trx_started | trx_is_read_only | trx_active_in_seconds | trx_active_in_days | lock_mode | lock_type | table_schema | table_name | > +---------------------+--------+-----------+------+---------------------+------------------+-----------------------+--------------------+-----------------+---------------------+--------------+---------------------------+ > | 249175 | 0 | RUNNING | 17 | 2025-02-27 22:16:37 | 0 | 3386 | 0.03918981 | MDL_SHARED_READ | Table metadata lock | XXX | XXX | So this is a transaction that has been active for almost an hour, while the connection seems to be idle. Why is there an open transaction running for that long without completing? Isn't that the real problem here? If the problem is idle connections that are holding a transaction open, then maybe using wait_timeout (or interactive_timeout) and/or idle_transaction_timeout is the solution. Also consider using autocommit on the client. Killing a blocking DML, as suggested in the title of this MDEV, will not help if an idle connection is holding a transaction open. Should the slave kill (and thus disconnect) any transaction holding a blocking lock? That seems overly drastic. The --slave-abort-blocking-timeout option, for comparison, kills a blocking query, not a connection.

            People

              knielsen Kristian Nielsen
              monty Michael Widenius
              Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.