Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-21953

deadlock between BACKUP STAGE BLOCK_COMMIT and parallel replication

Details

    Description

      It looks like there is a race condition between BACKUP STAGE BLOCK_COMMIT in mariabackup and parallel replication.

      Replication got deadlocked and we needed to kill the backup process in order to recover.
      Show processlist and show engine innodb below.

      Rick

      +----------+--------------+--------------------+------------------------+--------------+--------+-----------------------------------------------+----------------------------------------------------------------------------------------------+----------+
      | Id       | User         | Host               | db                     | Command      | Time   | State                                         | Info                                                                                         | Progress |
      +----------+--------------+--------------------+------------------------+--------------+--------+-----------------------------------------------+----------------------------------------------------------------------------------------------+----------+
      |        1 | system user  |                    | NULL                   | Daemon       |   NULL | InnoDB purge coordinator                      | NULL                                                                                         |    0.000 |
      |        2 | system user  |                    | NULL                   | Daemon       |   NULL | InnoDB purge worker                           | NULL                                                                                         |    0.000 |
      |        3 | system user  |                    | NULL                   | Daemon       |   NULL | InnoDB purge worker                           | NULL                                                                                         |    0.000 |
      |        4 | system user  |                    | NULL                   | Daemon       |   NULL | InnoDB purge worker                           | NULL                                                                                         |    0.000 |
      |        5 | system user  |                    | NULL                   | Daemon       |   NULL | InnoDB shutdown handler                       | NULL                                                                                         |    0.000 |
      |       10 | system user  |                    | NULL                   | Slave_IO     | 913176 | Waiting for master to send event              | NULL                                                                                         |    0.000 |
      |       12 | system user  |                    | NULL                   | Slave_worker | 913176 | Waiting for work from SQL thread              | NULL                                                                                         |    0.000 |
      |       13 | system user  |                    | NULL                   | Slave_worker | 913176 | Waiting for work from SQL thread              | NULL                                                                                         |    0.000 |
      |       14 | system user  |                    | NULL                   | Slave_worker | 913176 | Waiting for work from SQL thread              | NULL                                                                                         |    0.000 |
      |       15 | system user  |                    | NULL                   | Slave_worker | 913176 | Waiting for work from SQL thread              | NULL                                                                                         |    0.000 |
      |       16 | system user  |                    | NULL                   | Slave_worker | 913176 | Waiting for work from SQL thread              | NULL                                                                                         |    0.000 |
      |       17 | system user  |                    | NULL                   | Slave_worker | 913176 | Waiting for work from SQL thread              | NULL                                                                                         |    0.000 |
      |       18 | system user  |                    | NULL                   | Slave_worker | 913176 | Waiting for work from SQL thread              | NULL                                                                                         |    0.000 |
      |       19 | system user  |                    | NULL                   | Slave_worker | 913176 | Waiting for work from SQL thread              | NULL                                                                                         |    0.000 |
      |       20 | system user  |                    | NULL                   | Slave_worker |  31214 | Waiting for backup lock                       | NULL                                                                                         |    0.000 |
      |       21 | system user  |                    | NULL                   | Slave_worker |  31215 | Waiting for backup lock                       | UPDATE `customer_schema`.`heartbeat` SET ts='2020-03-16T01:12:54.009860' WHERE id='1' |    0.000 |
      |       22 | system user  |                    | NULL                   | Slave_worker |  31215 | Waiting for backup lock                       | NULL                                                                                         |    0.000 |
      |       23 | system user  |                    | NULL                   | Slave_worker |  31215 | Waiting for prior transaction to commit       | NULL                                                                                         |    0.000 |
      |       25 | system user  |                    | NULL                   | Slave_worker |  31215 | Waiting for backup lock                       | NULL                                                                                         |    0.000 |
      |       26 | system user  |                    | NULL                   | Slave_worker |  31215 | Waiting for backup lock                       | NULL                                                                                         |    0.000 |
      |       24 | system user  |                    | NULL                   | Slave_worker |  31215 | Waiting for prior transaction to commit       | NULL                                                                                         |    0.000 |
      |       27 | system user  |                    | NULL                   | Slave_worker |  31215 | Waiting for backup lock                       | NULL                                                                                         |    0.000 |
      |       11 | system user  |                    | NULL                   | Slave_SQL    |  31224 | Waiting for room in worker thread event queue | NULL                                                                                         |    0.000 |
      |       63 | monyogmon    | 192.168.4.5:32729  | NULL                   | Sleep        |      0 |                                               | NULL                                                                                         |    0.000 |
      |     2473 | newrelic     | 127.0.0.1:39359    | NULL                   | Sleep        |      5 |                                               | NULL                                                                                         |    0.000 |
      |     5259 | monyogmon    | 192.168.4.5:32790  | NULL                   | Sleep        |     18 |                                               | NULL                                                                                         |    0.000 |
      | 29385185 | mariabackup  | localhost          | NULL                   | Query        |  31215 | Waiting for backup lock                       | BACKUP STAGE BLOCK_COMMIT                                                                    |    0.000 |
      | 29941894 | user | 192.168.4.40:21990 | customer_schema | Sleep        |      0 |                                               | NULL                                                                                         |    0.000 |
      | 31892219 | root         | localhost          | mysql                  | Query        |   2552 | Waiting for backup lock                       | CREATE USER IF NOT EXISTS 'user'@'192.168.4.40'                                     |    0.000 |
      | 31994422 | root         | localhost          | NULL                   | Sleep        |   1282 |                                               | NULL                                                                                         |    0.000 |
      | 31994463 | mariadbadmin | localhost          | NULL                   | Query        |      0 | Init                                          | show processlist                                                                             |    0.000 |
      +----------+--------------+--------------------+------------------------+--------------+--------+-----------------------------------------------+----------------------------------------------------------------------------------------------+----------+
      

      ---TRANSACTION 99732954594, ACTIVE (PREPARED) 31143 sec
      6 lock struct(s), heap size 1136, 2 row lock(s), undo log entries 2
      MySQL thread id 24, OS thread handle 139941516424960, query id 2210114407 Waiting for prior transaction to commit
      ---TRANSACTION 421513941321144, not started
      0 lock struct(s), heap size 1136, 0 row lock(s)
      ---TRANSACTION 99732954650, ACTIVE 31143 sec
      3 lock struct(s), heap size 1136, 1 row lock(s), undo log entries 2
      MySQL thread id 26, OS thread handle 139941516629760, query id 2210114591 Waiting for backup lock
      ---TRANSACTION 99732954630, ACTIVE 31143 sec
      3 lock struct(s), heap size 1136, 0 row lock(s), undo log entries 3
      MySQL thread id 25, OS thread handle 139941516220160, query id 2210114496 Waiting for backup lock
      ---TRANSACTION 99732954619, ACTIVE 31143 sec
      3 lock struct(s), heap size 1136, 0 row lock(s), undo log entries 3
      MySQL thread id 22, OS thread handle 139941517039360, query id 2210114460 Waiting for backup lock
      ---TRANSACTION 99732954598, ACTIVE (PREPARED) 31143 sec
      2 lock struct(s), heap size 1136, 0 row lock(s), undo log entries 2
      MySQL thread id 23, OS thread handle 139941516834560, query id 2210114412 Waiting for prior transaction to commit
      ---TRANSACTION 99732954623, ACTIVE 31143 sec
      3 lock struct(s), heap size 1136, 1 row lock(s), undo log entries 2
      MySQL thread id 27, OS thread handle 139941516015360, query id 2210114468 Waiting for backup lock
      ---TRANSACTION 99732954626, ACTIVE 31143 sec
      2 lock struct(s), heap size 1136, 0 row lock(s), undo log entries 4451
      MySQL thread id 20, OS thread handle 139941517448960, query id 2210114475 Waiting for backup lock
      ---TRANSACTION 421513941291352, not started
      0 lock struct(s), heap size 1136, 0 row lock(s)
      ---TRANSACTION 421513941287096, not started
      0 lock struct(s), heap size 1136, 0 row lock(s)
      

      Attachments

        Issue Links

          Activity

            While merging this to 10.5, I omitted the changes to sql_class.cc:

            diff --git a/sql/sql_class.cc b/sql/sql_class.cc
            index 40e606425c5..15088148e02 100644
            --- a/sql/sql_class.cc
            +++ b/sql/sql_class.cc
            @@ -1383,7 +1383,11 @@ void THD::update_all_stats()
             void THD::init_for_queries()
             {
               set_time(); 
            -  ha_enable_transaction(this,TRUE);
            +  /*
            +    We don't need to call ha_enable_transaction() as we can't have
            +    any active transactions that has to be commited
            +  */
            +  transaction.on= TRUE;
             
               reset_root_defaults(mem_root, variables.query_alloc_block_size,
                                   variables.query_prealloc_size);
            

            With the above change (or transaction->on instead of transaction.on), replication XA tests would crash in 10.5. I believed that the change is not wanted in 10.5 due to MDEV-22531 and related changes. All tests passed with that omission.

            marko Marko Mäkelä added a comment - While merging this to 10.5 , I omitted the changes to sql_class.cc : diff --git a/sql/sql_class.cc b/sql/sql_class.cc index 40e606425c5..15088148e02 100644 --- a/sql/sql_class.cc +++ b/sql/sql_class.cc @@ -1383,7 +1383,11 @@ void THD::update_all_stats() void THD::init_for_queries() { set_time(); - ha_enable_transaction(this,TRUE); + /* + We don't need to call ha_enable_transaction() as we can't have + any active transactions that has to be commited + */ + transaction.on= TRUE; reset_root_defaults(mem_root, variables.query_alloc_block_size, variables.query_prealloc_size); With the above change (or transaction->on instead of transaction.on ), replication XA tests would crash in 10.5. I believed that the change is not wanted in 10.5 due to MDEV-22531 and related changes. All tests passed with that omission.

            An after-merge fix was part of the 10.5.5 release.

            marko Marko Mäkelä added a comment - An after-merge fix was part of the 10.5.5 release.

            marko is this fixed in 10.4.14? The release notes do not mention it.

            rpizzi Rick Pizzi (Inactive) added a comment - marko is this fixed in 10.4.14? The release notes do not mention it.

            This fix has introduced a regression and now the GTID position is wrong when using 10.4.14. Filing a new bug.

            rpizzi Rick Pizzi (Inactive) added a comment - This fix has introduced a regression and now the GTID position is wrong when using 10.4.14. Filing a new bug.

            Yes, the change is included of the MariaDB 10.4.14 and 10.5.5 releases.

            marko Marko Mäkelä added a comment - Yes, the change is included of the MariaDB 10.4.14 and 10.5.5 releases.

            People

              monty Michael Widenius
              rpizzi Rick Pizzi (Inactive)
              Votes:
              6 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.