Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-21255

Deadlock of parallel slave and mariabackup (with failed log copy thread)

Details

    Description

      Read only slave with slave_parallel_threads=10 got deadlocked when mariabackup executed FTWRL and tried to copy non-InnoDB files and remaining part of the redo log. Neither replication, no mariabackup, nor later mariabackup calls could proceed.

      In the processlist we see:

      ...
      |     10 | system user |                                    | NULL | Slave_IO     | 1626750 | Waiting for master to send event                                               | NULL                                          |    0.000 |
      |     13 | system user |                                    | NULL | Slave_worker |  184973 | Waiting for prior transaction to start commit before starting next transaction | NULL                                          |    0.000 |
      |     12 | system user |                                    | NULL | Slave_worker |  184973 | Waiting for prior transaction to start commit before starting next transaction | NULL                                          |    0.000 |
      |     14 | system user |                                    | NULL | Slave_worker |  184973 | Waiting for prior transaction to start commit before starting next transaction | NULL                                          |    0.000 |
      |     15 | system user |                                    | NULL | Slave_worker |  184973 | Waiting for prior transaction to start commit before starting next transaction | NULL                                          |    0.000 |
      |     16 | system user |                                    | NULL | Slave_worker |  184973 | Waiting for prior transaction to start commit before starting next transaction | NULL                                          |    0.000 |
      |     17 | system user |                                    | NULL | Slave_worker |  184973 | Waiting for global read lock                                                   | NULL                                          |    0.000 |
      |     18 | system user |                                    | NULL | Slave_worker |  184973 | Waiting for prior transaction to start commit before starting next transaction | NULL                                          |    0.000 |
      |     19 | system user |                                    | NULL | Slave_worker |  184973 | Waiting for prior transaction to start commit before starting next transaction | NULL                                          |    0.000 |
      |     20 | system user |                                    | NULL | Slave_worker |  184973 | Waiting for prior transaction to start commit before starting next transaction | NULL                                          |    0.000 |
      |     21 | system user |                                    | NULL | Slave_worker |  184973 | Waiting for prior transaction to start commit before starting next transaction | NULL                                          |    0.000 |
      |     11 | system user |                                    | NULL | Slave_SQL    |  187284 | Waiting for room in worker thread event queue                                  | NULL                                          |    0.000 |
      | 271217 | dbcleaner   | some_host:51410 | sa2  | Query        |       0 | Waiting in MASTER_GTID_WAIT() (primary waiter)                                 | SELECT MASTER_GTID_WAIT('0-11-5381276620', 4) |    0.000 |
      ...
      | 402858 | bkpuser     | localhost                          | NULL | Sleep        |  184970 |                                                                                | NULL                                          |    0.000 |
      | 427762 | bkpuser     | localhost                          | NULL | Query        |   98674 | Waiting for worker threads to pause for global read lock                       | FLUSH TABLES WITH READ LOCK                   |    0.000 |
      | 452639 | bkpuser     | localhost                          | NULL | Query        |   12636 | Waiting while replication worker thread pool is busy                           | FLUSH TABLES WITH READ LOCK                   |    0.000 |
      | 469510 | dbcleaner   | some_host:63978 | sa2  | Sleep        |      19 |                                                                                | NULL                                          |    0.000 |
      | 469541 | dbcleaner   | localhost                          | NULL | Query        |       0 | init                                                                           | show processlist                              |    0.000 |
      

      See backtrace of all threads attached.

      Attachments

        1. backtrace1.txt
          100 kB
          Valerii Kravchuk

        Activity

          The oldest (3rd) maraibackup session (working as bkpuser) hangs with these last messages in the log:

          ...
          [01] 2019-12-07 05:55:36         ...done
          [01] 2019-12-07 05:55:36 Copying ./performance_schema/db.opt to /nfs/backup/2019-12-06/FULL/performance_schema/db.opt
          [01] 2019-12-07 05:55:36         ...done
          [00] 2019-12-07 05:55:36 Finished backing up non-InnoDB tables and files
          [01] 2019-12-07 05:55:36 Copying aria_log_control to /nfs/backup/2019-12-06/FULL/aria_log_control
          [01] 2019-12-07 05:55:36         ...done
          [01] 2019-12-07 05:55:36 Copying aria_log.00000001 to /nfs/backup/2019-12-06/FULL/aria_log.00000001
          [01] 2019-12-07 05:55:36         ...done
          [00] 2019-12-07 05:55:36 Waiting for log copy thread to read lsn 78996210453049
          

          valerii Valerii Kravchuk added a comment - The oldest (3rd) maraibackup session (working as bkpuser) hangs with these last messages in the log: ... [01] 2019-12-07 05:55:36 ...done [01] 2019-12-07 05:55:36 Copying ./performance_schema/db.opt to /nfs/backup/2019-12-06/FULL/performance_schema/db.opt [01] 2019-12-07 05:55:36 ...done [00] 2019-12-07 05:55:36 Finished backing up non-InnoDB tables and files [01] 2019-12-07 05:55:36 Copying aria_log_control to /nfs/backup/2019-12-06/FULL/aria_log_control [01] 2019-12-07 05:55:36 ...done [01] 2019-12-07 05:55:36 Copying aria_log.00000001 to /nfs/backup/2019-12-06/FULL/aria_log.00000001 [01] 2019-12-07 05:55:36 ...done [00] 2019-12-07 05:55:36 Waiting for log copy thread to read lsn 78996210453049
          Elkin Andrei Elkin added a comment -

          According to a stacktrace
          a slave replication worker is hanging at MDL lock on
          `mysql.gtid_slave_pos` table request.
          It's unclear what thread is holding the lock.
          There are two FLUSH-TABLES-WITH-READ-LOCK waiters at that time, but
          neither seem can be suspected.

          If the hanging slave server remains in that state we might try to query
          out with gdb which tread is the MDL lock owner.
          Such thread e.g could have run

          FLUSH TABLES ... WITH READ LOCK with `gtid_slave_pos` in the ... list.

          I suggest to explore this possibility.

          Elkin Andrei Elkin added a comment - According to a stacktrace a slave replication worker is hanging at MDL lock on `mysql.gtid_slave_pos` table request. It's unclear what thread is holding the lock. There are two FLUSH-TABLES-WITH-READ-LOCK waiters at that time, but neither seem can be suspected. If the hanging slave server remains in that state we might try to query out with gdb which tread is the MDL lock owner. Such thread e.g could have run FLUSH TABLES ... WITH READ LOCK with `gtid_slave_pos` in the ... list. I suggest to explore this possibility.

          That confirms my interpretation of the backtrace. Unfortunately we had to kill the oldest running mariabackup process:

          | 402858 | bkpuser     | localhost                          | NULL | Sleep        |  184970 | 
          

          after getting backtrace, so no way to verify the FTWRL owner, but it had to be that thread above, as killing it allowed all threads to run.

          Now two questions remain:

          1. Why mariabackup hanged while holding FTWRL and NOT released it. i am studying this elsewhere.

          2. This question is to you (maybe): why one of slave threads waiting for FTWRL blocks other mariabackup thread from doing FTWRL? According to https://jira.mariadb.org/browse/MDEV-11709 this block is going to last forever as it does not depend on lock_wait_timeout. Maybe replication thread waiting for FTWRL should not prevent later FTWRL forever?

          valerii Valerii Kravchuk added a comment - That confirms my interpretation of the backtrace. Unfortunately we had to kill the oldest running mariabackup process: | 402858 | bkpuser | localhost | NULL | Sleep | 184970 | after getting backtrace, so no way to verify the FTWRL owner, but it had to be that thread above, as killing it allowed all threads to run. Now two questions remain: 1. Why mariabackup hanged while holding FTWRL and NOT released it. i am studying this elsewhere. 2. This question is to you (maybe): why one of slave threads waiting for FTWRL blocks other mariabackup thread from doing FTWRL? According to https://jira.mariadb.org/browse/MDEV-11709 this block is going to last forever as it does not depend on lock_wait_timeout. Maybe replication thread waiting for FTWRL should not prevent later FTWRL forever?

          mariabackup hangs like this:

          Thread 1 (Thread 0x7f0d8676c8c0 (LWP 111845)):
          #0  0x00007f0d86355945 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
          #1  0x0000563e32c8265b in backup_wait_for_lsn (lsn=79170560097335) at /usr/src/debug/MariaDB-10.2.27/src_0/extra/mariabackup/xtrabackup.cc:2787
          #2  0x0000563e32c9fee8 in backup_start () at /usr/src/debug/MariaDB-10.2.27/src_0/extra/mariabackup/backup_copy.cc:1508
          #3  0x0000563e32c85501 in xtrabackup_backup_func () at /usr/src/debug/MariaDB-10.2.27/src_0/extra/mariabackup/xtrabackup.cc:4344
          #4  main_low (argv=<optimized out>) at /usr/src/debug/MariaDB-10.2.27/src_0/extra/mariabackup/xtrabackup.cc:6401
          #5  0x0000563e32c62765 in main (argc=9, argv=0x7ffc57f04d18) at /usr/src/debug/MariaDB-10.2.27/src_0/extra/mariabackup/xtrabackup.cc:6200
          

          valerii Valerii Kravchuk added a comment - mariabackup hangs like this: Thread 1 (Thread 0x7f0d8676c8c0 (LWP 111845)): #0 0x00007f0d86355945 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x0000563e32c8265b in backup_wait_for_lsn (lsn=79170560097335) at /usr/src/debug/MariaDB-10.2.27/src_0/extra/mariabackup/xtrabackup.cc:2787 #2 0x0000563e32c9fee8 in backup_start () at /usr/src/debug/MariaDB-10.2.27/src_0/extra/mariabackup/backup_copy.cc:1508 #3 0x0000563e32c85501 in xtrabackup_backup_func () at /usr/src/debug/MariaDB-10.2.27/src_0/extra/mariabackup/xtrabackup.cc:4344 #4 main_low (argv=<optimized out>) at /usr/src/debug/MariaDB-10.2.27/src_0/extra/mariabackup/xtrabackup.cc:6401 #5 0x0000563e32c62765 in main (argc=9, argv=0x7ffc57f04d18) at /usr/src/debug/MariaDB-10.2.27/src_0/extra/mariabackup/xtrabackup.cc:6200
          Elkin Andrei Elkin added a comment -

          > 2. This question is to you (maybe): why one of slave threads ... blocks other mariabackup thread

          There's a decision protocol to make some of slave workers commit before FTWRL thread locks them out.
          See MDEV-7818 that was last to refine it and comment on.
          In our case the slave thread 17 is apparently allowed in committing but can't 'cos of the MDL lock conflict
          most probably from an earlier FTWRL thread.

          Elkin Andrei Elkin added a comment - > 2. This question is to you (maybe): why one of slave threads ... blocks other mariabackup thread There's a decision protocol to make some of slave workers commit before FTWRL thread locks them out. See MDEV-7818 that was last to refine it and comment on. In our case the slave thread 17 is apparently allowed in committing but can't 'cos of the MDL lock conflict most probably from an earlier FTWRL thread.

          On mariabackup side the problem is the same as in MDEV-18611. The workaround is to increase redo log size. But the issue is when log copying thread is finished during backup for some reason (redo log block read error in our case), mariabackup hangs waiting until the thread read log till certain LSN, and it waits under FTWRL. The fix is to finish mariabackup with error message on innodb redo log read failure.

          vlad.lesin Vladislav Lesin added a comment - On mariabackup side the problem is the same as in MDEV-18611 . The workaround is to increase redo log size. But the issue is when log copying thread is finished during backup for some reason (redo log block read error in our case), mariabackup hangs waiting until the thread read log till certain LSN, and it waits under FTWRL. The fix is to finish mariabackup with error message on innodb redo log read failure.

          People

            vlad.lesin Vladislav Lesin
            valerii Valerii Kravchuk
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.