[MDEV-21255] Deadlock of parallel slave and mariabackup (with failed log copy thread) - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.2(EOL), 10.3(EOL), 10.4(EOL), 10.5
Fix Version/s: 10.2.31, 10.3.22, 10.4.12, 10.5.1
Component/s: mariabackup, Replication
Labels:
- deadlock

Description

Read only slave with slave_parallel_threads=10 got deadlocked when mariabackup executed FTWRL and tried to copy non-InnoDB files and remaining part of the redo log. Neither replication, no mariabackup, nor later mariabackup calls could proceed.

In the processlist we see:

...

|     10 | system user |                                    | NULL | Slave_IO     | 1626750 | Waiting for master to send event                                               | NULL                                          |    0.000 |

|     13 | system user |                                    | NULL | Slave_worker |  184973 | Waiting for prior transaction to start commit before starting next transaction | NULL                                          |    0.000 |

|     12 | system user |                                    | NULL | Slave_worker |  184973 | Waiting for prior transaction to start commit before starting next transaction | NULL                                          |    0.000 |

|     14 | system user |                                    | NULL | Slave_worker |  184973 | Waiting for prior transaction to start commit before starting next transaction | NULL                                          |    0.000 |

|     15 | system user |                                    | NULL | Slave_worker |  184973 | Waiting for prior transaction to start commit before starting next transaction | NULL                                          |    0.000 |

|     16 | system user |                                    | NULL | Slave_worker |  184973 | Waiting for prior transaction to start commit before starting next transaction | NULL                                          |    0.000 |

|     17 | system user |                                    | NULL | Slave_worker |  184973 | Waiting for global read lock                                                   | NULL                                          |    0.000 |

|     18 | system user |                                    | NULL | Slave_worker |  184973 | Waiting for prior transaction to start commit before starting next transaction | NULL                                          |    0.000 |

|     19 | system user |                                    | NULL | Slave_worker |  184973 | Waiting for prior transaction to start commit before starting next transaction | NULL                                          |    0.000 |

|     20 | system user |                                    | NULL | Slave_worker |  184973 | Waiting for prior transaction to start commit before starting next transaction | NULL                                          |    0.000 |

|     21 | system user |                                    | NULL | Slave_worker |  184973 | Waiting for prior transaction to start commit before starting next transaction | NULL                                          |    0.000 |

|     11 | system user |                                    | NULL | Slave_SQL    |  187284 | Waiting for room in worker thread event queue                                  | NULL                                          |    0.000 |

| 271217 | dbcleaner   | some_host:51410 | sa2  | Query        |       0 | Waiting in MASTER_GTID_WAIT() (primary waiter)                                 | SELECT MASTER_GTID_WAIT('0-11-5381276620', 4) |    0.000 |

...

| 402858 | bkpuser     | localhost                          | NULL | Sleep        |  184970 |                                                                                | NULL                                          |    0.000 |

| 427762 | bkpuser     | localhost                          | NULL | Query        |   98674 | Waiting for worker threads to pause for global read lock                       | FLUSH TABLES WITH READ LOCK                   |    0.000 |

| 452639 | bkpuser     | localhost                          | NULL | Query        |   12636 | Waiting while replication worker thread pool is busy                           | FLUSH TABLES WITH READ LOCK                   |    0.000 |

| 469510 | dbcleaner   | some_host:63978 | sa2  | Sleep        |      19 |                                                                                | NULL                                          |    0.000 |

| 469541 | dbcleaner   | localhost                          | NULL | Query        |       0 | init                                                                           | show processlist                              |    0.000 |

See backtrace of all threads attached.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

backtrace1.txt
2019-12-09 11:25
100 kB
Valerii Kravchuk

Activity

Ascending order - Click to sort in descending order

Valerii Kravchuk added a comment - 2019-12-09 11:29

The oldest (3rd) maraibackup session (working as bkpuser) hangs with these last messages in the log:

...

[01] 2019-12-07 05:55:36         ...done

[01] 2019-12-07 05:55:36 Copying ./performance_schema/db.opt to /nfs/backup/2019-12-06/FULL/performance_schema/db.opt

[01] 2019-12-07 05:55:36         ...done

[00] 2019-12-07 05:55:36 Finished backing up non-InnoDB tables and files

[01] 2019-12-07 05:55:36 Copying aria_log_control to /nfs/backup/2019-12-06/FULL/aria_log_control

[01] 2019-12-07 05:55:36         ...done

[01] 2019-12-07 05:55:36 Copying aria_log.00000001 to /nfs/backup/2019-12-06/FULL/aria_log.00000001

[01] 2019-12-07 05:55:36         ...done

[00] 2019-12-07 05:55:36 Waiting for log copy thread to read lsn 78996210453049

Valerii Kravchuk added a comment - 2019-12-09 11:29 The oldest (3rd) maraibackup session (working as bkpuser) hangs with these last messages in the log: ... [01] 2019-12-07 05:55:36 ...done [01] 2019-12-07 05:55:36 Copying ./performance_schema/db.opt to /nfs/backup/2019-12-06/FULL/performance_schema/db.opt [01] 2019-12-07 05:55:36 ...done [00] 2019-12-07 05:55:36 Finished backing up non-InnoDB tables and files [01] 2019-12-07 05:55:36 Copying aria_log_control to /nfs/backup/2019-12-06/FULL/aria_log_control [01] 2019-12-07 05:55:36 ...done [01] 2019-12-07 05:55:36 Copying aria_log.00000001 to /nfs/backup/2019-12-06/FULL/aria_log.00000001 [01] 2019-12-07 05:55:36 ...done [00] 2019-12-07 05:55:36 Waiting for log copy thread to read lsn 78996210453049

Andrei Elkin added a comment - 2019-12-09 15:48

According to a stacktrace
a slave replication worker is hanging at MDL lock on
`mysql.gtid_slave_pos` table request.
It's unclear what thread is holding the lock.
There are two FLUSH-TABLES-WITH-READ-LOCK waiters at that time, but
neither seem can be suspected.

If the hanging slave server remains in that state we might try to query
out with gdb which tread is the MDL lock owner.
Such thread e.g could have run

FLUSH TABLES ... WITH READ LOCK with `gtid_slave_pos` in the ... list.

I suggest to explore this possibility.

Andrei Elkin added a comment - 2019-12-09 15:48 According to a stacktrace a slave replication worker is hanging at MDL lock on `mysql.gtid_slave_pos` table request. It's unclear what thread is holding the lock. There are two FLUSH-TABLES-WITH-READ-LOCK waiters at that time, but neither seem can be suspected. If the hanging slave server remains in that state we might try to query out with gdb which tread is the MDL lock owner. Such thread e.g could have run FLUSH TABLES ... WITH READ LOCK with `gtid_slave_pos` in the ... list. I suggest to explore this possibility.

Valerii Kravchuk added a comment - 2019-12-09 16:57

That confirms my interpretation of the backtrace. Unfortunately we had to kill the oldest running mariabackup process:

| 402858 | bkpuser     | localhost                          | NULL | Sleep        |  184970 |

after getting backtrace, so no way to verify the FTWRL owner, but it had to be that thread above, as killing it allowed all threads to run.

Now two questions remain:

1. Why mariabackup hanged while holding FTWRL and NOT released it. i am studying this elsewhere.

2. This question is to you (maybe): why one of slave threads waiting for FTWRL blocks other mariabackup thread from doing FTWRL? According to https://jira.mariadb.org/browse/MDEV-11709 this block is going to last forever as it does not depend on lock_wait_timeout. Maybe replication thread waiting for FTWRL should not prevent later FTWRL forever?

Valerii Kravchuk added a comment - 2019-12-09 16:57 That confirms my interpretation of the backtrace. Unfortunately we had to kill the oldest running mariabackup process: | 402858 | bkpuser | localhost | NULL | Sleep | 184970 | after getting backtrace, so no way to verify the FTWRL owner, but it had to be that thread above, as killing it allowed all threads to run. Now two questions remain: 1. Why mariabackup hanged while holding FTWRL and NOT released it. i am studying this elsewhere. 2. This question is to you (maybe): why one of slave threads waiting for FTWRL blocks other mariabackup thread from doing FTWRL? According to https://jira.mariadb.org/browse/MDEV-11709 this block is going to last forever as it does not depend on lock_wait_timeout. Maybe replication thread waiting for FTWRL should not prevent later FTWRL forever?

Valerii Kravchuk added a comment - 2019-12-10 08:33

mariabackup hangs like this:

Thread 1 (Thread 0x7f0d8676c8c0 (LWP 111845)):

#0  0x00007f0d86355945 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

#1  0x0000563e32c8265b in backup_wait_for_lsn (lsn=79170560097335) at /usr/src/debug/MariaDB-10.2.27/src_0/extra/mariabackup/xtrabackup.cc:2787

#2  0x0000563e32c9fee8 in backup_start () at /usr/src/debug/MariaDB-10.2.27/src_0/extra/mariabackup/backup_copy.cc:1508

#3  0x0000563e32c85501 in xtrabackup_backup_func () at /usr/src/debug/MariaDB-10.2.27/src_0/extra/mariabackup/xtrabackup.cc:4344

#4  main_low (argv=<optimized out>) at /usr/src/debug/MariaDB-10.2.27/src_0/extra/mariabackup/xtrabackup.cc:6401

#5  0x0000563e32c62765 in main (argc=9, argv=0x7ffc57f04d18) at /usr/src/debug/MariaDB-10.2.27/src_0/extra/mariabackup/xtrabackup.cc:6200

Valerii Kravchuk added a comment - 2019-12-10 08:33 mariabackup hangs like this: Thread 1 (Thread 0x7f0d8676c8c0 (LWP 111845)): #0 0x00007f0d86355945 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x0000563e32c8265b in backup_wait_for_lsn (lsn=79170560097335) at /usr/src/debug/MariaDB-10.2.27/src_0/extra/mariabackup/xtrabackup.cc:2787 #2 0x0000563e32c9fee8 in backup_start () at /usr/src/debug/MariaDB-10.2.27/src_0/extra/mariabackup/backup_copy.cc:1508 #3 0x0000563e32c85501 in xtrabackup_backup_func () at /usr/src/debug/MariaDB-10.2.27/src_0/extra/mariabackup/xtrabackup.cc:4344 #4 main_low (argv=<optimized out>) at /usr/src/debug/MariaDB-10.2.27/src_0/extra/mariabackup/xtrabackup.cc:6401 #5 0x0000563e32c62765 in main (argc=9, argv=0x7ffc57f04d18) at /usr/src/debug/MariaDB-10.2.27/src_0/extra/mariabackup/xtrabackup.cc:6200

Andrei Elkin added a comment - 2019-12-10 08:59

> 2. This question is to you (maybe): why one of slave threads ... blocks other mariabackup thread

There's a decision protocol to make some of slave workers commit before FTWRL thread locks them out.
See ~~MDEV-7818~~ that was last to refine it and comment on.
In our case the slave thread 17 is apparently allowed in committing but can't 'cos of the MDL lock conflict
most probably from an earlier FTWRL thread.

Andrei Elkin added a comment - 2019-12-10 08:59 > 2. This question is to you (maybe): why one of slave threads ... blocks other mariabackup thread There's a decision protocol to make some of slave workers commit before FTWRL thread locks them out. See MDEV-7818 that was last to refine it and comment on. In our case the slave thread 17 is apparently allowed in committing but can't 'cos of the MDL lock conflict most probably from an earlier FTWRL thread.

Vladislav Lesin added a comment - 2019-12-11 10:00

On mariabackup side the problem is the same as in ~~MDEV-18611~~. The workaround is to increase redo log size. But the issue is when log copying thread is finished during backup for some reason (redo log block read error in our case), mariabackup hangs waiting until the thread read log till certain LSN, and it waits under FTWRL. The fix is to finish mariabackup with error message on innodb redo log read failure.

Vladislav Lesin added a comment - 2019-12-11 10:00 On mariabackup side the problem is the same as in MDEV-18611 . The workaround is to increase redo log size. But the issue is when log copying thread is finished during backup for some reason (redo log block read error in our case), mariabackup hangs waiting until the thread read log till certain LSN, and it waits under FTWRL. The fix is to finish mariabackup with error message on innodb redo log read failure.

People

Assignee:: Vladislav Lesin

Reporter:: Valerii Kravchuk

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 2019-12-09 11:25

Updated:: 2024-09-06 10:00

Resolved:: 2019-12-12 10:38

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Activity

People

Dates

Git Integration