[MDEV-21255] Deadlock of parallel slave and mariabackup (with failed log copy thread) Created: 2019-12-09 Updated: 2021-09-30 Resolved: 2019-12-12 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | mariabackup, Replication |
| Affects Version/s: | 10.2, 10.3, 10.4, 10.5 |
| Fix Version/s: | 10.2.31, 10.3.22, 10.4.12, 10.5.1 |
| Type: | Bug | Priority: | Major |
| Reporter: | Valerii Kravchuk | Assignee: | Vladislav Lesin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | deadlock | ||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Description |
|
Read only slave with slave_parallel_threads=10 got deadlocked when mariabackup executed FTWRL and tried to copy non-InnoDB files and remaining part of the redo log. Neither replication, no mariabackup, nor later mariabackup calls could proceed. In the processlist we see:
See backtrace of all threads attached. |
| Comments |
| Comment by Valerii Kravchuk [ 2019-12-09 ] | ||||||||||
|
The oldest (3rd) maraibackup session (working as bkpuser) hangs with these last messages in the log:
| ||||||||||
| Comment by Andrei Elkin [ 2019-12-09 ] | ||||||||||
|
According to a stacktrace If the hanging slave server remains in that state we might try to query FLUSH TABLES ... WITH READ LOCK with `gtid_slave_pos` in the ... list. I suggest to explore this possibility. | ||||||||||
| Comment by Valerii Kravchuk [ 2019-12-09 ] | ||||||||||
|
That confirms my interpretation of the backtrace. Unfortunately we had to kill the oldest running mariabackup process:
after getting backtrace, so no way to verify the FTWRL owner, but it had to be that thread above, as killing it allowed all threads to run. Now two questions remain: 1. Why mariabackup hanged while holding FTWRL and NOT released it. i am studying this elsewhere. 2. This question is to you (maybe): why one of slave threads waiting for FTWRL blocks other mariabackup thread from doing FTWRL? According to https://jira.mariadb.org/browse/MDEV-11709 this block is going to last forever as it does not depend on lock_wait_timeout. Maybe replication thread waiting for FTWRL should not prevent later FTWRL forever? | ||||||||||
| Comment by Valerii Kravchuk [ 2019-12-10 ] | ||||||||||
|
mariabackup hangs like this:
| ||||||||||
| Comment by Andrei Elkin [ 2019-12-10 ] | ||||||||||
|
> 2. This question is to you (maybe): why one of slave threads ... blocks other mariabackup thread There's a decision protocol to make some of slave workers commit before FTWRL thread locks them out. | ||||||||||
| Comment by Vladislav Lesin [ 2019-12-11 ] | ||||||||||
|
On mariabackup side the problem is the same as in |