[MDEV-31167] parallel replication gets deadlocked on v10.11.2 with innodb Created: 2023-05-02  Updated: 2023-05-02  Resolved: 2023-05-02

Status: Closed
Project: MariaDB Server
Component/s: Replication, Storage Engine - InnoDB
Affects Version/s: 10.11.2
Fix Version/s: 11.1.1, 10.11.3, 11.0.2, 10.6.13, 10.8.8, 10.9.6, 10.10.4

Type: Bug Priority: Major
Reporter: Jan Geboers Assignee: Marko Mäkelä
Resolution: Duplicate Votes: 0
Labels: replication
Environment:

Ubuntu 22.04 LTS, upstream mariadb packages, NVME RAID10 storage


Attachments: Text File mariadbd_full_bt_all_threads.txt    
Issue Links:
Duplicate
duplicates MDEV-29835 Partial server freeze Closed

 Description   

For years we are using a master - slave parallel replication setup which always worked fine.
We were using previously v10.1, v10.4 and v10.6 of mariadb and never witnessed the issue.

However since upgrading to v10.11.2 the parallel replication process gets "stuck" every few days.
When this happens, the only solution is to `kill -9` the mariadb process.

We have 2 replicas, one which runs continuously without being interrupted, and that one doesn't have the issue.
The other one however we use for making daily backups. So we stop mariadb at midnight, then make the backup (which takes about 7 - 8 hours to complete) and then start mariadb again.
Of course this means that this server has to catch up with several hours worth of binlogs, which is what seems to trigger the deadlock.

This is the output of "show slave status":

https://dpaste.org/5axfT

This is the output of "show processlist":

https://dpaste.org/Ub10M

This is the output of "show engine innodb status":

https://dpaste.org/KmP1b

The full backtrace of all mariadb threads is attached as a txt file to this ticket.

These are my relevant mariadb settings:

slave_parallel_threads = 16
slave_parallel_mode = optimistic
innodb_compression_default = ON

I spoke to montywi and knielsen on #maria on liberachat about this and they recommended me to file a jira ticket here.



 Comments   
Comment by Marko Mäkelä [ 2023-05-02 ]

In mariadbd_full_bt_all_threads.txt there are Thread 18 and Thread 23 holding a shared latch on the block descriptor 0x7f7eec802e60, both also waiting for a latch on the block 0x7f7eec8021e0. Thread 12 is waiting on an exclusive latch on the former block and holding an exclusive latch on the latter block. Thread 12 is violating the design rules, as noted in MDEV-29835. With the fix, it would have acquired an exclusive latch on the index, which would prevent other threads (such as Thread 18 and Thread 23 here) from acquiring any latches on non-leaf index pages.

Generated at Thu Feb 08 10:21:47 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.