Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-31167

parallel replication gets deadlocked on v10.11.2 with innodb

Details

    Description

      For years we are using a master - slave parallel replication setup which always worked fine.
      We were using previously v10.1, v10.4 and v10.6 of mariadb and never witnessed the issue.

      However since upgrading to v10.11.2 the parallel replication process gets "stuck" every few days.
      When this happens, the only solution is to `kill -9` the mariadb process.

      We have 2 replicas, one which runs continuously without being interrupted, and that one doesn't have the issue.
      The other one however we use for making daily backups. So we stop mariadb at midnight, then make the backup (which takes about 7 - 8 hours to complete) and then start mariadb again.
      Of course this means that this server has to catch up with several hours worth of binlogs, which is what seems to trigger the deadlock.

      This is the output of "show slave status":

      https://dpaste.org/5axfT

      This is the output of "show processlist":

      https://dpaste.org/Ub10M

      This is the output of "show engine innodb status":

      https://dpaste.org/KmP1b

      The full backtrace of all mariadb threads is attached as a txt file to this ticket.

      These are my relevant mariadb settings:

      slave_parallel_threads = 16
      slave_parallel_mode = optimistic
      innodb_compression_default = ON

      I spoke to montywi and knielsen on #maria on liberachat about this and they recommended me to file a jira ticket here.

      Attachments

        Issue Links

          Activity

            In mariadbd_full_bt_all_threads.txt there are Thread 18 and Thread 23 holding a shared latch on the block descriptor 0x7f7eec802e60, both also waiting for a latch on the block 0x7f7eec8021e0. Thread 12 is waiting on an exclusive latch on the former block and holding an exclusive latch on the latter block. Thread 12 is violating the design rules, as noted in MDEV-29835. With the fix, it would have acquired an exclusive latch on the index, which would prevent other threads (such as Thread 18 and Thread 23 here) from acquiring any latches on non-leaf index pages.

            marko Marko Mäkelä added a comment - In mariadbd_full_bt_all_threads.txt there are Thread 18 and Thread 23 holding a shared latch on the block descriptor 0x7f7eec802e60, both also waiting for a latch on the block 0x7f7eec8021e0. Thread 12 is waiting on an exclusive latch on the former block and holding an exclusive latch on the latter block. Thread 12 is violating the design rules, as noted in MDEV-29835 . With the fix, it would have acquired an exclusive latch on the index, which would prevent other threads (such as Thread 18 and Thread 23 here) from acquiring any latches on non-leaf index pages.

            People

              marko Marko Mäkelä
              jgb1984 Jan Geboers
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.