Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-7825

Parallel replication race condition on gco->flags, possibly resulting in slave hang




      Looks like there is a regression in parallel replication in 10.1 due to a race
      condition introduced with the patch for optimistic parallel replication.

      The gco->installed flag was changed to be a bit in the gco->flags field. This
      hoever is a serious error, as gco->flags is changed by the SQL driver thread
      without locking. This can race with worker threads updating the INSTALLED bit.

      The result is that modifications to the gco->flags can be lost, causing
      corruption of the internal state. This could result in various problems from
      the user point-of-view, depending on exact timing and so on.

      One user reported seeing maximum replication retries exceeded, followed by the
      slave hanging. Similar to this in the error log and then all slave worker
      threads stuck with replication no longer progressing:

      150323 15:07:38 [ERROR] Slave worker thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable.

      I believe (though I cannot know for sure) that this was caused by this bug. It
      is also possible that the bug could manifest itself in other ways, probably
      related to transactions running in the wrong order and possibly conflicting,
      or to slave worker threads hanging waiting for the wrong transaction.




            knielsen Kristian Nielsen
            knielsen Kristian Nielsen
            0 Vote for this issue
            1 Start watching this issue



              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.