Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-16690

node hang due to conflicting inserts into foreign key child table

Details

    Description

      Cluster node may enter in unresolved conflict state when there are two inserts, with same primary key, into a table which has foreign key constraint for a parent table. These inserts must be issued in separate cluster nodes, and there has to be simultaneous writes (updates or deletes), for the referenced parent row.
      As a result of such scenario, replication applier thread may end in unresolved conflict state, And error log will be filled by messages of type:

      "WSREP: BF lock wait long"

      followed by InnoDB monitor outputs

      Attachments

        Issue Links

          Activity

            seppo Seppo Jaakola added a comment -

            Submitted a pull request, which has a mtr test for reproducing this issue with 10.2 and 10.3 HEAD versions
            Pull request fixes a race condition in row0ins.cc, assigning this for review

            seppo Seppo Jaakola added a comment - Submitted a pull request, which has a mtr test for reproducing this issue with 10.2 and 10.3 HEAD versions Pull request fixes a race condition in row0ins.cc, assigning this for review
            seppo Seppo Jaakola added a comment -

            Please take a look at the fix in row0ins.cc

            This is the earliest point in execution which originates the over write of hard error code in trx::error_state with DB_LOCK_WAIT code. If 'err' remains here having value DB_LOCK_WAIT, it will be returned through a few function call stacks, and finally blindly assigned to trx::error_state in row_ins_step() / error_handling:

            The fix here is protected with trx mutex, this may be redundant.

            seppo Seppo Jaakola added a comment - Please take a look at the fix in row0ins.cc This is the earliest point in execution which originates the over write of hard error code in trx::error_state with DB_LOCK_WAIT code. If 'err' remains here having value DB_LOCK_WAIT, it will be returned through a few function call stacks, and finally blindly assigned to trx::error_state in row_ins_step() / error_handling: The fix here is protected with trx mutex, this may be redundant.

            I like the solution, but I think that it can be cleaned up a little.

            marko Marko Mäkelä added a comment - I like the solution, but I think that it can be cleaned up a little.

            thiru, please check if trx->error_state can be modified by other threads than the one that is executing trx (I think not), and then merge (or cherry-pick) the fix to 10.2.

            marko Marko Mäkelä added a comment - thiru , please check if trx->error_state can be modified by other threads than the one that is executing trx (I think not), and then merge (or cherry-pick) the fix to 10.2.

            It looks like this has been fixed in MDEV-17541.

            marko Marko Mäkelä added a comment - It looks like this has been fixed in MDEV-17541 .

            This issue was fixed as part of MDEV-17541.

            marko Marko Mäkelä added a comment - This issue was fixed as part of MDEV-17541 .

            People

              marko Marko Mäkelä
              seppo Seppo Jaakola
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.