Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-29293

MariaDB stuck on starting commit state (waiting on commit order critical section)

Details

    Description

      In an environment running Galera Cluster with 6 MariaDB nodes, 1 arbitrator node, some replicas and a ProxySQL, after a network issue that triggered a state transfer on two nodes,
      for some reason, almost all the transactions hang in:

      • “starting” state on the commit statement or on "".
      • "acquiring total order isolation" on the "KILL CONNECTION" statement (The "KILL CONNECTION" was requested by the ProxySQL)
        We tried to restart the service but it hangs on stopping, ProxySQL detected this node as down and switched the traffic to another node.

      By looking at the backtrace it seems that we have a kind of "pthread_cond_wait() deadlock" executed by lock.wait() on the enter() function on the commit monitor during the commit order critical section.

      Unfortunately, we didn't find a way to reproduce the problem

      Attachments

        1. backtraces.txt
          315 kB
        2. innodb_status.txt
          67 kB
        3. process_list.txt
          467 kB
        4. processlist.png
          processlist.png
          701 kB
        5. process-list-sample.txt
          2 kB

        Issue Links

          Activity

            I see that it has been previously claimed that this bug does not affect MariaDB Server 10.6 or later. Please clarify what should be done on merge to 10.6. If it is anything else than a null-merge (discarding the changes), we need to review and test the 10.6 version as well.

            Am I right that this is basically yet another attempt at fixing MDEV-23328?

            marko Marko Mäkelä added a comment - I see that it has been previously claimed that this bug does not affect MariaDB Server 10.6 or later. Please clarify what should be done on merge to 10.6. If it is anything else than a null-merge (discarding the changes), we need to review and test the 10.6 version as well. Am I right that this is basically yet another attempt at fixing MDEV-23328 ?

            These changes seem to cause the test perfschema.nesting to fail.

            I reviewed the InnoDB changes of the 10.6 version of this (PR#2609), and I think that there is some room for race conditions.

            marko Marko Mäkelä added a comment - These changes seem to cause the test perfschema.nesting to fail. I reviewed the InnoDB changes of the 10.6 version of this ( PR#2609 ), and I think that there is some room for race conditions.

            Latest version of PR fixes Marko's review comments and test failure. But Marko reviewed only 10.6 and InnoDB changes so review on sql-layer is needed.

            janlindstrom Jan Lindström added a comment - Latest version of PR fixes Marko's review comments and test failure. But Marko reviewed only 10.6 and InnoDB changes so review on sql-layer is needed.

            Looks ok for me

            sanja Oleksandr Byelkin added a comment - Looks ok for me
            sysprg Julius Goryavsky added a comment - Fix merged with head revision: https://github.com/MariaDB/server/commit/6966d7fe4b7ccfb2b7d16ca4d7d5ab08234fa9ec https://github.com/MariaDB/server/commit/3f59bbeeaec751e9aabdc544324546f3c8326f0f https://github.com/MariaDB/server/commit/f307160218f8f9ed2528ffc685f49c4e2ae050b3

            People

              sysprg Julius Goryavsky
              williamwelter William Welter
              Votes:
              5 Vote for this issue
              Watchers:
              25 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.