Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-29293

MariaDB stuck on starting commit state (waiting on commit order critical section)

Details

    Description

      In an environment running Galera Cluster with 6 MariaDB nodes, 1 arbitrator node, some replicas and a ProxySQL, after a network issue that triggered a state transfer on two nodes,
      for some reason, almost all the transactions hang in:

      • “starting” state on the commit statement or on "".
      • "acquiring total order isolation" on the "KILL CONNECTION" statement (The "KILL CONNECTION" was requested by the ProxySQL)
        We tried to restart the service but it hangs on stopping, ProxySQL detected this node as down and switched the traffic to another node.

      By looking at the backtrace it seems that we have a kind of "pthread_cond_wait() deadlock" executed by lock.wait() on the enter() function on the commit monitor during the commit order critical section.

      Unfortunately, we didn't find a way to reproduce the problem

      Attachments

        1. backtraces.txt
          315 kB
        2. innodb_status.txt
          67 kB
        3. process_list.txt
          467 kB
        4. processlist.png
          processlist.png
          701 kB
        5. process-list-sample.txt
          2 kB

        Issue Links

          Activity

            williamwelter William Welter created issue -
            marko Marko Mäkelä made changes -
            Field Original Value New Value
            Assignee Jan Lindström [ jplindst ]
            carldenigma Carl Dobson made changes -
            Attachment process-list-sample.txt [ 65246 ]
            carldenigma Carl Dobson added a comment -

            I have just came across this issue when trying to move a DB cluster from a percona cluster into a MariaDB using logical backups.

            After a while of the applications running I ended up with hundreds of processes, which were stuck in starting commit state attached is a redacted sample of the process list process-list-sample.txt.

            I have restarted the cluster and enabled wsrep debug, to try and get some additional information, as to what is happening when it locks up into this state.

            Version information is:
            OS: Debian 11
            Kernel: 5.10.0-16-amd64 #1 SMP Debian 5.10.127-1
            MariaDB: 10.5.15-0+deb11u1
            Galera: 26.4.11-0+deb11u1

            carldenigma Carl Dobson added a comment - I have just came across this issue when trying to move a DB cluster from a percona cluster into a MariaDB using logical backups. After a while of the applications running I ended up with hundreds of processes, which were stuck in starting commit state attached is a redacted sample of the process list process-list-sample.txt . I have restarted the cluster and enabled wsrep debug, to try and get some additional information, as to what is happening when it locks up into this state. Version information is: OS: Debian 11 Kernel: 5.10.0-16-amd64 #1 SMP Debian 5.10.127-1 MariaDB: 10.5.15-0+deb11u1 Galera: 26.4.11-0+deb11u1
            jplindst Jan Lindström (Inactive) made changes -
            Assignee Jan Lindström [ jplindst ] Seppo Jaakola [ seppo ]
            khaiping.loh Khai Ping made changes -
            Attachment processlist.png [ 65721 ]
            khaiping.loh Khai Ping added a comment -

            i am seeing this in my cluster as well. There will be a system user that stuck in state "committing".

            Why it could led to a galera cluster getting stuck/hung?

            khaiping.loh Khai Ping added a comment - i am seeing this in my cluster as well. There will be a system user that stuck in state "committing". Why it could led to a galera cluster getting stuck/hung?
            elenst Elena Stepanova made changes -
            Fix Version/s 10.5 [ 23123 ]
            Ali.maria Alasdair Haswell made changes -
            Attachment gdb.txt_test3_100insertQPS.gz [ 68071 ]
            Attachment gdb.txt_test1.gz [ 68072 ]
            Attachment gdb.txt_test2_200insertQPS.gz [ 68073 ]
            seppo Seppo Jaakola made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            seppo Seppo Jaakola made changes -
            Status In Progress [ 3 ] Needs Feedback [ 10501 ]
            Ali.maria Alasdair Haswell made changes -
            Attachment gdb_010.txt.gz [ 68120 ]
            Attachment gdb_008.txt.gz [ 68121 ]
            Attachment gdb_007.txt.gz [ 68122 ]
            Attachment gdb_006.txt.gz [ 68123 ]
            Ali.maria Alasdair Haswell made changes -
            Attachment mariadb_003.err.gz [ 68164 ]
            Attachment mariadb_001.err.gz [ 68165 ]
            Attachment gdb.txt_003.gz [ 68166 ]
            Attachment gdb.txt_002.gz [ 68167 ]
            ccalender Chris Calender (Inactive) made changes -
            Status Needs Feedback [ 10501 ] Open [ 1 ]
            julien.fritsch Julien Fritsch made changes -
            Status Open [ 1 ] Needs Feedback [ 10501 ]
            Ali.maria Alasdair Haswell made changes -
            Attachment oltp_insert_nba.lua.rtf [ 68189 ]
            john.gehring John Gehring made changes -
            Attachment gdb_007.txt.gz [ 68122 ]
            john.gehring John Gehring made changes -
            Attachment gdb_008.txt.gz [ 68121 ]
            john.gehring John Gehring made changes -
            Attachment gdb_006.txt.gz [ 68123 ]
            john.gehring John Gehring made changes -
            Attachment gdb_010.txt.gz [ 68120 ]
            john.gehring John Gehring made changes -
            Attachment gdb.txt_002.gz [ 68167 ]
            john.gehring John Gehring made changes -
            Attachment gdb.txt_003.gz [ 68166 ]
            john.gehring John Gehring made changes -
            Attachment gdb.txt_test1.gz [ 68072 ]
            john.gehring John Gehring made changes -
            Attachment gdb.txt_test2_200insertQPS.gz [ 68073 ]
            john.gehring John Gehring made changes -
            Attachment gdb.txt_test3_100insertQPS.gz [ 68071 ]
            john.gehring John Gehring made changes -
            Attachment mariadb_001.err.gz [ 68165 ]
            john.gehring John Gehring made changes -
            Attachment mariadb_003.err.gz [ 68164 ]
            john.gehring John Gehring made changes -
            Attachment oltp_insert_nba.lua.rtf [ 68189 ]
            serg Sergei Golubchik made changes -
            Assignee Seppo Jaakola [ seppo ] Kwangbock Lee [ kb ]
            kb Kwangbock Lee made changes -
            Assignee Kwangbock Lee [ kb ] Seppo Jaakola [ seppo ]
            seppo Seppo Jaakola added a comment -

            Probably a similar hang was reproduced by using conflicting sysbench load and DDL (TOI mode replication), no ProxySQL involved in the test scenario.

            seppo Seppo Jaakola added a comment - Probably a similar hang was reproduced by using conflicting sysbench load and DDL (TOI mode replication), no ProxySQL involved in the test scenario.
            ralf.gebhardt Ralf Gebhardt made changes -
            seppo Seppo Jaakola added a comment -

            We can now reliably reproduce cluster hang, which is due to a deadlock between KILL CONNECTION execution and replication applier performing victim abort (for the connection which is target for the KILL command). However, stack traces of this hang are different than the stack traces attached in this MDEV. If the attached stack traces were recorded when the problem has not yet started, then we have matching problems.
            Anyway, we are preparing a fix and test case for the problem now discovered. It also turns out that the KILL CONNECTION issue should not happen with 10.6 and later, as KILL execution has been refactored after 10.5

            seppo Seppo Jaakola added a comment - We can now reliably reproduce cluster hang, which is due to a deadlock between KILL CONNECTION execution and replication applier performing victim abort (for the connection which is target for the KILL command). However, stack traces of this hang are different than the stack traces attached in this MDEV. If the attached stack traces were recorded when the problem has not yet started, then we have matching problems. Anyway, we are preparing a fix and test case for the problem now discovered. It also turns out that the KILL CONNECTION issue should not happen with 10.6 and later, as KILL execution has been refactored after 10.5
            julien.fritsch Julien Fritsch made changes -
            Status Needs Feedback [ 10501 ] Open [ 1 ]
            julien.fritsch Julien Fritsch made changes -
            Status Open [ 1 ] Confirmed [ 10101 ]
            julien.fritsch Julien Fritsch made changes -
            Priority Major [ 3 ] Critical [ 2 ]
            ralf.gebhardt Ralf Gebhardt made changes -
            Labels galera galera not-10.6+
            seppo Seppo Jaakola made changes -
            Status Confirmed [ 10101 ] In Progress [ 3 ]
            seppo Seppo Jaakola added a comment -

            Review cycle and related fixes are still ongoing. The pull request and reviews for the PR can be tracked here: https://github.com/codership/mariadb-server/pull/293

            seppo Seppo Jaakola added a comment - Review cycle and related fixes are still ongoing. The pull request and reviews for the PR can be tracked here: https://github.com/codership/mariadb-server/pull/293
            julien.fritsch Julien Fritsch made changes -
            Assignee Seppo Jaakola [ seppo ] Julien Fritsch [ julien.fritsch ]
            julien.fritsch Julien Fritsch made changes -
            Assignee Julien Fritsch [ julien.fritsch ] Julius Goryavsky [ sysprg ]
            Status In Progress [ 3 ] In Review [ 10002 ]
            julien.fritsch Julien Fritsch made changes -
            Assignee Julius Goryavsky [ sysprg ] Seppo Jaakola [ seppo ]
            julien.fritsch Julien Fritsch made changes -
            Assignee Julius Goryavsky [ sysprg ] Julien Fritsch [ julien.fritsch ]
            julien.fritsch Julien Fritsch made changes -
            Assignee Julien Fritsch [ julien.fritsch ] Seppo Jaakola [ seppo ]
            Status In Review [ 10002 ] Stalled [ 10000 ]
            julien.fritsch Julien Fritsch made changes -
            Status Stalled [ 10000 ] In Progress [ 3 ]
            khaiping.loh Khai Ping added a comment - - edited

            @seppo, can this happen on 10.6.5 as well? My cluster is on 10.6.5

            khaiping.loh Khai Ping added a comment - - edited @seppo, can this happen on 10.6.5 as well? My cluster is on 10.6.5
            teemu.ollakka Teemu Ollakka made changes -
            Status In Progress [ 3 ] Needs Feedback [ 10501 ]
            teemu.ollakka Teemu Ollakka added a comment - Pull request opened here https://jira.mariadb.org/browse/MDEV-29293 .
            teemu.ollakka Teemu Ollakka made changes -
            Fix Version/s N/A [ 14700 ]
            Fix Version/s 10.5 [ 23123 ]
            Resolution Incomplete [ 4 ]
            Status Needs Feedback [ 10501 ] Closed [ 6 ]
            teemu.ollakka Teemu Ollakka made changes -
            Resolution Incomplete [ 4 ]
            Status Closed [ 6 ] Stalled [ 10000 ]
            teemu.ollakka Teemu Ollakka made changes -
            Assignee Seppo Jaakola [ seppo ] Teemu Ollakka [ teemu.ollakka ]
            teemu.ollakka Teemu Ollakka made changes -
            Status Stalled [ 10000 ] In Review [ 10002 ]
            janlindstrom Jan Lindström made changes -
            Assignee Teemu Ollakka [ teemu.ollakka ] Jan Lindström [ JIRAUSER53125 ]
            janlindstrom Jan Lindström made changes -
            Assignee Jan Lindström [ JIRAUSER53125 ] Marko Mäkelä [ marko ]

            I see that it has been previously claimed that this bug does not affect MariaDB Server 10.6 or later. Please clarify what should be done on merge to 10.6. If it is anything else than a null-merge (discarding the changes), we need to review and test the 10.6 version as well.

            Am I right that this is basically yet another attempt at fixing MDEV-23328?

            marko Marko Mäkelä added a comment - I see that it has been previously claimed that this bug does not affect MariaDB Server 10.6 or later. Please clarify what should be done on merge to 10.6. If it is anything else than a null-merge (discarding the changes), we need to review and test the 10.6 version as well. Am I right that this is basically yet another attempt at fixing MDEV-23328 ?
            marko Marko Mäkelä made changes -
            Assignee Marko Mäkelä [ marko ] Jan Lindström [ JIRAUSER53125 ]
            Status In Review [ 10002 ] Stalled [ 10000 ]
            janlindstrom Jan Lindström made changes -
            ralf.gebhardt Ralf Gebhardt made changes -
            Fix Version/s 10.4 [ 22408 ]
            Fix Version/s 10.5 [ 23123 ]
            Fix Version/s N/A [ 14700 ]
            seppo Seppo Jaakola made changes -
            janlindstrom Jan Lindström made changes -
            Assignee Jan Lindström [ JIRAUSER53125 ] Marko Mäkelä [ marko ]
            Status Stalled [ 10000 ] In Review [ 10002 ]
            seppo Seppo Jaakola made changes -
            seppo Seppo Jaakola made changes -
            seppo Seppo Jaakola made changes -
            janlindstrom Jan Lindström made changes -

            These changes seem to cause the test perfschema.nesting to fail.

            I reviewed the InnoDB changes of the 10.6 version of this (PR#2609), and I think that there is some room for race conditions.

            marko Marko Mäkelä added a comment - These changes seem to cause the test perfschema.nesting to fail. I reviewed the InnoDB changes of the 10.6 version of this ( PR#2609 ), and I think that there is some room for race conditions.
            marko Marko Mäkelä made changes -
            Assignee Marko Mäkelä [ marko ] Jan Lindström [ JIRAUSER53125 ]
            Status In Review [ 10002 ] Stalled [ 10000 ]
            sysprg Julius Goryavsky made changes -
            pramod.mahto@mariadb.com Pramod Mahto made changes -
            Affects Version/s 10.6.12 [ 28513 ]

            Latest version of PR fixes Marko's review comments and test failure. But Marko reviewed only 10.6 and InnoDB changes so review on sql-layer is needed.

            janlindstrom Jan Lindström added a comment - Latest version of PR fixes Marko's review comments and test failure. But Marko reviewed only 10.6 and InnoDB changes so review on sql-layer is needed.
            janlindstrom Jan Lindström made changes -
            Assignee Jan Lindström [ JIRAUSER53125 ] Oleksandr Byelkin [ sanja ]
            Status Stalled [ 10000 ] In Review [ 10002 ]

            Looks ok for me

            sanja Oleksandr Byelkin added a comment - Looks ok for me
            sanja Oleksandr Byelkin made changes -
            Assignee Oleksandr Byelkin [ sanja ] Julius Goryavsky [ sysprg ]
            Status In Review [ 10002 ] Stalled [ 10000 ]
            sysprg Julius Goryavsky made changes -
            Status Stalled [ 10000 ] In Progress [ 3 ]
            janlindstrom Jan Lindström made changes -
            sysprg Julius Goryavsky added a comment - Fix merged with head revision: https://github.com/MariaDB/server/commit/6966d7fe4b7ccfb2b7d16ca4d7d5ab08234fa9ec https://github.com/MariaDB/server/commit/3f59bbeeaec751e9aabdc544324546f3c8326f0f https://github.com/MariaDB/server/commit/f307160218f8f9ed2528ffc685f49c4e2ae050b3
            sysprg Julius Goryavsky made changes -
            issue.field.resolutiondate 2023-05-22 02:02:46.0 2023-05-22 02:02:46.119
            sysprg Julius Goryavsky made changes -
            Fix Version/s 11.0.2 [ 28706 ]
            Fix Version/s 10.4.30 [ 28912 ]
            Fix Version/s 10.5.21 [ 28913 ]
            Fix Version/s 10.6.14 [ 28914 ]
            Fix Version/s 10.9.7 [ 28916 ]
            Fix Version/s 10.10.5 [ 28917 ]
            Fix Version/s 10.11.4 [ 28918 ]
            Fix Version/s 10.4 [ 22408 ]
            Fix Version/s 10.5 [ 23123 ]
            Resolution Fixed [ 1 ]
            Status In Progress [ 3 ] Closed [ 6 ]
            dbart Daniel Bartholomew made changes -
            Fix Version/s 10.4.31 [ 29010 ]
            Fix Version/s 10.5.22 [ 29011 ]
            Fix Version/s 10.6.15 [ 29013 ]
            Fix Version/s 10.9.8 [ 29015 ]
            Fix Version/s 10.10.6 [ 29017 ]
            Fix Version/s 10.11.5 [ 29019 ]
            Fix Version/s 11.0.3 [ 28920 ]
            Fix Version/s 11.1.2 [ 28921 ]
            Fix Version/s 11.0.2 [ 28706 ]
            Fix Version/s 10.4.30 [ 28912 ]
            Fix Version/s 10.5.21 [ 28913 ]
            Fix Version/s 10.6.14 [ 28914 ]
            Fix Version/s 10.9.7 [ 28916 ]
            Fix Version/s 10.10.5 [ 28917 ]
            Fix Version/s 10.11.4 [ 28918 ]
            kb Kwangbock Lee made changes -
            ralf.gebhardt Ralf Gebhardt made changes -
            Labels galera not-10.6+ galera
            ralf.gebhardt Ralf Gebhardt made changes -
            Labels galera
            ralf.gebhardt Ralf Gebhardt made changes -
            janlindstrom Jan Lindström made changes -
            mariadb-jira-automation Jira Automation (IT) made changes -
            Zendesk Related Tickets 175526 143407 133829 136221 147010

            People

              sysprg Julius Goryavsky
              williamwelter William Welter
              Votes:
              5 Vote for this issue
              Watchers:
              25 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.