Details

    Description

      This morning our newly spanned Galera 3 node cluster crashed. It has been running on production for something like 6-7 hours.

      What happened is that 2 of the nodes went into an undefined state. The third one was OK but for some reason a master node could not be selected by Maxscale:

      2020-07-01 06:41:47 notice : Server changed state: server3[192.168.138.240:3306]: slave_down. [Slave, Synced, Running] -> [Down]
      2020-07-01 06:41:49 notice : Server changed state: server1[192.168.198.58:3306]: slave_down. [Slave, Synced, Running] -> [Down]
      2020-07-01 06:41:55 error : [galeramon] There are no cluster members
      2020-07-01 06:41:55 notice : Server changed state: server2[192.168.148.226:3306]: lost_master. [Master, Synced, Running] -> [Running]
      2020-07-01 06:56:35 error : (9) [readwritesplit] Couldn't find suitable Master from 3 candidates.
      2020-07-01 06:56:35 error : (9) Failed to create new router session for service 'Galera-Service'. See previous errors for more details.

      After a browsed on the nodes themselves i found that actually mysql had crashed with an assertion:

      mysqld: /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.4.13/wsrep-lib/include/wsrep/client_state.hpp:603: int wsrep::client_state::bf_abort(wsrep::seqno): Assertion `mode_ == m_local || transaction_.is_streaming()' failed.

      Attaching the full crash log below.

      Also attaching the configs of one of the machines. Please note this specific one also has binary logging enabled as it is used by an external slave.

      Is there a way to prevent this?

      Attachments

        1. 07-02-2020-node1.txt
          35 kB
        2. 07-02-2020-node2.txt
          42 kB
        3. 07-02-2020-node3.txt
          15 kB
        4. binlog.cfg
          0.2 kB
        5. galera.cfg
          0.8 kB
        6. Maria crash.txt
          7 kB
        7. server.cfg
          3 kB
        8. table structure.txt
          2 kB

        Issue Links

          Activity

            mkovachev Martin Kovachev created issue -
            mkovachev Martin Kovachev made changes -
            Field Original Value New Value
            Attachment binlog.cfg [ 52523 ]
            Attachment server.cfg [ 52524 ]
            Attachment galera.cfg [ 52525 ]
            mkovachev Martin Kovachev made changes -
            Description This morning our newly spanned Galera 3 node cluster crashed. It has been running on production for something like 6-7 hours.

            What happened is that 2 of the nodes went into an undefined state. The third one was OK but for some reason a master node could not be selected by Maxscale:

            2020-07-01 06:41:47 notice : Server changed state: server3[192.168.138.240:3306]: slave_down. [Slave, Synced, Running] -> [Down]
            2020-07-01 06:41:49 notice : Server changed state: server1[192.168.198.58:3306]: slave_down. [Slave, Synced, Running] -> [Down]
            2020-07-01 06:41:55 error : [galeramon] There are no cluster members
            2020-07-01 06:41:55 notice : Server changed state: server2[192.168.148.226:3306]: lost_master. [Master, Synced, Running] -> [Running]
            2020-07-01 06:56:35 error : (9) [readwritesplit] Couldn't find suitable Master from 3 candidates.
            2020-07-01 06:56:35 error : (9) Failed to create new router session for service 'Galera-Service'. See previous errors for more details.

            After a browsed on the nodes themselves i found that actually mysql had crashed with an assertion:

            mysqld: /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.4.13/wsrep-lib/include/wsrep/client_state.hpp:603: int wsrep::client_state::bf_abort(wsrep::seqno): Assertion `mode_ == m_local || transaction_.is_streaming()' failed.

            Attaching the full crash log below.

            Is there a way to prevent this?
            This morning our newly spanned Galera 3 node cluster crashed. It has been running on production for something like 6-7 hours.

            What happened is that 2 of the nodes went into an undefined state. The third one was OK but for some reason a master node could not be selected by Maxscale:

            2020-07-01 06:41:47 notice : Server changed state: server3[192.168.138.240:3306]: slave_down. [Slave, Synced, Running] -> [Down]
            2020-07-01 06:41:49 notice : Server changed state: server1[192.168.198.58:3306]: slave_down. [Slave, Synced, Running] -> [Down]
            2020-07-01 06:41:55 error : [galeramon] There are no cluster members
            2020-07-01 06:41:55 notice : Server changed state: server2[192.168.148.226:3306]: lost_master. [Master, Synced, Running] -> [Running]
            2020-07-01 06:56:35 error : (9) [readwritesplit] Couldn't find suitable Master from 3 candidates.
            2020-07-01 06:56:35 error : (9) Failed to create new router session for service 'Galera-Service'. See previous errors for more details.

            After a browsed on the nodes themselves i found that actually mysql had crashed with an assertion:

            mysqld: /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.4.13/wsrep-lib/include/wsrep/client_state.hpp:603: int wsrep::client_state::bf_abort(wsrep::seqno): Assertion `mode_ == m_local || transaction_.is_streaming()' failed.

            Attaching the full crash log below.

            Also attaching the configs of one of the machines. Please note this specific one also has binary logging enabled as it is used by an external slave.

            Is there a way to prevent this?
            mkovachev Martin Kovachev made changes -
            elenst Elena Stepanova made changes -
            Component/s Galera [ 10124 ]
            Fix Version/s 10.4 [ 22408 ]
            Assignee Jan Lindström [ jplindst ]
            mkovachev Martin Kovachev made changes -
            Attachment table structure.txt [ 52547 ]
            mkovachev Martin Kovachev made changes -
            Attachment 07-02-2020-node3.txt [ 52548 ]
            Attachment 07-02-2020-node2.txt [ 52549 ]
            Attachment 07-02-2020-node1.txt [ 52550 ]
            jplindst Jan Lindström (Inactive) made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            jplindst Jan Lindström (Inactive) made changes -
            issue.field.resolutiondate 2020-10-07 13:41:46.0 2020-10-07 13:41:46.995
            jplindst Jan Lindström (Inactive) made changes -
            Fix Version/s 10.4.15 [ 24507 ]
            Fix Version/s 10.5.6 [ 24508 ]
            Fix Version/s 10.6.0 [ 24431 ]
            Fix Version/s 10.4 [ 22408 ]
            Resolution Fixed [ 1 ]
            Status In Progress [ 3 ] Closed [ 6 ]
            serg Sergei Golubchik made changes -
            Fix Version/s 10.4.16 [ 25020 ]
            Fix Version/s 10.5.7 [ 25019 ]
            Fix Version/s 10.6.0 [ 24431 ]
            Fix Version/s 10.4.15 [ 24507 ]
            Fix Version/s 10.5.6 [ 24508 ]
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 110738 ] MariaDB v4 [ 158043 ]

            People

              jplindst Jan Lindström (Inactive)
              mkovachev Martin Kovachev
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.