Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-21111

Random crash: [ERROR] mysqld got signal 6 ;

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Duplicate
    • 10.4.8, 10.4.10
    • 10.4.12
    • Galera, Server
    • Ubuntu 16

    Description

      Hi,

      Since we are using MariaDB 10.4.8 we've been getting several issues. We have a cluster of 3 nodes with Galera. In the last week, we've got some crashes everyday. Yesterday, we experienced a crash every 5 mins that usually meant two severs down.

      For this reason, we upgraded to 10.4.10 to see if this version patched anything related to our crashes. However, we again were in the same escenario.

      Finally, we had to perform a rollback to an older version: 10.2.29. And since yesterday we do not have any crashes.

      I'm attaching logs from 10.4.8 and 10.4.10 that were continuously crashing. Everytime same traces. We cannot reproduce this again as we are in production and need stability, so we are not in that version again, a bit worried to upgrade again in a near future.

      Some key lines in the logs:

      For 10.4.8:

      Nov 20 13:55:04 be61 mysqld[46559]: mysqld: /home/buildbot/buildbot/build/mariadb-10.4.8/wsrep-lib/src/transaction.cpp:951: void wsrep::transaction::after_replay(const wsrep::transaction&): Assertion `other.state() == s_committed || other.state() == s_aborted' failed.
      Nov 20 13:55:04 be61 mysqld[46559]: 191120 13:55:04 [ERROR] mysqld got signal 6 ;
      

      For 10.4.10:

      Nov 20 19:45:52 be60 mysqld[3322]: mysqld: /home/buildbot/buildbot/build/mariadb-10.4.10/wsrep-lib/src/transaction.cpp:1038: void wsrep::transaction::after_replay(const wsrep::transaction&): Assertion `other.state() == s_committed || other.state() == s_aborted' failed.
      Nov 20 19:45:52 be60 mysqld[3322]: 191120 19:45:52 [ERROR] mysqld got signal 6 ;
      

      Do you need any more information?

      Attachments

        Issue Links

          Activity

            insysadm Boris added a comment - - edited

            I confirm
            this error takes place to be

            we have assembled a cluster of 3 servers as well, only on Centos 7
            Mariadb version 10.4.10
            at 4 am one of the servers crashed with an error:

            Dec 4 04:07:51 mysql1 mysqld: mysqld: /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.4.10/wsrep-lib/src/transaction.cpp:1038: void wsrep::transaction::after_replay(const wsrep::transaction&): Assertion `other.state() == s_committed || other.state() == s_aborted' failed.
            Dec 4 04:07:51 mysql1 mysqld: 191204 4:07:51 [ERROR] mysqld got signal 6 ;

            After the error, mysql itself restarted and connected to the cluster, but its status changed to:

            wsrep_local_state_comment Joining: receiving State Transfer

            we did not update from old versions
            originally built the cluster on version 10.4.10

            PS there are no other errors preceding this failure.

            insysadm Boris added a comment - - edited I confirm this error takes place to be we have assembled a cluster of 3 servers as well, only on Centos 7 Mariadb version 10.4.10 at 4 am one of the servers crashed with an error: Dec 4 04:07:51 mysql1 mysqld: mysqld: /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.4.10/wsrep-lib/src/transaction.cpp:1038: void wsrep::transaction::after_replay(const wsrep::transaction&): Assertion `other.state() == s_committed || other.state() == s_aborted' failed. Dec 4 04:07:51 mysql1 mysqld: 191204 4:07:51 [ERROR] mysqld got signal 6 ; After the error, mysql itself restarted and connected to the cluster, but its status changed to: wsrep_local_state_comment Joining: receiving State Transfer we did not update from old versions originally built the cluster on version 10.4.10 PS there are no other errors preceding this failure.
            MarkH Mark Hughes added a comment -

            We're experiencing this error, too, on MariaDB 10.4.10 with a 3-node cluster (now running as a 4-node cluster, but experienced when it was 3 nodes, too). None of the nodes have been upgraded from a previous major release of MariaDB as they were all installed as 10.4 - I believe they haven't had minor updates, either, as they're new installs [ < 3 weeks].

            This has now happened on two separate nodes. The nodes are virtualised nodes (running on the Jelastic platform) - the first node failed 2 or 3 times in the identical way, second node has failed for the first time today.

            Restarting the first failed server stalled - as Boris, it remains in "Joining: receiving State Transfer" indefinitely (requiring me to delete everything in /var/lib/mysql and let it 'start again'). The most recent failed server actually did eventually start after about 15 minutes of being in this state.

            There IS one thing that seems to be unique between the nodes pre-failure; when Node 1 failed on the couple of occasions it did, it had references in the MySQL log to "Created page /var/lib/mysql/gcache.page.XXXXXX of size XXXXXX" / "Deleted page /var/lib/mysql/gcache.page.XXXXXX". The new node that has just failed referenced exactly the same gcache files being created/deleted. The third node that has never crashed has never referenced the GCache file. [We now have 4 nodes, but that node has been up for less than 24 hours and is a replacement for one of the other nodes, but also has never referenced GCache nor crashed].

            All servers are running Ubuntu 16.04.

            Attached are the logs for the recent server failure.

            Mysql Log Public.log

            MarkH Mark Hughes added a comment - We're experiencing this error, too, on MariaDB 10.4.10 with a 3-node cluster (now running as a 4-node cluster, but experienced when it was 3 nodes, too). None of the nodes have been upgraded from a previous major release of MariaDB as they were all installed as 10.4 - I believe they haven't had minor updates, either, as they're new installs [ < 3 weeks]. This has now happened on two separate nodes. The nodes are virtualised nodes (running on the Jelastic platform) - the first node failed 2 or 3 times in the identical way, second node has failed for the first time today. Restarting the first failed server stalled - as Boris, it remains in "Joining: receiving State Transfer" indefinitely (requiring me to delete everything in /var/lib/mysql and let it 'start again'). The most recent failed server actually did eventually start after about 15 minutes of being in this state. There IS one thing that seems to be unique between the nodes pre-failure; when Node 1 failed on the couple of occasions it did, it had references in the MySQL log to "Created page /var/lib/mysql/gcache.page.XXXXXX of size XXXXXX" / "Deleted page /var/lib/mysql/gcache.page.XXXXXX". The new node that has just failed referenced exactly the same gcache files being created/deleted. The third node that has never crashed has never referenced the GCache file. [We now have 4 nodes, but that node has been up for less than 24 hours and is a replacement for one of the other nodes, but also has never referenced GCache nor crashed] . All servers are running Ubuntu 16.04. Attached are the logs for the recent server failure. Mysql Log Public.log
            MarkH Mark Hughes added a comment -

            This bug has occurred a few more times in the past 24 hours. Some of the time the server is repairing itself (restarting / doing IST / coming back online) and other times it is going down, or claiming that the server is not in a state to operate (wsrep has not yet prepared node for application use). It is actually to the point where it is causing fairly regular server outages as we have to keep jumping to a different DB node - today it caused the entire cluster to fail and be bootstrapped again (from a node that wasn't in a bootstrap-ready state, but was still the last to technically leave the cluster).

            Any mitigation that can at least bypass this bug?

            MarkH Mark Hughes added a comment - This bug has occurred a few more times in the past 24 hours. Some of the time the server is repairing itself (restarting / doing IST / coming back online) and other times it is going down, or claiming that the server is not in a state to operate (wsrep has not yet prepared node for application use). It is actually to the point where it is causing fairly regular server outages as we have to keep jumping to a different DB node - today it caused the entire cluster to fail and be bootstrapped again (from a node that wasn't in a bootstrap-ready state, but was still the last to technically leave the cluster). Any mitigation that can at least bypass this bug?
            MarkH Mark Hughes added a comment -

            Update to another crash today, attached another log. Restarting the server fails, so I wiped everything within /var/lib/mysql/ again and started it successfully.

            Galera Log 2.txt

            MarkH Mark Hughes added a comment - Update to another crash today, attached another log. Restarting the server fails, so I wiped everything within /var/lib/mysql/ again and started it successfully. Galera Log 2.txt
            ujae7142 ujang added a comment -

            hi,

            I faced this situation too, one of node intermittently crashed. 10.4.11 wsrep_provider_version 26.4.3(r4535)
            found https://github.com/codership/wsrep-lib/commit/76f7249b8df209a2a3cefd7d4bbf31f6c72812f1
            that seems related to this issue.

            Please let me know when the fixes release?

            ujae7142 ujang added a comment - hi, I faced this situation too, one of node intermittently crashed. 10.4.11 wsrep_provider_version 26.4.3(r4535) found https://github.com/codership/wsrep-lib/commit/76f7249b8df209a2a3cefd7d4bbf31f6c72812f1 that seems related to this issue. Please let me know when the fixes release?

            People

              jplindst Jan Lindström (Inactive)
              ibaivalencia Ibai Valencia
              Votes:
              3 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.