[MDEV-21111] Random crash: [ERROR] mysqld got signal 6 ; Created: 2019-11-21 Updated: 2020-02-11 Resolved: 2020-02-11 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Galera, Server |
| Affects Version/s: | 10.4.8, 10.4.10 |
| Fix Version/s: | 10.4.12 |
| Type: | Bug | Priority: | Major |
| Reporter: | Ibai Valencia | Assignee: | Jan Lindström (Inactive) |
| Resolution: | Duplicate | Votes: | 3 |
| Labels: | crash | ||
| Environment: |
Ubuntu 16 |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Description |
|
Hi, Since we are using MariaDB 10.4.8 we've been getting several issues. We have a cluster of 3 nodes with Galera. In the last week, we've got some crashes everyday. Yesterday, we experienced a crash every 5 mins that usually meant two severs down. For this reason, we upgraded to 10.4.10 to see if this version patched anything related to our crashes. However, we again were in the same escenario. Finally, we had to perform a rollback to an older version: 10.2.29. And since yesterday we do not have any crashes. I'm attaching logs from 10.4.8 and 10.4.10 that were continuously crashing. Everytime same traces. We cannot reproduce this again as we are in production and need stability, so we are not in that version again, a bit worried to upgrade again in a near future. Some key lines in the logs: For 10.4.8:
For 10.4.10:
Do you need any more information? |
| Comments |
| Comment by Boris [ 2019-12-04 ] | ||
|
I confirm we have assembled a cluster of 3 servers as well, only on Centos 7 Dec 4 04:07:51 mysql1 mysqld: mysqld: /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.4.10/wsrep-lib/src/transaction.cpp:1038: void wsrep::transaction::after_replay(const wsrep::transaction&): Assertion `other.state() == s_committed || other.state() == s_aborted' failed. After the error, mysql itself restarted and connected to the cluster, but its status changed to:
we did not update from old versions PS there are no other errors preceding this failure. | ||
| Comment by Mark Hughes [ 2019-12-07 ] | ||
|
We're experiencing this error, too, on MariaDB 10.4.10 with a 3-node cluster (now running as a 4-node cluster, but experienced when it was 3 nodes, too). None of the nodes have been upgraded from a previous major release of MariaDB as they were all installed as 10.4 - I believe they haven't had minor updates, either, as they're new installs [ < 3 weeks]. This has now happened on two separate nodes. The nodes are virtualised nodes (running on the Jelastic platform) - the first node failed 2 or 3 times in the identical way, second node has failed for the first time today. Restarting the first failed server stalled - as Boris, it remains in "Joining: receiving State Transfer" indefinitely (requiring me to delete everything in /var/lib/mysql and let it 'start again'). The most recent failed server actually did eventually start after about 15 minutes of being in this state. There IS one thing that seems to be unique between the nodes pre-failure; when Node 1 failed on the couple of occasions it did, it had references in the MySQL log to "Created page /var/lib/mysql/gcache.page.XXXXXX of size XXXXXX" / "Deleted page /var/lib/mysql/gcache.page.XXXXXX". The new node that has just failed referenced exactly the same gcache files being created/deleted. The third node that has never crashed has never referenced the GCache file. [We now have 4 nodes, but that node has been up for less than 24 hours and is a replacement for one of the other nodes, but also has never referenced GCache nor crashed]. All servers are running Ubuntu 16.04. Attached are the logs for the recent server failure. | ||
| Comment by Mark Hughes [ 2019-12-09 ] | ||
|
This bug has occurred a few more times in the past 24 hours. Some of the time the server is repairing itself (restarting / doing IST / coming back online) and other times it is going down, or claiming that the server is not in a state to operate (wsrep has not yet prepared node for application use). It is actually to the point where it is causing fairly regular server outages as we have to keep jumping to a different DB node - today it caused the entire cluster to fail and be bootstrapped again (from a node that wasn't in a bootstrap-ready state, but was still the last to technically leave the cluster). Any mitigation that can at least bypass this bug? | ||
| Comment by Mark Hughes [ 2019-12-18 ] | ||
|
Update to another crash today, attached another log. Restarting the server fails, so I wiped everything within /var/lib/mysql/ again and started it successfully. | ||
| Comment by ujang [ 2020-01-10 ] | ||
|
hi, I faced this situation too, one of node intermittently crashed. 10.4.11 wsrep_provider_version 26.4.3(r4535) Please let me know when the fixes release? |