[MDEV-14036] Primary node of MariaDB 10.1 got signal 11 suddenly Created: 2017-10-10 Updated: 2019-05-20 Resolved: 2019-05-20 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Galera |
| Affects Version/s: | 10.1.22 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Major |
| Reporter: | y-taka | Assignee: | Jan Lindström (Inactive) |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | crash, galera | ||
| Environment: |
RedHat Enterprise Linux 6.8 |
||
| Attachments: |
|
| Description |
|
There is a MariaDB 10.1 Galera environment, which have three node. In this cluster, Node1 goes down suddenly with signal 11. And it seems that server resource (MEMORY, DISK, CPU) are afford, |
| Comments |
| Comment by Elena Stepanova [ 2017-10-10 ] | ||||||||||||||||||||||||||||||||||||||
|
From node1_info.zip:
Not much to go with. | ||||||||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2017-10-10 ] | ||||||||||||||||||||||||||||||||||||||
|
y-taka, Could you please enable the coredump creation on the nodes? Hopefully, when the problem occurs next time, at least the server will be able to get more diagnostics, stack trace etc. Now we really have nothing to work with. We don't expect you to use a debug binary in the production environment, but getting the stack trace even from a non-debug version might help. | ||||||||||||||||||||||||||||||||||||||
| Comment by y-taka [ 2017-10-11 ] | ||||||||||||||||||||||||||||||||||||||
|
To Elena, Thank you for comments. We have already set "core-file" parameter, but coredump wasn't created when the error happened. And we don't use debug binary. | ||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2017-10-11 ] | ||||||||||||||||||||||||||||||||||||||
|
check core-file-size is unlimited or LimitCORE ( for systemd: https://mariadb.com/kb/en/library/systemd/ ) | ||||||||||||||||||||||||||||||||||||||
| Comment by y-taka [ 2017-10-11 ] | ||||||||||||||||||||||||||||||||||||||
|
To Daniel, I checked core-file-size is "unlimited", <9284> is PID of mysqld process.
| ||||||||||||||||||||||||||||||||||||||
| Comment by y-taka [ 2017-10-13 ] | ||||||||||||||||||||||||||||||||||||||
|
The problem (signal 11 error) reproduced today. And I found some new things. + After signal 11 occur, I could login to MariaDB (but it take long time than usual) | ||||||||||||||||||||||||||||||||||||||
| Comment by Andrii Nikitin (Inactive) [ 2017-10-13 ] | ||||||||||||||||||||||||||||||||||||||
|
dmesg.log of nodes indicate quite few hardware / filesystem problems with messages like:
Node 2 had such problem right before crash on Node 1
So far it looks like hardware problems caused instability, do you agree? | ||||||||||||||||||||||||||||||||||||||
| Comment by y-taka [ 2017-10-16 ] | ||||||||||||||||||||||||||||||||||||||
|
To Andrii, Thanks for feedback. I'll investigate hardware layer.
| ||||||||||||||||||||||||||||||||||||||
| Comment by y-taka [ 2017-10-16 ] | ||||||||||||||||||||||||||||||||||||||
|
To Andrii, The problem node is "Node1", but your "mptscsih:" message is written only Node2's /var/log/message. When signal 11 happened in Node1, Node2 worked without any problem. | ||||||||||||||||||||||||||||||||||||||
| Comment by Andrii Nikitin (Inactive) [ 2017-10-18 ] | ||||||||||||||||||||||||||||||||||||||
|
As was mentioned earlier, I do see quite many those messages in dmesg of Node 1 as well. And it is bad sign, which is very likely causing the problems. Please feel free to send the logs every time the issue occurs and upload compressed core dump to the ticket or to ftp.askmonty.org/private , so we will try to find more hints. |