[MDEV-14504] galera cannot detect mysqld coredump Created: 2017-11-26  Updated: 2021-12-23  Resolved: 2021-12-23

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.2.7
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: TAO ZHOU Assignee: Jan Lindström (Inactive)
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

FreeBSD 11.1



 Description   

I was running mariadb with galera on 3 nodes(2 nodes + 1 garb).
One of the nodes often coredumps.
And when the coredump happens, it calls addr2line, which takes a very long time to finish.
While other nodes stills think it's alive and end up with huge queues of transaction.
I tried to disable coredump so it could die completely and didn't succeed.

Here are things I tried
1. changing the kernel parameter kern.corecump to 0
2. Adding core_file_size=0 to my.cnf
3. add mysql_limits_args to /etc/rc.conf

None of them worked.
It always calls addr2line, which I am not sure whether it will ever finish. I always manually killed all mysql processes when this happens.

Do I have to build mariadb with special flags to disable addr2line?



 Comments   
Comment by Sergey Vojtovich [ 2017-11-27 ]

Should be --stack-trace=0

Comment by Elena Stepanova [ 2017-12-27 ]

laocius, did --stack-trace=0 help?

Comment by TAO ZHOU [ 2018-01-24 ]

I haven't tried it yet since I disabled galera on this cluster.
But it just happened again on a different cluster.
Where do I add this flag?

Comment by TAO ZHOU [ 2018-01-25 ]

This issue is fatal. If one of the nodes coredump, the whole cluster stops working.

Comment by Daniel Black [ 2018-02-07 ]

stack-trace is a command line argument to mysqld https://mariadb.com/kb/en/library/mysqld-options/#-stack-trace

Comment by TAO ZHOU [ 2018-02-07 ]

core_file_size=0 worked. not sure why it didn't work before.

[mysqld_safe]
core_file_size=0

Comment by Daniel Black [ 2018-02-08 ]

Given evs-suspect-timeout is 5 seconds http://galeracluster.com/documentation-webpages/galeraparameters.html#evs-suspect-timeout it looks like the response handling is still going on.

In the mariadb code, maybe handle_fatal_signal should call wsrep_provider->disconnect()

mariadb/galera version could be handy to know for reference.

Generated at Thu Feb 08 08:14:06 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.