[MDEV-4464] MariaDB 5.5.29 + Galera on Ubuntu 12.04 Crash Created: 2013-05-01  Updated: 2013-10-01  Resolved: 2013-10-01

Status: Closed
Project: MariaDB Server
Component/s: None
Affects Version/s: 5.5.29-galera
Fix Version/s: 5.5.34-galera

Type: Bug Priority: Major
Reporter: Tim Clark Assignee: Seppo Jaakola
Resolution: Fixed Votes: 0
Labels: galera
Environment:

Ubuntu 12.04 (debs from http://ftp.heanet.ie/mirrors/mariadb/repo/5.5/ubuntu precise main)



 Description   

3 node Galera cluster - using Rsync SST. Runs fine for a few days - then does this in the middle of the night - with no load on the server - loads of RAM - no swapping:

May 1 01:54:38 site-db2 mysqld: 130501 1:54:38 [ERROR] mysqld got signal 11 ;
May 1 01:54:38 site-db2 mysqld: This could be because you hit a bug. It is also possible that this binary
May 1 01:54:38 site-db2 mysqld: or one of the libraries it was linked against is corrupt, improperly built,
May 1 01:54:38 site-db2 mysqld: or misconfigured. This error can also be caused by malfunctioning hardware.
May 1 01:54:38 site-db2 mysqld:
May 1 01:54:38 site-db2 mysqld: To report this bug, see http://kb.askmonty.org/en/reporting-bugs
May 1 01:54:38 site-db2 mysqld:
May 1 01:54:38 site-db2 mysqld: We will try our best to scrape up some info that will hopefully help
May 1 01:54:38 site-db2 mysqld: diagnose the problem, but since we have already crashed,
May 1 01:54:38 site-db2 mysqld: something is definitely wrong and this may fail.
May 1 01:54:38 site-db2 mysqld:
May 1 01:54:38 site-db2 mysqld: Server version: 5.5.29-MariaDB-mariadb1~precise
May 1 01:54:38 site-db2 mysqld: key_buffer_size=268435456
May 1 01:54:38 site-db2 mysqld: read_buffer_size=131072
May 1 01:54:38 site-db2 mysqld: max_used_connections=6
May 1 01:54:38 site-db2 mysqld: max_threads=802
May 1 01:54:38 site-db2 mysqld: thread_count=3
May 1 01:54:38 site-db2 mysqld: It is possible that mysqld could use up to
May 1 01:54:38 site-db2 mysqld: key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 2018916 K bytes of memory
May 1 01:54:38 site-db2 mysqld: Hope that's ok; if not, decrease some variables in the equation.
May 1 01:54:38 site-db2 mysqld:
May 1 01:54:38 site-db2 mysqld: Thread pointer: 0x0x0
May 1 01:54:38 site-db2 mysqld: Attempting back2trace. You can use the following information to find out
May 1 01:54:38 site-db2 mysqld: where mysqld died. If you see no messages after this, something went
May 1 01:54:38 site-db2 mysqld: terribly wrong...
May 1 01:54:38 site-db2 mysqld: stack_bottom = 0x0 thread_stack 0x40000
May 1 01:54:38 site-db2 mysqld: :0()[0x7fc4fd0682fb]
May 1 01:54:38 site-db2 mysqld: :0()[0x7fc4fcc8daa1]
May 1 01:54:38 site-db2 mysqld: :0()[0x7fc4fb501cb0]
May 1 01:54:38 site-db2 mysqld: :0()[0x7fc4f856ec65]
May 1 01:54:38 site-db2 mysqld: :0()[0x7fc4f856ee39]
May 1 01:54:38 site-db2 mysqld: :0()[0x7fc4f8570438]
May 1 01:54:38 site-db2 mysqld: :0()[0x7fc4f8627d2e]
May 1 01:54:38 site-db2 mysqld: :0()[0x7fc4f862d8b7]
May 1 01:54:38 site-db2 mysqld: :0()[0x7fc4f8633181]
May 1 01:54:38 site-db2 mysqld: :0()[0x7fc4fb4f9e9a]
May 1 01:54:38 site-db2 mysqld: :0()[0x7fc4fac2accd]
May 1 01:54:38 site-db2 mysqld: The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
May 1 01:54:38 site-db2 mysqld: information that should help you find out what is causing the crash.
May 1 01:54:38 site-db2 mysqld_safe: Number of processes running now: 0
May 1 01:54:38 site-db2 mysqld_safe: WSREP: not restarting wsrep node automatically
May 1 01:54:38 site-db2 mysqld_safe: mysqld from pid file /var/run/mysqld/mysqld.pid ended

Any help greatly appreciated!

Tim



 Comments   
Comment by Elena Stepanova [ 2013-05-01 ]

Hi Tim,

As you can see yourself, there's not much here to look at so far, so we'll need some additional information.

Is it a production server? Does it have high load during the day?
Do all nodes crash, or is it always the same one?
Does the crashing node serve as a master, or is it a pure slave (in other words, does it receive queries from client connections, even if it's only maintenance job, or only through replication)?

If possible, please enable general log ( SET GLOBAL general_log=1; ) on the crashing node(s) for a while (until the crash), so we will at least see what it was doing, if the crashing statement came from outside. You can see or configure the general log's location and name through general_log_file variable.
Please also preserve binary logs of all nodes from around the time of the crash, so we could later compare them with the general log.
When you get the crash again, please pack and upload the general log, the error log (or the related fragment of syslog if that's what you're using), the binary logs and my.cnf to ftp.askmonty.org/private (please include MDEV-4464 somewhere in the archive name so we could find it), and let us know through a comment here.

Thanks!

Comment by Tim Clark [ 2013-05-01 ]

Hi Elena,

It is a production server - but it isn't under load at the moment.
I've had all 3 nodes crash at once about a week back - but in this case, only node2 crashed.
All three nodes are setup as master - and most of the data processing is sent to node2 at the moment.

I've added enabled the general log - and will send on all the info if/when it does it again.

Thanks for your help so far!

Tim

Comment by Tim Clark [ 2013-05-13 ]

Hi Elena,

It happened again this weekend - This time all three nodes crashed at 11:20 on Friday - but I didn't get the same error message. I've grabbed the general log, syslog and config and pushed them up to the FTP site for one of the nodes - I can do the same for the others if that will help?

Tim

Comment by Elena Stepanova [ 2013-05-13 ]

It looks like it starts from some kind of an environment problem (network or permissions or filesystem), here's what I see in the error log (syslog):

May 10 23:22:06 mysqld: 130510 23:22:06 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
May 10 23:22:33 nrpe[18449]: Error: Network server getpeername() failure (107: Transport endpoint is not connected)
May 10 23:22:33 nrpe[18449]: Cannot remove pidfile '/var/run/nagios/nrpe.pid' - check your privileges.
May 10 23:22:33 nrpe[18449]: Daemon shutdown
May 10 23:22:33 mysqld: terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<asio::system_error> >'
May 10 23:22:33 mysqld: what(): Transport endpoint is not connected
May 10 23:22:33 mysqld: 130510 23:22:33 [ERROR] mysqld got signal 6 ;

Looks a bit too close to be a coincidence.
Yes, lets look at the files (especially at syslog) from other nodes, maybe we'll see some correlation.

Thanks!

Comment by Tim Clark [ 2013-05-13 ]

Hi Elena,

Two more tar's uploaded for you - really weird thing is that DB1 seems to have no logs at all over the weekend (not even a rotated blank one - although there were not other warnings on the server over the weekend).

rw-r---- 1 mysql adm 579018336 May 13 14:27 mysql.log
rw-r---- 1 mysql adm 0 May 12 06:43 mysql.log.1
rw-r---- 1 mysql adm 2310904721 May 10 06:53 mysql.log.4

Another thing I noticed this morning which may or may not be relevant - when trying to resurrect the cluster (with a gcomm:// on db1)... there was constantly a mysql process (with a different PID each time I checked) 'in the way'. I wonder if upstart is causing it to flip out? I've also had trouble with an rsync process holding open on port 4444 and stopping mysql restarting in the past - but that didn't seem to be the problem this morning.

Thanks!

Tim

Comment by Elena Stepanova [ 2013-05-13 ]

So, db2 failed with the exact same error, after a slightly different error from nrpe:

May 10 22:36:28 nrpe[32009]: Host 192.168.5.1 is not allowed to talk to us!
May 10 22:36:29 mysqld: terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<asio::system_error> >'
May 10 22:36:29 mysqld: what(): Transport endpoint is not connected
May 10 22:36:29 mysqld: 130510 22:36:29 [ERROR] mysqld got signal 6 ;

And db1 suffered a similar error

May 10 23:10:38 parky-db1 nrpe[17035]: Host 192.168.5.1 is not allowed to talk to us!
May 10 23:17:01 parky-db1 CRON[17250]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
May 10 23:17:48 parky-db1 nrpe[17259]: Error: Network server getpeername() failure (107: Transport endpoint is not connected)
May 10 23:17:48 parky-db1 nrpe[17259]: Cannot remove pidfile '/var/run/nagios/nrpe.pid' - check your privileges.
May 10 23:17:48 parky-db1 nrpe[17259]: Daemon shutdown

According to syslog, there were continuous attempts to automatically restart the process, so if in addition to that you are also trying to start it manually, it's going to cause the conflict:

May 10 23:17:49 parky-db1 kernel: [960387.183873] init: mysql main process (26583) terminated with status 1
May 10 23:17:49 parky-db1 kernel: [960387.184024] init: mysql main process ended, respawning
May 10 23:18:19 parky-db1 kernel: [960417.576550] init: mysql post-start process (17272) terminated with status 1
May 10 23:18:21 parky-db1 kernel: [960419.353384] init: mysql main process (17271) terminated with status 1
May 10 23:18:21 parky-db1 kernel: [960419.353412] init: mysql main process ended, respawning
...

It's unclear from the log why the process on db1 couldn't get restarted, we only see respawning every half a minute. Possibly it's the rsync issue that you mentioned – I encountered it before as well, when a node dies on some reason during synchronization, a related rsync process is likely to stay open so it needs to be killed manually (although I thought it was producing more comprehensive logging). In any case it seems to be a consequence of previous failure, the one which needs to be investigated.

I'll forward this issue to Seppo now, hopefully he can shed some light on this failure:

May 10 22:36:28 nrpe[32009]: Host 192.168.5.1 is not allowed to talk to us!
May 10 22:36:29 mysqld: terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<asio::system_error> >'
May 10 22:36:29 mysqld: what(): Transport endpoint is not connected
May 10 22:36:29 mysqld: 130510 22:36:29 [ERROR] mysqld got signal 6 ;

Comment by Vladimir Perepechin [ 2013-05-14 ]

Got the same problem on CentOS 6.4.
Version:
mysqld Ver 5.5.29-MariaDB for Linux on x86_64 (MariaDB Server, wsrep_23.7.3.rXXXX)

Cluster with two nodes died.

In messages got this:
May 14 14:49:18 mariadb1 xinetd[2869]: START: rsync pid=27847 from=::ffff:172.16.0.89
May 14 14:49:18 mariadb1 xinetd[27847]: warning: can't get client address: Connection reset by peer
May 14 14:49:18 mariadb1 rsyncd[27847]: params.c:Parameter() - Ignoring badly formed line in configuration file: nimal configuration file for rsync daemon
May 14 14:49:18 mariadb1 rsyncd[27847]: rsync: getpeername on fd0 failed: Transport endpoint is not connected (107)
May 14 14:49:18 mariadb1 rsyncd[27847]: rsync error: error in socket IO (code 10) at clientname.c(171) [receiver=3.0.6]
May 14 14:49:18 mariadb1 xinetd[2869]: EXIT: rsync status=10 pid=27847 duration=0(sec)

The reason of the problem was in port scaning from 172.16.0.89 (db host can't connect to 172.16.0.89)
But simple nmap scan on rsync port had started rsync (via xinetd) and after it died mysql server died a few seconds later.

Comment by Tim Clark [ 2013-05-20 ]

Hi Elena / Seppo,

Same thing happened again this weekend - have uploaded logs for DB1 + DB2 (issue occurs May 17 22:33:13).

Let me know if you need more information?

Thanks!
Tim

Comment by Seppo Jaakola [ 2013-05-20 ]

Tim, looks like smth stops network connectivity intermittently in your cluster. Could it be e.g. dhcp renewal? Or do you have apparmor configured? maybe it explains... Is Galera configured to use SSL encryption?

Vladimir, what makes you think your case is identical to Tim's issue? It looks rather than rsync configuration issue, at first glance. DO you have hanging rsync daemon after SST? You could also try to switch to some other SST method (mysqldump or xtrabackup), to rule out potential rsync problem.

Comment by Tim Clark [ 2013-05-20 ]

Thanks for the reply Seppo. IP's are static - DB1 and Web1 are both on the same hypervisor (and therefore shouldn't lose contact) while DB2 is on a different one. We do have apparmor installed by default - have I missed a config stage? Galera isn't set to run in SSL mode AFAIK (although you can check my config files if you want to be sure).

I've been massively impressed with Galera / MariaDb when it's working - writing to any node as master is awesome.... but I'm a bit confused as to why the servers end up in a split brain state when quorum should always be possible - even if there was a temporary network glitch surely this shouldn't cause the servers to keel over and not get up - as there should always be 2 vs 1 so quorum should be possible?.... or have I misunderstood how it should work?

Another problem arose today - trying to backup a database to the same db server like this:

mysqldump database1 | mysql database2
mysqldump: Error 2013: Lost connection to MySQL server during query when dumping table `cache_form` at row: 6

It fails at a different place each time.... this same operation works on the bog standard mysql that comes with Ubuntu....

Tim

Comment by Tim Clark [ 2013-05-20 ]

Checked apparmor this evening - it has the standard Mariadb 'blank' policy for usr.sbin.mysqld - so I don't think that's the problem.

Tim

Comment by Vladimir Perepechin [ 2013-05-21 ]

Seppo, i thought that it's the same problem because of my log entries after crash:

terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<asio::system_error> >'
what(): Transport endpoint is not connected
130514 15:39:08 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see http://kb.askmonty.org/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 5.5.29-MariaDB
key_buffer_size=536870912
read_buffer_size=1048576
max_used_connections=13
max_threads=153
thread_count=6
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1936541 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x48000
??:0(my_print_stacktrace)[0xa9d44e]
??:0(handle_fatal_signal)[0x6e3c2b]
:0()[0x7f9f6f778500]
:0()[0x7f9f6e02e8a5]
:0()[0x7f9f6e030085]
:0()[0x7f9f6e6d1a5d]
:0()[0x7f9f6e6cfbe6]
:0()[0x7f9f6e6cfc13]
:0()[0x7f9f6e6cfd0e]
:0()[0x7f9f6baecf8a]
:0()[0x7f9f6baed0ab]
:0()[0x7f9f6baed231]
:0()[0x7f9f6bade497]
:0()[0x7f9f6badf0a3]
:0()[0x7f9f6bae8cf0]
:0()[0x7f9f6bb092c9]
:0()[0x7f9f6bb03616]
:0()[0x7f9f6bb1c227]
:0()[0x7f9f6bb1fb19]
:0()[0x7f9f6f770851]
:0()[0x7f9f6e0e490d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.

Not the best backtrace, but googling "'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<asio::system_error> >'" + mariadb brings me too this report.
And looking at Elena's report:

> May 10 23:22:33 mysqld: terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<asio::system_error> >'
> May 10 23:22:33 mysqld: what(): Transport endpoint is not connected
> May 10 23:22:33 mysqld: 130510 23:22:33 [ERROR] mysqld got signal 6 ;

I thought that it's the same problem.

Comment by Seppo Jaakola [ 2013-05-25 ]

@Vladimir: your crash looks indeed the same

...and this issue is probably same as reported with PXC in these reports:
https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1153727
https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1184034

As a workaround, the cluster ports can be protected e.g. with iptables

Comment by Tim Clark [ 2013-06-05 ]

Hi Seppo,

Quick update - Since protecting the port - issue hasn't reoccurred - also noticed that we had a scheduled Nessus scan on a Friday (which coincides nicely with the crash) - so that's a very likely cause.

Tim

Comment by Seppo Jaakola [ 2013-10-01 ]

The issue is now in Fix Released state in Galera portal. Galera plugin 2.6 has the fix and is present in MGC 5.5.32 and later releases

Generated at Thu Feb 08 06:56:38 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.