[MDEV-4464] MariaDB 5.5.29 + Galera on Ubuntu 12.04 Crash Created: 2013-05-01 Updated: 2013-10-01 Resolved: 2013-10-01 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | None |
| Affects Version/s: | 5.5.29-galera |
| Fix Version/s: | 5.5.34-galera |
| Type: | Bug | Priority: | Major |
| Reporter: | Tim Clark | Assignee: | Seppo Jaakola |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | galera | ||
| Environment: |
Ubuntu 12.04 (debs from http://ftp.heanet.ie/mirrors/mariadb/repo/5.5/ubuntu precise main) |
||
| Description |
|
3 node Galera cluster - using Rsync SST. Runs fine for a few days - then does this in the middle of the night - with no load on the server - loads of RAM - no swapping: May 1 01:54:38 site-db2 mysqld: 130501 1:54:38 [ERROR] mysqld got signal 11 ; Any help greatly appreciated! Tim |
| Comments |
| Comment by Elena Stepanova [ 2013-05-01 ] |
|
Hi Tim, As you can see yourself, there's not much here to look at so far, so we'll need some additional information. Is it a production server? Does it have high load during the day? If possible, please enable general log ( SET GLOBAL general_log=1; ) on the crashing node(s) for a while (until the crash), so we will at least see what it was doing, if the crashing statement came from outside. You can see or configure the general log's location and name through general_log_file variable. Thanks! |
| Comment by Tim Clark [ 2013-05-01 ] |
|
Hi Elena, It is a production server - but it isn't under load at the moment. I've added enabled the general log - and will send on all the info if/when it does it again. Thanks for your help so far! Tim |
| Comment by Tim Clark [ 2013-05-13 ] |
|
Hi Elena, It happened again this weekend - This time all three nodes crashed at 11:20 on Friday - but I didn't get the same error message. I've grabbed the general log, syslog and config and pushed them up to the FTP site for one of the nodes - I can do the same for the others if that will help? Tim |
| Comment by Elena Stepanova [ 2013-05-13 ] |
|
It looks like it starts from some kind of an environment problem (network or permissions or filesystem), here's what I see in the error log (syslog): May 10 23:22:06 mysqld: 130510 23:22:06 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. Looks a bit too close to be a coincidence. Thanks! |
| Comment by Tim Clark [ 2013-05-13 ] |
|
Hi Elena, Two more tar's uploaded for you - really weird thing is that DB1 seems to have no logs at all over the weekend (not even a rotated blank one - although there were not other warnings on the server over the weekend).
Another thing I noticed this morning which may or may not be relevant - when trying to resurrect the cluster (with a gcomm:// on db1)... there was constantly a mysql process (with a different PID each time I checked) 'in the way'. I wonder if upstart is causing it to flip out? I've also had trouble with an rsync process holding open on port 4444 and stopping mysql restarting in the past - but that didn't seem to be the problem this morning. Thanks! Tim |
| Comment by Elena Stepanova [ 2013-05-13 ] |
|
So, db2 failed with the exact same error, after a slightly different error from nrpe: May 10 22:36:28 nrpe[32009]: Host 192.168.5.1 is not allowed to talk to us! And db1 suffered a similar error May 10 23:10:38 parky-db1 nrpe[17035]: Host 192.168.5.1 is not allowed to talk to us! According to syslog, there were continuous attempts to automatically restart the process, so if in addition to that you are also trying to start it manually, it's going to cause the conflict: May 10 23:17:49 parky-db1 kernel: [960387.183873] init: mysql main process (26583) terminated with status 1 It's unclear from the log why the process on db1 couldn't get restarted, we only see respawning every half a minute. Possibly it's the rsync issue that you mentioned – I encountered it before as well, when a node dies on some reason during synchronization, a related rsync process is likely to stay open so it needs to be killed manually (although I thought it was producing more comprehensive logging). In any case it seems to be a consequence of previous failure, the one which needs to be investigated. I'll forward this issue to Seppo now, hopefully he can shed some light on this failure: May 10 22:36:28 nrpe[32009]: Host 192.168.5.1 is not allowed to talk to us! |
| Comment by Vladimir Perepechin [ 2013-05-14 ] |
|
Got the same problem on CentOS 6.4. Cluster with two nodes died. In messages got this: The reason of the problem was in port scaning from 172.16.0.89 (db host can't connect to 172.16.0.89) |
| Comment by Tim Clark [ 2013-05-20 ] |
|
Hi Elena / Seppo, Same thing happened again this weekend - have uploaded logs for DB1 + DB2 (issue occurs May 17 22:33:13). Let me know if you need more information? Thanks! |
| Comment by Seppo Jaakola [ 2013-05-20 ] |
|
Tim, looks like smth stops network connectivity intermittently in your cluster. Could it be e.g. dhcp renewal? Or do you have apparmor configured? maybe it explains... Is Galera configured to use SSL encryption? Vladimir, what makes you think your case is identical to Tim's issue? It looks rather than rsync configuration issue, at first glance. DO you have hanging rsync daemon after SST? You could also try to switch to some other SST method (mysqldump or xtrabackup), to rule out potential rsync problem. |
| Comment by Tim Clark [ 2013-05-20 ] |
|
Thanks for the reply Seppo. IP's are static - DB1 and Web1 are both on the same hypervisor (and therefore shouldn't lose contact) while DB2 is on a different one. We do have apparmor installed by default - have I missed a config stage? Galera isn't set to run in SSL mode AFAIK (although you can check my config files if you want to be sure). I've been massively impressed with Galera / MariaDb when it's working - writing to any node as master is awesome.... but I'm a bit confused as to why the servers end up in a split brain state when quorum should always be possible - even if there was a temporary network glitch surely this shouldn't cause the servers to keel over and not get up - as there should always be 2 vs 1 so quorum should be possible?.... or have I misunderstood how it should work? Another problem arose today - trying to backup a database to the same db server like this: mysqldump database1 | mysql database2 It fails at a different place each time.... this same operation works on the bog standard mysql that comes with Ubuntu.... Tim |
| Comment by Tim Clark [ 2013-05-20 ] |
|
Checked apparmor this evening - it has the standard Mariadb 'blank' policy for usr.sbin.mysqld - so I don't think that's the problem. Tim |
| Comment by Vladimir Perepechin [ 2013-05-21 ] |
|
Seppo, i thought that it's the same problem because of my log entries after crash: terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<asio::system_error> >' To report this bug, see http://kb.askmonty.org/en/reporting-bugs We will try our best to scrape up some info that will hopefully help Server version: 5.5.29-MariaDB Thread pointer: 0x0x0 Not the best backtrace, but googling "'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<asio::system_error> >'" + mariadb brings me too this report. > May 10 23:22:33 mysqld: terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<asio::system_error> >' I thought that it's the same problem. |
| Comment by Seppo Jaakola [ 2013-05-25 ] |
|
@Vladimir: your crash looks indeed the same ...and this issue is probably same as reported with PXC in these reports: As a workaround, the cluster ports can be protected e.g. with iptables |
| Comment by Tim Clark [ 2013-06-05 ] |
|
Hi Seppo, Quick update - Since protecting the port - issue hasn't reoccurred - also noticed that we had a scheduled Nessus scan on a Friday (which coincides nicely with the crash) - so that's a very likely cause. Tim |
| Comment by Seppo Jaakola [ 2013-10-01 ] |
|
The issue is now in Fix Released state in Galera portal. Galera plugin 2.6 has the fix and is present in MGC 5.5.32 and later releases |