[MDEV-9914] MariaDB fails to shutdown cleanly after gcomm error which brings down backend Created: 2016-04-13  Updated: 2023-04-12  Resolved: 2023-04-11

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.1.13
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Markus Ueberall Assignee: Jan Lindström (Inactive)
Resolution: Won't Fix Votes: 0
Labels: galera
Environment:

Ubuntu Xenial [using Wily packages], Ubuntu Trusty



 Description   

(Sorry, this is a bit longish.)

Basic setup: Three vservers ("vserver03", "vserver04", "vserver05"), using internal IPv4 addresses 172.16.0.[123] with the following settings (excerpt):

wsrep_node_name=vserver04
wsrep_node_address=172.16.0.1
wsrep_cluster_address="gcomm://"
##other nodes have 'wsrep_cluster_address="gcomm://172.16.0.1"' or the like in place
wsrep_provider_options="gmcast.listen_addr=tcp://172.16.0.1:4567; gcache.size=128M; gcache.name=/tmp/galera.cache; gcache.page_size=128M"
##other nodes use 'gmcast.listen_addr=tcp://172.16.0.[23]:4567' here

(Note: The above works when using the vservers' public IP addresses, so the basic mariadb/galera configuration appears to be ok.)

The above obviously implies that the separate machines cannot contact each other directly; in order to remedy this, stunnel is used as shown below (in short, stunnel instances on vserver03/vserver05 would listen on ports 172.16.0.1:[4444,4567,4568] whereas another stunnel instance on vserver04 would listen (client=no) on 172.16.0.[23]:... and connect (client=yes) to 172.16.0.1:.. in order to forward everything sent from vserver03/vserver05--all directions are covered, of course).

Example: After starting MariaDB on vserver04 using "service mysql start --wsrep-new-cluster", it listens on port 4567 using IP address 172.16.0.1:

[2016-04-13T16:06:55] root@vserver04:/etc/mysql# netstat -tulpn | grep 4567
tcp        0      0 172.16.0.1:4567         0.0.0.0:*               LISTEN      32676/mysqld    
tcp        0      0 172.16.0.3:4567         0.0.0.0:*               LISTEN      20363/stunnel4  
tcp        0      0 172.16.0.2:4567         0.0.0.0:*               LISTEN      20363/stunnel4  

However, waiting a few seconds and then starting MariaDB on vserver05 using "service mysql start", vserver04's /var/log/syslog shows the following:

Apr 13 16:06:44 vserver04 systemd[1]: Started MariaDB database server.
Apr 13 16:07:13 vserver04 stunnel: LOG5[56]: Service [vserver04-galera_p1-ipv6] accepted connection from 2a03:nnnn:nnnn:nnnn::1:52568
[...]
Apr 13 16:07:13 vserver04 stunnel: LOG6[56]: SSL accepted: new session negotiated
Apr 13 16:07:13 vserver04 stunnel: LOG6[56]: Negotiated TLSv1.2 ciphersuite ECDHE-RSA-AES256-GCM-SHA384 (256-bit encryption)
Apr 13 16:07:13 vserver04 stunnel: LOG6[56]: s_connect: connecting 172.16.0.1:4567
Apr 13 16:07:13 vserver04 stunnel: LOG5[56]: s_connect: connected 172.16.0.1:4567
Apr 13 16:07:13 vserver04 stunnel: LOG6[56]: persistence: 172.16.0.1:4567 cached
Apr 13 16:07:13 vserver04 stunnel: LOG5[56]: Service [vserver04-galera_p1-ipv6] connected remote server from 172.16.0.1:53170
Apr 13 16:07:13 vserver04 mysqld[26689]: 2016-04-13 16:07:13 140570907375360 [ERROR] WSREP: exception from gcomm, backend must be restarted: Trying to add self addr tcp://172.16.0.1:4567 to addr list (FATAL)
Apr 13 16:07:13 vserver04 mysqld[26689]: #011 at gcomm/src/gmcast.cpp:insert_address():762
Apr 13 16:07:13 vserver04 mysqld[26689]: 2016-04-13 16:07:13 140570907375360 [Note] WSREP: gcomm: terminating thread
Apr 13 16:07:13 vserver04 mysqld[26689]: 2016-04-13 16:07:13 140570907375360 [Note] WSREP: gcomm: joining thread
Apr 13 16:07:13 vserver04 mysqld[26689]: 2016-04-13 16:07:13 140570907375360 [Note] WSREP: gcomm: closing backend
Apr 13 16:07:13 vserver04 mysqld[26689]: 2016-04-13 16:07:13 140570907375360 [Note] WSREP: Forced PC close
Apr 13 16:07:13 vserver04 mysqld[26689]: 2016-04-13 16:07:13 140570907375360 [Note] WSREP: gcomm: closed
Apr 13 16:07:13 vserver04 mysqld[26689]: 2016-04-13 16:07:13 140570898982656 [ERROR] WSREP: gcs/src/gcs_core.cpp:core_handle_comp_msg():718: Malformed component message. Ignoring
Apr 13 16:07:17 vserver04 stunnel: LOG6[56]: SSL closed (SSL_read)
Apr 13 16:07:18 vserver04 stunnel: LOG5[57]: Service [vserver04-galera_p1-ipv6] accepted connection from 2a03:nnnn:nnnn:nnnn::1:52578
Apr 13 16:07:18 vserver04 stunnel: LOG6[57]: SSL accepted: previous session reused
Apr 13 16:07:18 vserver04 stunnel: LOG6[57]: s_connect: connecting 172.16.0.1:4567
Apr 13 16:07:18 vserver04 stunnel: LOG3[57]: s_connect: connect 172.16.0.1:4567: Connection refused (111)
Apr 13 16:07:18 vserver04 stunnel: LOG5[57]: Connection reset: 0 byte(s) sent to SSL, 0 byte(s) sent to socket
Apr 13 16:07:19 vserver04 stunnel: LOG5[58]: Service [vserver04-galera_p1-ipv6] accepted connection from 2a03:nnnn:nnnn:nnnn::1:52582
[...]

For some reason, the second node is unable to join. This might be related to the use of stunnel (which works elsewhere, btw), but the reported issue here is about the fact that the local instance should not be affected from this as follows:

At this point in time, MariaDB is still up and running on vserver04 (queries work flawlessly), but port 172.16.0.1:4567 is now closed and won't come back up (see the exemplary "Connection refused" error message on the bottom for all subsequent connection attempts from other nodes).

It is also not possible to cleanly shut down MariaDB anymore: It takes three minutes for "service mysql stop" to return on the command line, showing the following in /var/log/syslog:

Apr 13 16:08:32 vserver04 systemd[1]: Stopping MariaDB database server...
Apr 13 16:08:32 vserver04 mysqld[26689]: 2016-04-13 16:08:32 140571185154816 [Note] /usr/sbin/mysqld: Normal shutdown
Apr 13 16:08:32 vserver04 mysqld[26689]: 2016-04-13 16:08:32 140571185154816 [Note] WSREP: Stop replication
Apr 13 16:08:32 vserver04 mysqld[26689]: 2016-04-13 16:08:32 140571185154816 [Note] WSREP: Closing send monitor...
Apr 13 16:08:32 vserver04 mysqld[26689]: 2016-04-13 16:08:32 140571185154816 [Note] WSREP: Closed send monitor.
Apr 13 16:08:32 vserver04 mysqld[26689]: 2016-04-13 16:08:32 140571185154816 [Warning] WSREP: gcomm: backend already closed
[...]
Apr 13 16:10:02 vserver04 systemd[1]: mariadb.service: State 'stop-sigterm' timed out. Skipping SIGKILL.
Apr 13 16:11:33 vserver04 systemd[1]: mariadb.service: State 'stop-final-sigterm' timed out. Skipping SIGKILL. Entering failed mode.
Apr 13 16:11:33 vserver04 systemd[1]: Stopped MariaDB database server.
Apr 13 16:11:33 vserver04 systemd[1]: mariadb.service: Unit entered failed state.
Apr 13 16:11:33 vserver04 systemd[1]: mariadb.service: Failed with result 'timeout'.

What's worse, this also leaves behind an instance of /usr/sbin/mysqld that still holds locks until you terminate it with "kill -9":

[2016-04-13T16:15:04] root@vserver04:/etc/mysql# ps axu | grep 26689
mysql    26689  0.1  2.8 1187736 172348 ?      Ssl  16:06   0:01 /usr/sbin/mysqld

Trying to re-execute "service start mysql --wsrep-new-cluster" without killing the above yields

Apr 13 16:04:31 vserver04 mysqld[25641]: 2016-04-13 16:04:31 140137967790336 [ERROR] mysqld: Can't lock aria control file '/var/lib/mysql/aria_log_control' for exclusive
 use, error: 11. Will retry for 30 seconds
[...]
Apr 13 16:05:02 vserver04 mysqld[25641]: 2016-04-13 16:05:02 140137967790336 [ERROR] Plugin 'Aria' init function returned error.
Apr 13 16:05:02 vserver04 mysqld[25641]: 2016-04-13 16:05:02 140137967790336 [ERROR] Plugin 'Aria' registration as a STORAGE ENGINE failed.
[...]
Apr 13 16:05:02 vserver04 mysqld[25641]: 2016-04-13 16:05:02 140137967790336 [ERROR] InnoDB: Unable to lock ./ibdata1, error: 11
Apr 13 16:05:02 vserver04 mysqld[25641]: 2016-04-13 16:05:02 140137967790336 [Note] InnoDB: Retrying to lock the first data file
[...]
Apr 13 16:06:07 vserver04 mysqld[25641]: 2016-04-13 16:06:07 140137967790336 [ERROR] Aria engine is not enabled or did not start. The Aria engine must be enabled to continue as mysqld was configured with --with-aria-tmp-tables
Apr 13 16:06:07 vserver04 mysqld[25641]: 2016-04-13 16:06:07 140137967790336 [ERROR] Aborting
Apr 13 16:06:12 vserver04 systemd[1]: mariadb.service: Main process exited, code=killed, status=9/KILL
Apr 13 16:06:12 vserver04 systemd[1]: Failed to start MariaDB database server.
Apr 13 16:06:12 vserver04 systemd[1]: mariadb.service: Unit entered failed state.
Apr 13 16:06:12 vserver04 systemd[1]: mariadb.service: Failed with result 'signal'.

(The above is not tied to a certain vserver or the use of Ubuntu Xenial vs. Trusty--any variation of the above yields the very same results.)



 Comments   
Comment by Andrew Garner [ 2016-08-11 ]

I can generate a similar case by taking a cluster, stopping one node and restarting it with galera_new_cluster - inducing an ugly cluster split - and then generating some cluster changes to cause things to fail. The node fails later:

[Warning] WSREP: 8fcfe157 conflicting prims: my prim: view_id(PRIM,8fcfe157,1) other prim: view_id(PRIM,8491bc83,7)
...
[ERROR] WSREP: exception from gcomm, backend must be restarted: 8fcfe157 aborting due to conflicting prims: older overrides (FATAL) at gcomm/src/pc_proto.cpp:handle_state():982
[Note] WSREP: gcomm: terminating thread
[Note] WSREP: gcomm: joining thread
[Note] WSREP: gcomm: closing backend
[Note] WSREP: Forced PC close
[Warning] WSREP: discarding 2 messages from message index
[Note] WSREP: gcomm: closed
[ERROR] WSREP: gcs/src/gcs_core.cpp:core_handle_comp_msg():718: Malformed component message. Ignoring

Sometime later when a shutdown is induced the whole process hangs similar to the original report here:

[Note] /usr/sbin/mysqld: Normal shutdown
[Note] WSREP: Stop replication
[Note] WSREP: Closing send monitor...
[Note] WSREP: Closed send monitor.
[Warning] WSREP: gcomm: backend already close

The node failure isn't terribly interesting and I suspect that's related to https://github.com/codership/galera/issues/115.

The shutdown hang is the annoying part. This looks similar to this PXC bug: https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1212739

Offhand, I'm not sure if the causes of the hang are the same in MariaDB, but maybe it's in the right direction.

I've reproduced this on a 3-node MariaDB 10.1.16 cluster running on Debian8.

Comment by Jan Lindström [ 2023-04-11 ]

10.1 is EOL.

Generated at Thu Feb 08 07:38:16 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.