Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-9914

MariaDB fails to shutdown cleanly after gcomm error which brings down backend

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Won't Fix
    • 10.1.13
    • N/A
    • Galera
    • Ubuntu Xenial [using Wily packages], Ubuntu Trusty

    Description

      (Sorry, this is a bit longish.)

      Basic setup: Three vservers ("vserver03", "vserver04", "vserver05"), using internal IPv4 addresses 172.16.0.[123] with the following settings (excerpt):

      wsrep_node_name=vserver04
      wsrep_node_address=172.16.0.1
      wsrep_cluster_address="gcomm://"
      ##other nodes have 'wsrep_cluster_address="gcomm://172.16.0.1"' or the like in place
      wsrep_provider_options="gmcast.listen_addr=tcp://172.16.0.1:4567; gcache.size=128M; gcache.name=/tmp/galera.cache; gcache.page_size=128M"
      ##other nodes use 'gmcast.listen_addr=tcp://172.16.0.[23]:4567' here
      

      (Note: The above works when using the vservers' public IP addresses, so the basic mariadb/galera configuration appears to be ok.)

      The above obviously implies that the separate machines cannot contact each other directly; in order to remedy this, stunnel is used as shown below (in short, stunnel instances on vserver03/vserver05 would listen on ports 172.16.0.1:[4444,4567,4568] whereas another stunnel instance on vserver04 would listen (client=no) on 172.16.0.[23]:... and connect (client=yes) to 172.16.0.1:.. in order to forward everything sent from vserver03/vserver05--all directions are covered, of course).

      Example: After starting MariaDB on vserver04 using "service mysql start --wsrep-new-cluster", it listens on port 4567 using IP address 172.16.0.1:

      [2016-04-13T16:06:55] root@vserver04:/etc/mysql# netstat -tulpn | grep 4567
      tcp        0      0 172.16.0.1:4567         0.0.0.0:*               LISTEN      32676/mysqld    
      tcp        0      0 172.16.0.3:4567         0.0.0.0:*               LISTEN      20363/stunnel4  
      tcp        0      0 172.16.0.2:4567         0.0.0.0:*               LISTEN      20363/stunnel4  
      

      However, waiting a few seconds and then starting MariaDB on vserver05 using "service mysql start", vserver04's /var/log/syslog shows the following:

      Apr 13 16:06:44 vserver04 systemd[1]: Started MariaDB database server.
      Apr 13 16:07:13 vserver04 stunnel: LOG5[56]: Service [vserver04-galera_p1-ipv6] accepted connection from 2a03:nnnn:nnnn:nnnn::1:52568
      [...]
      Apr 13 16:07:13 vserver04 stunnel: LOG6[56]: SSL accepted: new session negotiated
      Apr 13 16:07:13 vserver04 stunnel: LOG6[56]: Negotiated TLSv1.2 ciphersuite ECDHE-RSA-AES256-GCM-SHA384 (256-bit encryption)
      Apr 13 16:07:13 vserver04 stunnel: LOG6[56]: s_connect: connecting 172.16.0.1:4567
      Apr 13 16:07:13 vserver04 stunnel: LOG5[56]: s_connect: connected 172.16.0.1:4567
      Apr 13 16:07:13 vserver04 stunnel: LOG6[56]: persistence: 172.16.0.1:4567 cached
      Apr 13 16:07:13 vserver04 stunnel: LOG5[56]: Service [vserver04-galera_p1-ipv6] connected remote server from 172.16.0.1:53170
      Apr 13 16:07:13 vserver04 mysqld[26689]: 2016-04-13 16:07:13 140570907375360 [ERROR] WSREP: exception from gcomm, backend must be restarted: Trying to add self addr tcp://172.16.0.1:4567 to addr list (FATAL)
      Apr 13 16:07:13 vserver04 mysqld[26689]: #011 at gcomm/src/gmcast.cpp:insert_address():762
      Apr 13 16:07:13 vserver04 mysqld[26689]: 2016-04-13 16:07:13 140570907375360 [Note] WSREP: gcomm: terminating thread
      Apr 13 16:07:13 vserver04 mysqld[26689]: 2016-04-13 16:07:13 140570907375360 [Note] WSREP: gcomm: joining thread
      Apr 13 16:07:13 vserver04 mysqld[26689]: 2016-04-13 16:07:13 140570907375360 [Note] WSREP: gcomm: closing backend
      Apr 13 16:07:13 vserver04 mysqld[26689]: 2016-04-13 16:07:13 140570907375360 [Note] WSREP: Forced PC close
      Apr 13 16:07:13 vserver04 mysqld[26689]: 2016-04-13 16:07:13 140570907375360 [Note] WSREP: gcomm: closed
      Apr 13 16:07:13 vserver04 mysqld[26689]: 2016-04-13 16:07:13 140570898982656 [ERROR] WSREP: gcs/src/gcs_core.cpp:core_handle_comp_msg():718: Malformed component message. Ignoring
      Apr 13 16:07:17 vserver04 stunnel: LOG6[56]: SSL closed (SSL_read)
      Apr 13 16:07:18 vserver04 stunnel: LOG5[57]: Service [vserver04-galera_p1-ipv6] accepted connection from 2a03:nnnn:nnnn:nnnn::1:52578
      Apr 13 16:07:18 vserver04 stunnel: LOG6[57]: SSL accepted: previous session reused
      Apr 13 16:07:18 vserver04 stunnel: LOG6[57]: s_connect: connecting 172.16.0.1:4567
      Apr 13 16:07:18 vserver04 stunnel: LOG3[57]: s_connect: connect 172.16.0.1:4567: Connection refused (111)
      Apr 13 16:07:18 vserver04 stunnel: LOG5[57]: Connection reset: 0 byte(s) sent to SSL, 0 byte(s) sent to socket
      Apr 13 16:07:19 vserver04 stunnel: LOG5[58]: Service [vserver04-galera_p1-ipv6] accepted connection from 2a03:nnnn:nnnn:nnnn::1:52582
      [...]
      

      For some reason, the second node is unable to join. This might be related to the use of stunnel (which works elsewhere, btw), but the reported issue here is about the fact that the local instance should not be affected from this as follows:

      At this point in time, MariaDB is still up and running on vserver04 (queries work flawlessly), but port 172.16.0.1:4567 is now closed and won't come back up (see the exemplary "Connection refused" error message on the bottom for all subsequent connection attempts from other nodes).

      It is also not possible to cleanly shut down MariaDB anymore: It takes three minutes for "service mysql stop" to return on the command line, showing the following in /var/log/syslog:

      Apr 13 16:08:32 vserver04 systemd[1]: Stopping MariaDB database server...
      Apr 13 16:08:32 vserver04 mysqld[26689]: 2016-04-13 16:08:32 140571185154816 [Note] /usr/sbin/mysqld: Normal shutdown
      Apr 13 16:08:32 vserver04 mysqld[26689]: 2016-04-13 16:08:32 140571185154816 [Note] WSREP: Stop replication
      Apr 13 16:08:32 vserver04 mysqld[26689]: 2016-04-13 16:08:32 140571185154816 [Note] WSREP: Closing send monitor...
      Apr 13 16:08:32 vserver04 mysqld[26689]: 2016-04-13 16:08:32 140571185154816 [Note] WSREP: Closed send monitor.
      Apr 13 16:08:32 vserver04 mysqld[26689]: 2016-04-13 16:08:32 140571185154816 [Warning] WSREP: gcomm: backend already closed
      [...]
      Apr 13 16:10:02 vserver04 systemd[1]: mariadb.service: State 'stop-sigterm' timed out. Skipping SIGKILL.
      Apr 13 16:11:33 vserver04 systemd[1]: mariadb.service: State 'stop-final-sigterm' timed out. Skipping SIGKILL. Entering failed mode.
      Apr 13 16:11:33 vserver04 systemd[1]: Stopped MariaDB database server.
      Apr 13 16:11:33 vserver04 systemd[1]: mariadb.service: Unit entered failed state.
      Apr 13 16:11:33 vserver04 systemd[1]: mariadb.service: Failed with result 'timeout'.
      

      What's worse, this also leaves behind an instance of /usr/sbin/mysqld that still holds locks until you terminate it with "kill -9":

      [2016-04-13T16:15:04] root@vserver04:/etc/mysql# ps axu | grep 26689
      mysql    26689  0.1  2.8 1187736 172348 ?      Ssl  16:06   0:01 /usr/sbin/mysqld
      

      Trying to re-execute "service start mysql --wsrep-new-cluster" without killing the above yields

      Apr 13 16:04:31 vserver04 mysqld[25641]: 2016-04-13 16:04:31 140137967790336 [ERROR] mysqld: Can't lock aria control file '/var/lib/mysql/aria_log_control' for exclusive
       use, error: 11. Will retry for 30 seconds
      [...]
      Apr 13 16:05:02 vserver04 mysqld[25641]: 2016-04-13 16:05:02 140137967790336 [ERROR] Plugin 'Aria' init function returned error.
      Apr 13 16:05:02 vserver04 mysqld[25641]: 2016-04-13 16:05:02 140137967790336 [ERROR] Plugin 'Aria' registration as a STORAGE ENGINE failed.
      [...]
      Apr 13 16:05:02 vserver04 mysqld[25641]: 2016-04-13 16:05:02 140137967790336 [ERROR] InnoDB: Unable to lock ./ibdata1, error: 11
      Apr 13 16:05:02 vserver04 mysqld[25641]: 2016-04-13 16:05:02 140137967790336 [Note] InnoDB: Retrying to lock the first data file
      [...]
      Apr 13 16:06:07 vserver04 mysqld[25641]: 2016-04-13 16:06:07 140137967790336 [ERROR] Aria engine is not enabled or did not start. The Aria engine must be enabled to continue as mysqld was configured with --with-aria-tmp-tables
      Apr 13 16:06:07 vserver04 mysqld[25641]: 2016-04-13 16:06:07 140137967790336 [ERROR] Aborting
      Apr 13 16:06:12 vserver04 systemd[1]: mariadb.service: Main process exited, code=killed, status=9/KILL
      Apr 13 16:06:12 vserver04 systemd[1]: Failed to start MariaDB database server.
      Apr 13 16:06:12 vserver04 systemd[1]: mariadb.service: Unit entered failed state.
      Apr 13 16:06:12 vserver04 systemd[1]: mariadb.service: Failed with result 'signal'.
      

      (The above is not tied to a certain vserver or the use of Ubuntu Xenial vs. Trusty--any variation of the above yields the very same results.)

      Attachments

        Activity

          People

            jplindst Jan Lindström (Inactive)
            m_ueberall Markus Ueberall
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.