Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-33217

MariaDB Galera cluster hangs in shutdown

Details

    Description

      We have seen some cases (not reproducible yet, but not a single event) where a Galera node hung after a failed SST/IST. We could not stop the node with systemctl stop mariadb but have to kill it with -9.

      Here is the error log. Possibly you have an idea where in the code it did not behave well:

      ...
      Jan 10 15:19:44 florian rsyncd[528681]: rsync to rsync_sst/ from UNKNOWN (192.168.106.144)
      Jan 10 15:19:44 florian rsyncd[528681]: receiving file list
      Jan 10 15:19:44 florian mariadbd[526058]: 2024-01-10 15:19:44 0 [Note] WSREP: 2.0 (ben): State transfer to 0.0 (florian) complete.
      Jan 10 15:19:44 florian mariadbd[526058]: 2024-01-10 15:19:44 0 [Note] WSREP: Member 2.0 (ben) synced with group.
      Jan 10 15:19:44 florian rsyncd[528681]: sent 48 bytes received 199 bytes total size 53
      Jan 10 15:19:44 florian mariadbd[526413]: WSREP_SST: [INFO] Galera co-ords from recovery: ff307812-afb7-11ee-95a1-6ad42fc54b65:524112 0 (20240110 15:19:44.989)
      Jan 10 15:19:44 florian mariadbd[526413]: WSREP_SST: [INFO] rsync IST completed on joiner (20240110 15:19:44.991)
      Jan 10 15:19:44 florian mariadbd[526413]: WSREP_SST: [INFO] Joiner cleanup: rsync PID=526574, stunnel PID=0 (20240110 15:19:44.996)
      Jan 10 15:19:45 florian rsyncd[526574]: sent 0 bytes received 0 bytes total size 0
      Jan 10 15:19:45 florian mariadbd[526413]: WSREP_SST: [INFO] Joiner cleanup done. (20240110 15:19:45.544)
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 3 [Note] WSREP: SST received
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 3 [Note] WSREP: Server status change joiner -> initializing
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] InnoDB: Number of transaction pools: 1
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] InnoDB: Initializing buffer pool, total size = 128.000MiB, chunk size = 2.000MiB
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] InnoDB: Completed initialization of buffer pool
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] InnoDB: Buffered log writes (block size=512 bytes)
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] InnoDB: End of log at LSN=6378872379
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] InnoDB: Opened 3 undo tablespaces
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] InnoDB: 128 rollback segments in 3 undo tablespaces are active.
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait
      ...
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] InnoDB: log sequence number 6378872379; transaction id 3629654
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] Plugin 'FEEDBACK' is disabled.
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] InnoDB: Loading buffer pool(s) from /var/lib/mysql/ib_buffer_pool
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] Server socket created on IP: '0.0.0.0'.
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] mariadbd: Event Scheduler: Loaded 0 events
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] WSREP: wsrep_init_schema_and_SR (nil)
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] WSREP: Server initialized
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] WSREP: Server status change initializing -> initialized
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 3 [Note] WSREP: Recovered position from storage: ff307812-afb7-11ee-95a1-6ad42fc54b65:524112
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 3 [Note] WSREP: Recovered view from SST:
      Jan 10 15:19:45 florian mariadbd[526058]: id: ff307812-afb7-11ee-95a1-6ad42fc54b65:458846
      Jan 10 15:19:45 florian mariadbd[526058]: status: primary
      Jan 10 15:19:45 florian mariadbd[526058]: protocol_version: 4
      Jan 10 15:19:45 florian mariadbd[526058]: capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
      Jan 10 15:19:45 florian mariadbd[526058]: final: no
      Jan 10 15:19:45 florian mariadbd[526058]: own_index: -1
      Jan 10 15:19:45 florian mariadbd[526058]: members(3):
      Jan 10 15:19:45 florian mariadbd[526058]: 0: 55720425-afbf-11ee-bc3e-536dee7ff450, burkhard
      Jan 10 15:19:45 florian mariadbd[526058]: 1: 5d3db8c1-afbf-11ee-b377-32b590213736, florian
      Jan 10 15:19:45 florian mariadbd[526058]: 2: 6160dd12-afc1-11ee-a39c-6f2680b69df6, ben
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 3 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 7 [Note] WSREP: Recovered cluster id ff307812-afb7-11ee-95a1-6ad42fc54b65
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 3 [Note] WSREP: SST received: ff307812-afb7-11ee-95a1-6ad42fc54b65:524112
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 3 [Note] WSREP: SST succeeded for position ff307812-afb7-11ee-95a1-6ad42fc54b65:524112
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: Installed new state from SST: ff307812-afb7-11ee-95a1-6ad42fc54b65:524112
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] WSREP: Joiner monitor thread ended with total time 10 sec
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] /usr/sbin/mariadbd: ready for connections.
      Jan 10 15:19:45 florian mariadbd[526058]: Version: '11.2.2-MariaDB-1:11.2.2+maria~ubu2204-log' socket: '/run/mysqld/mysqld.sock' port: 3306 mariadb.org binary distribution
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: Receiving IST: 15118 writesets, seqnos 524113-539230
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] WSREP: ####### IST applying starts with 524113
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [ERROR] WSREP: IST didn't contain all write sets, expected last: 539230 last received: -1
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [ERROR] WSREP: Receiving IST failed, node restart required: IST receiver reported failure: 71 (Protocol error)
      Jan 10 15:19:45 florian mariadbd[526058]: at ./galera/src/replicator_smm.hpp:pop_front():336. Null event.
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: Closing send monitor...
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: Closed send monitor.
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: gcomm: terminating thread
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: gcomm: joining thread
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: gcomm: closing backend
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: view(view_id(NON_PRIM,46d1bb52-b8f7,11) memb

      { Jan 10 15:19:45 florian mariadbd[526058]: 46d1bb52-b8f7,0 Jan 10 15:19:45 florian mariadbd[526058]: }

      joined

      { Jan 10 15:19:45 florian mariadbd[526058]: }

      left

      { Jan 10 15:19:45 florian mariadbd[526058]: }

      partitioned

      { Jan 10 15:19:45 florian mariadbd[526058]: 6160dd12-a39c,0 Jan 10 15:19:45 florian mariadbd[526058]: }

      )
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: PC protocol downgrade 1 -> 0
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: view((empty))
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: gcomm: closed
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] WSREP: Flow-control interval: [16, 16]
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] WSREP: Received NON-PRIMARY.
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] WSREP: Shifting JOINER -> OPEN (TO: 539279)
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] WSREP: New SELF-LEAVE.
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] WSREP: Flow-control interval: [0, 0]
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] WSREP: Received SELF-LEAVE. Closing connection.
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 539279)
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] WSREP: RECV thread exiting 0: Success
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: recv_thread() joined.
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: Closing replication queue.
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: Closing slave action queue.
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: Draining apply monitors after IST up to -1
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Warning] WSREP: last left 524112 greater than drain seqno -1
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: IST received: ff307812-afb7-11ee-95a1-6ad42fc54b65:-1
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [ERROR] WSREP: ./gcs/src/gcs.cpp:s_join():965: Sending JOIN failed: -103 (Software caused connection abort).
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Warning] WSREP: Failed to JOIN the cluster after SST gcs_join(ff307812-afb7-11ee-95a1-6ad42fc54b65:-1) failed: 103 (Software caused connection abort)
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: Min available from gcache for CC from sst: 522887
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Warning] WSREP: trx protocol version: 5 does not match certification protocol version: -1
      Jan 10 15:19:45 florian systemd[1]: Started MariaDB 11.2.2 database server.

      Subject: A start job for unit mariadb.service has finished successfully
      Defined-By: systemd
      Support: http://www.ubuntu.com/support

      A start job for unit mariadb.service has finished successfully.

      The job identifier is 242934.

      Jan 10 15:19:46 florian mariadbd[526058]: 2024-01-10 15:19:46 0 [Note] InnoDB: Buffer pool(s) load completed at 240110 15:19:46
      Jan 10 15:19:46 florian debian-start[529137]: ERROR 1047 (08S01) at line 1: WSREP has not yet prepared node for application use
      Jan 10 15:19:47 florian mariadbd[526058]: 2024-01-10 15:19:47 71 [Warning] IP address '192.168.50.82' could not be resolved: Name or service not known
      Jan 10 15:22:03 florian systemd[1]: Stopping MariaDB 11.2.2 database server...

      Subject: A stop job for unit mariadb.service has begun execution
      Defined-By: systemd
      Support: http://www.ubuntu.com/support

      A stop job for unit mariadb.service has begun execution.

      The job identifier is 245316.

      Jan 10 15:22:03 florian mariadbd[526058]: 2024-01-10 15:22:03 0 [Note] /usr/sbin/mariadbd (initiated by: unknown): Normal shutdown
      Jan 10 15:22:03 florian mariadbd[526058]: 2024-01-10 15:22:03 0 [Note] WSREP: Shutdown replication
      *Jan 10 15:22:03 florian mariadbd[526058]: 2024-01-10 15:22:03 0 [Note] WSREP: Server status change initialized -> disconnecting
      *Jan 10 15:26:54 florian systemd[1]: mariadb.service: Main process exited, code=killed, status=9/KILL

      Subject: Unit process exited
      Defined-By: systemd
      Support: http://www.ubuntu.com/support

      An ExecStart= process belonging to unit mariadb.service has exited.

      The process' exit code is 'killed' and its exit status is 9.
      Jan 10 15:26:54 florian systemd[1]: mariadb.service: Failed with result 'signal'.

      Attachments

        Activity

          oli For some reason node did not join the cluster, can you provide error log from all nodes and configuration. There is also IP address '192.168.50.82' could not be resolved in log so that could be problem.

          janlindstrom Jan Lindström added a comment - oli For some reason node did not join the cluster, can you provide error log from all nodes and configuration. There is also IP address '192.168.50.82' could not be resolved in log so that could be problem.

          Joking? This is a training system from one month ago. All is gone in the meanwhile... We will have a similar training in 2 weeks. Let's see if we run into the same situation...
          And we wanted to STOP the node (locally) so network issues or resolving problems should not matter and also not remote error logs?
          So possibly gdb command which will help to understand the situation next time would be more useful?

          oli Oli Sennhauser added a comment - Joking? This is a training system from one month ago. All is gone in the meanwhile... We will have a similar training in 2 weeks. Let's see if we run into the same situation... And we wanted to STOP the node (locally) so network issues or resolving problems should not matter and also not remote error logs? So possibly gdb command which will help to understand the situation next time would be more useful?

          oli, if I were diagnosing this, I would want to see the output of thread apply all backtrace of all mariadbd processes at the time of the hang, as well as their log output.

          marko Marko Mäkelä added a comment - oli , if I were diagnosing this, I would want to see the output of thread apply all backtrace of all mariadbd processes at the time of the hang, as well as their log output.

          Hi Marko, I will try to consider this next time. Thanks for the hint...

          oli Oli Sennhauser added a comment - Hi Marko, I will try to consider this next time. Thanks for the hint...

          People

            janlindstrom Jan Lindström
            oli Oli Sennhauser
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.