Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-33217

MariaDB Galera cluster hangs in shutdown

    XMLWordPrintable

Details

    • Bug
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • 11.2.2
    • None
    • Galera
    • Ubuntu 22.04, n.a.

    Description

      We have seen some cases (not reproducible yet, but not a single event) where a Galera node hung after a failed SST/IST. We could not stop the node with systemctl stop mariadb but have to kill it with -9.

      Here is the error log. Possibly you have an idea where in the code it did not behave well:

      ...
      Jan 10 15:19:44 florian rsyncd[528681]: rsync to rsync_sst/ from UNKNOWN (192.168.106.144)
      Jan 10 15:19:44 florian rsyncd[528681]: receiving file list
      Jan 10 15:19:44 florian mariadbd[526058]: 2024-01-10 15:19:44 0 [Note] WSREP: 2.0 (ben): State transfer to 0.0 (florian) complete.
      Jan 10 15:19:44 florian mariadbd[526058]: 2024-01-10 15:19:44 0 [Note] WSREP: Member 2.0 (ben) synced with group.
      Jan 10 15:19:44 florian rsyncd[528681]: sent 48 bytes received 199 bytes total size 53
      Jan 10 15:19:44 florian mariadbd[526413]: WSREP_SST: [INFO] Galera co-ords from recovery: ff307812-afb7-11ee-95a1-6ad42fc54b65:524112 0 (20240110 15:19:44.989)
      Jan 10 15:19:44 florian mariadbd[526413]: WSREP_SST: [INFO] rsync IST completed on joiner (20240110 15:19:44.991)
      Jan 10 15:19:44 florian mariadbd[526413]: WSREP_SST: [INFO] Joiner cleanup: rsync PID=526574, stunnel PID=0 (20240110 15:19:44.996)
      Jan 10 15:19:45 florian rsyncd[526574]: sent 0 bytes received 0 bytes total size 0
      Jan 10 15:19:45 florian mariadbd[526413]: WSREP_SST: [INFO] Joiner cleanup done. (20240110 15:19:45.544)
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 3 [Note] WSREP: SST received
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 3 [Note] WSREP: Server status change joiner -> initializing
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] InnoDB: Number of transaction pools: 1
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] InnoDB: Initializing buffer pool, total size = 128.000MiB, chunk size = 2.000MiB
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] InnoDB: Completed initialization of buffer pool
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] InnoDB: Buffered log writes (block size=512 bytes)
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] InnoDB: End of log at LSN=6378872379
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] InnoDB: Opened 3 undo tablespaces
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] InnoDB: 128 rollback segments in 3 undo tablespaces are active.
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait
      ...
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] InnoDB: log sequence number 6378872379; transaction id 3629654
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] Plugin 'FEEDBACK' is disabled.
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] InnoDB: Loading buffer pool(s) from /var/lib/mysql/ib_buffer_pool
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] Server socket created on IP: '0.0.0.0'.
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] mariadbd: Event Scheduler: Loaded 0 events
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] WSREP: wsrep_init_schema_and_SR (nil)
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] WSREP: Server initialized
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] WSREP: Server status change initializing -> initialized
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 3 [Note] WSREP: Recovered position from storage: ff307812-afb7-11ee-95a1-6ad42fc54b65:524112
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 3 [Note] WSREP: Recovered view from SST:
      Jan 10 15:19:45 florian mariadbd[526058]: id: ff307812-afb7-11ee-95a1-6ad42fc54b65:458846
      Jan 10 15:19:45 florian mariadbd[526058]: status: primary
      Jan 10 15:19:45 florian mariadbd[526058]: protocol_version: 4
      Jan 10 15:19:45 florian mariadbd[526058]: capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
      Jan 10 15:19:45 florian mariadbd[526058]: final: no
      Jan 10 15:19:45 florian mariadbd[526058]: own_index: -1
      Jan 10 15:19:45 florian mariadbd[526058]: members(3):
      Jan 10 15:19:45 florian mariadbd[526058]: 0: 55720425-afbf-11ee-bc3e-536dee7ff450, burkhard
      Jan 10 15:19:45 florian mariadbd[526058]: 1: 5d3db8c1-afbf-11ee-b377-32b590213736, florian
      Jan 10 15:19:45 florian mariadbd[526058]: 2: 6160dd12-afc1-11ee-a39c-6f2680b69df6, ben
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 3 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 7 [Note] WSREP: Recovered cluster id ff307812-afb7-11ee-95a1-6ad42fc54b65
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 3 [Note] WSREP: SST received: ff307812-afb7-11ee-95a1-6ad42fc54b65:524112
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 3 [Note] WSREP: SST succeeded for position ff307812-afb7-11ee-95a1-6ad42fc54b65:524112
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: Installed new state from SST: ff307812-afb7-11ee-95a1-6ad42fc54b65:524112
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] WSREP: Joiner monitor thread ended with total time 10 sec
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] /usr/sbin/mariadbd: ready for connections.
      Jan 10 15:19:45 florian mariadbd[526058]: Version: '11.2.2-MariaDB-1:11.2.2+maria~ubu2204-log' socket: '/run/mysqld/mysqld.sock' port: 3306 mariadb.org binary distribution
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: Receiving IST: 15118 writesets, seqnos 524113-539230
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] WSREP: ####### IST applying starts with 524113
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [ERROR] WSREP: IST didn't contain all write sets, expected last: 539230 last received: -1
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [ERROR] WSREP: Receiving IST failed, node restart required: IST receiver reported failure: 71 (Protocol error)
      Jan 10 15:19:45 florian mariadbd[526058]: at ./galera/src/replicator_smm.hpp:pop_front():336. Null event.
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: Closing send monitor...
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: Closed send monitor.
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: gcomm: terminating thread
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: gcomm: joining thread
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: gcomm: closing backend
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: view(view_id(NON_PRIM,46d1bb52-b8f7,11) memb

      { Jan 10 15:19:45 florian mariadbd[526058]: 46d1bb52-b8f7,0 Jan 10 15:19:45 florian mariadbd[526058]: }

      joined

      { Jan 10 15:19:45 florian mariadbd[526058]: }

      left

      { Jan 10 15:19:45 florian mariadbd[526058]: }

      partitioned

      { Jan 10 15:19:45 florian mariadbd[526058]: 6160dd12-a39c,0 Jan 10 15:19:45 florian mariadbd[526058]: }

      )
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: PC protocol downgrade 1 -> 0
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: view((empty))
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: gcomm: closed
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] WSREP: Flow-control interval: [16, 16]
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] WSREP: Received NON-PRIMARY.
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] WSREP: Shifting JOINER -> OPEN (TO: 539279)
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] WSREP: New SELF-LEAVE.
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] WSREP: Flow-control interval: [0, 0]
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] WSREP: Received SELF-LEAVE. Closing connection.
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 539279)
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] WSREP: RECV thread exiting 0: Success
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: recv_thread() joined.
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: Closing replication queue.
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: Closing slave action queue.
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: Draining apply monitors after IST up to -1
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Warning] WSREP: last left 524112 greater than drain seqno -1
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: IST received: ff307812-afb7-11ee-95a1-6ad42fc54b65:-1
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [ERROR] WSREP: ./gcs/src/gcs.cpp:s_join():965: Sending JOIN failed: -103 (Software caused connection abort).
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Warning] WSREP: Failed to JOIN the cluster after SST gcs_join(ff307812-afb7-11ee-95a1-6ad42fc54b65:-1) failed: 103 (Software caused connection abort)
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: Min available from gcache for CC from sst: 522887
      Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Warning] WSREP: trx protocol version: 5 does not match certification protocol version: -1
      Jan 10 15:19:45 florian systemd[1]: Started MariaDB 11.2.2 database server.

      Subject: A start job for unit mariadb.service has finished successfully
      Defined-By: systemd
      Support: http://www.ubuntu.com/support

      A start job for unit mariadb.service has finished successfully.

      The job identifier is 242934.

      Jan 10 15:19:46 florian mariadbd[526058]: 2024-01-10 15:19:46 0 [Note] InnoDB: Buffer pool(s) load completed at 240110 15:19:46
      Jan 10 15:19:46 florian debian-start[529137]: ERROR 1047 (08S01) at line 1: WSREP has not yet prepared node for application use
      Jan 10 15:19:47 florian mariadbd[526058]: 2024-01-10 15:19:47 71 [Warning] IP address '192.168.50.82' could not be resolved: Name or service not known
      Jan 10 15:22:03 florian systemd[1]: Stopping MariaDB 11.2.2 database server...

      Subject: A stop job for unit mariadb.service has begun execution
      Defined-By: systemd
      Support: http://www.ubuntu.com/support

      A stop job for unit mariadb.service has begun execution.

      The job identifier is 245316.

      Jan 10 15:22:03 florian mariadbd[526058]: 2024-01-10 15:22:03 0 [Note] /usr/sbin/mariadbd (initiated by: unknown): Normal shutdown
      Jan 10 15:22:03 florian mariadbd[526058]: 2024-01-10 15:22:03 0 [Note] WSREP: Shutdown replication
      *Jan 10 15:22:03 florian mariadbd[526058]: 2024-01-10 15:22:03 0 [Note] WSREP: Server status change initialized -> disconnecting
      *Jan 10 15:26:54 florian systemd[1]: mariadb.service: Main process exited, code=killed, status=9/KILL

      Subject: Unit process exited
      Defined-By: systemd
      Support: http://www.ubuntu.com/support

      An ExecStart= process belonging to unit mariadb.service has exited.

      The process' exit code is 'killed' and its exit status is 9.
      Jan 10 15:26:54 florian systemd[1]: mariadb.service: Failed with result 'signal'.

      Attachments

        Activity

          People

            janlindstrom Jan Lindström
            oli Oli Sennhauser
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.