Details
Description
We have seen some cases (not reproducible yet, but not a single event) where a Galera node hung after a failed SST/IST. We could not stop the node with systemctl stop mariadb but have to kill it with -9.
Here is the error log. Possibly you have an idea where in the code it did not behave well:
...
Jan 10 15:19:44 florian rsyncd[528681]: rsync to rsync_sst/ from UNKNOWN (192.168.106.144)
Jan 10 15:19:44 florian rsyncd[528681]: receiving file list
Jan 10 15:19:44 florian mariadbd[526058]: 2024-01-10 15:19:44 0 [Note] WSREP: 2.0 (ben): State transfer to 0.0 (florian) complete.
Jan 10 15:19:44 florian mariadbd[526058]: 2024-01-10 15:19:44 0 [Note] WSREP: Member 2.0 (ben) synced with group.
Jan 10 15:19:44 florian rsyncd[528681]: sent 48 bytes received 199 bytes total size 53
Jan 10 15:19:44 florian mariadbd[526413]: WSREP_SST: [INFO] Galera co-ords from recovery: ff307812-afb7-11ee-95a1-6ad42fc54b65:524112 0 (20240110 15:19:44.989)
Jan 10 15:19:44 florian mariadbd[526413]: WSREP_SST: [INFO] rsync IST completed on joiner (20240110 15:19:44.991)
Jan 10 15:19:44 florian mariadbd[526413]: WSREP_SST: [INFO] Joiner cleanup: rsync PID=526574, stunnel PID=0 (20240110 15:19:44.996)
Jan 10 15:19:45 florian rsyncd[526574]: sent 0 bytes received 0 bytes total size 0
Jan 10 15:19:45 florian mariadbd[526413]: WSREP_SST: [INFO] Joiner cleanup done. (20240110 15:19:45.544)
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 3 [Note] WSREP: SST received
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 3 [Note] WSREP: Server status change joiner -> initializing
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] InnoDB: Number of transaction pools: 1
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] InnoDB: Initializing buffer pool, total size = 128.000MiB, chunk size = 2.000MiB
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] InnoDB: Completed initialization of buffer pool
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] InnoDB: Buffered log writes (block size=512 bytes)
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] InnoDB: End of log at LSN=6378872379
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] InnoDB: Opened 3 undo tablespaces
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] InnoDB: 128 rollback segments in 3 undo tablespaces are active.
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait
...
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] InnoDB: log sequence number 6378872379; transaction id 3629654
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] Plugin 'FEEDBACK' is disabled.
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] InnoDB: Loading buffer pool(s) from /var/lib/mysql/ib_buffer_pool
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] Server socket created on IP: '0.0.0.0'.
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] mariadbd: Event Scheduler: Loaded 0 events
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] WSREP: wsrep_init_schema_and_SR (nil)
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] WSREP: Server initialized
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] WSREP: Server status change initializing -> initialized
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 3 [Note] WSREP: Recovered position from storage: ff307812-afb7-11ee-95a1-6ad42fc54b65:524112
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 3 [Note] WSREP: Recovered view from SST:
Jan 10 15:19:45 florian mariadbd[526058]: id: ff307812-afb7-11ee-95a1-6ad42fc54b65:458846
Jan 10 15:19:45 florian mariadbd[526058]: status: primary
Jan 10 15:19:45 florian mariadbd[526058]: protocol_version: 4
Jan 10 15:19:45 florian mariadbd[526058]: capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
Jan 10 15:19:45 florian mariadbd[526058]: final: no
Jan 10 15:19:45 florian mariadbd[526058]: own_index: -1
Jan 10 15:19:45 florian mariadbd[526058]: members(3):
Jan 10 15:19:45 florian mariadbd[526058]: 0: 55720425-afbf-11ee-bc3e-536dee7ff450, burkhard
Jan 10 15:19:45 florian mariadbd[526058]: 1: 5d3db8c1-afbf-11ee-b377-32b590213736, florian
Jan 10 15:19:45 florian mariadbd[526058]: 2: 6160dd12-afc1-11ee-a39c-6f2680b69df6, ben
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 3 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 7 [Note] WSREP: Recovered cluster id ff307812-afb7-11ee-95a1-6ad42fc54b65
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 3 [Note] WSREP: SST received: ff307812-afb7-11ee-95a1-6ad42fc54b65:524112
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 3 [Note] WSREP: SST succeeded for position ff307812-afb7-11ee-95a1-6ad42fc54b65:524112
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: Installed new state from SST: ff307812-afb7-11ee-95a1-6ad42fc54b65:524112
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] WSREP: Joiner monitor thread ended with total time 10 sec
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] /usr/sbin/mariadbd: ready for connections.
Jan 10 15:19:45 florian mariadbd[526058]: Version: '11.2.2-MariaDB-1:11.2.2+maria~ubu2204-log' socket: '/run/mysqld/mysqld.sock' port: 3306 mariadb.org binary distribution
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: Receiving IST: 15118 writesets, seqnos 524113-539230
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] WSREP: ####### IST applying starts with 524113
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [ERROR] WSREP: IST didn't contain all write sets, expected last: 539230 last received: -1
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [ERROR] WSREP: Receiving IST failed, node restart required: IST receiver reported failure: 71 (Protocol error)
Jan 10 15:19:45 florian mariadbd[526058]: at ./galera/src/replicator_smm.hpp:pop_front():336. Null event.
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: Closing send monitor...
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: Closed send monitor.
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: gcomm: terminating thread
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: gcomm: joining thread
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: gcomm: closing backend
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: view(view_id(NON_PRIM,46d1bb52-b8f7,11) memb
joined
{ Jan 10 15:19:45 florian mariadbd[526058]: }left
{ Jan 10 15:19:45 florian mariadbd[526058]: }partitioned
{ Jan 10 15:19:45 florian mariadbd[526058]: 6160dd12-a39c,0 Jan 10 15:19:45 florian mariadbd[526058]: })
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: PC protocol downgrade 1 -> 0
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: view((empty))
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: gcomm: closed
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] WSREP: Flow-control interval: [16, 16]
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] WSREP: Received NON-PRIMARY.
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] WSREP: Shifting JOINER -> OPEN (TO: 539279)
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] WSREP: New SELF-LEAVE.
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] WSREP: Flow-control interval: [0, 0]
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] WSREP: Received SELF-LEAVE. Closing connection.
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 539279)
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 0 [Note] WSREP: RECV thread exiting 0: Success
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: recv_thread() joined.
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: Closing replication queue.
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: Closing slave action queue.
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: Draining apply monitors after IST up to -1
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Warning] WSREP: last left 524112 greater than drain seqno -1
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: IST received: ff307812-afb7-11ee-95a1-6ad42fc54b65:-1
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [ERROR] WSREP: ./gcs/src/gcs.cpp:s_join():965: Sending JOIN failed: -103 (Software caused connection abort).
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Warning] WSREP: Failed to JOIN the cluster after SST gcs_join(ff307812-afb7-11ee-95a1-6ad42fc54b65:-1) failed: 103 (Software caused connection abort)
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Note] WSREP: Min available from gcache for CC from sst: 522887
Jan 10 15:19:45 florian mariadbd[526058]: 2024-01-10 15:19:45 2 [Warning] WSREP: trx protocol version: 5 does not match certification protocol version: -1
Jan 10 15:19:45 florian systemd[1]: Started MariaDB 11.2.2 database server.
Subject: A start job for unit mariadb.service has finished successfully
Defined-By: systemd
Support: http://www.ubuntu.com/support
A start job for unit mariadb.service has finished successfully.
The job identifier is 242934.
Jan 10 15:19:46 florian mariadbd[526058]: 2024-01-10 15:19:46 0 [Note] InnoDB: Buffer pool(s) load completed at 240110 15:19:46
Jan 10 15:19:46 florian debian-start[529137]: ERROR 1047 (08S01) at line 1: WSREP has not yet prepared node for application use
Jan 10 15:19:47 florian mariadbd[526058]: 2024-01-10 15:19:47 71 [Warning] IP address '192.168.50.82' could not be resolved: Name or service not known
Jan 10 15:22:03 florian systemd[1]: Stopping MariaDB 11.2.2 database server...
Subject: A stop job for unit mariadb.service has begun execution
Defined-By: systemd
Support: http://www.ubuntu.com/support
A stop job for unit mariadb.service has begun execution.
The job identifier is 245316.
Jan 10 15:22:03 florian mariadbd[526058]: 2024-01-10 15:22:03 0 [Note] /usr/sbin/mariadbd (initiated by: unknown): Normal shutdown
Jan 10 15:22:03 florian mariadbd[526058]: 2024-01-10 15:22:03 0 [Note] WSREP: Shutdown replication
*Jan 10 15:22:03 florian mariadbd[526058]: 2024-01-10 15:22:03 0 [Note] WSREP: Server status change initialized -> disconnecting
*Jan 10 15:26:54 florian systemd[1]: mariadb.service: Main process exited, code=killed, status=9/KILL
Subject: Unit process exited
Defined-By: systemd
Support: http://www.ubuntu.com/support
An ExecStart= process belonging to unit mariadb.service has exited.
The process' exit code is 'killed' and its exit status is 9.
Jan 10 15:26:54 florian systemd[1]: mariadb.service: Failed with result 'signal'.
oli For some reason node did not join the cluster, can you provide error log from all nodes and configuration. There is also IP address '192.168.50.82' could not be resolved in log so that could be problem.