[MDEV-31617] Galera Cluster could not recover since 2023-07-01 23:55:01 3287386 [Warning] WSREP: gcs_caused() returned -107 (Transport endpoint is not connected) Created: 2023-07-04  Updated: 2023-09-04  Resolved: 2023-09-04

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.5.9
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Min-Jen Chang Assignee: Jan Lindström
Resolution: Incomplete Votes: 0
Labels: corruption, galera


 Description   

Our Galera Cluster was created by 3 nodes.
Recently, node-0 restart repeatedly since server resource shortage,
and finally, node-1 and node-2 tried to connect to node-0 but failed:

2023-07-01 23:54:52 0 [Note] WSREP: (63d23c5c-b67b, 'tcp://0.0.0.0:4567') connection to peer 45ebb9d4-a748 with addr tcp://172.24.151.92:4567 timed out, no messages seen in PT3S, socket stats: rtt: 1473 rttvar: 2527 rto: 204000 lost: 0 last_data_recv: 3344 cwnd: 6 last_queued_since: 500032641 last_delivered_since: 3342827068 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages: 0

But after this message, node-1 and node-2 were all showing this message:
2023-07-01 23:55:01 3213483 [Warning] WSREP: gcs_caused() returned -107 (Transport endpoint is not connected)
2023-07-01 23:55:01 3320009 [Warning] WSREP: gcs_caused() returned -1 (Operation not permitted)
2023-07-01 23:55:01 3320012 [Warning] WSREP: gcs_caused() returned -107 (Transport endpoint is not connected)

This message kept showing, and node-1 and node-2 were both trigger status change, from
2023-07-01 23:55:01 6 [Note] WSREP: Server status change synced -> connected
2023-07-01 23:55:01 6 [Note] WSREP: Server status change connected -> connected

Then turned into Non-primary view:
2023-07-01 23:55:01 6 [Note] WSREP: Non-primary view

Since this issue, our Galera Cluster could not access, since each node local_state were turned into Initialization.

After we compared node-1 and node-2's wsrep_last_committed, we selected node-1 to rebootstrap node (SET WSREP_PROVIDER_OPTIONS = "pc.bootstrap = 1;"), node-1 turned into Primary, and
[Warning] WSREP: gcs_caused() returned -107 (Transport endpoint is not connected)
this message did not show, and node-2 joined cluster successfully.

Did there any reason or trigger, to let this message:
[Warning] WSREP: gcs_caused() returned -107 (Transport endpoint is not connected)
keep showing?

Did we hit any bug?

Thank you.



 Comments   
Comment by Jan Lindström [ 2023-08-07 ]

mjchangk Can you please try with more recent version of MariaDB server and Galera library. If your problem reproduces please provide full error log, output of show processlist and node configuration.

Generated at Thu Feb 08 10:25:13 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.