[MDEV-26861] Galera Crashing - what(): remote_endpoint: Transport endpoint is not connected Created: 2021-10-20  Updated: 2023-06-12  Resolved: 2023-06-12

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.4.20, 10.5.11
Fix Version/s: N/A

Type: Bug Priority: Critical
Reporter: Mathew Toms Assignee: Teemu Ollakka
Resolution: Incomplete Votes: 2
Labels: crash, galera
Environment:

Ubuntu 20.04.2 LTS, Dedicated hosts per node


Attachments: Text File gdb.txt     File log-10.4.2     File log-10.5.11    
Issue Links:
Relates
relates to MDEV-25068 Node crashes with Transport endpoint ... Closed

 Description   

Been seeing Galera nodes crashing within a few minutes of each other with days between incidents.

Running 2 clusters with 3 nodes each, one cluster running 10.5.11 and another cluster 10.4.20. From the logs, both clusters seem to be suffering crashes for the same reason:

Oct 16 19:34:41 db1-core mysqld[3629505]: terminate called after throwing an instance of 'boost::wrapexcept<std::system_error>'
Oct 16 19:34:41 db1-core mysqld[3629505]:   what():  remote_endpoint: Transport endpoint is not connected
Oct 16 19:34:41 db1-core mysqld[3629505]: 211016 19:34:41 [ERROR] mysqld got signal 6 ;

It appears that when the crash strikes one node, there is a high chance a second node will crash (with the same error) a few minutes after the 1st crash - causing the cluster to require a bootstrap. Other times, just one node will crash and automatically restart and rejoin the cluster 5-10 minutes later. Days between incidents overall.

I've attached logs from both clusters and a stack trace from the 10.5.11 node.



 Comments   
Comment by veast [ 2022-04-28 ]

Hello, Mathew Toms.
I've had a similar problem and I'd like to know if your problem has been resolved or if you have any conclusions.
Thanks

Comment by Théo Cerutti [ 2023-01-04 ]

Me too same problem with Mariadb cluster 3 nodes 10.5.13

Comment by Jan Lindström [ 2023-05-10 ]

mattwt I could not find anything clear from provided stack trace or error logs. Problem is that error logs shows just a assertion and nothing what happened before crash. Can you please provide full unedited error log, show processlist output, node configuration?

Generated at Thu Feb 08 09:48:31 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.