Details
-
Bug
-
Status: Open (View Workflow)
-
Major
-
Resolution: Unresolved
-
10.6.10
-
None
-
RHEL7 on VMware
Description
Hi,
Our databases are mostly Galera cluster - 2 db nodes + 1 arbitrator.
There was a patching last year from MariDB 10.6.7 (with Galera 4-26-11) to 10.6.10 (with Galera 4-26-12).
After the patching, when we restart one of DB node, there is a chance of hitting "operation not permitted" error in both donor & joiner. Then, joiner cannot join back cluster.
We tried several ways but still could not figure out the root cause.
- no SST log found in DB
- switch from SST method from mariabackup to rsync
- enable wsrep_debug=SERVER
The last action in each incident is to restart (bootstrap) the remaining DB node first. Then, we can start the joinor to join back cluster. But this is a service downtime.
This problem happened 4~5 times already out of 20 to 100 DB node restart.
Any way to troubleshoot in next occurrence?
We guess the problem is at donor node side. But since we need to resume the cluster, we restarted donor and cannot troubleshoot from donor at this moment. Only can troubleshoot in next occurrence.
DB log of one case are uploaded:
mariadb-error.log-node1-20230415
mariadb-error.log-node2-20230415
DB parameter file are uploaded:
mariadb.cnf.node1
mariadb.cnf.node2
Regards,
William Wong