Details
-
Bug
-
Status: Open (View Workflow)
-
Major
-
Resolution: Unresolved
-
10.6.10
-
None
-
RHEL7 on VMware
Description
Hi,
Our databases are mostly Galera cluster - 2 db nodes + 1 arbitrator.
There was a patching last year from MariDB 10.6.7 (with Galera 4-26-11) to 10.6.10 (with Galera 4-26-12).
After the patching, when we restart one of DB node, there is a chance of hitting "operation not permitted" error in both donor & joiner. Then, joiner cannot join back cluster.
We tried several ways but still could not figure out the root cause.
- no SST log found in DB
- switch from SST method from mariabackup to rsync
- enable wsrep_debug=SERVER
The last action in each incident is to restart (bootstrap) the remaining DB node first. Then, we can start the joinor to join back cluster. But this is a service downtime.
This problem happened 4~5 times already out of 20 to 100 DB node restart.
Any way to troubleshoot in next occurrence?
We guess the problem is at donor node side. But since we need to resume the cluster, we restarted donor and cannot troubleshoot from donor at this moment. Only can troubleshoot in next occurrence.
DB log of one case are uploaded:
mariadb-error.log-node1-20230415
mariadb-error.log-node2-20230415
DB parameter file are uploaded:
mariadb.cnf.node1
mariadb.cnf.node2
Regards,
William Wong
The main error is in the node1 logs:
2023-04-14 12:32:57 0 [ERROR] WSREP: Failed to read from: wsrep_sst_mariabackup --role 'donor' --address 't2vdbs-itcttass-ttastdb1-2-01-db:18303/xtrabackup_sst//1' --local-port 19307 --socket '/var/run/mariadb/.mariadb.itcttass_ttastdb1.socket' --progress 0 --datadir '/u01/mydata/itcttass_ttastdb1/' --defaults-file '/u01/mycnf/itcttass_ttastdb1/mariadb.cnf' --gtid 'c1062071-70d1-11eb-a95b-db537cbe8f2e:46173018' --gtid-domain-id 0 --binlog '/u01/myarch/itcttass_ttastdb1/mariadb-bin' --bypass --mysqld-args --defaults-file=/u01/mycnf/itcttass_ttastdb1/mariadb.cnf --wsrep_start_position=c1062071-70d1-11eb-a95b-db537cbe8f2e:45650007
2023-04-14 12:32:57 0 [ERROR] WSREP: Process completed with error: wsrep_sst_mariabackup --role 'donor' --address 't2vdbs-itcttass-ttastdb1-2-01-db:18303/xtrabackup_sst//1' --local-port 19307 --socket '/var/run/mariadb/.mariadb.itcttass_ttastdb1.socket' --progress 0 --datadir '/u01/mydata/itcttass_ttastdb1/' --defaults-file '/u01/mycnf/itcttass_ttastdb1/mariadb.cnf' --gtid 'c1062071-70d1-11eb-a95b-db537cbe8f2e:46173018' --gtid-domain-id 0 --binlog '/u01/myarch/itcttass_ttastdb1/mariadb-bin' --bypass --mysqld-args --defaults-file=/u01/mycnf/itcttass_ttastdb1/mariadb.cnf --wsrep_start_position=c1062071-70d1-11eb-a95b-db537cbe8f2e:45650007: 1 (Operation not permitted)
The underlying case is that fgets failing to read. MariaDB could be more descriptive with the cause of this error with ferror on the first error message in sql/wsrep_sst.cc:sst_donor_thread
The "Operations not permitted" is annoying galera fantasy of making up errno based on other info and pretending its the same
https://github.com/MariaDB/server/blob/10.6/sql/wsrep_utils.cc#L399 which relates to the waitpid rather than the fgets.
It seems likely the sst script started, and mariabackup too, and then terminated itself immediately.
The joiner message of:
12:32:57 0 [Warning] WSREP: 2.0 (t1vdbs-itcttass-ttastdb1-1-01): State transfer to 0.0 (t2vdbs-itcttass-ttastdb1-2-01) failed: -1 (Operation not permitted)
The joiner is just unpacking the donors message in group_unserialize_code_msg.
Are there the follow log files in the datadir?
Can you include these?