[MDEV-31155] How to troubleshoot "operation not permitted" error in Galera cluster? Created: 2023-04-30  Updated: 2023-05-02

Status: Open
Project: MariaDB Server
Component/s: Galera SST
Affects Version/s: 10.6.10
Fix Version/s: 10.6

Type: Bug Priority: Major
Reporter: William Wong Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Environment:

RHEL7 on VMware


Attachments: File mariadb-error.log-node1-20230415     File mariadb-error.log-node2-20230415     File mariadb.cnf.node1     File mariadb.cnf.node2    

 Description   

Hi,

Our databases are mostly Galera cluster - 2 db nodes + 1 arbitrator.
There was a patching last year from MariDB 10.6.7 (with Galera 4-26-11) to 10.6.10 (with Galera 4-26-12).
After the patching, when we restart one of DB node, there is a chance of hitting "operation not permitted" error in both donor & joiner. Then, joiner cannot join back cluster.
We tried several ways but still could not figure out the root cause.

  • no SST log found in DB
  • switch from SST method from mariabackup to rsync
  • enable wsrep_debug=SERVER
    The last action in each incident is to restart (bootstrap) the remaining DB node first. Then, we can start the joinor to join back cluster. But this is a service downtime.

This problem happened 4~5 times already out of 20 to 100 DB node restart.

Any way to troubleshoot in next occurrence?

We guess the problem is at donor node side. But since we need to resume the cluster, we restarted donor and cannot troubleshoot from donor at this moment. Only can troubleshoot in next occurrence.

DB log of one case are uploaded:
mariadb-error.log-node1-20230415
mariadb-error.log-node2-20230415

DB parameter file are uploaded:
mariadb.cnf.node1
mariadb.cnf.node2

Regards,
William Wong



 Comments   
Comment by Daniel Black [ 2023-05-01 ]

The main error is in the node1 logs:

2023-04-14 12:32:57 0 [ERROR] WSREP: Failed to read from: wsrep_sst_mariabackup --role 'donor' --address 't2vdbs-itcttass-ttastdb1-2-01-db:18303/xtrabackup_sst//1' --local-port 19307 --socket '/var/run/mariadb/.mariadb.itcttass_ttastdb1.socket' --progress 0 --datadir '/u01/mydata/itcttass_ttastdb1/' --defaults-file '/u01/mycnf/itcttass_ttastdb1/mariadb.cnf' --gtid 'c1062071-70d1-11eb-a95b-db537cbe8f2e:46173018' --gtid-domain-id 0 --binlog '/u01/myarch/itcttass_ttastdb1/mariadb-bin' --bypass --mysqld-args --defaults-file=/u01/mycnf/itcttass_ttastdb1/mariadb.cnf --wsrep_start_position=c1062071-70d1-11eb-a95b-db537cbe8f2e:45650007
2023-04-14 12:32:57 0 [ERROR] WSREP: Process completed with error: wsrep_sst_mariabackup --role 'donor' --address 't2vdbs-itcttass-ttastdb1-2-01-db:18303/xtrabackup_sst//1' --local-port 19307 --socket '/var/run/mariadb/.mariadb.itcttass_ttastdb1.socket' --progress 0 --datadir '/u01/mydata/itcttass_ttastdb1/' --defaults-file '/u01/mycnf/itcttass_ttastdb1/mariadb.cnf' --gtid 'c1062071-70d1-11eb-a95b-db537cbe8f2e:46173018' --gtid-domain-id 0 --binlog '/u01/myarch/itcttass_ttastdb1/mariadb-bin' --bypass --mysqld-args --defaults-file=/u01/mycnf/itcttass_ttastdb1/mariadb.cnf --wsrep_start_position=c1062071-70d1-11eb-a95b-db537cbe8f2e:45650007: 1 (Operation not permitted)

The underlying case is that fgets failing to read. MariaDB could be more descriptive with the cause of this error with ferror on the first error message in sql/wsrep_sst.cc:sst_donor_thread

The "Operations not permitted" is annoying galera fantasy of making up errno based on other info and pretending its the same
https://github.com/MariaDB/server/blob/10.6/sql/wsrep_utils.cc#L399 which relates to the waitpid rather than the fgets.

It seems likely the sst script started, and mariabackup too, and then terminated itself immediately.

The joiner message of:

12:32:57 0 [Warning] WSREP: 2.0 (t1vdbs-itcttass-ttastdb1-1-01): State transfer to 0.0 (t2vdbs-itcttass-ttastdb1-2-01) failed: -1 (Operation not permitted)

The joiner is just unpacking the donors message in group_unserialize_code_msg.

Are there the follow log files in the datadir?

  • mariabackup.prepare.log
  • mariabackup.move.log
  • mariabackup.backup.log

Can you include these?

Comment by William Wong [ 2023-05-01 ]

Thanks @daniel

Agree the root cause should be at donor side.

We could not find mariabackup log file in datadir of donor node at issue time in past several incidents.

Comment by Daniel Black [ 2023-05-01 ]

Alternately look at attempting a mariabackup outside of SST following similar options and see if it succeeds.

Comment by William Wong [ 2023-05-02 ]

Below is found in node 1 DB log, run wsrep_sst_mariabackup in next occurrence? Cannot find mariabackup alone command in DB log.

2023-04-14 12:32:57 0 [Note] WSREP: Running: 'wsrep_sst_mariabackup --role 'donor' --address 't2vdbs-itcttass-ttastdb1-2-01-db:18303/xtrabackup_sst//1' --local-port 19307 --socket '/var/run/mariadb/.mariadb.itcttass_ttastdb1.socket' --progress 0 --datadir '/u01/mydata/itcttass_ttastdb1/' --defaults-file '/u01/mycnf/itcttass_ttastdb1/mariadb.cnf' --gtid 'c1062071-70d1-11eb-a95b-db537cbe8f2e:46173018' --gtid-domain-id 0 --binlog '/u01/myarch/itcttass_ttastdb1/mariadb-bin' --bypass --mysqld-args --defaults-file=/u01/mycnf/itcttass_ttastdb1/mariadb.cnf --wsrep_start_position=c1062071-70d1-11eb-a95b-db537cbe8f2e:45650007'

Generated at Thu Feb 08 10:21:42 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.