Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Fixed
-
10.1.20, 5.5.51-galera, 10.2.7
-
[root@eap-db01 ~]# rpm -qa | grep -i maria
MariaDB-compat-10.2.7-1.el7.centos.x86_64
MariaDB-shared-10.2.7-1.el7.centos.x86_64
MariaDB-common-10.2.7-1.el7.centos.x86_64
MariaDB-client-10.2.7-1.el7.centos.x86_64
MariaDB-server-10.2.7-1.el7.centos.x86_64
-
10.1.30
Description
Hi MariaDB Folks,
We had the same problem with MariaDB verion 5.5 and also now after upgrading to 10.2.7
Problem desc:
When node needs a full state snapshot (SST) first sync attempt always fails in our case. It does not fail at the very beginning of SST phase but only after data transfer is completed. We are waiting 45 min to transfer 180GB and eventually it always fails.
Error message is seen in the logs when data are being streamed:
2017-08-08 23:14:56 140013350135552 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.28.253.22:5444' --datadir '/data/' --parent '14018' '' ' |
WSREP_SST: [INFO] Streaming with xbstream (20170808 23:14:56.343) |
WSREP_SST: [INFO] Using socat as streamer (20170808 23:14:56.346) |
WSREP_SST: [INFO] Stale sst_in_progress file: /data//sst_in_progress (20170808 23:14:56.354) |
2017-08-08 23:14:56 140014532581120 [Note] WSREP: Prepared SST request: xtrabackup-v2|10.28.253.22:5444/xtrabackup_sst//1 |
2017-08-08 23:14:56 140014532581120 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. |
2017-08-08 23:14:56 140014532581120 [Note] WSREP: REPL Protocols: 7 (3, 2) |
2017-08-08 23:14:56 140014532581120 [Note] WSREP: Assign initial position for certification: 436077364, protocol version: 3 |
2017-08-08 23:14:56 140014549366528 [Note] WSREP: Service thread queue flushed. |
2017-08-08 23:14:56 140014532581120 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (d01e5332-2656-11e4-ae7b-56db42595f62): 1 (Operation not permitted) |
at galera/src/replicator_str.cpp:prepare_for_IST():482. IST will be unavailable. |
2017-08-08 23:14:56 140013358528256 [Note] WSREP: Member 0.0 (eap-db01) requested state transfer from 'eap-db02'. Selected 2.0 (eap-db02)(SYNCED) as donor. |
2017-08-08 23:14:56 140013358528256 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 436077364) |
2017-08-08 23:14:56 140014532581120 [Note] WSREP: Requesting state transfer: success, donor: 2 |
2017-08-08 23:14:56 140014532581120 [Note] WSREP: GCache history reset: old(00000000-0000-0000-0000-000000000000:0) -> new(d01e5332-2656-11e4-ae7b-56db42595f62:436077364) |
WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN:5444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20170808 23:14:56.402) |
WSREP_SST: [INFO] Proceeding with SST (20170808 23:14:56.684) |
WSREP_SST: [INFO] Evaluating socat -u TCP-LISTEN:5444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20170808 23:14:56.685) |
WSREP_SST: [INFO] Cleaning the existing datadir and innodb-data/log directories (20170808 23:14:56.687) |
## lot's of files removed here, output cut ###
|
removed ‘/data/ethangroup/bill_payments.ibd’
|
removed ‘/data/ethangroup/account_tokens.ibd’
|
removed ‘/data/ethangroup/identity.ibd’
|
removed ‘/data/ethangroup/sp_trans_xml.ibd’
|
removed ‘/data/phoneclub/bill_trans_to_usage.frm’
|
removed directory: ‘/data/phoneclub’
|
find: ‘/data/phoneclub/bill_trans_to_usage.frm’: No such file or directory
|
removed ‘/data/testing_logging/login_logs.ibd’
|
removed ‘/data/testing_logging/ec_dump2.frm’
|
removed ‘/data/testing_logging/ec_errorlog.frm’
|
removed ‘/data/testing_logging/queue_consumer.frm’
|
removed ‘/data/testing_logging/queue_function.frm’
|
removed directory: ‘/data/testing_logging’
|
removed ‘/data/aria_log.00000001’ |
removed ‘/data/multi-master.info’
|
removed ‘/data/mysql_upgrade_info’
|
removed ‘/data/ib_buffer_pool’
|
WSREP_SST: [ERROR] Cleanup after exit with status:1 (20170808 23:14:57.373) |
|
So key thing in above snapshot is a second attempt to remove the file which already had been removed earlier:
find: ‘/data/phoneclub/bill_trans_to_usage.frm’: No such file or directory
|
two lines before:
removed ‘/data/phoneclub/bill_trans_to_usage.frm’
|
It's a random file, not always same.
As a result whole cleanup phase fails with error:
WSREP_SST: [ERROR] Cleanup after exit with status:1 (20170808 23:14:57.373) |
Code responsible for this part:
/bin/wsrep_sst_xtrabackup-v2
889 |
890 wsrep_log_info "Cleaning the existing datadir and innodb-data/log directories" |
891 find $ib_home_dir $ib_log_dir $ib_undo_dir $DATA -mindepth 1 -regex $cpat -prune -o -exec rm -rfv {} 1>&2 \+ |
Second SST attempt is successful as all files and directories have already been removed. I wasn't able to find the root cause why find is trying to remove same files again.
We have quite a lot databases and tables in mysql data directory, but I don't think it may cause this kind of issues.
[root@eap-db03 data]# ls -lR /data/| wc -l |
7927 |
To make the first SST workable we need to change line 891 to:
find $ib_home_dir $ib_log_dir $ib_undo_dir $DATA -mindepth 1 -regex $cpat -prune -o -print0 | xargs -0 /bin/rm -rf |
Attachments
Issue Links
- relates to
-
MDEV-13789 mariabackup galera SST fail
- Closed