[MDEV-13478] Full SST sync fails because of the error in the cleaning part - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.1.20, 5.5.51-galera, 10.2.7
Fix Version/s: 10.1.30, 10.2.12, 5.5.59-galera, 10.0.34-galera
Component/s: Configuration
Labels:
- galera
- wsrep
Environment:

Hide
[root@eap-db01 ~]# rpm -qa | grep -i maria
MariaDB-compat-10.2.7-1.el7.centos.x86_64
MariaDB-shared-10.2.7-1.el7.centos.x86_64
MariaDB-common-10.2.7-1.el7.centos.x86_64
MariaDB-client-10.2.7-1.el7.centos.x86_64
MariaDB-server-10.2.7-1.el7.centos.x86_64

Show
[ root@eap-db01 ~]# rpm -qa | grep -i maria MariaDB-compat-10.2.7-1.el7.centos.x86_64 MariaDB-shared-10.2.7-1.el7.centos.x86_64 MariaDB-common-10.2.7-1.el7.centos.x86_64 MariaDB-client-10.2.7-1.el7.centos.x86_64 MariaDB-server-10.2.7-1.el7.centos.x86_64

Sprint:
10.1.30

Description

Hi MariaDB Folks,

We had the same problem with MariaDB verion 5.5 and also now after upgrading to 10.2.7

Problem desc:

When node needs a full state snapshot (SST) first sync attempt always fails in our case. It does not fail at the very beginning of SST phase but only after data transfer is completed. We are waiting 45 min to transfer 180GB and eventually it always fails.

Error message is seen in the logs when data are being streamed:

2017-08-08 23:14:56 140013350135552 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.28.253.22:5444' --datadir '/data/'   --parent '14018'  '' '

WSREP_SST: [INFO] Streaming with xbstream (20170808 23:14:56.343)

WSREP_SST: [INFO] Using socat as streamer (20170808 23:14:56.346)

WSREP_SST: [INFO] Stale sst_in_progress file: /data//sst_in_progress (20170808 23:14:56.354)

2017-08-08 23:14:56 140014532581120 [Note] WSREP: Prepared SST request: xtrabackup-v2|10.28.253.22:5444/xtrabackup_sst//1

2017-08-08 23:14:56 140014532581120 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

2017-08-08 23:14:56 140014532581120 [Note] WSREP: REPL Protocols: 7 (3, 2)

2017-08-08 23:14:56 140014532581120 [Note] WSREP: Assign initial position for certification: 436077364, protocol version: 3

2017-08-08 23:14:56 140014549366528 [Note] WSREP: Service thread queue flushed.

2017-08-08 23:14:56 140014532581120 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (d01e5332-2656-11e4-ae7b-56db42595f62): 1 (Operation not permitted)

         at galera/src/replicator_str.cpp:prepare_for_IST():482. IST will be unavailable.

2017-08-08 23:14:56 140013358528256 [Note] WSREP: Member 0.0 (eap-db01) requested state transfer from 'eap-db02'. Selected 2.0 (eap-db02)(SYNCED) as donor.

2017-08-08 23:14:56 140013358528256 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 436077364)

2017-08-08 23:14:56 140014532581120 [Note] WSREP: Requesting state transfer: success, donor: 2

2017-08-08 23:14:56 140014532581120 [Note] WSREP: GCache history reset: old(00000000-0000-0000-0000-000000000000:0) -> new(d01e5332-2656-11e4-ae7b-56db42595f62:436077364)

WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN:5444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20170808 23:14:56.402)

WSREP_SST: [INFO] Proceeding with SST (20170808 23:14:56.684)

WSREP_SST: [INFO] Evaluating socat -u TCP-LISTEN:5444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20170808 23:14:56.685)

WSREP_SST: [INFO] Cleaning the existing datadir and innodb-data/log directories (20170808 23:14:56.687)

## lot's of files removed here, output cut ###

removed ‘/data/ethangroup/bill_payments.ibd’

removed ‘/data/ethangroup/account_tokens.ibd’

removed ‘/data/ethangroup/identity.ibd’

removed ‘/data/ethangroup/sp_trans_xml.ibd’

removed ‘/data/phoneclub/bill_trans_to_usage.frm’

removed directory: ‘/data/phoneclub’

find: ‘/data/phoneclub/bill_trans_to_usage.frm’: No such file or directory

removed ‘/data/testing_logging/login_logs.ibd’

removed ‘/data/testing_logging/ec_dump2.frm’

removed ‘/data/testing_logging/ec_errorlog.frm’

removed ‘/data/testing_logging/queue_consumer.frm’

removed ‘/data/testing_logging/queue_function.frm’

removed directory: ‘/data/testing_logging’

removed ‘/data/aria_log.00000001’

removed ‘/data/multi-master.info’

removed ‘/data/mysql_upgrade_info’

removed ‘/data/ib_buffer_pool’

WSREP_SST: [ERROR] Cleanup after exit with status:1 (20170808 23:14:57.373)

So key thing in above snapshot is a second attempt to remove the file which already had been removed earlier:

find: ‘/data/phoneclub/bill_trans_to_usage.frm’: No such file or directory

two lines before:

removed ‘/data/phoneclub/bill_trans_to_usage.frm’

It's a random file, not always same.

As a result whole cleanup phase fails with error:

WSREP_SST: [ERROR] Cleanup after exit with status:1 (20170808 23:14:57.373)

Code responsible for this part:
/bin/wsrep_sst_xtrabackup-v2

    890         wsrep_log_info "Cleaning the existing datadir and innodb-data/log directories"

    891         find $ib_home_dir $ib_log_dir $ib_undo_dir $DATA -mindepth 1  -regex $cpat  -prune  -o -exec rm -rfv {} 1>&2 \+

Second SST attempt is successful as all files and directories have already been removed. I wasn't able to find the root cause why find is trying to remove same files again.

We have quite a lot databases and tables in mysql data directory, but I don't think it may cause this kind of issues.

[root@eap-db03 data]# ls -lR /data/| wc -l

To make the first SST workable we need to change line 891 to:

find $ib_home_dir $ib_log_dir $ib_undo_dir $DATA -mindepth 1  -regex $cpat  -prune -o -print0 | xargs -0 /bin/rm -rf

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

db_structure.txt
359 kB
2017-08-21 13:47

Issue Links

relates to

MDEV-13789 mariabackup galera SST fail

Closed

Activity

Ascending order - Click to sort in descending order

View 14 older comments

Sergei Golubchik added a comment - 2017-12-18 15:08

No, -print0 has nothing to do with the problem in question, there are no spaces or other strange characters there.

Sergei Golubchik added a comment - 2017-12-18 15:08 No, -print0 has nothing to do with the problem in question, there are no spaces or other strange characters there.

Sachin Setiya (Inactive) added a comment - 2017-12-19 12:34 - edited

I think prune position is slightly wrong it should be

find $ib_home_dir $ib_log_dir $ib_undo_dir $DATA -mindepth 1 -prune -regex $cpat    -o -exec rm -rfv {} 1>&2 \+

(before -regex) , so that it should be applied in both cases.

Sachin Setiya (Inactive) added a comment - 2017-12-19 12:34 - edited I think prune position is slightly wrong it should be find $ib_home_dir $ib_log_dir $ib_undo_dir $DATA -mindepth 1 -prune -regex $cpat -o - exec rm -rfv {} 1>&2 \+ (before -regex) , so that it should be applied in both cases.

Sachin Setiya (Inactive) added a comment - 2017-12-19 14:44

So the reason for this is in current find works like this
find (-regex && -prune) || (-exec )
So if regex is false , -prune is not applied and we get so many files
what solution does is
find (-prune && -regex) || (-exec )
So the prune will be executed always , (and is has non reversible effect ,so even if regex fails exec will get only folder name , not the all file names. )

Sachin Setiya (Inactive) added a comment - 2017-12-19 14:44 So the reason for this is in current find works like this find (-regex && -prune) || (-exec ) So if regex is false , -prune is not applied and we get so many files what solution does is find (-prune && -regex) || (-exec ) So the prune will be executed always , (and is has non reversible effect ,so even if regex fails exec will get only folder name , not the all file names. )

Sachin Setiya (Inactive) added a comment - 2017-12-19 17:07

http://lists.askmonty.org/pipermail/commits/2017-December/011744.html

Sachin Setiya (Inactive) added a comment - 2017-12-19 17:07 http://lists.askmonty.org/pipermail/commits/2017-December/011744.html

Sachin Setiya (Inactive) added a comment - 2017-12-19 18:56

http://lists.askmonty.org/pipermail/commits/2017-December/011745.html

Sachin Setiya (Inactive) added a comment - 2017-12-19 18:56 http://lists.askmonty.org/pipermail/commits/2017-December/011745.html

People

Assignee:: Sachin Setiya (Inactive)

Reporter:: Kamil

Votes:: 2 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 2017-08-09 00:56

Updated:: 2024-07-08 00:38

Resolved:: 2017-12-19 18:56

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration