[MDEV-13478] Full SST sync fails because of the error in the cleaning part Created: 2017-08-09  Updated: 2020-08-25  Resolved: 2017-12-19

Status: Closed
Project: MariaDB Server
Component/s: Configuration
Affects Version/s: 10.1.20, 5.5.51-galera, 10.2.7
Fix Version/s: 10.1.30, 10.2.12, 5.5.59-galera, 10.0.34-galera

Type: Bug Priority: Major
Reporter: Kamil Assignee: Sachin Setiya (Inactive)
Resolution: Fixed Votes: 2
Labels: galera, wsrep
Environment:

[root@eap-db01 ~]# rpm -qa | grep -i maria
MariaDB-compat-10.2.7-1.el7.centos.x86_64
MariaDB-shared-10.2.7-1.el7.centos.x86_64
MariaDB-common-10.2.7-1.el7.centos.x86_64
MariaDB-client-10.2.7-1.el7.centos.x86_64
MariaDB-server-10.2.7-1.el7.centos.x86_64


Attachments: Text File db_structure.txt    
Issue Links:
Relates
relates to MDEV-13789 mariabackup galera SST fail Closed
Sprint: 10.1.30

 Description   

Hi MariaDB Folks,

We had the same problem with MariaDB verion 5.5 and also now after upgrading to 10.2.7

Problem desc:

When node needs a full state snapshot (SST) first sync attempt always fails in our case. It does not fail at the very beginning of SST phase but only after data transfer is completed. We are waiting 45 min to transfer 180GB and eventually it always fails.

Error message is seen in the logs when data are being streamed:

2017-08-08 23:14:56 140013350135552 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.28.253.22:5444' --datadir '/data/'   --parent '14018'  '' '
WSREP_SST: [INFO] Streaming with xbstream (20170808 23:14:56.343)
WSREP_SST: [INFO] Using socat as streamer (20170808 23:14:56.346)
WSREP_SST: [INFO] Stale sst_in_progress file: /data//sst_in_progress (20170808 23:14:56.354)
2017-08-08 23:14:56 140014532581120 [Note] WSREP: Prepared SST request: xtrabackup-v2|10.28.253.22:5444/xtrabackup_sst//1
2017-08-08 23:14:56 140014532581120 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2017-08-08 23:14:56 140014532581120 [Note] WSREP: REPL Protocols: 7 (3, 2)
2017-08-08 23:14:56 140014532581120 [Note] WSREP: Assign initial position for certification: 436077364, protocol version: 3
2017-08-08 23:14:56 140014549366528 [Note] WSREP: Service thread queue flushed.
2017-08-08 23:14:56 140014532581120 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (d01e5332-2656-11e4-ae7b-56db42595f62): 1 (Operation not permitted)
         at galera/src/replicator_str.cpp:prepare_for_IST():482. IST will be unavailable.
2017-08-08 23:14:56 140013358528256 [Note] WSREP: Member 0.0 (eap-db01) requested state transfer from 'eap-db02'. Selected 2.0 (eap-db02)(SYNCED) as donor.
2017-08-08 23:14:56 140013358528256 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 436077364)
2017-08-08 23:14:56 140014532581120 [Note] WSREP: Requesting state transfer: success, donor: 2
2017-08-08 23:14:56 140014532581120 [Note] WSREP: GCache history reset: old(00000000-0000-0000-0000-000000000000:0) -> new(d01e5332-2656-11e4-ae7b-56db42595f62:436077364)
WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN:5444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20170808 23:14:56.402)
WSREP_SST: [INFO] Proceeding with SST (20170808 23:14:56.684)
WSREP_SST: [INFO] Evaluating socat -u TCP-LISTEN:5444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20170808 23:14:56.685)
WSREP_SST: [INFO] Cleaning the existing datadir and innodb-data/log directories (20170808 23:14:56.687)
## lot's of files removed here, output cut ###
removed ‘/data/ethangroup/bill_payments.ibd’
removed ‘/data/ethangroup/account_tokens.ibd’
removed ‘/data/ethangroup/identity.ibd’
removed ‘/data/ethangroup/sp_trans_xml.ibd’
removed ‘/data/phoneclub/bill_trans_to_usage.frm’
removed directory: ‘/data/phoneclub’
find: ‘/data/phoneclub/bill_trans_to_usage.frm’: No such file or directory
removed ‘/data/testing_logging/login_logs.ibd’
removed ‘/data/testing_logging/ec_dump2.frm’
removed ‘/data/testing_logging/ec_errorlog.frm’
removed ‘/data/testing_logging/queue_consumer.frm’
removed ‘/data/testing_logging/queue_function.frm’
removed directory: ‘/data/testing_logging’
removed ‘/data/aria_log.00000001
removed ‘/data/multi-master.info’
removed ‘/data/mysql_upgrade_info’
removed ‘/data/ib_buffer_pool’
WSREP_SST: [ERROR] Cleanup after exit with status:1 (20170808 23:14:57.373)

So key thing in above snapshot is a second attempt to remove the file which already had been removed earlier:

find: ‘/data/phoneclub/bill_trans_to_usage.frm’: No such file or directory

two lines before:

removed ‘/data/phoneclub/bill_trans_to_usage.frm’

It's a random file, not always same.

As a result whole cleanup phase fails with error:

WSREP_SST: [ERROR] Cleanup after exit with status:1 (20170808 23:14:57.373)

Code responsible for this part:
/bin/wsrep_sst_xtrabackup-v2

    889
    890         wsrep_log_info "Cleaning the existing datadir and innodb-data/log directories"
    891         find $ib_home_dir $ib_log_dir $ib_undo_dir $DATA -mindepth 1  -regex $cpat  -prune  -o -exec rm -rfv {} 1>&2 \+

Second SST attempt is successful as all files and directories have already been removed. I wasn't able to find the root cause why find is trying to remove same files again.

We have quite a lot databases and tables in mysql data directory, but I don't think it may cause this kind of issues.

[root@eap-db03 data]# ls -lR /data/| wc -l
7927

To make the first SST workable we need to change line 891 to:

find $ib_home_dir $ib_log_dir $ib_undo_dir $DATA -mindepth 1  -regex $cpat  -prune -o -print0 | xargs -0 /bin/rm -rf



 Comments   
Comment by Andrii Nikitin (Inactive) [ 2017-08-11 ]

In my understanding with default configuration value of $cpat that command will not attempt to remove .frm files:

cpat=$(parse_cnf sst cpat '.*galera\.cache$\|.*sst_in_progress$\|.*\.sst$\|.*gvwstate\.dat$\|.*grastate\.dat$\|.*\.err$\|.*\.log$\|.*RPM_UPGRADE_MARKER$\|.*RPM_UPGRADE_HISTORY$')

Do you have custom configuration for cpat in your configuration files? (If yes - please provide it for reference).
Also, could you confirm if you have any of parameter innodb-data-home-dir , innodb-log-group-home-dir and innodb-undo-directory in cnf files ? (If yes - please provide them as well)

Comment by Kamil [ 2017-08-14 ]

Hi Andrii,

We don't have a custom configuration for that.

find execution I mentioned above does not remove files matched with cpat variable but instead it removs everything which is not matched with regex:

 
find $ib_home_dir $ib_log_dir $ib_undo_dir $DATA -mindepth 1  -regex $cpat  -prune  -o -exec rm -rfv {} 1>&2 \+

"-prune -o"

We don't have any of parameters like innodb-data-home-dir , innodb-log-group-home-dir and innodb-undo-directory defined:

[root@eap-db01 my.cnf.d]# grep dir *
galera.cnf:datadir=/data/

Thanks,
Kamil

Comment by Andrii Nikitin (Inactive) [ 2017-08-16 ]

Thank you for explanation.
Indeed I cannot understand how "No such file or directory" message may be shown for file which did exist before command execution.
Could you just in case provide outputs from that system just to make sure I investigate exact version:

rpm -qa | grep findutils
which find
type find
find --version

Comment by Kamil [ 2017-08-19 ]

Hi Andrii,

I'm sorry for delayed response. For some reason find is trying to remove the same file twice, it's not clear to me why it happens.

Below output for requested checks:

[root@eap-db02 ~]# rpm -qa | grep findutils
findutils-4.5.11-3.el7.x86_64
 
[root@eap-db02 ~]# which find
/bin/find
 
[root@eap-db02 ~]# type find
find is /bin/find
 
[root@eap-db02 ~]# find --version
find (GNU findutils) 4.5.11
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
 
Written by Eric B. Decker, James Youngman, and Kevin Dalley.
Features enabled: D_TYPE O_NOFOLLOW(enabled) LEAF_OPTIMISATION FTS(FTS_CWDFD) CBO(level=2)
[root@eap-db02 ~]#

Br ,Kamil

Comment by Andrii Nikitin (Inactive) [ 2017-08-21 ]

Just for reference:
http://savannah.gnu.org/bugs/index.php?42985 find reports "No such file or directory" when executing rm (not verified)
http://savannah.gnu.org/bugs/index.php?48683 "no such file or directory" error when directory is removed during execution

Comment by Andrii Nikitin (Inactive) [ 2017-08-21 ]

It doesn't look easy to reproduce the problem and I really don't want to suggest a patch without knowing exact reasons when problem occurs (because for every patch some regression is possible and there may be some environment which stops to work properly). Thus I have to ask more questions which may help in finding the best resolution here:

1. Are you sure that your mentioned fix with xargs will completely solve the problem?
2. Was it only twice when you have seen the issue or you can reliably reproduce it with some steps?
3. Do you have SELinux enabled in problem environments?
4. Do you know if `phoneclub` schema was something which was actively used by Business Applications or maybe it is some old leftover folder, e.g. from some old installation? (E.g. with wrong file permissions?) I ask this question because from provided log it looks that `phoneclub` folder did contain only single .frm, which is not really typical layout.

Comment by Kamil [ 2017-08-21 ]

Hi Andrii,

1. Since the time we replaced exec with xargs we haven't seen this problem anymore. Before the change when any node left the cluster and SST was required to fully sync the node it never worked for us. This leads us to problems during cleanup phase.
2. We had this kind of issues many times, basically every time when SST was necessary
3. SELinux is disabled on every node
4. Output was very long and I truncated it intentionally. I would say it's not related to phoneclub at all as we saw different files which were attempted to be removed twice in the past so it's purely non deterministic. I'm attaching full output for your reference. I've modified customer names to something else.

db_structure.txt

Comment by Kamil [ 2017-08-21 ]

Now I've noticed that find actually tried to remove more than one file twice:

find: ‘/data/testing_api/API_mur_data.ibd’: No such file or directory
find: ‘/data/testing_api/API_mur_costs.ibd’: No such file or directory
find: ‘/data/testing_api/API_cdr_total.ibd’: No such file or directory
find: ‘/data/testing_api/API_mur_total.ibd’: No such file or directory
find: ‘/data/testing_api/API_cdr_sessions.frm’: No such file or directory
find: ‘/data/testing_api/API_cdr_data.frm’: No such file or directory
find: ‘/data/testing_api/db.opt’: No such file or directory
find: ‘/data/testing_api/API_rbt_total.frm’: No such file or directory
find: ‘/data/testing_api/API_rbt_sessions.frm’: No such file or directory
find: ‘/data/testing_api/API_rbt_data.frm’: No such file or directory
find: ‘/data/testing_api/API_rbt_costs_invoiced.frm’: No such file or directory
find: ‘/data/testing_api/API_rbt_costs.frm’: No such file or directory
find: ‘/data/testing_api/API_mur_sessions.frm’: No such file or directory
find: ‘/data/testing_api/API_mur_data.frm’: No such file or directory
find: ‘/data/testing_api/API_mur_costs.frm’: No such file or directory
find: ‘/data/testing_api/API_cdr_total.frm’: No such file or directory
find: ‘/data/testing_api/API_mur_total.frm’: No such file or directory
find: ‘/data/testing_api/API_cdr_costs_invoiced.frm’: No such file or directory
find: ‘/data/testing_api/API_cdr_costs.frm’: No such file or directory

Comment by Andrii Nikitin (Inactive) [ 2017-09-21 ]

Working on MDEV-13789 I noticed that behavior of sst script may be affected by nullglob configuration in bash (and probably other explicit shopt options).

Could you check if line `shopt -s nullglob` is present somewhere in .bashrc or similar places and provide output of command below from problem environment:

shopt

This what may demonstrate 'No such file or directory' error from find with explicit nullglob :

$ shopt -s nullglob
$ mkdir datatest
$ cd datatest/
$ cpat='.*galera\.cache$\|.*sst_in_progress$\|.*\.sst$\|.*gvwstate\.dat$\|.*grastate\.dat$\|.*\.err$\|.*\.log$\|.*RPM_UPGRADE_MARKER$\|.*RPM_UPGRADE_HISTORY$'
$ mkdir .sst
$ mkdir 1
$ mkdir 2
$ touch 1/a
$ touch 2/a
$ find $(pwd) . -mindepth 1  -regex $cpat -prune -o -exec rm -rf {} \;
find: ‘/home/a/datatest/.sst’: No such file or directory
find: ‘/home/a/datatest/1’: No such file or directory
find: ‘/home/a/datatest/2’: No such file or directory

Comment by Kamil [ 2017-09-22 ]

Hi Andrii,

That's interesting!

I've checked my all DB nodes but haven't found nullglob anywhere.

 
[root@salt01 ~]# salt 'db0*' cmd.run "shopt | grep null"
db01:
    nullglob            off
db03:
    nullglob            off
db02:
    nullglob            off

Full output:

 
[root@eap-db01 ~]# shopt
autocd          off
cdable_vars     off
cdspell         off
checkhash       off
checkjobs       off
checkwinsize    on
cmdhist         on
compat31        off
compat32        off
compat40        off
compat41        off
direxpand       off
dirspell        off
dotglob         off
execfail        off
expand_aliases  on
extdebug        off
extglob         off
extquote        on
failglob        off
force_fignore   on
globstar        off
gnu_errfmt      off
histappend      on
histreedit      off
histverify      off
hostcomplete    on
huponexit       off
interactive_comments    on
lastpipe        off
lithist         off
login_shell     on
mailwarn        off
no_empty_cmd_completion off
nocaseglob      off
nocasematch     off
nullglob        off
progcomp        on
promptvars      on
restricted_shell        off
shift_verbose   off
sourcepath      on
xpg_echo        off

Also nothing in the scripts:

 
[root@eap-db01 ~]# grep shopt /etc/bashrc /root/.bash*
/etc/bashrc:  shopt -s histappend
/etc/bashrc:  shopt -s checkwinsize
/etc/bashrc:if ! shopt -q login_shell ; then # We're not a login shell

Comment by Andrii Nikitin (Inactive) [ 2017-09-22 ]

Thank you for feedback. While it didn't confirm my assumptions, after all I managed to prove that the command used in sst may show 'No such file or directory' with many files.
(The case puts heavy load on filesystem, so use it only in good test environment)
The error does not exactly matches provided messages , but still proves necessity of using proposed xargs command :

mkdir -p MDEV-13478
(
set -e
cd MDEV-13478
echo creating many folders
for i in {1..10000}; do n=$(echo $i | md5sum); n=${n%% *}; mkdir $n; touch $n/$n $n/${n}1 $n/${n}2 ; done > /dev/null
cpat='.sst'
if ! find $(pwd) . -mindepth 1  -regex $cpat -prune -o -exec rm -rfv {} \+ >/dev/null 2>../error.log ; then
        echo there were errors using find -exec rm:
        head ../error.log
fi
)

bash ~/bugs/MDEV-13478.sh 
creating many folders
there were errors using find -exec rm:
find: ‘./317544066787836fc93db9cf875c512c’: No such file or directory
find: ‘./a2c12f3214d29967fe1b12eed28da2a3’: No such file or directory
find: ‘./a22adffcd09a596e05faa676b3ba66d1’: No such file or directory
find: ‘./0eee941aa24e3d3160c33bec839dab56’: No such file or directory
find: ‘./8beb7317ab7b7d426db24fb3a63b4c76’: No such file or directory
find: ‘./76beeadedf0c48fa4b7fa264fbd81828’: No such file or directory
find: ‘./f25d8b289df79553ca724d02ceab5623’: No such file or directory
find: ‘./38988397ce1ac28124159af65ff8eb05’: No such file or directory
find: ‘./24d3d6ab1d5943c29bc82bddfb447c14’: No such file or directory
find: ‘./5608ed7c839405d7bdccda62bee413d0’: No such file or directory

It worth to note that your proposed solution still prints error messages for case above, just overall operation still succeeds:

mkdir -p MDEV-13478
(
set -e
cd MDEV-13478
echo creating many folders
for i in {1..10000}; do n=$(echo $i | md5sum); n=${n%% *}; mkdir $n; touch $n/$n $n/${n}1 $n/${n}2 ; done > /dev/null
cpat='.sst'
if ! find $(pwd) . -mindepth 1  -regex $cpat -prune -o -print0 | xargs -0 rm -rfv >/dev/null 2>../error.log ; then
        echo "there were errors using find \| xargs rm:"
        head ../error.log
else
        echo success
fi
)

So, unless me or reviewers would come with better solution, I will try to get your suggestion pushed into code.

Comment by Andrii Nikitin (Inactive) [ 2017-09-23 ]

could you also provide output from the problem machine(s):

getconf ARG_MAX

Comment by Kamil [ 2017-09-27 ]

HI Andrii,

I'm sorry I must have missed your query, please see the output:

[root@db01 ~]# getconf ARG_MAX
2097152

Comment by Valerii Kravchuk [ 2017-10-25 ]

It seems 10.1.20 on SLES 12 is also affected by this bug.

Comment by Sergei Golubchik [ 2017-12-18 ]

No, -print0 has nothing to do with the problem in question, there are no spaces or other strange characters there.

Comment by Sachin Setiya (Inactive) [ 2017-12-19 ]

I think prune position is slightly wrong it should be

find $ib_home_dir $ib_log_dir $ib_undo_dir $DATA -mindepth 1 -prune -regex $cpat    -o -exec rm -rfv {} 1>&2 \+

(before -regex) , so that it should be applied in both cases.

Comment by Sachin Setiya (Inactive) [ 2017-12-19 ]

So the reason for this is in current find works like this
find (-regex && -prune) || (-exec )
So if regex is false , -prune is not applied and we get so many files
what solution does is
find (-prune && -regex) || (-exec )
So the prune will be executed always , (and is has non reversible effect ,so even if regex fails exec will get only folder name , not the all file names. )

Comment by Sachin Setiya (Inactive) [ 2017-12-19 ]

http://lists.askmonty.org/pipermail/commits/2017-December/011744.html

Comment by Sachin Setiya (Inactive) [ 2017-12-19 ]

http://lists.askmonty.org/pipermail/commits/2017-December/011745.html

Generated at Thu Feb 08 08:05:54 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.