Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-13478

Full SST sync fails because of the error in the cleaning part

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • 10.1.20, 5.5.51-galera, 10.2.7
    • Configuration
    • 10.1.30

    Description

      Hi MariaDB Folks,

      We had the same problem with MariaDB verion 5.5 and also now after upgrading to 10.2.7

      Problem desc:

      When node needs a full state snapshot (SST) first sync attempt always fails in our case. It does not fail at the very beginning of SST phase but only after data transfer is completed. We are waiting 45 min to transfer 180GB and eventually it always fails.

      Error message is seen in the logs when data are being streamed:

      2017-08-08 23:14:56 140013350135552 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.28.253.22:5444' --datadir '/data/'   --parent '14018'  '' '
      WSREP_SST: [INFO] Streaming with xbstream (20170808 23:14:56.343)
      WSREP_SST: [INFO] Using socat as streamer (20170808 23:14:56.346)
      WSREP_SST: [INFO] Stale sst_in_progress file: /data//sst_in_progress (20170808 23:14:56.354)
      2017-08-08 23:14:56 140014532581120 [Note] WSREP: Prepared SST request: xtrabackup-v2|10.28.253.22:5444/xtrabackup_sst//1
      2017-08-08 23:14:56 140014532581120 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      2017-08-08 23:14:56 140014532581120 [Note] WSREP: REPL Protocols: 7 (3, 2)
      2017-08-08 23:14:56 140014532581120 [Note] WSREP: Assign initial position for certification: 436077364, protocol version: 3
      2017-08-08 23:14:56 140014549366528 [Note] WSREP: Service thread queue flushed.
      2017-08-08 23:14:56 140014532581120 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (d01e5332-2656-11e4-ae7b-56db42595f62): 1 (Operation not permitted)
               at galera/src/replicator_str.cpp:prepare_for_IST():482. IST will be unavailable.
      2017-08-08 23:14:56 140013358528256 [Note] WSREP: Member 0.0 (eap-db01) requested state transfer from 'eap-db02'. Selected 2.0 (eap-db02)(SYNCED) as donor.
      2017-08-08 23:14:56 140013358528256 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 436077364)
      2017-08-08 23:14:56 140014532581120 [Note] WSREP: Requesting state transfer: success, donor: 2
      2017-08-08 23:14:56 140014532581120 [Note] WSREP: GCache history reset: old(00000000-0000-0000-0000-000000000000:0) -> new(d01e5332-2656-11e4-ae7b-56db42595f62:436077364)
      WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN:5444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20170808 23:14:56.402)
      WSREP_SST: [INFO] Proceeding with SST (20170808 23:14:56.684)
      WSREP_SST: [INFO] Evaluating socat -u TCP-LISTEN:5444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20170808 23:14:56.685)
      WSREP_SST: [INFO] Cleaning the existing datadir and innodb-data/log directories (20170808 23:14:56.687)
      ## lot's of files removed here, output cut ###
      removed ‘/data/ethangroup/bill_payments.ibd’
      removed ‘/data/ethangroup/account_tokens.ibd’
      removed ‘/data/ethangroup/identity.ibd’
      removed ‘/data/ethangroup/sp_trans_xml.ibd’
      removed ‘/data/phoneclub/bill_trans_to_usage.frm’
      removed directory: ‘/data/phoneclub’
      find: ‘/data/phoneclub/bill_trans_to_usage.frm’: No such file or directory
      removed ‘/data/testing_logging/login_logs.ibd’
      removed ‘/data/testing_logging/ec_dump2.frm’
      removed ‘/data/testing_logging/ec_errorlog.frm’
      removed ‘/data/testing_logging/queue_consumer.frm’
      removed ‘/data/testing_logging/queue_function.frm’
      removed directory: ‘/data/testing_logging’
      removed ‘/data/aria_log.00000001’
      removed ‘/data/multi-master.info’
      removed ‘/data/mysql_upgrade_info’
      removed ‘/data/ib_buffer_pool’
      WSREP_SST: [ERROR] Cleanup after exit with status:1 (20170808 23:14:57.373)
      
      

      So key thing in above snapshot is a second attempt to remove the file which already had been removed earlier:

      find: ‘/data/phoneclub/bill_trans_to_usage.frm’: No such file or directory
      

      two lines before:

      removed ‘/data/phoneclub/bill_trans_to_usage.frm’
      

      It's a random file, not always same.

      As a result whole cleanup phase fails with error:

      WSREP_SST: [ERROR] Cleanup after exit with status:1 (20170808 23:14:57.373)
      

      Code responsible for this part:
      /bin/wsrep_sst_xtrabackup-v2

          889
          890         wsrep_log_info "Cleaning the existing datadir and innodb-data/log directories"
          891         find $ib_home_dir $ib_log_dir $ib_undo_dir $DATA -mindepth 1  -regex $cpat  -prune  -o -exec rm -rfv {} 1>&2 \+
      

      Second SST attempt is successful as all files and directories have already been removed. I wasn't able to find the root cause why find is trying to remove same files again.

      We have quite a lot databases and tables in mysql data directory, but I don't think it may cause this kind of issues.

      [root@eap-db03 data]# ls -lR /data/| wc -l
      7927
      

      To make the first SST workable we need to change line 891 to:

      find $ib_home_dir $ib_log_dir $ib_undo_dir $DATA -mindepth 1  -regex $cpat  -prune -o -print0 | xargs -0 /bin/rm -rf
      

      Attachments

        Issue Links

          Activity

            No, -print0 has nothing to do with the problem in question, there are no spaces or other strange characters there.

            serg Sergei Golubchik added a comment - No, -print0 has nothing to do with the problem in question, there are no spaces or other strange characters there.
            sachin.setiya.007 Sachin Setiya (Inactive) added a comment - - edited

            I think prune position is slightly wrong it should be

            find $ib_home_dir $ib_log_dir $ib_undo_dir $DATA -mindepth 1 -prune -regex $cpat    -o -exec rm -rfv {} 1>&2 \+
            

            (before -regex) , so that it should be applied in both cases.

            sachin.setiya.007 Sachin Setiya (Inactive) added a comment - - edited I think prune position is slightly wrong it should be find $ib_home_dir $ib_log_dir $ib_undo_dir $DATA -mindepth 1 -prune -regex $cpat -o - exec rm -rfv {} 1>&2 \+ (before -regex) , so that it should be applied in both cases.

            So the reason for this is in current find works like this
            find (-regex && -prune) || (-exec )
            So if regex is false , -prune is not applied and we get so many files
            what solution does is
            find (-prune && -regex) || (-exec )
            So the prune will be executed always , (and is has non reversible effect ,so even if regex fails exec will get only folder name , not the all file names. )

            sachin.setiya.007 Sachin Setiya (Inactive) added a comment - So the reason for this is in current find works like this find (-regex && -prune) || (-exec ) So if regex is false , -prune is not applied and we get so many files what solution does is find (-prune && -regex) || (-exec ) So the prune will be executed always , (and is has non reversible effect ,so even if regex fails exec will get only folder name , not the all file names. )
            sachin.setiya.007 Sachin Setiya (Inactive) added a comment - http://lists.askmonty.org/pipermail/commits/2017-December/011744.html
            sachin.setiya.007 Sachin Setiya (Inactive) added a comment - http://lists.askmonty.org/pipermail/commits/2017-December/011745.html

            People

              sachin.setiya.007 Sachin Setiya (Inactive)
              Brys Kamil
              Votes:
              2 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.