Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-13478

Full SST sync fails because of the error in the cleaning part

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • 10.1.20, 5.5.51-galera, 10.2.7
    • Configuration
    • 10.1.30

    Description

      Hi MariaDB Folks,

      We had the same problem with MariaDB verion 5.5 and also now after upgrading to 10.2.7

      Problem desc:

      When node needs a full state snapshot (SST) first sync attempt always fails in our case. It does not fail at the very beginning of SST phase but only after data transfer is completed. We are waiting 45 min to transfer 180GB and eventually it always fails.

      Error message is seen in the logs when data are being streamed:

      2017-08-08 23:14:56 140013350135552 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.28.253.22:5444' --datadir '/data/'   --parent '14018'  '' '
      WSREP_SST: [INFO] Streaming with xbstream (20170808 23:14:56.343)
      WSREP_SST: [INFO] Using socat as streamer (20170808 23:14:56.346)
      WSREP_SST: [INFO] Stale sst_in_progress file: /data//sst_in_progress (20170808 23:14:56.354)
      2017-08-08 23:14:56 140014532581120 [Note] WSREP: Prepared SST request: xtrabackup-v2|10.28.253.22:5444/xtrabackup_sst//1
      2017-08-08 23:14:56 140014532581120 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      2017-08-08 23:14:56 140014532581120 [Note] WSREP: REPL Protocols: 7 (3, 2)
      2017-08-08 23:14:56 140014532581120 [Note] WSREP: Assign initial position for certification: 436077364, protocol version: 3
      2017-08-08 23:14:56 140014549366528 [Note] WSREP: Service thread queue flushed.
      2017-08-08 23:14:56 140014532581120 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (d01e5332-2656-11e4-ae7b-56db42595f62): 1 (Operation not permitted)
               at galera/src/replicator_str.cpp:prepare_for_IST():482. IST will be unavailable.
      2017-08-08 23:14:56 140013358528256 [Note] WSREP: Member 0.0 (eap-db01) requested state transfer from 'eap-db02'. Selected 2.0 (eap-db02)(SYNCED) as donor.
      2017-08-08 23:14:56 140013358528256 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 436077364)
      2017-08-08 23:14:56 140014532581120 [Note] WSREP: Requesting state transfer: success, donor: 2
      2017-08-08 23:14:56 140014532581120 [Note] WSREP: GCache history reset: old(00000000-0000-0000-0000-000000000000:0) -> new(d01e5332-2656-11e4-ae7b-56db42595f62:436077364)
      WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN:5444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20170808 23:14:56.402)
      WSREP_SST: [INFO] Proceeding with SST (20170808 23:14:56.684)
      WSREP_SST: [INFO] Evaluating socat -u TCP-LISTEN:5444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20170808 23:14:56.685)
      WSREP_SST: [INFO] Cleaning the existing datadir and innodb-data/log directories (20170808 23:14:56.687)
      ## lot's of files removed here, output cut ###
      removed ‘/data/ethangroup/bill_payments.ibd’
      removed ‘/data/ethangroup/account_tokens.ibd’
      removed ‘/data/ethangroup/identity.ibd’
      removed ‘/data/ethangroup/sp_trans_xml.ibd’
      removed ‘/data/phoneclub/bill_trans_to_usage.frm’
      removed directory: ‘/data/phoneclub’
      find: ‘/data/phoneclub/bill_trans_to_usage.frm’: No such file or directory
      removed ‘/data/testing_logging/login_logs.ibd’
      removed ‘/data/testing_logging/ec_dump2.frm’
      removed ‘/data/testing_logging/ec_errorlog.frm’
      removed ‘/data/testing_logging/queue_consumer.frm’
      removed ‘/data/testing_logging/queue_function.frm’
      removed directory: ‘/data/testing_logging’
      removed ‘/data/aria_log.00000001
      removed ‘/data/multi-master.info’
      removed ‘/data/mysql_upgrade_info’
      removed ‘/data/ib_buffer_pool’
      WSREP_SST: [ERROR] Cleanup after exit with status:1 (20170808 23:14:57.373)
      
      

      So key thing in above snapshot is a second attempt to remove the file which already had been removed earlier:

      find: ‘/data/phoneclub/bill_trans_to_usage.frm’: No such file or directory
      

      two lines before:

      removed ‘/data/phoneclub/bill_trans_to_usage.frm’
      

      It's a random file, not always same.

      As a result whole cleanup phase fails with error:

      WSREP_SST: [ERROR] Cleanup after exit with status:1 (20170808 23:14:57.373)
      

      Code responsible for this part:
      /bin/wsrep_sst_xtrabackup-v2

          889
          890         wsrep_log_info "Cleaning the existing datadir and innodb-data/log directories"
          891         find $ib_home_dir $ib_log_dir $ib_undo_dir $DATA -mindepth 1  -regex $cpat  -prune  -o -exec rm -rfv {} 1>&2 \+
      

      Second SST attempt is successful as all files and directories have already been removed. I wasn't able to find the root cause why find is trying to remove same files again.

      We have quite a lot databases and tables in mysql data directory, but I don't think it may cause this kind of issues.

      [root@eap-db03 data]# ls -lR /data/| wc -l
      7927
      

      To make the first SST workable we need to change line 891 to:

      find $ib_home_dir $ib_log_dir $ib_undo_dir $DATA -mindepth 1  -regex $cpat  -prune -o -print0 | xargs -0 /bin/rm -rf
      

      Attachments

        Issue Links

          Activity

            People

              sachin.setiya.007 Sachin Setiya (Inactive)
              Brys Kamil
              Votes:
              2 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.