Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-17114

Node failed to Join cluster after switch SST from Rsync to Xtrabackup using MariaDB-10.2.17

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 10.2.17
    • Fix Version/s: 10.2
    • Component/s: Galera SST
    • Labels:
      None
    • Environment:
      MariaDB-server-10.2.17
      percona-xtrabackup-24-2.4.12
      socat-1.7.3.2
      rsync-3.1.2
      galera-25.3.23
      CentOS 7.5

      Description

      Hello Support,

      I think I'm hitting a bug after switch the SST method from rsync to xtrabackup. Here is the story:

      I'm running a 3 cluster node using galera version 25.3. This cluster has been having issues since it went to production (Nodes constantly crashing). In order to improve the downtime for our customers I decided to change the wsrep_sst_method from rsync to xtrabackup a few days ago to avoid lockings on the Donor.

      Change went well in all three servers, I tested shutting down mariadb in one of the nodes and it was able to join the cluster using xtrabackup without issues.

      Yesterday, node3 crash with the following error:

      Duplicate entry '766974' for key 'PRIMARY', Error
      _code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 678, Internal MariaDB error code: 1062
      

      After that, it was not able to join the cluster anymore. Logs from this node were reporting the following:

      2018-08-30 14:44:03 140708028196608 [Warning] WSREP: Gap in state sequence. Need state transfer.
      2018-08-30 14:44:03 140707826231040 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.30.83.59' --datadir '/var/lib/mysql/'   --parent '16141'  '' '
      WSREP_SST: [INFO] Streaming with xbstream (20180830 14:44:03.966)
      WSREP_SST: [INFO] Using socat as streamer (20180830 14:44:03.970)
      WSREP_SST: [INFO] Stale sst_in_progress file: /var/lib/mysql//sst_in_progress (20180830 14:44:03.978)
      WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20180830 14:44:04.031)
      2018-08-30 14:44:04 140708028196608 [Note] WSREP: Prepared SST request: xtrabackup-v2|10.30.83.59:4444/xtrabackup_sst//10.30.83.59
      2018-08-30 14:44:04 140708028196608 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      2018-08-30 14:44:04 140708028196608 [Note] WSREP: REPL Protocols: 8 (3, 2)
      2018-08-30 14:44:04 140708028196608 [Note] WSREP: Assign initial position for certification: 210247333, protocol version: 3
      2018-08-30 14:44:04 140708072285952 [Note] WSREP: Service thread queue flushed.
      2018-08-30 14:44:04 140708028196608 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (bceea6d1-93cb-11e8-a528-769389c11935): 1 (Operation not permitted)
      2018-08-30 14:44:04 140707692013312 [Note] WSREP: Member 1.0 (sql03) requested state transfer from '*any*'. Selected 0.0 (sql02)(SYNCED) as donor.
      2018-08-30 14:44:04 140707692013312 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 210247419)
      2018-08-30 14:44:04 140708028196608 [Note] WSREP: Requesting state transfer: success, donor: 0
      2018-08-30 14:44:04 140708028196608 [Note] WSREP: GCache history reset: 00000000-0000-0000-0000-000000000000:0 -> bceea6d1-93cb-11e8-a528-769389c11935:210247333
      WSREP_SST: [INFO] WARNING: Stale temporary SST directory: /var/lib/mysql//.sst from previous state transfer. Removing (20180830 14:44:05.449)
      WSREP_SST: [INFO] Proceeding with SST (20180830 14:44:05.459)
      WSREP_SST: [INFO] Evaluating socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20180830 14:44:05.463)
      WSREP_SST: [INFO] Cleaning the existing datadir and innodb-data/log directories (20180830 14:44:05.463)
      2018-08-30 14:44:05 140708038715136 [Note] WSREP: (aa211f51, 'tcp://0.0.0.0:4567') connection to peer aa211f51 with addr tcp://10.30.83.59:4567 timed out, no messages seen in PT3S
      2018-08-30 14:44:05 140708038715136 [Note] WSREP: (aa211f51, 'tcp://0.0.0.0:4567') turning message relay requesting off
      WSREP_SST: [INFO] Waiting for SST streaming to complete! (20180830 14:44:06.112)
      2018-08-30 14:44:15 140707692013312 [Warning] WSREP: 0.0 (sql02): State transfer to 1.0 (sql03) failed: -22 (Invalid argument)
      2018-08-30 14:44:15 140707692013312 [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():737: Will never receive state. Need to abort.
      2018-08-30 14:44:15 140707692013312 [Note] WSREP: gcomm: terminating thread
      2018-08-30 14:44:15 140707692013312 [Note] WSREP: gcomm: joining thread
      2018-08-30 14:44:15 140707692013312 [Note] WSREP: gcomm: closing backend
      WSREP_SST: [ERROR] xtrabackup_checkpoints missing, failed innobackupex/SST on donor (20180830 14:44:15.559)
      WSREP_SST: [ERROR] Cleanup after exit with status:2 (20180830 14:44:15.564)
      2018-08-30 14:44:15 140707826231040 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.30.83.59' --datadir '/var/lib/mysql/'   --parent '16141'  '' : 2 (No such file or directory)
      2018-08-30 14:44:15 140707826231040 [ERROR] WSREP: Failed to read uuid:seqno and wsrep_gtid_domain_id from joiner script.
      2018-08-30 14:44:15 140708169021632 [ERROR] WSREP: SST failed: 2 (No such file or directory)
      

      Donor was reporting:

      2018-08-30 14:14:44 140694740002560 [Note] WSREP:  cleaning up 878a2501 (tcp://10.30.83.59:4567)
      2018-08-30 14:44:00 140687282530048 [Warning] Aborted connection 21855 to db: 'ftphnbuai_1234' user: 'ftphnbuai_1234' host: 'gateway' (Got an error reading communication packets)
      2018-08-30 14:44:02 140694740002560 [Note] WSREP: (70b92cd7, 'tcp://0.0.0.0:4567') connection established to aa211f51 tcp://10.30.83.59:4567
      2018-08-30 14:44:02 140694740002560 [Note] WSREP: (70b92cd7, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
      2018-08-30 14:44:02 140694740002560 [Note] WSREP: declaring aa211f51 at tcp://10.30.83.59:4567 stable
      2018-08-30 14:44:02 140694740002560 [Note] WSREP: Node 70b92cd7 state prim
      2018-08-30 14:44:02 140694740002560 [Note] WSREP: view(view_id(PRIM,70b92cd7,764) memb {
      2018-08-30 14:44:02 140694740002560 [Note] WSREP: save pc into disk
      2018-08-30 14:44:03 140694729484032 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2
      2018-08-30 14:44:03 140694729484032 [Note] WSREP: STATE_EXCHANGE: sent state UUID: aa716b5e-ac84-11e8-bf66-c372901c4260
      2018-08-30 14:44:03 140694729484032 [Note] WSREP: STATE EXCHANGE: sent state msg: aa716b5e-ac84-11e8-bf66-c372901c4260
      2018-08-30 14:44:03 140694729484032 [Note] WSREP: STATE EXCHANGE: got state msg: aa716b5e-ac84-11e8-bf66-c372901c4260 from 0 (sql02)
      2018-08-30 14:44:03 140694729484032 [Note] WSREP: STATE EXCHANGE: got state msg: aa716b5e-ac84-11e8-bf66-c372901c4260 from 1 (sql03)
      2018-08-30 14:44:03 140694729484032 [Note] WSREP: Quorum results:
      2018-08-30 14:44:03 140694729484032 [Note] WSREP: Flow-control interval: [23, 23]
      2018-08-30 14:44:03 140694729484032 [Note] WSREP: Trying to continue unpaused monitor
      2018-08-30 14:44:03 140694674339584 [Note] WSREP: New cluster view: global state: bceea6d1-93cb-11e8-a528-769389c11935:210247333, view# 298: Primary, number of nodes: 2, my index: 0, protocol version 3
      2018-08-30 14:44:03 140694674339584 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      2018-08-30 14:44:03 140694674339584 [Note] WSREP: REPL Protocols: 8 (3, 2)
      2018-08-30 14:44:03 140694674339584 [Note] WSREP: Assign initial position for certification: 210247333, protocol version: 3
      2018-08-30 14:44:03 140694976788224 [Note] WSREP: Service thread queue flushed.
      2018-08-30 14:44:04 140694729484032 [Note] WSREP: Member 1.0 (sql03) requested state transfer from '*any*'. Selected 0.0 (sql02)(SYNCED) as donor.
      2018-08-30 14:44:04 140694729484032 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 210247419)
      2018-08-30 14:44:04 140694674339584 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      2018-08-30 14:44:04 140685730633472 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'donor' --address '10.30.83.59:4444/xtrabackup_sst//10.30.83.59' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/'     '' --gtid 'bceea6d1-93cb-11e8-a528-769389c11935:210247419' --gtid-domain-id '0''
      2018-08-30 14:44:04 140694674339584 [Note] WSREP: sst_donor_thread signaled with 0
      2018-08-30 14:44:05 140694740002560 [Note] WSREP: (70b92cd7, 'tcp://0.0.0.0:4567') turning message relay requesting off
      2018-08-30 14:44:15 140687291516672 [Warning] Access denied for user 'galera'@'localhost' (using password: YES)
      2018-08-30 14:44:15 140685730633472 [ERROR] WSREP: Failed to read from: wsrep_sst_xtrabackup-v2 --role 'donor' --address '10.30.83.59:4444/xtrabackup_sst//10.30.83.59' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/'     '' --gtid 'bceea6d1-93cb-11e8-a528-769389c11935:210247419' --gtid-domain-id '0'
      2018-08-30 14:44:15 140685730633472 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'donor' --address '10.30.83.59:4444/xtrabackup_sst//10.30.83.59' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/'     '' --gtid 'bceea6d1-93cb-11e8-a528-769389c11935:210247419' --gtid-domain-id '0': 22 (Invalid argument)
      2018-08-30 14:44:15 140685730633472 [ERROR] WSREP: Command did not run: wsrep_sst_xtrabackup-v2 --role 'donor' --address '10.30.83.59:4444/xtrabackup_sst//10.30.83.59' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/'     '' --gtid 'bceea6d1-93cb-11e8-a528-769389c11935:210247419' --gtid-domain-id '0'
      2018-08-30 14:44:15 140694729484032 [Warning] WSREP: 0.0 (sql02): State transfer to 1.0 (sql03) failed: -22 (Invalid argument)
      2018-08-30 14:44:15 140694729484032 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 210248565)
      2018-08-30 14:44:15 140694729484032 [Note] WSREP: Member 0.0 (sql02) synced with group.
      2018-08-30 14:44:15 140694729484032 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 210248566)
      2018-08-30 14:44:15 140694943217408 [Note] WSREP: Synchronized with group, ready for connections
      

      As you can see above the donor reported access denied for galera user. This user is defined in the /etc/my.cnf.d/server.cnf file and I am able to log into the database locally using its credentials.

      innobackup.backup.log showed the following:

      Connecting to MySQL server host: localhost, user: galera, password: set, port: not set, socket: /var/lib/mysql/mysql.sock
      Failed to connect to MySQL server: Access denied for user 'galera'@'localhost' (using password: YES).
      

      That's no the end of the story, then node01 crashed it couldn't join the cluster either due to same errors.

      At this point, I decided to bounce mariadb in node02 and it failed to restart and I had to bootstrap it and service came back up.

      The other two nodes were still having issues to join the cluster after bootstrap, so I decided to switch back to rsync in nodes 01 and 03 and they were able to join the cluster after all.

      With all three nodes online, I brought down 02 which was still configured to use xtrabackup and surprisingly it was able to join the cluster without errors.

      I did one more test, I brought down node 01 and 03 and switched method back again to rsync and same issue, nodes were not able to join cluster with the same errors posted above. At this point I gave up and rollback all nodes to use rsync.

      Here is mysql config:

      cat /etc/my.cnf.d/server.cnf
      #
      # These groups are read by MariaDB server.
      # Use it for options that only the server (but not clients) should see
      #
      # See the examples of server my.cnf files in /usr/share/mysql/
      #
       
      # this is read by the standalone daemon and embedded servers
      [server]
       
      # this is only for the mysqld standalone daemon
      [mysqld]
      datadir = /var/lib/mysql
      log_error = /var/log/mariadb.log
      max_user_connections=64
      max_connections=10240
      connect_timeout = 5
      wait_timeout = 600
      slow_query_log = 1
      slow_query_log_file = /var/lib/mysql/sql01-slow.log
      long_query_time = 600
       
      # Innodb Settings
      default_storage_engine=InnoDB
      enforce_storage_engine=InnoDB
      innodb_buffer_pool_size=4G
      innodb_buffer_pool_instances = 4
      innodb_log_file_size = 500MB
      innodb_autoinc_lock_mode=2
       
       
      #
      # * Galera-related settings
      #
      [galera]
      wsrep_on=ON
      wsrep_provider=/usr/lib64/galera/libgalera_smm.so
      wsrep_cluster_address='gcomm://sql01,sql02,sql03'
      wsrep_cluster_name='WP0'
      wsrep_node_address='sql01'
      wsrep_node_name='sql01'
      #wsrep_sst_method=xtrabackup-v2
      wsrep_sst_method=rsync
      wsrep_sst_auth=galera:PASSWORD
      wsrep_slave_threads=4
      binlog_format=row
      bind-address=0.0.0.0
       
      # this is only for embedded server
      [embedded]
       
      # This group is only read by MariaDB servers, not by MySQL.
      # If you use the same .cnf file for MySQL and MariaDB,
      # you can put MariaDB-only options here
      [mariadb]
       
      # This group is only read by MariaDB-10.1 servers.
      # If you use the same .cnf file for MariaDB of different versions,
      # use this group for options that older servers don't understand
      [mariadb-10.1]
      
      

      Configuration file is the same in all three nodes.

        Attachments

          Activity

            People

            • Assignee:
              jplindst Jan Lindström
              Reporter:
              oswgarcia Oswaldo Garcia
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: