Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-21738

SST Bug with MariaDB-backup

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Critical
    • Resolution: Fixed
    • 10.3.22
    • 10.3.23

    Description

      Situation is:

      For Info: We have a Proxysql on another machine that sends WRITE traffic to node1 only.

      After setting up a Galera Cluster with MariaDB version 10.3.22 on 3 Centos 8 machines, an initiative to reboot each node 1 by 1 was attempted (node1, node2, node3).

      Reboot procedure on node3:

      • Stop mariadb (systemctl stop mariadb)
      • Reboot

      Firewall resetted on node3 after reboot which blocked the access to node2. After fix, the mariadb server was started again to start an SST which failed.

      On node3:

      [...]
      removed '/mnt/galera/mariadb/db1/table1.frm'
      removed directory '/mnt/galera/mariadb/db2'
      WSREP_SST: [INFO] Waiting for SST streaming to complete! (20200214 08:34:54.885)
      2020-02-14  9:14:27 0 [Warning] WSREP: 0.0 (node2): State transfer to 2.0 (node3) failed: -22 (Invalid argument)
      2020-02-14  9:14:27 0 [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():775: Will never receive state. Need to abort.
      2020-02-14  9:14:27 0 [Note] WSREP: gcomm: terminating thread
      2020-02-14  9:14:27 0 [Note] WSREP: gcomm: joining thread
      2020-02-14  9:14:27 0 [Note] WSREP: gcomm: closing backend
      WSREP_SST: [ERROR] xtrabackup_checkpoints missing, failed innobackupex/SST on donor (20200214 09:14:27.683)
      WSREP_SST: [ERROR] Cleanup after exit with status:2 (20200214 09:14:27.687)
      2020-02-14  9:14:27 0 [Note] WSREP: view(view_id(NON_PRIM,96d7b0f1,243) memb {
      	dc590f74,0
      } joined {
      } left {
      } partitioned {
      	96d7b0f1,0
      	97dd31b7,0
      })
      2020-02-14  9:14:27 0 [Note] WSREP: view((empty))
      2020-02-14  9:14:27 0 [Note] WSREP: gcomm: closed
      2020-02-14  9:14:27 0 [Note] WSREP: /usr/sbin/mysqld: Terminated.
      Signal 15 (TERM) caught by ps (3.3.15).
      ps:ps/display.c:66: please report this bug
      WSREP_SST: [ERROR] Removing /mnt/galera/mariadb//.sst/xtrabackup_galera_info file due to signal (20200214 09:14:27.701)
      2020-02-14  9:14:36 0 [Note] WSREP: Read nil XID from storage engines, skipping position init
      2020-02-14  9:14:36 0 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so'
      2020-02-14  9:14:36 0 [Note] WSREP: wsrep_load(): Galera 25.3.28(r3875) by Codership Oy <info@codership.com> loaded successfully.
      2020-02-14  9:14:36 0 [Note] WSREP: CRC-32C: using hardware acceleration.
      2020-02-14  9:14:36 0 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1, safe_to_bootstrap: 0
      2020-02-14  9:14:36 0 [Note] WSREP: Passing config to GCS: base_dir = /mnt/galera/mariadb/; base_host = ip3; base_port = 4567; cert.log_conflicts = no; cert.optimistic_pa = yes; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /mnt/galera/mariadb/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /mnt/galera/mariadb//galera.cache; gcache.page_size = 128M; gcache.recover = no; gcache.size = 16G; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce
      2020-02-14  9:14:36 0 [Note] WSREP: GCache history reset: efea29fc-4d5e-11ea-93d9-82acc3ade856:0 -> 00000000-0000-0000-0000-000000000000:-1
      2020-02-14  9:14:36 0 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1
      2020-02-14  9:14:36 0 [Note] WSREP: wsrep_sst_grab()
      2020-02-14  9:14:36 0 [Note] WSREP: Start replication
      2020-02-14  9:14:36 0 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
      2020-02-14  9:14:36 0 [Note] WSREP: protonet asio version 0
      2020-02-14  9:14:36 0 [Note] WSREP: Using CRC-32C for message checksums.
      2020-02-14  9:14:36 0 [Note] WSREP: backend: asio
      2020-02-14  9:14:36 0 [Note] WSREP: gcomm thread scheduling priority set to other:0 
      2020-02-14  9:14:36 0 [Warning] WSREP: access file(/mnt/galera/mariadb//gvwstate.dat) failed(No such file or directory)
      2020-02-14  9:14:36 0 [Note] WSREP: restore pc from disk failed
      2020-02-14  9:14:36 0 [Note] WSREP: GMCast version 0
      2020-02-14  9:14:36 0 [Note] WSREP: (6ba79e4c, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
      2020-02-14  9:14:36 0 [Note] WSREP: (6ba79e4c, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
      2020-02-14  9:14:36 0 [Note] WSREP: EVS version 0
      2020-02-14  9:14:36 0 [Note] WSREP: gcomm: connecting to group 'prod', peer 'ip3:,ip2:,ip1:'
      2020-02-14  9:14:36 0 [Note] WSREP: (6ba79e4c, 'tcp://0.0.0.0:4567') Found matching local endpoint for a connection, blacklisting address tcp://<ip3>:4567
      2020-02-14  9:14:36 0 [Note] WSREP: (6ba79e4c, 'tcp://0.0.0.0:4567') connection established to 96d7b0f1 tcp://<ip2>:4567
      2020-02-14  9:14:36 0 [Note] WSREP: (6ba79e4c, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: 
      2020-02-14  9:14:36 0 [Note] WSREP: (6ba79e4c, 'tcp://0.0.0.0:4567') connection established to 97dd31b7 tcp://<ip1>:4567
      2020-02-14  9:14:36 0 [Note] WSREP: declaring 96d7b0f1 at tcp://<ip2>:4567 stable
      2020-02-14  9:14:36 0 [Note] WSREP: declaring 97dd31b7 at tcp://<ip1>:4567 stable
      2020-02-14  9:14:36 0 [Note] WSREP: Node 96d7b0f1 state prim
      2020-02-14  9:14:36 0 [Note] WSREP: view(view_id(PRIM,6ba79e4c,245) memb {
      	6ba79e4c,0
      	96d7b0f1,0
      	97dd31b7,0
      } joined {
      } left {
      } partitioned {
      })
      2020-02-14  9:14:36 0 [Note] WSREP: save pc into disk
      2020-02-14  9:14:36 0 [Note] WSREP: gcomm: connected
      2020-02-14  9:14:36 0 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
      2020-02-14  9:14:36 0 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
      2020-02-14  9:14:36 0 [Note] WSREP: Opened channel 'prod'
      2020-02-14  9:14:36 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 3
      2020-02-14  9:14:36 0 [Note] WSREP: Waiting for SST to complete.
      2020-02-14  9:14:36 0 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 6bf432b1-4f0a-11ea-884d-7bcef913ea9b
      2020-02-14  9:14:36 0 [Note] WSREP: STATE EXCHANGE: sent state msg: 6bf432b1-4f0a-11ea-884d-7bcef913ea9b
      2020-02-14  9:14:36 0 [Note] WSREP: STATE EXCHANGE: got state msg: 6bf432b1-4f0a-11ea-884d-7bcef913ea9b from 0 (node3)
      2020-02-14  9:14:36 0 [Note] WSREP: STATE EXCHANGE: got state msg: 6bf432b1-4f0a-11ea-884d-7bcef913ea9b from 1 (node2)
      2020-02-14  9:14:36 0 [Note] WSREP: STATE EXCHANGE: got state msg: 6bf432b1-4f0a-11ea-884d-7bcef913ea9b from 2 (node1)
      2020-02-14  9:14:36 0 [Note] WSREP: Quorum results:
      	version    = 6,
      	component  = PRIMARY,
      	conf_id    = 244,
      	members    = 2/3 (joined/total),
      	act_id     = 1562227,
      	last_appl. = -1,
      	protocols  = 0/9/3 (gcs/repl/appl),
      	group UUID = efea29fc-4d5e-11ea-93d9-82acc3ade856
      2020-02-14  9:14:36 0 [Note] WSREP: Flow-control interval: [28, 28]
      2020-02-14  9:14:36 0 [Note] WSREP: Trying to continue unpaused monitor
      2020-02-14  9:14:36 0 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 1562227)
      2020-02-14  9:14:36 2 [Note] WSREP: State transfer required: 
      	Group state: efea29fc-4d5e-11ea-93d9-82acc3ade856:1562227
      	Local state: 00000000-0000-0000-0000-000000000000:-1
      2020-02-14  9:14:36 2 [Note] WSREP: New cluster view: global state: efea29fc-4d5e-11ea-93d9-82acc3ade856:1562227, view# 245: Primary, number of nodes: 3, my index: 0, protocol version 3
      2020-02-14  9:14:36 2 [Warning] WSREP: Gap in state sequence. Need state transfer.
      2020-02-14  9:14:36 0 [Note] WSREP: Running: 'wsrep_sst_mariabackup --role 'joiner' --address '<ip3>' --datadir '/mnt/galera/mariadb/' --parent '6030' --mysqld-args --wsrep_start_position=00000000-0000-0000-0000-000000000000:-1'
      WSREP_SST: [INFO] Streaming with xbstream (20200214 09:14:36.942)
      WSREP_SST: [INFO] Using socat as streamer (20200214 09:14:36.945)
      WSREP_SST: [INFO] Stale sst_in_progress file: /mnt/galera/mariadb//sst_in_progress (20200214 09:14:36.949)
      WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN:4444,reuseaddr stdio |  pigz -d | mbstream -x; RC=( ${PIPESTATUS[@]} ) (20200214 09:14:36.982)
      2020-02-14  9:14:37 2 [Note] WSREP: Prepared SST request: mariabackup|<ip3>:4444/xtrabackup_sst//1
      2020-02-14  9:14:37 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      2020-02-14  9:14:37 2 [Note] WSREP: REPL Protocols: 9 (4, 2)
      2020-02-14  9:14:37 2 [Note] WSREP: Assign initial position for certification: 1562227, protocol version: 4
      2020-02-14  9:14:37 0 [Note] WSREP: Service thread queue flushed.
      2020-02-14  9:14:37 2 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (efea29fc-4d5e-11ea-93d9-82acc3ade856): 1 (Operation not permitted)
      	 at galera/src/replicator_str.cpp:prepare_for_IST():482. IST will be unavailable.
      2020-02-14  9:14:37 0 [Note] WSREP: Member 0.0 (node3) requested state transfer from '*any*'. Selected 1.0 (node2)(SYNCED) as donor.
      2020-02-14  9:14:37 0 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 1562227)
      2020-02-14  9:14:37 2 [Note] WSREP: Requesting state transfer: success, donor: 1
      2020-02-14  9:14:37 2 [Note] WSREP: GCache history reset: 00000000-0000-0000-0000-000000000000:0 -> efea29fc-4d5e-11ea-93d9-82acc3ade856:1562227
      WSREP_SST: [INFO] WARNING: Stale temporary SST directory: /mnt/galera/mariadb//.sst from previous state transfer. Removing (20200214 09:14:38.011)
      2020-02-14  9:14:39 0 [Note] WSREP: (6ba79e4c, 'tcp://0.0.0.0:4567') turning message relay requesting off
      WSREP_SST: [INFO] Evaluating socat -u TCP-LISTEN:4444,reuseaddr stdio |  pigz -d | mbstream -x; RC=( ${PIPESTATUS[@]} ) (20200214 09:14:41.009)
      WSREP_SST: [INFO] Proceeding with SST (20200214 09:14:41.010)
      WSREP_SST: [INFO] Cleaning the existing datadir and innodb-data/log directories (20200214 09:14:41.014)
      removed '/mnt/galera/mariadb/aria_log.00000001'
      removed '/mnt/galera/mariadb/ib_logfile1'
      removed '/mnt/galera/mariadb/aria_log_control'
      removed '/mnt/galera/mariadb/ibdata1'
      WSREP_SST: [INFO] Waiting for SST streaming to complete! (20200214 09:14:41.034)
      2020-02-14  9:53:37 0 [Warning] WSREP: 1.0 (node2): State transfer to 0.0 (node3) failed: -22 (Invalid argument)
      2020-02-14  9:53:37 0 [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():775: Will never receive state. Need to abort.
      2020-02-14  9:53:37 0 [Note] WSREP: gcomm: terminating thread
      2020-02-14  9:53:37 0 [Note] WSREP: gcomm: joining thread
      2020-02-14  9:53:37 0 [Note] WSREP: gcomm: closing backend
      2020-02-14  9:53:37 0 [Note] WSREP: view(view_id(NON_PRIM,6ba79e4c,245) memb {
      	6ba79e4c,0
      } joined {
      } left {
      } partitioned {
      	96d7b0f1,0
      	97dd31b7,0
      })
      2020-02-14  9:53:37 0 [Note] WSREP: view((empty))
      2020-02-14  9:53:37 0 [Note] WSREP: gcomm: closed
      2020-02-14  9:53:37 0 [Note] WSREP: /usr/sbin/mysqld: Terminated.
      WSREP_SST: [ERROR] Removing /mnt/galera/mariadb//.sst/xtrabackup_galera_info file due to signal (20200214 09:53:37.119)
      WSREP_SST: [ERROR] xtrabackup_checkpoints missing, failed innobackupex/SST on donor (20200214 09:53:37.126)
      WSREP_SST: [ERROR] Cleanup after exit with status:2 (20200214 09:53:37.128)
      [...]
      

      The Mariabackup was launched from node2 which failed with a bug message:

      [...]
      [01] 2020-02-14 09:14:27 Streaming ./db1/table1.frm to <STDOUT>
      [01] 2020-02-14 09:14:27         ...done
      [00] 2020-02-14 09:14:27 Finished backing up non-InnoDB tables and files
      [01] 2020-02-14 09:14:27 Streaming ./aria_log.00000001 to <STDOUT>
      [01] 2020-02-14 09:14:27         ...done
      [01] 2020-02-14 09:14:27 Streaming ./aria_log_control to <STDOUT>
      [01] 2020-02-14 09:14:27         ...done
      [00] 2020-02-14 09:14:27 Waiting for log copy thread to read lsn 678163080833
      [00] 2020-02-14 09:14:27 >> log scanned up to (678163080842)
      [00] 2020-02-14 09:14:27 Unexpected tablespace innodb_system filename innodb_system.isl
      2020-02-14 09:14:27 0x7f5beb6a4900  InnoDB: Assertion failure in file /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.3.22/extra/mariabackup/xtrabackup.cc line 4492
      InnoDB: Failing assertion: 0
      InnoDB: We intentionally generate a memory trap.
      InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
      InnoDB: If you get repeated assertion failures or crashes, even
      InnoDB: immediately after the mysqld startup, there may be
      InnoDB: corruption in the InnoDB tablespace. Please refer to
      InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/
      InnoDB: about forcing recovery.
      200214  9:14:27 [ERROR] mysqld got signal 6 ;
      This could be because you hit a bug. It is also possible that this binary
      or one of the libraries it was linked against is corrupt, improperly built,
      or misconfigured. This error can also be caused by malfunctioning hardware.
       
      To report this bug, see https://mariadb.com/kb/en/reporting-bugs
       
      We will try our best to scrape up some info that will hopefully help
      diagnose the problem, but since we have already crashed, 
      something is definitely wrong and this may fail.
       
      Server version: 10.3.22-MariaDB
      key_buffer_size=0
      read_buffer_size=131072
      max_used_connections=0
      max_threads=1
      thread_count=0
      It is possible that mysqld could use up to 
      key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 5599 K  bytes of memory
      Hope that's ok; if not, decrease some variables in the equation.
       
      Thread pointer: 0x0
      Attempting backtrace. You can use the following information to find out
      where mysqld died. If you see no messages after this, something went
      terribly wrong...
      stack_bottom = 0x0 thread_stack 0x49000
      /usr/bin/mariabackup(my_print_stacktrace+0x2e)[0x55bbddb3614e]
      /usr/bin/mariabackup(handle_fatal_signal+0x54d)[0x55bbdd6adadd]
      sigaction.c:0(__restore_rt)[0x7f5beb284dd0]
      :0(__GI_raise)[0x7f5be934b99f]
      :0(__GI_abort)[0x7f5be9335cf5]
      /usr/bin/mariabackup(+0x4fe1ed)[0x55bbdd3701ed]
      /usr/bin/mariabackup(+0x4b2ea2)[0x55bbdd324ea2]
      /usr/bin/mariabackup(_Z12backup_startv+0x17a)[0x55bbdd3aec5a]
      /usr/bin/mariabackup(+0x5227ef)[0x55bbdd3947ef]
      /usr/bin/mariabackup(main+0x185)[0x55bbdd378995]
      /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f5be9337873]
      /usr/bin/mariabackup(_start+0x2e)[0x55bbdd38c7fe]
      The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
      information that should help you find out what is causing the crash.
      Writing a core file...
      Working directory at /mnt/galera/mariadb
      Resource Limits:
      Limit                     Soft Limit           Hard Limit           Units     
      Max cpu time              unlimited            unlimited            seconds   
      Max file size             unlimited            unlimited            bytes     
      Max data size             unlimited            unlimited            bytes     
      Max stack size            8388608              unlimited            bytes     
      Max core file size        unlimited            unlimited            bytes     
      Max resident set          unlimited            unlimited            bytes     
      Max processes             127892               127892               processes 
      Max open files            262140               262140               files     
      Max locked memory         16777216             16777216             bytes     
      Max address space         unlimited            unlimited            bytes     
      Max file locks            unlimited            unlimited            locks     
      Max pending signals       127892               127892               signals   
      Max msgqueue size         819200               819200               bytes     
      Max nice priority         0                    0                    
      Max realtime priority     0                    0                    
      Max realtime timeout      unlimited            unlimited            us        
      Core pattern: |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e
      

      For information, the core dump file was not recorded because it is not activated.

      After this bug is produced, the SST restarted automatically (normal due to database code). The mariabackup failed 3 times and SST restarted automatically 3 times.

      The Working Part:

      • The SST was cancelled manually.
      • The datadir contents were removed (rm -rf /datadir)
      • The MariaDB server was restarted again.
      • This time, it started successfully.

      Is this a normal behaviour with Mariabackup ?

      Sarvesh.

      Attachments

        Activity

          People

            sysprg Julius Goryavsky
            Sarvesh Goburdhun Goburdhun Sarvesh Sharma
            Votes:
            2 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.