[MDEV-21738] SST Bug with MariaDB-backup Created: 2020-02-14  Updated: 2021-05-11  Resolved: 2021-05-11

Status: Closed
Project: MariaDB Server
Component/s: Galera SST, mariabackup, Platform RedHat
Affects Version/s: 10.3.22
Fix Version/s: 10.3.23

Type: Bug Priority: Critical
Reporter: Goburdhun Sarvesh Sharma Assignee: Julius Goryavsky
Resolution: Fixed Votes: 2
Labels: mariabackup
Environment:

OS: CentOS 8
OS Resources: 16 CPU (AMD EPYC 7501 32-Core Processor), 32GB RAM, 500GB SSD Disk
Type: Virtual Machine
Kernel: Linux 4.18.0-147.3.1.el8_1.x86_64
Architecture: x86-64
MariaDB version: 10.3.22-1.el8.x86_64 (installed from repo: http://yum.mariadb.org/10.3/centos8-amd64)
ProxySQL: proxysql-2.0.8-1.x86_64 (another VM)
Database datadir size: 350GB


Attachments: File cluster.cnf    

 Description   

Situation is:

For Info: We have a Proxysql on another machine that sends WRITE traffic to node1 only.

After setting up a Galera Cluster with MariaDB version 10.3.22 on 3 Centos 8 machines, an initiative to reboot each node 1 by 1 was attempted (node1, node2, node3).

Reboot procedure on node3:

  • Stop mariadb (systemctl stop mariadb)
  • Reboot

Firewall resetted on node3 after reboot which blocked the access to node2. After fix, the mariadb server was started again to start an SST which failed.

On node3:

[...]
removed '/mnt/galera/mariadb/db1/table1.frm'
removed directory '/mnt/galera/mariadb/db2'
WSREP_SST: [INFO] Waiting for SST streaming to complete! (20200214 08:34:54.885)
2020-02-14  9:14:27 0 [Warning] WSREP: 0.0 (node2): State transfer to 2.0 (node3) failed: -22 (Invalid argument)
2020-02-14  9:14:27 0 [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():775: Will never receive state. Need to abort.
2020-02-14  9:14:27 0 [Note] WSREP: gcomm: terminating thread
2020-02-14  9:14:27 0 [Note] WSREP: gcomm: joining thread
2020-02-14  9:14:27 0 [Note] WSREP: gcomm: closing backend
WSREP_SST: [ERROR] xtrabackup_checkpoints missing, failed innobackupex/SST on donor (20200214 09:14:27.683)
WSREP_SST: [ERROR] Cleanup after exit with status:2 (20200214 09:14:27.687)
2020-02-14  9:14:27 0 [Note] WSREP: view(view_id(NON_PRIM,96d7b0f1,243) memb {
	dc590f74,0
} joined {
} left {
} partitioned {
	96d7b0f1,0
	97dd31b7,0
})
2020-02-14  9:14:27 0 [Note] WSREP: view((empty))
2020-02-14  9:14:27 0 [Note] WSREP: gcomm: closed
2020-02-14  9:14:27 0 [Note] WSREP: /usr/sbin/mysqld: Terminated.
Signal 15 (TERM) caught by ps (3.3.15).
ps:ps/display.c:66: please report this bug
WSREP_SST: [ERROR] Removing /mnt/galera/mariadb//.sst/xtrabackup_galera_info file due to signal (20200214 09:14:27.701)
2020-02-14  9:14:36 0 [Note] WSREP: Read nil XID from storage engines, skipping position init
2020-02-14  9:14:36 0 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so'
2020-02-14  9:14:36 0 [Note] WSREP: wsrep_load(): Galera 25.3.28(r3875) by Codership Oy <info@codership.com> loaded successfully.
2020-02-14  9:14:36 0 [Note] WSREP: CRC-32C: using hardware acceleration.
2020-02-14  9:14:36 0 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1, safe_to_bootstrap: 0
2020-02-14  9:14:36 0 [Note] WSREP: Passing config to GCS: base_dir = /mnt/galera/mariadb/; base_host = ip3; base_port = 4567; cert.log_conflicts = no; cert.optimistic_pa = yes; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /mnt/galera/mariadb/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /mnt/galera/mariadb//galera.cache; gcache.page_size = 128M; gcache.recover = no; gcache.size = 16G; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce
2020-02-14  9:14:36 0 [Note] WSREP: GCache history reset: efea29fc-4d5e-11ea-93d9-82acc3ade856:0 -> 00000000-0000-0000-0000-000000000000:-1
2020-02-14  9:14:36 0 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1
2020-02-14  9:14:36 0 [Note] WSREP: wsrep_sst_grab()
2020-02-14  9:14:36 0 [Note] WSREP: Start replication
2020-02-14  9:14:36 0 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
2020-02-14  9:14:36 0 [Note] WSREP: protonet asio version 0
2020-02-14  9:14:36 0 [Note] WSREP: Using CRC-32C for message checksums.
2020-02-14  9:14:36 0 [Note] WSREP: backend: asio
2020-02-14  9:14:36 0 [Note] WSREP: gcomm thread scheduling priority set to other:0 
2020-02-14  9:14:36 0 [Warning] WSREP: access file(/mnt/galera/mariadb//gvwstate.dat) failed(No such file or directory)
2020-02-14  9:14:36 0 [Note] WSREP: restore pc from disk failed
2020-02-14  9:14:36 0 [Note] WSREP: GMCast version 0
2020-02-14  9:14:36 0 [Note] WSREP: (6ba79e4c, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2020-02-14  9:14:36 0 [Note] WSREP: (6ba79e4c, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2020-02-14  9:14:36 0 [Note] WSREP: EVS version 0
2020-02-14  9:14:36 0 [Note] WSREP: gcomm: connecting to group 'prod', peer 'ip3:,ip2:,ip1:'
2020-02-14  9:14:36 0 [Note] WSREP: (6ba79e4c, 'tcp://0.0.0.0:4567') Found matching local endpoint for a connection, blacklisting address tcp://<ip3>:4567
2020-02-14  9:14:36 0 [Note] WSREP: (6ba79e4c, 'tcp://0.0.0.0:4567') connection established to 96d7b0f1 tcp://<ip2>:4567
2020-02-14  9:14:36 0 [Note] WSREP: (6ba79e4c, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: 
2020-02-14  9:14:36 0 [Note] WSREP: (6ba79e4c, 'tcp://0.0.0.0:4567') connection established to 97dd31b7 tcp://<ip1>:4567
2020-02-14  9:14:36 0 [Note] WSREP: declaring 96d7b0f1 at tcp://<ip2>:4567 stable
2020-02-14  9:14:36 0 [Note] WSREP: declaring 97dd31b7 at tcp://<ip1>:4567 stable
2020-02-14  9:14:36 0 [Note] WSREP: Node 96d7b0f1 state prim
2020-02-14  9:14:36 0 [Note] WSREP: view(view_id(PRIM,6ba79e4c,245) memb {
	6ba79e4c,0
	96d7b0f1,0
	97dd31b7,0
} joined {
} left {
} partitioned {
})
2020-02-14  9:14:36 0 [Note] WSREP: save pc into disk
2020-02-14  9:14:36 0 [Note] WSREP: gcomm: connected
2020-02-14  9:14:36 0 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
2020-02-14  9:14:36 0 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
2020-02-14  9:14:36 0 [Note] WSREP: Opened channel 'prod'
2020-02-14  9:14:36 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 3
2020-02-14  9:14:36 0 [Note] WSREP: Waiting for SST to complete.
2020-02-14  9:14:36 0 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 6bf432b1-4f0a-11ea-884d-7bcef913ea9b
2020-02-14  9:14:36 0 [Note] WSREP: STATE EXCHANGE: sent state msg: 6bf432b1-4f0a-11ea-884d-7bcef913ea9b
2020-02-14  9:14:36 0 [Note] WSREP: STATE EXCHANGE: got state msg: 6bf432b1-4f0a-11ea-884d-7bcef913ea9b from 0 (node3)
2020-02-14  9:14:36 0 [Note] WSREP: STATE EXCHANGE: got state msg: 6bf432b1-4f0a-11ea-884d-7bcef913ea9b from 1 (node2)
2020-02-14  9:14:36 0 [Note] WSREP: STATE EXCHANGE: got state msg: 6bf432b1-4f0a-11ea-884d-7bcef913ea9b from 2 (node1)
2020-02-14  9:14:36 0 [Note] WSREP: Quorum results:
	version    = 6,
	component  = PRIMARY,
	conf_id    = 244,
	members    = 2/3 (joined/total),
	act_id     = 1562227,
	last_appl. = -1,
	protocols  = 0/9/3 (gcs/repl/appl),
	group UUID = efea29fc-4d5e-11ea-93d9-82acc3ade856
2020-02-14  9:14:36 0 [Note] WSREP: Flow-control interval: [28, 28]
2020-02-14  9:14:36 0 [Note] WSREP: Trying to continue unpaused monitor
2020-02-14  9:14:36 0 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 1562227)
2020-02-14  9:14:36 2 [Note] WSREP: State transfer required: 
	Group state: efea29fc-4d5e-11ea-93d9-82acc3ade856:1562227
	Local state: 00000000-0000-0000-0000-000000000000:-1
2020-02-14  9:14:36 2 [Note] WSREP: New cluster view: global state: efea29fc-4d5e-11ea-93d9-82acc3ade856:1562227, view# 245: Primary, number of nodes: 3, my index: 0, protocol version 3
2020-02-14  9:14:36 2 [Warning] WSREP: Gap in state sequence. Need state transfer.
2020-02-14  9:14:36 0 [Note] WSREP: Running: 'wsrep_sst_mariabackup --role 'joiner' --address '<ip3>' --datadir '/mnt/galera/mariadb/' --parent '6030' --mysqld-args --wsrep_start_position=00000000-0000-0000-0000-000000000000:-1'
WSREP_SST: [INFO] Streaming with xbstream (20200214 09:14:36.942)
WSREP_SST: [INFO] Using socat as streamer (20200214 09:14:36.945)
WSREP_SST: [INFO] Stale sst_in_progress file: /mnt/galera/mariadb//sst_in_progress (20200214 09:14:36.949)
WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN:4444,reuseaddr stdio |  pigz -d | mbstream -x; RC=( ${PIPESTATUS[@]} ) (20200214 09:14:36.982)
2020-02-14  9:14:37 2 [Note] WSREP: Prepared SST request: mariabackup|<ip3>:4444/xtrabackup_sst//1
2020-02-14  9:14:37 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-02-14  9:14:37 2 [Note] WSREP: REPL Protocols: 9 (4, 2)
2020-02-14  9:14:37 2 [Note] WSREP: Assign initial position for certification: 1562227, protocol version: 4
2020-02-14  9:14:37 0 [Note] WSREP: Service thread queue flushed.
2020-02-14  9:14:37 2 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (efea29fc-4d5e-11ea-93d9-82acc3ade856): 1 (Operation not permitted)
	 at galera/src/replicator_str.cpp:prepare_for_IST():482. IST will be unavailable.
2020-02-14  9:14:37 0 [Note] WSREP: Member 0.0 (node3) requested state transfer from '*any*'. Selected 1.0 (node2)(SYNCED) as donor.
2020-02-14  9:14:37 0 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 1562227)
2020-02-14  9:14:37 2 [Note] WSREP: Requesting state transfer: success, donor: 1
2020-02-14  9:14:37 2 [Note] WSREP: GCache history reset: 00000000-0000-0000-0000-000000000000:0 -> efea29fc-4d5e-11ea-93d9-82acc3ade856:1562227
WSREP_SST: [INFO] WARNING: Stale temporary SST directory: /mnt/galera/mariadb//.sst from previous state transfer. Removing (20200214 09:14:38.011)
2020-02-14  9:14:39 0 [Note] WSREP: (6ba79e4c, 'tcp://0.0.0.0:4567') turning message relay requesting off
WSREP_SST: [INFO] Evaluating socat -u TCP-LISTEN:4444,reuseaddr stdio |  pigz -d | mbstream -x; RC=( ${PIPESTATUS[@]} ) (20200214 09:14:41.009)
WSREP_SST: [INFO] Proceeding with SST (20200214 09:14:41.010)
WSREP_SST: [INFO] Cleaning the existing datadir and innodb-data/log directories (20200214 09:14:41.014)
removed '/mnt/galera/mariadb/aria_log.00000001'
removed '/mnt/galera/mariadb/ib_logfile1'
removed '/mnt/galera/mariadb/aria_log_control'
removed '/mnt/galera/mariadb/ibdata1'
WSREP_SST: [INFO] Waiting for SST streaming to complete! (20200214 09:14:41.034)
2020-02-14  9:53:37 0 [Warning] WSREP: 1.0 (node2): State transfer to 0.0 (node3) failed: -22 (Invalid argument)
2020-02-14  9:53:37 0 [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():775: Will never receive state. Need to abort.
2020-02-14  9:53:37 0 [Note] WSREP: gcomm: terminating thread
2020-02-14  9:53:37 0 [Note] WSREP: gcomm: joining thread
2020-02-14  9:53:37 0 [Note] WSREP: gcomm: closing backend
2020-02-14  9:53:37 0 [Note] WSREP: view(view_id(NON_PRIM,6ba79e4c,245) memb {
	6ba79e4c,0
} joined {
} left {
} partitioned {
	96d7b0f1,0
	97dd31b7,0
})
2020-02-14  9:53:37 0 [Note] WSREP: view((empty))
2020-02-14  9:53:37 0 [Note] WSREP: gcomm: closed
2020-02-14  9:53:37 0 [Note] WSREP: /usr/sbin/mysqld: Terminated.
WSREP_SST: [ERROR] Removing /mnt/galera/mariadb//.sst/xtrabackup_galera_info file due to signal (20200214 09:53:37.119)
WSREP_SST: [ERROR] xtrabackup_checkpoints missing, failed innobackupex/SST on donor (20200214 09:53:37.126)
WSREP_SST: [ERROR] Cleanup after exit with status:2 (20200214 09:53:37.128)
[...]

The Mariabackup was launched from node2 which failed with a bug message:

[...]
[01] 2020-02-14 09:14:27 Streaming ./db1/table1.frm to <STDOUT>
[01] 2020-02-14 09:14:27         ...done
[00] 2020-02-14 09:14:27 Finished backing up non-InnoDB tables and files
[01] 2020-02-14 09:14:27 Streaming ./aria_log.00000001 to <STDOUT>
[01] 2020-02-14 09:14:27         ...done
[01] 2020-02-14 09:14:27 Streaming ./aria_log_control to <STDOUT>
[01] 2020-02-14 09:14:27         ...done
[00] 2020-02-14 09:14:27 Waiting for log copy thread to read lsn 678163080833
[00] 2020-02-14 09:14:27 >> log scanned up to (678163080842)
[00] 2020-02-14 09:14:27 Unexpected tablespace innodb_system filename innodb_system.isl
2020-02-14 09:14:27 0x7f5beb6a4900  InnoDB: Assertion failure in file /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.3.22/extra/mariabackup/xtrabackup.cc line 4492
InnoDB: Failing assertion: 0
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/
InnoDB: about forcing recovery.
200214  9:14:27 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.3.22-MariaDB
key_buffer_size=0
read_buffer_size=131072
max_used_connections=0
max_threads=1
thread_count=0
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 5599 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x49000
/usr/bin/mariabackup(my_print_stacktrace+0x2e)[0x55bbddb3614e]
/usr/bin/mariabackup(handle_fatal_signal+0x54d)[0x55bbdd6adadd]
sigaction.c:0(__restore_rt)[0x7f5beb284dd0]
:0(__GI_raise)[0x7f5be934b99f]
:0(__GI_abort)[0x7f5be9335cf5]
/usr/bin/mariabackup(+0x4fe1ed)[0x55bbdd3701ed]
/usr/bin/mariabackup(+0x4b2ea2)[0x55bbdd324ea2]
/usr/bin/mariabackup(_Z12backup_startv+0x17a)[0x55bbdd3aec5a]
/usr/bin/mariabackup(+0x5227ef)[0x55bbdd3947ef]
/usr/bin/mariabackup(main+0x185)[0x55bbdd378995]
/lib64/libc.so.6(__libc_start_main+0xf3)[0x7f5be9337873]
/usr/bin/mariabackup(_start+0x2e)[0x55bbdd38c7fe]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
Writing a core file...
Working directory at /mnt/galera/mariadb
Resource Limits:
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        unlimited            unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             127892               127892               processes 
Max open files            262140               262140               files     
Max locked memory         16777216             16777216             bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       127892               127892               signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        
Core pattern: |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e

For information, the core dump file was not recorded because it is not activated.

After this bug is produced, the SST restarted automatically (normal due to database code). The mariabackup failed 3 times and SST restarted automatically 3 times.

The Working Part:

  • The SST was cancelled manually.
  • The datadir contents were removed (rm -rf /datadir)
  • The MariaDB server was restarted again.
  • This time, it started successfully.

Is this a normal behaviour with Mariabackup ?

Sarvesh.



 Comments   
Comment by Goburdhun Sarvesh Sharma [ 2021-04-02 ]

Since the minor upgrade from MariaDB 10.3.22 to 10.3.23, there is no more issue.

Comment by Julius Goryavsky [ 2021-05-11 ]

Although this error was specifically related to other reasons not directly related to Galera and SST, and it was fixed by moving from 10.3.22 to 10.3.23 (see comments), nevertheless, flaws were identified in the code of SST scripts that can lead to problems with SST. These problems have been fixed in separate tasks MDEV-24962 and MDEV-23580. I am closing this task as its immediate cause has been removed (already in version 10.3.23).

Generated at Thu Feb 08 09:09:25 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.