[MDEV-21738] SST Bug with MariaDB-backup - Jira

XML

Word

Printable

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.3.22
Fix Version/s: 10.3.23
Component/s: Galera SST, mariabackup, Platform RedHat
Labels:
- mariabackup
Environment:

Hide
OS: CentOS 8
OS Resources: 16 CPU (AMD EPYC 7501 32-Core Processor), 32GB RAM, 500GB SSD Disk
Type: Virtual Machine
Kernel: Linux 4.18.0-147.3.1.el8_1.x86_64
Architecture: x86-64
MariaDB version: 10.3.22-1.el8.x86_64 (installed from repo: http://yum.mariadb.org/10.3/centos8-amd64)
ProxySQL: proxysql-2.0.8-1.x86_64 (another VM)
Database datadir size: 350GB

Show
OS: CentOS 8 OS Resources: 16 CPU (AMD EPYC 7501 32-Core Processor), 32GB RAM, 500GB SSD Disk Type: Virtual Machine Kernel: Linux 4.18.0-147.3.1.el8_1.x86_64 Architecture: x86-64 MariaDB version: 10.3.22-1.el8.x86_64 (installed from repo: http://yum.mariadb.org/10.3/centos8-amd64) ProxySQL: proxysql-2.0.8-1.x86_64 (another VM) Database datadir size: 350GB

Description

Situation is:

For Info: We have a Proxysql on another machine that sends WRITE traffic to node1 only.

After setting up a Galera Cluster with MariaDB version 10.3.22 on 3 Centos 8 machines, an initiative to reboot each node 1 by 1 was attempted (node1, node2, node3).

Reboot procedure on node3:

Stop mariadb (systemctl stop mariadb)
Reboot

Firewall resetted on node3 after reboot which blocked the access to node2. After fix, the mariadb server was started again to start an SST which failed.

On node3:

[...]

removed '/mnt/galera/mariadb/db1/table1.frm'

removed directory '/mnt/galera/mariadb/db2'

WSREP_SST: [INFO] Waiting for SST streaming to complete! (20200214 08:34:54.885)

2020-02-14  9:14:27 0 [Warning] WSREP: 0.0 (node2): State transfer to 2.0 (node3) failed: -22 (Invalid argument)

2020-02-14  9:14:27 0 [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():775: Will never receive state. Need to abort.

2020-02-14  9:14:27 0 [Note] WSREP: gcomm: terminating thread

2020-02-14  9:14:27 0 [Note] WSREP: gcomm: joining thread

2020-02-14  9:14:27 0 [Note] WSREP: gcomm: closing backend

WSREP_SST: [ERROR] xtrabackup_checkpoints missing, failed innobackupex/SST on donor (20200214 09:14:27.683)

WSREP_SST: [ERROR] Cleanup after exit with status:2 (20200214 09:14:27.687)

2020-02-14  9:14:27 0 [Note] WSREP: view(view_id(NON_PRIM,96d7b0f1,243) memb {

	dc590f74,0

} joined {

} left {

} partitioned {

	96d7b0f1,0

	97dd31b7,0

})

2020-02-14  9:14:27 0 [Note] WSREP: view((empty))

2020-02-14  9:14:27 0 [Note] WSREP: gcomm: closed

2020-02-14  9:14:27 0 [Note] WSREP: /usr/sbin/mysqld: Terminated.

Signal 15 (TERM) caught by ps (3.3.15).

ps:ps/display.c:66: please report this bug

WSREP_SST: [ERROR] Removing /mnt/galera/mariadb//.sst/xtrabackup_galera_info file due to signal (20200214 09:14:27.701)

2020-02-14  9:14:36 0 [Note] WSREP: Read nil XID from storage engines, skipping position init

2020-02-14  9:14:36 0 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so'

2020-02-14  9:14:36 0 [Note] WSREP: wsrep_load(): Galera 25.3.28(r3875) by Codership Oy <info@codership.com> loaded successfully.

2020-02-14  9:14:36 0 [Note] WSREP: CRC-32C: using hardware acceleration.

2020-02-14  9:14:36 0 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1, safe_to_bootstrap: 0

2020-02-14  9:14:36 0 [Note] WSREP: Passing config to GCS: base_dir = /mnt/galera/mariadb/; base_host = ip3; base_port = 4567; cert.log_conflicts = no; cert.optimistic_pa = yes; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /mnt/galera/mariadb/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /mnt/galera/mariadb//galera.cache; gcache.page_size = 128M; gcache.recover = no; gcache.size = 16G; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce

2020-02-14  9:14:36 0 [Note] WSREP: GCache history reset: efea29fc-4d5e-11ea-93d9-82acc3ade856:0 -> 00000000-0000-0000-0000-000000000000:-1

2020-02-14  9:14:36 0 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1

2020-02-14  9:14:36 0 [Note] WSREP: wsrep_sst_grab()

2020-02-14  9:14:36 0 [Note] WSREP: Start replication

2020-02-14  9:14:36 0 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1

2020-02-14  9:14:36 0 [Note] WSREP: protonet asio version 0

2020-02-14  9:14:36 0 [Note] WSREP: Using CRC-32C for message checksums.

2020-02-14  9:14:36 0 [Note] WSREP: backend: asio

2020-02-14  9:14:36 0 [Note] WSREP: gcomm thread scheduling priority set to other:0

2020-02-14  9:14:36 0 [Warning] WSREP: access file(/mnt/galera/mariadb//gvwstate.dat) failed(No such file or directory)

2020-02-14  9:14:36 0 [Note] WSREP: restore pc from disk failed

2020-02-14  9:14:36 0 [Note] WSREP: GMCast version 0

2020-02-14  9:14:36 0 [Note] WSREP: (6ba79e4c, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567

2020-02-14  9:14:36 0 [Note] WSREP: (6ba79e4c, 'tcp://0.0.0.0:4567') multicast: , ttl: 1

2020-02-14  9:14:36 0 [Note] WSREP: EVS version 0

2020-02-14  9:14:36 0 [Note] WSREP: gcomm: connecting to group 'prod', peer 'ip3:,ip2:,ip1:'

2020-02-14  9:14:36 0 [Note] WSREP: (6ba79e4c, 'tcp://0.0.0.0:4567') Found matching local endpoint for a connection, blacklisting address tcp://<ip3>:4567

2020-02-14  9:14:36 0 [Note] WSREP: (6ba79e4c, 'tcp://0.0.0.0:4567') connection established to 96d7b0f1 tcp://<ip2>:4567

2020-02-14  9:14:36 0 [Note] WSREP: (6ba79e4c, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:

2020-02-14  9:14:36 0 [Note] WSREP: (6ba79e4c, 'tcp://0.0.0.0:4567') connection established to 97dd31b7 tcp://<ip1>:4567

2020-02-14  9:14:36 0 [Note] WSREP: declaring 96d7b0f1 at tcp://<ip2>:4567 stable

2020-02-14  9:14:36 0 [Note] WSREP: declaring 97dd31b7 at tcp://<ip1>:4567 stable

2020-02-14  9:14:36 0 [Note] WSREP: Node 96d7b0f1 state prim

2020-02-14  9:14:36 0 [Note] WSREP: view(view_id(PRIM,6ba79e4c,245) memb {

	6ba79e4c,0

	96d7b0f1,0

	97dd31b7,0

} joined {

} left {

} partitioned {

})

2020-02-14  9:14:36 0 [Note] WSREP: save pc into disk

2020-02-14  9:14:36 0 [Note] WSREP: gcomm: connected

2020-02-14  9:14:36 0 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636

2020-02-14  9:14:36 0 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)

2020-02-14  9:14:36 0 [Note] WSREP: Opened channel 'prod'

2020-02-14  9:14:36 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 3

2020-02-14  9:14:36 0 [Note] WSREP: Waiting for SST to complete.

2020-02-14  9:14:36 0 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 6bf432b1-4f0a-11ea-884d-7bcef913ea9b

2020-02-14  9:14:36 0 [Note] WSREP: STATE EXCHANGE: sent state msg: 6bf432b1-4f0a-11ea-884d-7bcef913ea9b

2020-02-14  9:14:36 0 [Note] WSREP: STATE EXCHANGE: got state msg: 6bf432b1-4f0a-11ea-884d-7bcef913ea9b from 0 (node3)

2020-02-14  9:14:36 0 [Note] WSREP: STATE EXCHANGE: got state msg: 6bf432b1-4f0a-11ea-884d-7bcef913ea9b from 1 (node2)

2020-02-14  9:14:36 0 [Note] WSREP: STATE EXCHANGE: got state msg: 6bf432b1-4f0a-11ea-884d-7bcef913ea9b from 2 (node1)

2020-02-14  9:14:36 0 [Note] WSREP: Quorum results:

	version    = 6,

	component  = PRIMARY,

	conf_id    = 244,

	members    = 2/3 (joined/total),

	act_id     = 1562227,

	last_appl. = -1,

	protocols  = 0/9/3 (gcs/repl/appl),

	group UUID = efea29fc-4d5e-11ea-93d9-82acc3ade856

2020-02-14  9:14:36 0 [Note] WSREP: Flow-control interval: [28, 28]

2020-02-14  9:14:36 0 [Note] WSREP: Trying to continue unpaused monitor

2020-02-14  9:14:36 0 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 1562227)

2020-02-14  9:14:36 2 [Note] WSREP: State transfer required:

	Group state: efea29fc-4d5e-11ea-93d9-82acc3ade856:1562227

	Local state: 00000000-0000-0000-0000-000000000000:-1

2020-02-14  9:14:36 2 [Note] WSREP: New cluster view: global state: efea29fc-4d5e-11ea-93d9-82acc3ade856:1562227, view# 245: Primary, number of nodes: 3, my index: 0, protocol version 3

2020-02-14  9:14:36 2 [Warning] WSREP: Gap in state sequence. Need state transfer.

2020-02-14  9:14:36 0 [Note] WSREP: Running: 'wsrep_sst_mariabackup --role 'joiner' --address '<ip3>' --datadir '/mnt/galera/mariadb/' --parent '6030' --mysqld-args --wsrep_start_position=00000000-0000-0000-0000-000000000000:-1'

WSREP_SST: [INFO] Streaming with xbstream (20200214 09:14:36.942)

WSREP_SST: [INFO] Using socat as streamer (20200214 09:14:36.945)

WSREP_SST: [INFO] Stale sst_in_progress file: /mnt/galera/mariadb//sst_in_progress (20200214 09:14:36.949)

WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN:4444,reuseaddr stdio |  pigz -d | mbstream -x; RC=( ${PIPESTATUS[@]} ) (20200214 09:14:36.982)

2020-02-14  9:14:37 2 [Note] WSREP: Prepared SST request: mariabackup|<ip3>:4444/xtrabackup_sst//1

2020-02-14  9:14:37 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

2020-02-14  9:14:37 2 [Note] WSREP: REPL Protocols: 9 (4, 2)

2020-02-14  9:14:37 2 [Note] WSREP: Assign initial position for certification: 1562227, protocol version: 4

2020-02-14  9:14:37 0 [Note] WSREP: Service thread queue flushed.

2020-02-14  9:14:37 2 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (efea29fc-4d5e-11ea-93d9-82acc3ade856): 1 (Operation not permitted)

	 at galera/src/replicator_str.cpp:prepare_for_IST():482. IST will be unavailable.

2020-02-14  9:14:37 0 [Note] WSREP: Member 0.0 (node3) requested state transfer from '*any*'. Selected 1.0 (node2)(SYNCED) as donor.

2020-02-14  9:14:37 0 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 1562227)

2020-02-14  9:14:37 2 [Note] WSREP: Requesting state transfer: success, donor: 1

2020-02-14  9:14:37 2 [Note] WSREP: GCache history reset: 00000000-0000-0000-0000-000000000000:0 -> efea29fc-4d5e-11ea-93d9-82acc3ade856:1562227

WSREP_SST: [INFO] WARNING: Stale temporary SST directory: /mnt/galera/mariadb//.sst from previous state transfer. Removing (20200214 09:14:38.011)

2020-02-14  9:14:39 0 [Note] WSREP: (6ba79e4c, 'tcp://0.0.0.0:4567') turning message relay requesting off

WSREP_SST: [INFO] Evaluating socat -u TCP-LISTEN:4444,reuseaddr stdio |  pigz -d | mbstream -x; RC=( ${PIPESTATUS[@]} ) (20200214 09:14:41.009)

WSREP_SST: [INFO] Proceeding with SST (20200214 09:14:41.010)

WSREP_SST: [INFO] Cleaning the existing datadir and innodb-data/log directories (20200214 09:14:41.014)

removed '/mnt/galera/mariadb/aria_log.00000001'

removed '/mnt/galera/mariadb/ib_logfile1'

removed '/mnt/galera/mariadb/aria_log_control'

removed '/mnt/galera/mariadb/ibdata1'

WSREP_SST: [INFO] Waiting for SST streaming to complete! (20200214 09:14:41.034)

2020-02-14  9:53:37 0 [Warning] WSREP: 1.0 (node2): State transfer to 0.0 (node3) failed: -22 (Invalid argument)

2020-02-14  9:53:37 0 [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():775: Will never receive state. Need to abort.

2020-02-14  9:53:37 0 [Note] WSREP: gcomm: terminating thread

2020-02-14  9:53:37 0 [Note] WSREP: gcomm: joining thread

2020-02-14  9:53:37 0 [Note] WSREP: gcomm: closing backend

2020-02-14  9:53:37 0 [Note] WSREP: view(view_id(NON_PRIM,6ba79e4c,245) memb {

	6ba79e4c,0

} joined {

} left {

} partitioned {

	96d7b0f1,0

	97dd31b7,0

})

2020-02-14  9:53:37 0 [Note] WSREP: view((empty))

2020-02-14  9:53:37 0 [Note] WSREP: gcomm: closed

2020-02-14  9:53:37 0 [Note] WSREP: /usr/sbin/mysqld: Terminated.

WSREP_SST: [ERROR] Removing /mnt/galera/mariadb//.sst/xtrabackup_galera_info file due to signal (20200214 09:53:37.119)

WSREP_SST: [ERROR] xtrabackup_checkpoints missing, failed innobackupex/SST on donor (20200214 09:53:37.126)

WSREP_SST: [ERROR] Cleanup after exit with status:2 (20200214 09:53:37.128)

[...]

The Mariabackup was launched from node2 which failed with a bug message:

[...]

[01] 2020-02-14 09:14:27 Streaming ./db1/table1.frm to <STDOUT>

[01] 2020-02-14 09:14:27         ...done

[00] 2020-02-14 09:14:27 Finished backing up non-InnoDB tables and files

[01] 2020-02-14 09:14:27 Streaming ./aria_log.00000001 to <STDOUT>

[01] 2020-02-14 09:14:27         ...done

[01] 2020-02-14 09:14:27 Streaming ./aria_log_control to <STDOUT>

[01] 2020-02-14 09:14:27         ...done

[00] 2020-02-14 09:14:27 Waiting for log copy thread to read lsn 678163080833

[00] 2020-02-14 09:14:27 >> log scanned up to (678163080842)

[00] 2020-02-14 09:14:27 Unexpected tablespace innodb_system filename innodb_system.isl

2020-02-14 09:14:27 0x7f5beb6a4900  InnoDB: Assertion failure in file /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.3.22/extra/mariabackup/xtrabackup.cc line 4492

InnoDB: Failing assertion: 0

InnoDB: We intentionally generate a memory trap.

InnoDB: Submit a detailed bug report to https://jira.mariadb.org/

InnoDB: If you get repeated assertion failures or crashes, even

InnoDB: immediately after the mysqld startup, there may be

InnoDB: corruption in the InnoDB tablespace. Please refer to

InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/

InnoDB: about forcing recovery.

200214  9:14:27 [ERROR] mysqld got signal 6 ;

This could be because you hit a bug. It is also possible that this binary

or one of the libraries it was linked against is corrupt, improperly built,

or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help

diagnose the problem, but since we have already crashed,

something is definitely wrong and this may fail.

Server version: 10.3.22-MariaDB

key_buffer_size=0

read_buffer_size=131072

max_used_connections=0

max_threads=1

thread_count=0

It is possible that mysqld could use up to

key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 5599 K  bytes of memory

Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0

Attempting backtrace. You can use the following information to find out

where mysqld died. If you see no messages after this, something went

terribly wrong...

stack_bottom = 0x0 thread_stack 0x49000

/usr/bin/mariabackup(my_print_stacktrace+0x2e)[0x55bbddb3614e]

/usr/bin/mariabackup(handle_fatal_signal+0x54d)[0x55bbdd6adadd]

sigaction.c:0(__restore_rt)[0x7f5beb284dd0]

:0(__GI_raise)[0x7f5be934b99f]

:0(__GI_abort)[0x7f5be9335cf5]

/usr/bin/mariabackup(+0x4fe1ed)[0x55bbdd3701ed]

/usr/bin/mariabackup(+0x4b2ea2)[0x55bbdd324ea2]

/usr/bin/mariabackup(_Z12backup_startv+0x17a)[0x55bbdd3aec5a]

/usr/bin/mariabackup(+0x5227ef)[0x55bbdd3947ef]

/usr/bin/mariabackup(main+0x185)[0x55bbdd378995]

/lib64/libc.so.6(__libc_start_main+0xf3)[0x7f5be9337873]

/usr/bin/mariabackup(_start+0x2e)[0x55bbdd38c7fe]

The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains

information that should help you find out what is causing the crash.

Writing a core file...

Working directory at /mnt/galera/mariadb

Resource Limits:

Limit                     Soft Limit           Hard Limit           Units

Max cpu time              unlimited            unlimited            seconds

Max file size             unlimited            unlimited            bytes

Max data size             unlimited            unlimited            bytes

Max stack size            8388608              unlimited            bytes

Max core file size        unlimited            unlimited            bytes

Max resident set          unlimited            unlimited            bytes

Max processes             127892               127892               processes

Max open files            262140               262140               files

Max locked memory         16777216             16777216             bytes

Max address space         unlimited            unlimited            bytes

Max file locks            unlimited            unlimited            locks

Max pending signals       127892               127892               signals

Max msgqueue size         819200               819200               bytes

Max nice priority         0                    0

Max realtime priority     0                    0

Max realtime timeout      unlimited            unlimited            us

Core pattern: |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e

For information, the core dump file was not recorded because it is not activated.

After this bug is produced, the SST restarted automatically (normal due to database code). The mariabackup failed 3 times and SST restarted automatically 3 times.

The Working Part:

The SST was cancelled manually.
The datadir contents were removed (rm -rf /datadir)
The MariaDB server was restarted again.
This time, it started successfully.

Is this a normal behaviour with Mariabackup ?

Sarvesh.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

cluster.cnf
0.4 kB
2020-02-14 11:34

Activity

People

Assignee:: Julius Goryavsky

Reporter:: Goburdhun Sarvesh Sharma

Votes:: 2 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 2020-02-14 12:01

Updated:: 2021-05-11 01:51

Resolved:: 2021-05-11 01:51

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.