Details
-
Bug
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
-
10.3.22
-
OS: CentOS 8
OS Resources: 16 CPU (AMD EPYC 7501 32-Core Processor), 32GB RAM, 500GB SSD Disk
Type: Virtual Machine
Kernel: Linux 4.18.0-147.3.1.el8_1.x86_64
Architecture: x86-64
MariaDB version: 10.3.22-1.el8.x86_64 (installed from repo: http://yum.mariadb.org/10.3/centos8-amd64)
ProxySQL: proxysql-2.0.8-1.x86_64 (another VM)
Database datadir size: 350GBOS: CentOS 8 OS Resources: 16 CPU (AMD EPYC 7501 32-Core Processor), 32GB RAM, 500GB SSD Disk Type: Virtual Machine Kernel: Linux 4.18.0-147.3.1.el8_1.x86_64 Architecture: x86-64 MariaDB version: 10.3.22-1.el8.x86_64 (installed from repo: http://yum.mariadb.org/10.3/centos8-amd64) ProxySQL: proxysql-2.0.8-1.x86_64 (another VM) Database datadir size: 350GB
Description
Situation is:
For Info: We have a Proxysql on another machine that sends WRITE traffic to node1 only.
After setting up a Galera Cluster with MariaDB version 10.3.22 on 3 Centos 8 machines, an initiative to reboot each node 1 by 1 was attempted (node1, node2, node3).
Reboot procedure on node3:
- Stop mariadb (systemctl stop mariadb)
- Reboot
Firewall resetted on node3 after reboot which blocked the access to node2. After fix, the mariadb server was started again to start an SST which failed.
On node3:
[...]
|
removed '/mnt/galera/mariadb/db1/table1.frm'
|
removed directory '/mnt/galera/mariadb/db2'
|
WSREP_SST: [INFO] Waiting for SST streaming to complete! (20200214 08:34:54.885)
|
2020-02-14 9:14:27 0 [Warning] WSREP: 0.0 (node2): State transfer to 2.0 (node3) failed: -22 (Invalid argument)
|
2020-02-14 9:14:27 0 [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():775: Will never receive state. Need to abort.
|
2020-02-14 9:14:27 0 [Note] WSREP: gcomm: terminating thread
|
2020-02-14 9:14:27 0 [Note] WSREP: gcomm: joining thread
|
2020-02-14 9:14:27 0 [Note] WSREP: gcomm: closing backend
|
WSREP_SST: [ERROR] xtrabackup_checkpoints missing, failed innobackupex/SST on donor (20200214 09:14:27.683)
|
WSREP_SST: [ERROR] Cleanup after exit with status:2 (20200214 09:14:27.687)
|
2020-02-14 9:14:27 0 [Note] WSREP: view(view_id(NON_PRIM,96d7b0f1,243) memb {
|
dc590f74,0
|
} joined {
|
} left {
|
} partitioned {
|
96d7b0f1,0
|
97dd31b7,0
|
})
|
2020-02-14 9:14:27 0 [Note] WSREP: view((empty))
|
2020-02-14 9:14:27 0 [Note] WSREP: gcomm: closed
|
2020-02-14 9:14:27 0 [Note] WSREP: /usr/sbin/mysqld: Terminated.
|
Signal 15 (TERM) caught by ps (3.3.15).
|
ps:ps/display.c:66: please report this bug
|
WSREP_SST: [ERROR] Removing /mnt/galera/mariadb//.sst/xtrabackup_galera_info file due to signal (20200214 09:14:27.701)
|
2020-02-14 9:14:36 0 [Note] WSREP: Read nil XID from storage engines, skipping position init
|
2020-02-14 9:14:36 0 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so'
|
2020-02-14 9:14:36 0 [Note] WSREP: wsrep_load(): Galera 25.3.28(r3875) by Codership Oy <info@codership.com> loaded successfully.
|
2020-02-14 9:14:36 0 [Note] WSREP: CRC-32C: using hardware acceleration.
|
2020-02-14 9:14:36 0 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1, safe_to_bootstrap: 0
|
2020-02-14 9:14:36 0 [Note] WSREP: Passing config to GCS: base_dir = /mnt/galera/mariadb/; base_host = ip3; base_port = 4567; cert.log_conflicts = no; cert.optimistic_pa = yes; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /mnt/galera/mariadb/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /mnt/galera/mariadb//galera.cache; gcache.page_size = 128M; gcache.recover = no; gcache.size = 16G; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce
|
2020-02-14 9:14:36 0 [Note] WSREP: GCache history reset: efea29fc-4d5e-11ea-93d9-82acc3ade856:0 -> 00000000-0000-0000-0000-000000000000:-1
|
2020-02-14 9:14:36 0 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1
|
2020-02-14 9:14:36 0 [Note] WSREP: wsrep_sst_grab()
|
2020-02-14 9:14:36 0 [Note] WSREP: Start replication
|
2020-02-14 9:14:36 0 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
|
2020-02-14 9:14:36 0 [Note] WSREP: protonet asio version 0
|
2020-02-14 9:14:36 0 [Note] WSREP: Using CRC-32C for message checksums.
|
2020-02-14 9:14:36 0 [Note] WSREP: backend: asio
|
2020-02-14 9:14:36 0 [Note] WSREP: gcomm thread scheduling priority set to other:0
|
2020-02-14 9:14:36 0 [Warning] WSREP: access file(/mnt/galera/mariadb//gvwstate.dat) failed(No such file or directory)
|
2020-02-14 9:14:36 0 [Note] WSREP: restore pc from disk failed
|
2020-02-14 9:14:36 0 [Note] WSREP: GMCast version 0
|
2020-02-14 9:14:36 0 [Note] WSREP: (6ba79e4c, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
|
2020-02-14 9:14:36 0 [Note] WSREP: (6ba79e4c, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
|
2020-02-14 9:14:36 0 [Note] WSREP: EVS version 0
|
2020-02-14 9:14:36 0 [Note] WSREP: gcomm: connecting to group 'prod', peer 'ip3:,ip2:,ip1:'
|
2020-02-14 9:14:36 0 [Note] WSREP: (6ba79e4c, 'tcp://0.0.0.0:4567') Found matching local endpoint for a connection, blacklisting address tcp://<ip3>:4567
|
2020-02-14 9:14:36 0 [Note] WSREP: (6ba79e4c, 'tcp://0.0.0.0:4567') connection established to 96d7b0f1 tcp://<ip2>:4567
|
2020-02-14 9:14:36 0 [Note] WSREP: (6ba79e4c, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
|
2020-02-14 9:14:36 0 [Note] WSREP: (6ba79e4c, 'tcp://0.0.0.0:4567') connection established to 97dd31b7 tcp://<ip1>:4567
|
2020-02-14 9:14:36 0 [Note] WSREP: declaring 96d7b0f1 at tcp://<ip2>:4567 stable
|
2020-02-14 9:14:36 0 [Note] WSREP: declaring 97dd31b7 at tcp://<ip1>:4567 stable
|
2020-02-14 9:14:36 0 [Note] WSREP: Node 96d7b0f1 state prim
|
2020-02-14 9:14:36 0 [Note] WSREP: view(view_id(PRIM,6ba79e4c,245) memb {
|
6ba79e4c,0
|
96d7b0f1,0
|
97dd31b7,0
|
} joined {
|
} left {
|
} partitioned {
|
})
|
2020-02-14 9:14:36 0 [Note] WSREP: save pc into disk
|
2020-02-14 9:14:36 0 [Note] WSREP: gcomm: connected
|
2020-02-14 9:14:36 0 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
|
2020-02-14 9:14:36 0 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
|
2020-02-14 9:14:36 0 [Note] WSREP: Opened channel 'prod'
|
2020-02-14 9:14:36 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 3
|
2020-02-14 9:14:36 0 [Note] WSREP: Waiting for SST to complete.
|
2020-02-14 9:14:36 0 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 6bf432b1-4f0a-11ea-884d-7bcef913ea9b
|
2020-02-14 9:14:36 0 [Note] WSREP: STATE EXCHANGE: sent state msg: 6bf432b1-4f0a-11ea-884d-7bcef913ea9b
|
2020-02-14 9:14:36 0 [Note] WSREP: STATE EXCHANGE: got state msg: 6bf432b1-4f0a-11ea-884d-7bcef913ea9b from 0 (node3)
|
2020-02-14 9:14:36 0 [Note] WSREP: STATE EXCHANGE: got state msg: 6bf432b1-4f0a-11ea-884d-7bcef913ea9b from 1 (node2)
|
2020-02-14 9:14:36 0 [Note] WSREP: STATE EXCHANGE: got state msg: 6bf432b1-4f0a-11ea-884d-7bcef913ea9b from 2 (node1)
|
2020-02-14 9:14:36 0 [Note] WSREP: Quorum results:
|
version = 6,
|
component = PRIMARY,
|
conf_id = 244,
|
members = 2/3 (joined/total),
|
act_id = 1562227,
|
last_appl. = -1,
|
protocols = 0/9/3 (gcs/repl/appl),
|
group UUID = efea29fc-4d5e-11ea-93d9-82acc3ade856
|
2020-02-14 9:14:36 0 [Note] WSREP: Flow-control interval: [28, 28]
|
2020-02-14 9:14:36 0 [Note] WSREP: Trying to continue unpaused monitor
|
2020-02-14 9:14:36 0 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 1562227)
|
2020-02-14 9:14:36 2 [Note] WSREP: State transfer required:
|
Group state: efea29fc-4d5e-11ea-93d9-82acc3ade856:1562227
|
Local state: 00000000-0000-0000-0000-000000000000:-1
|
2020-02-14 9:14:36 2 [Note] WSREP: New cluster view: global state: efea29fc-4d5e-11ea-93d9-82acc3ade856:1562227, view# 245: Primary, number of nodes: 3, my index: 0, protocol version 3
|
2020-02-14 9:14:36 2 [Warning] WSREP: Gap in state sequence. Need state transfer.
|
2020-02-14 9:14:36 0 [Note] WSREP: Running: 'wsrep_sst_mariabackup --role 'joiner' --address '<ip3>' --datadir '/mnt/galera/mariadb/' --parent '6030' --mysqld-args --wsrep_start_position=00000000-0000-0000-0000-000000000000:-1'
|
WSREP_SST: [INFO] Streaming with xbstream (20200214 09:14:36.942)
|
WSREP_SST: [INFO] Using socat as streamer (20200214 09:14:36.945)
|
WSREP_SST: [INFO] Stale sst_in_progress file: /mnt/galera/mariadb//sst_in_progress (20200214 09:14:36.949)
|
WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN:4444,reuseaddr stdio | pigz -d | mbstream -x; RC=( ${PIPESTATUS[@]} ) (20200214 09:14:36.982)
|
2020-02-14 9:14:37 2 [Note] WSREP: Prepared SST request: mariabackup|<ip3>:4444/xtrabackup_sst//1
|
2020-02-14 9:14:37 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
|
2020-02-14 9:14:37 2 [Note] WSREP: REPL Protocols: 9 (4, 2)
|
2020-02-14 9:14:37 2 [Note] WSREP: Assign initial position for certification: 1562227, protocol version: 4
|
2020-02-14 9:14:37 0 [Note] WSREP: Service thread queue flushed.
|
2020-02-14 9:14:37 2 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (efea29fc-4d5e-11ea-93d9-82acc3ade856): 1 (Operation not permitted)
|
at galera/src/replicator_str.cpp:prepare_for_IST():482. IST will be unavailable.
|
2020-02-14 9:14:37 0 [Note] WSREP: Member 0.0 (node3) requested state transfer from '*any*'. Selected 1.0 (node2)(SYNCED) as donor.
|
2020-02-14 9:14:37 0 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 1562227)
|
2020-02-14 9:14:37 2 [Note] WSREP: Requesting state transfer: success, donor: 1
|
2020-02-14 9:14:37 2 [Note] WSREP: GCache history reset: 00000000-0000-0000-0000-000000000000:0 -> efea29fc-4d5e-11ea-93d9-82acc3ade856:1562227
|
WSREP_SST: [INFO] WARNING: Stale temporary SST directory: /mnt/galera/mariadb//.sst from previous state transfer. Removing (20200214 09:14:38.011)
|
2020-02-14 9:14:39 0 [Note] WSREP: (6ba79e4c, 'tcp://0.0.0.0:4567') turning message relay requesting off
|
WSREP_SST: [INFO] Evaluating socat -u TCP-LISTEN:4444,reuseaddr stdio | pigz -d | mbstream -x; RC=( ${PIPESTATUS[@]} ) (20200214 09:14:41.009)
|
WSREP_SST: [INFO] Proceeding with SST (20200214 09:14:41.010)
|
WSREP_SST: [INFO] Cleaning the existing datadir and innodb-data/log directories (20200214 09:14:41.014)
|
removed '/mnt/galera/mariadb/aria_log.00000001'
|
removed '/mnt/galera/mariadb/ib_logfile1'
|
removed '/mnt/galera/mariadb/aria_log_control'
|
removed '/mnt/galera/mariadb/ibdata1'
|
WSREP_SST: [INFO] Waiting for SST streaming to complete! (20200214 09:14:41.034)
|
2020-02-14 9:53:37 0 [Warning] WSREP: 1.0 (node2): State transfer to 0.0 (node3) failed: -22 (Invalid argument)
|
2020-02-14 9:53:37 0 [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():775: Will never receive state. Need to abort.
|
2020-02-14 9:53:37 0 [Note] WSREP: gcomm: terminating thread
|
2020-02-14 9:53:37 0 [Note] WSREP: gcomm: joining thread
|
2020-02-14 9:53:37 0 [Note] WSREP: gcomm: closing backend
|
2020-02-14 9:53:37 0 [Note] WSREP: view(view_id(NON_PRIM,6ba79e4c,245) memb {
|
6ba79e4c,0
|
} joined {
|
} left {
|
} partitioned {
|
96d7b0f1,0
|
97dd31b7,0
|
})
|
2020-02-14 9:53:37 0 [Note] WSREP: view((empty))
|
2020-02-14 9:53:37 0 [Note] WSREP: gcomm: closed
|
2020-02-14 9:53:37 0 [Note] WSREP: /usr/sbin/mysqld: Terminated.
|
WSREP_SST: [ERROR] Removing /mnt/galera/mariadb//.sst/xtrabackup_galera_info file due to signal (20200214 09:53:37.119)
|
WSREP_SST: [ERROR] xtrabackup_checkpoints missing, failed innobackupex/SST on donor (20200214 09:53:37.126)
|
WSREP_SST: [ERROR] Cleanup after exit with status:2 (20200214 09:53:37.128)
|
[...]
|
The Mariabackup was launched from node2 which failed with a bug message:
[...]
|
[01] 2020-02-14 09:14:27 Streaming ./db1/table1.frm to <STDOUT>
|
[01] 2020-02-14 09:14:27 ...done
|
[00] 2020-02-14 09:14:27 Finished backing up non-InnoDB tables and files
|
[01] 2020-02-14 09:14:27 Streaming ./aria_log.00000001 to <STDOUT>
|
[01] 2020-02-14 09:14:27 ...done
|
[01] 2020-02-14 09:14:27 Streaming ./aria_log_control to <STDOUT>
|
[01] 2020-02-14 09:14:27 ...done
|
[00] 2020-02-14 09:14:27 Waiting for log copy thread to read lsn 678163080833
|
[00] 2020-02-14 09:14:27 >> log scanned up to (678163080842)
|
[00] 2020-02-14 09:14:27 Unexpected tablespace innodb_system filename innodb_system.isl
|
2020-02-14 09:14:27 0x7f5beb6a4900 InnoDB: Assertion failure in file /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.3.22/extra/mariabackup/xtrabackup.cc line 4492
|
InnoDB: Failing assertion: 0
|
InnoDB: We intentionally generate a memory trap.
|
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
|
InnoDB: If you get repeated assertion failures or crashes, even
|
InnoDB: immediately after the mysqld startup, there may be
|
InnoDB: corruption in the InnoDB tablespace. Please refer to
|
InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/
|
InnoDB: about forcing recovery.
|
200214 9:14:27 [ERROR] mysqld got signal 6 ;
|
This could be because you hit a bug. It is also possible that this binary
|
or one of the libraries it was linked against is corrupt, improperly built,
|
or misconfigured. This error can also be caused by malfunctioning hardware.
|
|
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
|
|
We will try our best to scrape up some info that will hopefully help
|
diagnose the problem, but since we have already crashed,
|
something is definitely wrong and this may fail.
|
|
Server version: 10.3.22-MariaDB
|
key_buffer_size=0
|
read_buffer_size=131072
|
max_used_connections=0
|
max_threads=1
|
thread_count=0
|
It is possible that mysqld could use up to
|
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 5599 K bytes of memory
|
Hope that's ok; if not, decrease some variables in the equation.
|
|
Thread pointer: 0x0
|
Attempting backtrace. You can use the following information to find out
|
where mysqld died. If you see no messages after this, something went
|
terribly wrong...
|
stack_bottom = 0x0 thread_stack 0x49000
|
/usr/bin/mariabackup(my_print_stacktrace+0x2e)[0x55bbddb3614e]
|
/usr/bin/mariabackup(handle_fatal_signal+0x54d)[0x55bbdd6adadd]
|
sigaction.c:0(__restore_rt)[0x7f5beb284dd0]
|
:0(__GI_raise)[0x7f5be934b99f]
|
:0(__GI_abort)[0x7f5be9335cf5]
|
/usr/bin/mariabackup(+0x4fe1ed)[0x55bbdd3701ed]
|
/usr/bin/mariabackup(+0x4b2ea2)[0x55bbdd324ea2]
|
/usr/bin/mariabackup(_Z12backup_startv+0x17a)[0x55bbdd3aec5a]
|
/usr/bin/mariabackup(+0x5227ef)[0x55bbdd3947ef]
|
/usr/bin/mariabackup(main+0x185)[0x55bbdd378995]
|
/lib64/libc.so.6(__libc_start_main+0xf3)[0x7f5be9337873]
|
/usr/bin/mariabackup(_start+0x2e)[0x55bbdd38c7fe]
|
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
|
information that should help you find out what is causing the crash.
|
Writing a core file...
|
Working directory at /mnt/galera/mariadb
|
Resource Limits:
|
Limit Soft Limit Hard Limit Units
|
Max cpu time unlimited unlimited seconds
|
Max file size unlimited unlimited bytes
|
Max data size unlimited unlimited bytes
|
Max stack size 8388608 unlimited bytes
|
Max core file size unlimited unlimited bytes
|
Max resident set unlimited unlimited bytes
|
Max processes 127892 127892 processes
|
Max open files 262140 262140 files
|
Max locked memory 16777216 16777216 bytes
|
Max address space unlimited unlimited bytes
|
Max file locks unlimited unlimited locks
|
Max pending signals 127892 127892 signals
|
Max msgqueue size 819200 819200 bytes
|
Max nice priority 0 0
|
Max realtime priority 0 0
|
Max realtime timeout unlimited unlimited us
|
Core pattern: |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e
|
For information, the core dump file was not recorded because it is not activated.
After this bug is produced, the SST restarted automatically (normal due to database code). The mariabackup failed 3 times and SST restarted automatically 3 times.
The Working Part:
- The SST was cancelled manually.
- The datadir contents were removed (rm -rf /datadir)
- The MariaDB server was restarted again.
- This time, it started successfully.
Is this a normal behaviour with Mariabackup ?
Sarvesh.