[MDEV-29444] WSREP donor crashes after changing status from donor to joined Created: 2022-09-02  Updated: 2023-10-25

Status: Open
Project: MariaDB Server
Component/s: wsrep
Affects Version/s: 10.5.9, 10.5.16
Fix Version/s: 10.5

Type: Bug Priority: Critical
Reporter: cc lin Assignee: Alexey
Resolution: Unresolved Votes: 2
Labels: crash, galera
Environment:

OS: Debian GNU/Linux 10 (buster)
MariaDB: 10.5.9 (wrapped in Docker Image: bitnami-docker-mariadb-galera (release: 10.5.9-debian-10-r57))
Galera Cluster: 3 nodes


Attachments: File mariadb-issue-logs-csv-format.csv    
Issue Links:
Duplicate
is duplicated by MDEV-29482 Node crashes with Error: Attempt to m... Closed

 Description   

It does not always happen. When it happens, we can observe logs like below:

[Note] WSREP: Server status change donor -> joined

[Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

[ERROR] WSREP: Certification exception: Attempt to match against an empty key (1,0): 22 (Invalid argument) at / bitnami/blacksmith-sandbox/libgalera-26.4.7/galera/src/key_set.cpp:throw_match_empty_key():174

[Note] WSREP: ReplicatorSMM::abort()

[Note] WSREP: Closing send monitor...

[Note] WSREP: Closed send monitor.

[Note] WSREP: gcomm: terminating thread

[Note] WSREP: gcomm: joining thread

[Note] WSREP: gcomm: closing backend

...

[Note] WSREP: New SELF-LEAVE.

[Note] WSREP: Flow-control interval: [0, 0]

[Note] WSREP: Received SELF-LEAVE. Closing connection.

[Note] WSREP: Shifting OPEN -> CLOSED (TO: 1585314378)

[Note] WSREP: RECV thread exiting 0: Success

...

[ Note] WSREP: recv_thread() joined.

[ Note] WSREP: Closing replication queue.

[ Note] WSREP: Closing slave action queue.

[Note] WSREP: /opt/bitnami/mairadb/sbin/mysqld: Terminated.

...



 Comments   
Comment by cc lin [ 2022-09-06 ]

Timeline of key events in the attached log file

15:46:00 mariadb cluster member "conductor-mariadb-fz1-0" was starting mariadb

15:46:05 WSREP: Member 2.0 (conductor-mariadb-fz1-0) requested state transfer from 'any'. Selected 0.0 (conductor-mariadb-fz1-2)(SYNCED) as donor.

15:48:52 conductor-mariadb-fz1-2 [Note] WSREP: resume
15:48:52 conductor-mariadb-fz1-2 [Note] WSREP: resuming provider at 978364142
15:48:52 conductor-mariadb-fz1-2 [Note] WSREP: Provider resumed.
15:48:52 conductor-mariadb-fz1-2 [Note] WSREP: SST sent: d85dc8c3-ec51-11ec-b250-8fff4d8e5486:1585256583
15:48:52 conductor-mariadb-fz1-2 [Note] WSREP: Server status change donor -> joined
15:48:52 conductor-mariadb-fz1-2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification."
15:48:52 conductor-mariadb-fz1-2 [ERROR] WSREP: Certification exception: Attempt to match against an empty key (1,0): 22 (Invalid
argument)",at /bitnami/blacksmith-sandox/libgalera-26.4.7/galera/src/key_set.cpp:throw_match_empty_key():174

15:48:53 conductor-mariadb-fz1-2 [Note] WSREP: New SELF-LEAVE.
15:48:53 conductor-mariadb-fz1-2 [Note] WSREP: /opt/bitnami/mariadb/sbin/mysqld: Terminated.

Comment by cc lin [ 2022-09-12 ]

fz1-0 Core Dump

This could be because you hit a bug. It is also possible that this binary
"or one of the libraries it was linked against is corrupt, improperly built,"
or misconfigured. This error can also be caused by malfunctioning hardware.
"To report this bug, see https://mariadb.com/kb/en/reporting-bugs"
We will try our best to scrape up some info that will hopefully help
"diagnose the problem, but since we have already crashed,"
something is definitely wrong and this may fail.
Server version: 10.5.9-MariaDB-log
key_buffer_size=0
read_buffer_size=131072
max_used_connections=0
max_threads=502
thread_count=3
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1105088 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation."

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
"where mysqld died. If you see no messages after this, something went"
terribly wrong...
stack_bottom = 0x0 thread_stack 0x49000
??:0(my_print_stacktrace)[0x560e71fb7d2e]
??:0(handle_fatal_signal)[0x560e71a79255]
??:0(__restore_rt)[0x7f502ccd9730]
stdlib/abort.c:107(__GI_abort)[0x7f502c429611]
??:0(wsrep_loader)[0x7f501c269e8c]
??:0(wsrep_loader)[0x7f501c16cbbc]
??:0(wsrep_loader)[0x7f501c162218]
??:0(wsrep_loader)[0x7f501c15df6e]
nptl/pthread_create.c:487(start_thread)[0x7f502cccefa3]
x86_64/clone.S:97(clone)[0x7f502c4ffeff]
The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
information that should help you find out what is causing the crash.
Writing a core file...
Working directory at /bitnami/mariadb/data
Resource Limits:

fz1-2 Core Dump

??:0(my_print_stacktrace)[0x556b6012dd2e]
??:0(handle_fatal_signal)[0x556b5fbef255]
"2022-08-31 15:48:54 0 [ERROR] WSREP: async IST sender failed to serve tcp://172.24.151.239:4568: ist send failed: asio.system:104', asio error 'write: Connection reset by peer': 104 (Connection reset by peer)"
??:0(wsrep_loader)[0x7f0a84f54e8c]
??:0(wsrep_loader)[0x7f0a84deaa63]
/opt/bitnami/mariadb/lib/libgalera_smm.so(+0x18c44)[0x7f0a84da7c44]
??:0(wsrep_loader)[0x7f0a84df40f7]
??:0(wsrep_loader)[0x7f0a84e1db4c]
??:0(wsrep_loader)[0x7f0a84e1e402]
??:0(wsrep_loader)[0x7f0a84e1e6f1]
??:0(wsrep_loader)[0x7f0a84df48e0]
/opt/bitnami/mariadb/lib/libgalera_smm.so(+0x42bb8)[0x7f0a84dd1bb8]
??:0(wsrep::wsrep_provider_v26::run_applier(wsrep::high_priority_service*))[0x556b601bb94e]
??:0(wsrep_reset_threadvars(THD*))[0x556b5fe8f363]
??:0(start_wsrep_THD(void*))[0x556b5fe8034f]
"??:0(MyCTX_nopad::finish(unsigned char*, unsigned int*))[0x556b5fe1f31b]"
nptl/pthread_create.c:487(start_thread)[0x7f0a959b9fa3]
x86_64/clone.S:97(clone)[0x7f0a951eaeff]
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x0): (null)
Connection ID (thread ID): 7
Status: NOT_KILLED
"Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,"
,"Aug 31, 2022 @ 15:48:55.193",mariadb-galera,conductor-mariadb-fz1-2,"loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off"
The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
information that should help you find out what is causing the crash.
"We think the query pointer is invalid, but we will try to print it anyway."
Query:
Writing a core file...
Working directory at /bitnami/mariadb/data
Resource Limits:
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 0 bytes
Max resident set unlimited unlimited bytes
Max processes 4194304 4194304 processes
Max open files 1048576 1048576 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 4126718 4126718 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
Core pattern: |/lib/systemd/systemd-coredump %P %u %g %s %t 9223372036854775808 %h

Comment by Daniel Black [ 2022-09-12 ]

On the bitnami side I'm stuck on debug symbol resolution until bitnami provide debug symbols information.

Comment by Brad [ 2022-09-30 ]

We met the issue again on 2022/9/30. It is critical for our system. Could Daniel help to clarify? Thanks.
(Certification exception: Attempt to match against an empty key)

Comment by Brad [ 2022-09-30 ]

How could we do for bitnami "debug symbols information"?

Comment by Brad [ 2022-10-01 ]

Our critical systems suffered this issue. Please help.
How could we do about bitnami "debug symbols information"? Thanks.

Comment by Brad [ 2022-10-03 ]

Your help will be highly appreciated. Thanks.

Comment by Brad [ 2022-10-04 ]

Does the issue impact SST only? Or SST & IST, both?

Comment by Daniel Black [ 2022-10-18 ]

Sorry for the delay, I was on leave and hoping for a Bitnami debug info.

As an alternative, the Docker Library ubuntu focal seems compatible with the Debian 11 base of Bitnami, so build the following:

Dockerfile

ARG VERSION=10.5
 
FROM docker.io/library/mariadb:$VERSION AS dockerlibrarymariadb
 
RUN sed -i -e 's:main:main main/debug:' /etc/apt//sources.list.d/mariadb.list
RUN apt-get update && apt-get install -y mariadb-server-core-10.5-dbgsym galera-4-dbg && rm -rf /var/lib/apt/lists/*
 
ARG VERSION
 
FROM docker.io/bitnami/mariadb-galera:$VERSION
 
USER root
RUN apt-get update && apt-get install -y gdb-minimal && rm -rf /var/lib/apt/lists/*
RUN sed -i -e '/^EXEC=/c\EXEC="gdb --command=/gdb.script --args ${DB_SBIN_DIR}/mariadbd"' \
	-e 's/"\$EXEC"/\$EXEC/'  /opt/bitnami/scripts/mariadb-galera/run.sh
COPY gdb.script /
COPY --from=dockerlibrarymariadb /usr/sbin/mariadbd /opt/bitnami/mariadb/sbin/
COPY --from=dockerlibrarymariadb /usr/lib/libgalera_smm.so /opt/bitnami/mariadb/lib/
COPY --from=dockerlibrarymariadb /usr/lib/debug  /usr/lib/debug
USER 1001

gdb.script

run
set print frame-arguments all
thread apply all bt full

Building and running these will have a full gdb backtrace when the container crashes in the logs.

Comment by Brad [ 2022-10-21 ]

Thanks for your reply. We will follow and attempt to re-produce the issue. Thanks.

Comment by Jan Lindström (Inactive) [ 2023-01-02 ]

cclin Can you provide us full error log and proper stack trace as instructed.

Comment by Seppo Jaakola [ 2023-06-13 ]

Could be due to gcache corruption, assigning to Yurchenko for analysis

Generated at Thu Feb 08 10:08:38 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.