[MDEV-29482] Node crashes with Error: Attempt to match against an empty key Created: 2022-09-07  Updated: 2022-09-19  Resolved: 2022-09-12

Status: Closed
Project: MariaDB Server
Component/s: Galera, Platform FreeBSD
Affects Version/s: None
Fix Version/s: N/A

Type: Bug Priority: Critical
Reporter: Heiko Dunse Assignee: Unassigned
Resolution: Duplicate Votes: 0
Labels: crash, galera
Environment:

OS Version: FreeBSD 13.0
MariaDB Version: mariadb105-server-10.5.16
Galera Version: galera26-26.4.11_1


Issue Links:
Duplicate
duplicates MDEV-29444 WSREP donor crashes after changing st... Open

 Description   

I'm using 2 Loadbalancers and 3 MariaDB Nodes as a Galera Cluster.

Only my 3rd Node makes backups every night and a few hours (or rarely direct) after the backup is finished, the Node crached with the following message in the error.log:

2022-09-07 1:18:36 154740 [ERROR] WSREP: Certification exception: Attempt to match against an empty key (0,1): 22 (Invalid argument)
at /wrkdirs/usr/ports/databases/galera26/work/galera-release_26.4.11/galera/src/key_set.cpp:throw_match_empty_key():194
2022-09-07 1:18:36 154740 [Note] WSREP: ReplicatorSMM::abort()
2022-09-07 1:18:36 154740 [Note] WSREP: Closing send monitor...
2022-09-07 1:18:36 154740 [Note] WSREP: Closed send monitor.
2022-09-07 1:18:36 154740 [Note] WSREP: gcomm: terminating thread
2022-09-07 1:18:36 154740 [Note] WSREP: gcomm: joining thread
2022-09-07 1:18:36 154740 [Note] WSREP: gcomm: closing backend
2022-09-07 1:18:36 154740 [Note] WSREP: view(view_id(NON_PRIM,3284c172-ab8d,9) memb

{ 4f15ba57-8362,0 }

joined {
} left {
} partitioned

{ 3284c172-ab8d,0 4cee3e67-b299,0 }

)
2022-09-07 1:18:36 154740 [Note] WSREP: PC protocol downgrade 1 -> 0
2022-09-07 1:18:36 154740 [Note] WSREP: view((empty))
2022-09-07 1:18:36 154740 [Note] WSREP: gcomm: closed
2022-09-07 1:18:36 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2022-09-07 1:18:36 0 [Note] WSREP: Flow-control interval: [16, 16]
2022-09-07 1:18:36 0 [Note] WSREP: Received NON-PRIMARY.
2022-09-07 1:18:36 0 [Note] WSREP: Shifting DONOR/DESYNCED -> OPEN (TO: 39667715)
2022-09-07 1:18:36 0 [Note] WSREP: New SELF-LEAVE.
2022-09-07 1:18:36 0 [Note] WSREP: Flow-control interval: [0, 0]
2022-09-07 1:18:36 0 [Note] WSREP: Received SELF-LEAVE. Closing connection.
2022-09-07 1:18:36 0 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 39667715)
2022-09-07 1:18:36 0 [Note] WSREP: RECV thread exiting 0: No error: 0
2022-09-07 1:18:36 154740 [Note] WSREP: recv_thread() joined.
2022-09-07 1:18:36 154740 [Note] WSREP: Closing replication queue.
2022-09-07 1:18:36 154740 [Note] WSREP: Closing slave action queue.
2022-09-07 1:18:36 154740 [Note] WSREP: mariadbd: Terminated.

I use mysqldump for the backup script and to make it consistent
it starts with
mysql ... -e "SET GLOBAL wsrep_desync = ON;flush tables;flush logs;"
and ends with:
mysql ... -e "SET GLOBAL wsrep_desync = OFF;"

The certificate and key files should be okay. After I started the MySQL-Server everything is fine. The node is synced from the others. I'm not sure if it really has anything to do with the backup.



 Comments   
Comment by Daniel Black [ 2022-09-12 ]

Duplicate per linked issue. Thanks for the bug report, it provided extra information which is useful.

Comment by Daniel Black [ 2022-09-13 ]

As this is reproducible are you able to provide a obtain a core dump and backtrace?. Its bt all on lldb if you have that installed (can't find "frame-arguments all" equivalent but it would be useful).

The backtrace won't be enough. What would be deal is printing the contents of the galera writeset information in the backtrace. This can be displayed with p variable where variable looks something like a writeset. frame can navigate to the right location. I realize this is probably outside your comfort zone but the more information the quicker this can be resolved.

If you find any table information, a structure of that table show create table and generally how its updated might be of assistance.

Comment by Heiko Dunse [ 2022-09-14 ]

Thanks for your answer. Do you still need a tracing or something, because in MDEV-29444 a core dump was already posted.
I've tried the truss command, which points to the mariadb process, but it generates a lot of data, the backup takes a while and I can't trigger the bug during the day. I'm not familiar with the tracing stuff, but I would really like to help narrow down the problem.

Comment by Daniel Black [ 2022-09-19 ]

Yes please.The MDEV-29444 trace wasn't particularly useful. truss will be a bit noisy.

Configuring core settings to support a core dump, and a limit for the coredumpsize to be unlimited (or size of memory that freebsd normally uses(). Try to check the running mariadbd process has these limits raised.

Then when the assert happens, after getting your system operational again, attempt a backtrace using the generated core file.

Generated at Thu Feb 08 10:08:56 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.