[MDEV-25368] Galera cluster hangs on Freeing items Created: 2021-04-08  Updated: 2022-12-16  Resolved: 2021-10-30

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.4.18, 10.5.9, 10.5.11
Fix Version/s: 10.4.22

Type: Bug Priority: Critical
Reporter: Kóczán Ákos Assignee: Seppo Jaakola
Resolution: Fixed Votes: 17
Labels: None

Attachments: PNG File Command counters 10.5.9 vs 10.5.6.png     File gdb2.log     File my.cnf    
Issue Links:
Blocks
is blocked by MDEV-25114 Crash: WSREP: invalid state ROLLED_BA... Closed
Relates
relates to MDEV-23328 Server hang due to Galera lock confli... Closed
relates to MDEV-29512 WSREP: cluster conflict causes cluste... Closed
relates to MDEV-24294 MariaDB - Cluster freezes if node hangs Closed

 Description   

Hi All, I have a mariadb 10.4.18 galera cluster with 5 node. Randomly on one node hangs a request with "Freeing items" state and "Busy" command. It stops all the write queries on all nodes forever. I have to kill mariadb on the hanging node, because I can't stop it normal way. After kill, other nodes start working correctly. After restart full SST required. Error log is empty. Query cache is off. my.cnf attached.

Process list:
----------------------------------------------------------
ID: 19717900
USER: xxx
HOST: xxx:44464
DB:
COMMAND: Busy
TIME: 28
STATE: Freeing items
INFO:
TIME_MS: 28497.852
STAGE: 0
MAX_STAGE: 0
PROGRESS: 0.000
MEMORY_USED: 82232
MAX_MEMORY_USED: 178320
EXAMINED_ROWS: 1
QUERY_ID: 1328408265
INFO_BINARY:
TID: 36964
----------------------------------------------------------
ID: 23
USER: system user
HOST:
DB:
COMMAND: Busy
TIME: 28
STATE: committing
INFO:
TIME_MS: 28496.583
STAGE: 0
MAX_STAGE: 0
PROGRESS: 0.000
MEMORY_USED: 65136
MAX_MEMORY_USED: 81584
EXAMINED_ROWS: 0
QUERY_ID: 1328408270
INFO_BINARY:
TID: 12877
----------------------------------------------------------

Please help.



 Comments   
Comment by Matt Le Fevre [ 2021-04-13 ]

Also experiencing this on v10.5.9, might be related to https://jira.mariadb.org/browse/MDEV-24294

Comment by Luke Cousins [ 2021-04-13 ]

We've started experiencing this issue too, multiple times per week (sometimes multiple times per day), but only at "busy" times of day.

We recently upgraded from 10.5.4 to 10.5.9 and the problem has only occurred since then.

Using 4 node Galera cluster on Ubuntu (installed from apt repo) with the following config in addition to defaults. (One node is permanently kept out of the load balancer for backups, via Mariabackup). Each node has 377G of memory and 40 cores (80 with hyperthreading).

Here's an extract of SHOW PROCESSLIST; at the time:

162 system user   NULL Sleep 116 committing NULL 0.000
161 system user   NULL Sleep 123 wsrep applier committed NULL 0.000
163 system user   NULL Sleep 120 wsrep applier committed NULL 0.000
164 system user   NULL Sleep 112 committing NULL 0.000
179 REDACTED REDACTED:43766 REDACTED Query 118 starting COMMIT 0.000
180 REDACTED REDACTED:43768 REDACTED Query 116 Commit UPDATE `front_end`.`session_data` SET `data` = 'a:7:{s:16:\"is_mobile_device\";a:2:{s:5:\"value\";b: 0.000
269 REDACTED REDACTED:43866 REDACTED Query 115 Commit UPDATE `front_end`.`session_data` SET `data` = 'a:6:{s:11:\"user_source\";a:1:{i:0;s:23:\"https://du 0.000
270 REDACTED REDACTED:43868 REDACTED Query 117 Commit INSERT INTO front_end.session_data (`customers_id`, `data`, `last_used`, `created`, `shopping_basket 0.000
837 REDACTED REDACTED:44506 REDACTED Query 118 Commit INSERT INTO front_end.session_data (`customers_id`, `data`, `last_used`, `created`, `shopping_basket 0.000
846 REDACTED REDACTED:44522 REDACTED Query 118 Commit UPDATE `warehouse`.`bulk_picks` SET `last_action_ts` = '1618315874' WHERE `bulk_pick_id` = '39400991 0.000
3926 REDACTED REDACTED:25504   Busy 118 Freeing items NULL 0.000
3927 REDACTED REDACTED:16145 REDACTED Sleep 88   NULL 0.000

The session_data table is quite contentious, but this shouldn't be a problem, wasn't before the upgrade and nothing else material that we know of has changed.

It's consistent to see a `Busy` `Freeing items` row with a higher process ID than the other "stuck" queries, but a running time in seconds of the same (or within a second of) those of the other stuck queries (which are typically on our `session_data` table).

What further information would be useful? This bug is causing us enormous headaches at the moment so anything we can do to help get it resolved, we'd be happy to do.

Thanks for your help.

Luke

Config:

[mysqld]
performance_schema
max_connections=1000
wait_timeout=1800
default_storage_engine=InnoDB
binlog_format=ROW
 
query_cache_size=0
query_cache_type=0
 
# 70% of available RAM, 377G server, assume 5GB for other processes so 70% of 372GB is 260GB https://mariadb.com/kb/en/mariadb-memory-allocation/
innodb_buffer_pool_size=260G
 
# Log file should be 25% of the buffer pool size
innodb_log_file_size=65G
 
sql_mode = ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_AUTO_VALUE_ON_ZERO,NO_ENGINE_SUBSTITUTION,NO_ZERO_DATE,NO_ZERO_IN_DATE
innodb_file_per_table=1
interactive_timeout=28800
table_definition_cache=800
 
wsrep_on=on
wsrep_provider=/usr/lib/galera/libgalera_smm.so
wsrep_cluster_name="[REDACTED]"
wsrep_cluster_address="gcomm://[REDACTED]"
wsrep_sst_method=mariabackup
wsrep_sst_auth=[REDACTED]
 
# gcs.fc_limit (in wsrep_provider_options) should be wsrep_slave_threads * 5
wsrep_slave_threads=160
innodb_autoinc_lock_mode=2
innodb_flush_log_at_trx_commit=2
wsrep_sync_wait=1
wsrep_provider_options=gcs.fc_limit=800;gcs.fc_factor=0.8;gcache.size=2G;gcache.recover=yes
 
# The below varies by node
wsrep_node_name=[REDACTED]
wsrep_node_address=[REDACTED]
bind-address=[REDACTED]

Comment by Luke Cousins [ 2021-04-14 ]

Possibly this was introduced by the changes in MDEV-23328?

Comment by Luke Cousins [ 2021-04-14 ]

We have just downgraded from 10.5.9 to 10.5.8. Will report back if this has an effect.

Comment by Luke Cousins [ 2021-04-16 ]

So far, so good To us, this strongly suggests that this bug was introduced in 10.5.9. Downgrading via apt (on Ubuntu) was as simple as:

sudo systemctl stop mariadb
 
sudo apt remove --purge mariadb-server* mariadb-backup* mariadb-client*
 
sudo apt install mariadb-server=1:10.5.8+maria~focal mariadb-server-10.5=1:10.5.8+maria~focal mariadb-backup=1:10.5.8+maria~focal mariadb-client=1:10.5.8+maria~focal mariadb-client-core-10.5=1:10.5.8+maria~focal mariadb-client-10.5=1:10.5.8+maria~focal mariadb-server-core-10.5=1:10.5.8+maria~focal

Comment by Thomas Lutz [ 2021-04-22 ]

We're affected by this as well in a 10.4.18 3-node cluster. Will try to downgrade to 10.4.17.

Comment by Matt Le Fevre [ 2021-04-23 ]

Have downgraded to 10.5.8 from 10.5.9 and have stopped seeing this issue. A warning to others though that might do the same, we believe we've now started encountering this issue in 10.5.8 https://jira.mariadb.org/browse/MDEV-23851, and so are considering on downgrading further back to 10.5.6

edit (28/04/2021)

We have since downgraded to 10.5.6 on Friday last week and have not seen either issue since, so this seems like a 'safe version' to wait out resolution for now if you're on >10.5.6.

edit (23/09/2021)

This issue seems to be sticking around for some time, so adding this further edit as reassurance for anyone else unfortunately coming across this. Since downgrading our cluster to 10.5.6 ~5 months ago we've had 0 problems. Hoping a successful fix does come along at some point though..

Comment by Thomas Lutz [ 2021-04-23 ]

Thanks for the heads-up! We will consider downgrading to 10.4.15 accordingly.

Comment by Paolo Iannelli [ 2021-04-24 ]

We had a cluster of 3 nodes running 10.5.9 and were experiences same issues described above with 3 to 5 crashes a day.
We have downgraded tonight to 10.5.6 and so far looks good after 15 hours.

The one thing that changed in our munin graphs was Command Counters, so I thought it might be worth sharing it if is relevant for anyone:

The counters look "bursty" on the left side (10.5.9) and more "steady" on the right side (10.5.6).
I will report here if the cluster goes down again.

Meanwhile, thanks Matt for the tips!

Update:
After 10 days on 10.5.6 everything has been very smooth and had no issues so far

Comment by Florian Bezdeka [ 2021-04-26 ]

I can confirm that downgrading from 10.4.18 to 10.4.17 helps. 3 node cluster setup that regularly got frozen when running 10.4.18.

Comment by Thomas Lutz [ 2021-04-26 ]

Florian, consider downgrading to 10.4.15 - we experienced another problem with 10.4.17 and are now running stable since Saturday around noon.

Comment by Kóczán Ákos [ 2021-05-28 ]

Is this problem resolved in 10.4.19? Any experience?

Comment by Ionut Dumitru [ 2021-06-01 ]

I'm seeing the same issue on 10.5.10

8224328 xxxxx21 10.11.12.15:47006   Busy 368 Freeing items NULL

All writes across all 3 galera nodes remain hanging. It happened twice in one week so far. Not sure what data to gather, if you let me know I will setup monitoring. All the things we monitor seem to be normal compared to regular operation.

Comment by Pasi Pentikäinen [ 2021-06-01 ]

Here is one stack trace of instance paused with "Freeing items" from 10.5.9., if it helps. gdb2.log

Comment by Walter Doekes [ 2021-06-01 ]

And here's another one.

Running 1:10.4.19+maria~focal with Galera-4 26.4.8.

In state: Busy "Freeing Items".

1 system user   NULL Sleep 7762 wsrep aborter idle NULL 0.000
2 system user   NULL Sleep 664 wsrep applier committed NULL 0.000
4 system user   NULL Daemon NULL InnoDB purge coordinator NULL 0.000
5 system user   NULL Daemon NULL InnoDB shutdown handler NULL 0.000
9 system user   NULL Sleep 664 wsrep applier committed NULL 0.000
8 system user   NULL Sleep 664 wsrep applier committed NULL 0.000
10 system user   NULL Sleep 663 committing NULL 0.000
11 system user   NULL Sleep 664 wsrep applier committed NULL 0.000
13 system user   NULL Sleep 664 wsrep applier committed NULL 0.000
14 system user     Busy 663 committing NULL 0.000
12 system user   NULL Sleep 665 wsrep applier committed NULL 0.000
87152 my_db my_ip   Busy 663 Freeing items NULL 0.000
87165 my_db my_db my_user Query 663 Commit INSERT INTO `soll_statistieken` (... 0.000
88643 root localhost NULL Query 0 Init show full processlist 0.000

(All non-system Sleep commands elided.)

My guess would be that ID 10, 14 and 87152 are thread 40, 41 and 42. But I cannot tell which is which.


Thread 42

I have thread 42, holding lock X (0x7fd73c002758):

(gdb) bt
#0  futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7fd7f427adf0, clockid=<optimized out>, expected=0, futex_word=0x7fe108d7c700) at ../sysdeps/nptl/futex-internal.h:320
#1  __pthread_cond_wait_common (abstime=0x7fd7f427adf0, clockid=<optimized out>, mutex=0x55c26646abf8, cond=0x7fe108d7c6d8) at pthread_cond_wait.c:520
#2  __pthread_cond_timedwait (cond=cond@entry=0x7fe108d7c6d8, mutex=mutex@entry=0x55c26646abf8, abstime=abstime@entry=0x7fd7f427adf0) at pthread_cond_wait.c:656
#3  0x00007fe10a0b9716 in gu::Lock::wait (this=<synthetic pointer>, date=<synthetic pointer>..., cond=...) at ./galerautils/src/gu_mutex.hpp:86
#4  galera::Monitor<galera::ReplicatorSMM::ApplyOrder>::wait (wait_until=<synthetic pointer>..., gtid=..., this=0x55c26646abf8) at ./galera/src/monitor.hpp:360
#5  galera::ReplicatorSMM::sync_wait (this=0x55c266469ca0, upto=<optimized out>, tout=<optimized out>, gtid=0x0) at ./galera/src/replicator_smm.cpp:1595
#6  0x00007fe10a09acb4 in galera_sync_wait (wsrep=<optimized out>, upto=<optimized out>, tout=<optimized out>, gtid=<optimized out>) at ./galera/src/wsrep_provider.cpp:475
#7  0x000055c2633a846e in wsrep::wsrep_provider_v26::wait_for_gtid (this=<optimized out>, gtid=..., timeout=<optimized out>) at ./wsrep-lib/src/wsrep_provider_v26.cpp:994
#8  0x000055c262f7d842 in Wsrep_server_service::set_position (this=<optimized out>, c=..., gtid=...) at ./wsrep-lib/include/wsrep/server_state.hpp:308
#9  0x000055c26339c10a in wsrep::transaction::release_commit_order (this=0x7fd73c006c48, lock=...) at ./wsrep-lib/include/wsrep/provider.hpp:141
#10 0x000055c2633a45bd in wsrep::transaction::after_statement (this=this@entry=0x7fd73c006c48) at ./wsrep-lib/src/transaction.cpp:884
#11 0x000055c26338c372 in wsrep::client_state::after_statement (this=this@entry=0x7fd73c006be0) at ./wsrep-lib/src/client_state.cpp:276
#12 0x000055c262bbaa85 in wsrep_after_statement (thd=0x7fd73c000c48) at ./sql/sql_class.h:4939
#13 wsrep_mysql_parse (thd=0x7fd73c000c48, 
    rawbuf=0x7fd73c010120 "REPLACE INTO sessions VALUES('xxxxxxxxxxxx', 'browser|O:8:\\\"stdClass\\\":50:{s:18:\\\"browser_name_regex\\\";s:121:\\\"/^mozilla\\\\/5\\\\.0 \\\\(.*windows nt 10\\\\.0.*win64. x64.*\\\\) applewebkit"..., length=5904, parser_state=0x7fd7f427b490, is_com_multi=<optimized out>, is_next_command=<optimized out>) at ./sql/sql_parse.cc:7829
#14 0x000055c262bbdee3 in dispatch_command (command=COM_QUERY, thd=0x7fd73c000c48, packet=<optimized out>, packet_length=<optimized out>, is_com_multi=<optimized out>, is_next_command=<optimized out>)
    at ./sql/sql_class.h:1184
#15 0x000055c262bbe6a0 in do_command (thd=0x7fd73c000c48) at ./sql/sql_parse.cc:1373
#16 0x000055c262cb276e in do_handle_one_connection (connect=connect@entry=0x55c2834791b8) at ./sql/sql_connect.cc:1412
#17 0x000055c262cb288f in handle_one_connection (arg=arg@entry=0x55c2834791b8) at ./sql/sql_connect.cc:1316
#18 0x000055c2632d5f56 in pfs_spawn_thread (arg=0x55c2836322f8) at ./storage/perfschema/pfs.cc:1869
#19 0x00007fe132286609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#20 0x00007fe131e75293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

(gdb) print *(pthread_mutex_t*)0x7fd73c002758
$1 = pthread_mutex_t = {Type = Adaptive, Status = Acquired, possibly with waiters, Owner ID = 2181645, Robust = No, Shared = No, Protocol = None}
(gdb) thread find 2181645
Thread 42 has target id 'Thread 0x7fd7f427c700 (LWP 2181645)'


Thread 41

There's thread 41 who wants to get that lock X, but who is holding on to lock Y (0x7fd778002758):

(gdb) print *(pthread_mutex_t*)0x7fd778002758
$2 = pthread_mutex_t = {Type = Adaptive, Status = Acquired, possibly with waiters, Owner ID = 2181252, Robust = No, Shared = No, Protocol = None}
(gdb) thread find 2181252
Thread 41 has target id 'Thread 0x7fdd0c207700 (LWP 2181252)'

(gdb) bt
#0  __lll_lock_wait (futex=futex@entry=0x7fd73c002758, private=0) at lowlevellock.c:52
#1  0x00007fe132289235 in __GI___pthread_mutex_lock (mutex=mutex@entry=0x7fd73c002758) at ../nptl/pthread_mutex_lock.c:135
#2  0x000055c262d5ddbc in inline_mysql_mutex_lock (src_file=0x55c2633f51e8 "/home/buildbot/buildbot/build/mariadb-10.4.19/sql/service_wsrep.cc", src_line=32, that=0x7fd73c002758)
    at ./include/mysql/psi/mysql_thread.h:719
#3  wsrep_thd_LOCK (thd=thd@entry=0x7fd73c000c48) at ./sql/service_wsrep.cc:32
#4  0x000055c262fa9639 in bg_wsrep_kill_trx (void_arg=0x7fd7783222c0) at ./storage/innobase/handler/ha_innodb.cc:18743
#5  0x000055c262ba882e in handle_manager (arg=arg@entry=0x0) at ./sql/sql_manager.cc:112
#6  0x000055c2632d5f56 in pfs_spawn_thread (arg=0x55c283628678) at ./storage/perfschema/pfs.cc:1869
#7  0x00007fe132286609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#8  0x00007fe131e75293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95


Thread 40

And lastly, I have thread 40, who wants lock Y, but is stuck waiting forever.

(gdb) bt
#0  __lll_lock_wait (futex=futex@entry=0x7fd778002758, private=0) at lowlevellock.c:52
#1  0x00007fe132289235 in __GI___pthread_mutex_lock (mutex=mutex@entry=0x7fd778002758) at ../nptl/pthread_mutex_lock.c:135
#2  0x000055c262b6e027 in inline_mysql_mutex_lock (src_line=34, src_file=0x55c2633c5af8 "/home/buildbot/buildbot/build/mariadb-10.4.19/sql/wsrep_mutex.h", that=0x7fd778002758)
    at ./include/mysql/psi/mysql_thread.h:719
#3  Wsrep_mutex::lock (this=<optimized out>) at ./sql/wsrep_mutex.h:34
#4  0x000055c26339f012 in std::unique_lock<wsrep::mutex>::lock (this=0x7fdd0c24fe20, this=0x7fdd0c24fe20) at ./wsrep-lib/include/wsrep/mutex.hpp:76
#5  std::unique_lock<wsrep::mutex>::unique_lock (__m=..., this=0x7fdd0c24fe20) at /usr/include/c++/9/bits/unique_lock.h:71
#6  wsrep::transaction::after_commit (this=this@entry=0x7fd778006c48) at ./wsrep-lib/src/transaction.cpp:600
#7  0x000055c262ddfa50 in wsrep::client_state::after_commit (this=0x7fd778006be0) at ./wsrep-lib/include/wsrep/client_state.hpp:517
#8  wsrep_after_commit (all=true, thd=0x7fd778000c48) at ./sql/wsrep_trans_observer.h:326
#9  ha_commit_trans (thd=thd@entry=0x7fd778000c48, all=all@entry=true) at ./sql/handler.cc:1659
#10 0x000055c262cc13a9 in trans_commit (thd=thd@entry=0x7fd778000c48) at ./sql/transaction.cc:236
#11 0x000055c262d3d558 in Wsrep_high_priority_service::commit (this=0x7fdd0c251cf0, ws_handle=..., ws_meta=...) at ./sql/wsrep_high_priority_service.cc:313
#12 0x000055c26339573b in apply_write_set (server_state=..., high_priority_service=..., ws_handle=..., ws_meta=..., data=...) at ./wsrep-lib/src/server_state.cpp:332
#13 0x000055c263396771 in wsrep::server_state::on_apply (this=<optimized out>, high_priority_service=..., ws_handle=..., ws_meta=..., data=...) at ./wsrep-lib/src/server_state.cpp:1136
#14 0x000055c2633a7a6c in wsrep::high_priority_service::apply (data=..., ws_meta=..., ws_handle=..., this=0x7fdd0c251cf0) at ./wsrep-lib/include/wsrep/high_priority_service.hpp:47
#15 (anonymous namespace)::apply_cb (ctx=ctx@entry=0x7fdd0c251cf0, wsh=wsh@entry=0x7fdd0c250d80, flags=<optimized out>, flags@entry=65, buf=buf@entry=0x7fdd0c250d90, meta=meta@entry=0x7fdd0c251040, 
    exit_loop=exit_loop@entry=0x7fdd0c250fff) at ./wsrep-lib/src/wsrep_provider_v26.cpp:504
#16 0x00007fe10a0af64b in galera::TrxHandleSlave::apply (this=this@entry=0x7fd78823fc20, recv_ctx=recv_ctx@entry=0x7fdd0c251cf0, 
    apply_cb=0x55c2633a7940 <(anonymous namespace)::apply_cb(void*, wsrep_ws_handle_t const*, uint32_t, wsrep_buf_t const*, wsrep_trx_meta_t const*, wsrep_bool_t*)>, meta=..., 
    exit_loop=exit_loop@entry=@0x7fdd0c250fff: false) at ./galera/src/trx_handle.cpp:391
#17 0x00007fe10a0c149e in galera::ReplicatorSMM::apply_trx (this=0x55c266469ca0, recv_ctx=0x7fdd0c251cf0, ts=...) at ./galera/src/replicator_smm.cpp:504
#18 0x00007fe10a0c4008 in galera::ReplicatorSMM::process_trx (this=0x55c266469ca0, recv_ctx=0x7fdd0c251cf0, ts_ptr=...) at ./galera/src/replicator_smm.cpp:2127
#19 0x00007fe10a0f7902 in galera::GcsActionSource::process_writeset (this=0x55c266472c30, recv_ctx=0x7fdd0c251cf0, act=..., exit_loop=@0x7fdd0c2518ef: false) at ./galera/src/gcs_action_source.cpp:62
#20 0x00007fe10a0f7a8b in galera::GcsActionSource::dispatch (this=<optimized out>, recv_ctx=<optimized out>, act=..., exit_loop=<optimized out>) at ./galera/src/gcs_action_source.cpp:110
#21 0x00007fe10a0f7c82 in galera::GcsActionSource::process (this=0x55c266472c30, recv_ctx=0x7fdd0c251cf0, exit_loop=@0x7fdd0c2518ef: false) at ./galera/src/gcs_action_source.cpp:183
#22 0x00007fe10a0c4a00 in galera::ReplicatorSMM::async_recv (this=0x55c266469ca0, recv_ctx=0x7fdd0c251cf0) at ./galera/src/replicator_smm.cpp:390
#23 0x00007fe10a09ac41 in galera_recv (gh=<optimized out>, recv_ctx=<optimized out>) at ./galera/src/wsrep_provider.cpp:236
#24 0x000055c2633a8082 in wsrep::wsrep_provider_v26::run_applier (this=<optimized out>, applier_ctx=<optimized out>) at ./wsrep-lib/src/wsrep_provider_v26.cpp:796
#25 0x000055c262d574b3 in wsrep_replication_process (thd=0x7fd778000c48, arg=<optimized out>) at ./wsrep-lib/include/wsrep/server_state.hpp:308
#26 0x000055c262d4833c in start_wsrep_THD (arg=arg@entry=0x55c283628300) at ./sql/wsrep_mysqld.h:430
#27 0x000055c2632d5f56 in pfs_spawn_thread (arg=0x55c283628328) at ./storage/perfschema/pfs.cc:1869
#28 0x00007fe132286609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#29 0x00007fe131e75293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95


Thread 73

Additionally, I have thread 73, which is also waiting for something:

Thread 73 (Thread 0x7fd7e4e01700 (LWP 2185314)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fdd43e747c8) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x55c26646aca8, cond=0x7fdd43e747a0) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=cond@entry=0x7fdd43e747a0, mutex=mutex@entry=0x55c26646aca8) at pthread_cond_wait.c:638
#3  0x00007fe10a0cea52 in gu::Lock::wait (this=<synthetic pointer>, cond=...) at ./galerautils/src/gu_mutex.hpp:86
#4  galera::Monitor<galera::ReplicatorSMM::CommitOrder>::enter (this=this@entry=0x55c26646aca8, obj=...) at ./galera/src/monitor.hpp:161
#5  0x00007fe10a0bc153 in galera::ReplicatorSMM::commit_order_enter_local (this=this@entry=0x55c266469ca0, trx=...) at ./galera/src/replicator_smm.cpp:1246
#6  0x00007fe10a09d081 in galera_commit_order_enter (gh=<optimized out>, ws_handle=<optimized out>, meta=<optimized out>) at ./galera/src/wsrep_provider.cpp:730
#7  0x000055c2633a82ea in wsrep::wsrep_provider_v26::commit_order_enter (this=<optimized out>, ws_handle=..., ws_meta=...) at ./wsrep-lib/src/wsrep_provider_v26.cpp:273
#8  0x000055c2633a2fc7 in wsrep::transaction::before_commit (this=this@entry=0x7fd7101a4fe8) at ./wsrep-lib/include/wsrep/server_state.hpp:308
#9  0x000055c262ddfe41 in wsrep::client_state::before_commit (this=0x7fd7101a4f80) at ./wsrep-lib/include/wsrep/client_state.hpp:503
#10 wsrep_before_commit (all=false, thd=0x7fd71019efe8) at ./sql/wsrep_trans_observer.h:273
#11 ha_commit_trans (thd=thd@entry=0x7fd71019efe8, all=all@entry=false) at ./sql/handler.cc:1643
#12 0x000055c262cc16c1 in trans_commit_stmt (thd=thd@entry=0x7fd71019efe8) at ./sql/transaction.cc:436
#13 0x000055c262bb2932 in mysql_execute_command (thd=<optimized out>) at ./sql/sql_parse.cc:6250
#14 0x000055c262bb9f43 in mysql_parse (thd=thd@entry=0x7fd71019efe8, rawbuf=rawbuf@entry=0x7fd71012ae50 "INSERT INTO `soll_statistieken` ("..., length=length@entry=239, parser_state=parser_state@entry=0x7fd7e4e00490, is_com_multi=is_com_multi@entry=false, is_next_command=is_next_command@entry=false) at ./sql/sql_parse.cc:7995
#15 0x000055c262bba8a8 in wsrep_mysql_parse (thd=0x7fd71019efe8, rawbuf=0x7fd71012ae50 "INSERT INTO `soll_statistieken` ("..., length=239, parser_state=0x7fd7e4e00490, is_com_multi=<optimized out>, is_next_command=<optimized out>) at ./sql/sql_parse.cc:7799
#16 0x000055c262bbdee3 in dispatch_command (command=COM_QUERY, thd=0x7fd71019efe8, packet=<optimized out>, packet_length=<optimized out>, is_com_multi=<optimized out>, is_next_command=<optimized out>) at ./sql/sql_class.h:1184
#17 0x000055c262bbe6a0 in do_command (thd=0x7fd71019efe8) at ./sql/sql_parse.cc:1373
#18 0x000055c262cb276e in do_handle_one_connection (connect=connect@entry=0x55c28376aa18) at ./sql/sql_connect.cc:1412
#19 0x000055c262cb288f in handle_one_connection (arg=arg@entry=0x55c28376aa18) at ./sql/sql_connect.cc:1316
#20 0x000055c2632d5f56 in pfs_spawn_thread (arg=0x55c283733de8) at ./storage/perfschema/pfs.cc:1869
#21 0x00007fe132286609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#22 0x00007fe131e75293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

This one should not be the one in stage_freeing_items, as it's still in mysql_execute_command(thd). Probably 87165.


I've salvaged the generated core file while we've downgraded. If you want I can extract additional info from it.

If this belongs in a different bug report, I'll gladly move the info.

Comment by Roy Lei [ 2021-06-05 ]

I have the same issue running Galera Mariadb cluster with Mariadb version 10.5.9 and 10.4.18. Is there any update on the issue?

Comment by Roy Lei [ 2021-06-13 ]

Hi. Just a bit of an update. Our version 10.4.18 Galera Mariadb cluster sits behind HAProxy. Changed HaProxy from roundrobin to active/backup, so that read and write requests are only sent to a single Mariadb node. And no Mariadb process stuck in Freeing Item since.

Comment by Derk-Jan Hartman [ 2021-06-25 ]

We've also been struggling with paused/locked clusters ever since a few upgrades. We since dealt with MDEV-25405 and MDEV-25378 but I suspect those were just fallout / side effects of this underlying and still not resolved issue...
After 10.5.10 still didn't seem to fix our problems, we gave in for a bit and we have been running without galera, as we didn't have time to further investigate all these problems at the time.

Comment by Stephan Vos [ 2021-06-25 ]

This is worrying as we are planning to upgrade from 10.4.13 to the latest version of 10.5 which is currently 10.5.11
Is this problem still present in the 10.5 and 10.6 stream?

Comment by Derk-Jan Hartman [ 2021-06-28 ]

@Stephan Vos It's hard to say, we didn't have the time to investigate last time what our 10.5.10 issue was exactly, but the symptoms seem very comparible to what is described here incl by Ionut Dumitru (also at 10.5.10) and I can't find a confirmation that anyone has definetively defined what this problem is and how to fix it, so.... I' wouldn't update.

We've been pondering about switching back to 10.5.6 as described by others in this ticket, by pinning to that version from the archives.

Comment by Stephan Vos [ 2021-06-28 ]

Thanks @Derk-Jan Hartman.

It seems that most 10.5.x minor versions after 10.5.6 is plagued by some sort of issue and any cluster hanging is an absolutely not acceptable for us, so we will then consider upgrading to 10.5.6 or staying on 10.4.13.
It is a bit concerning that there is so little feedback from the developers regarding this issues and other related "Closed" issues.
I would expect that it should be high priority as it cancels out the high availability feature of a Galera Cluster.

Comment by MikaH [ 2021-06-28 ]

We stay on 10.5.6 since it is stable with Galera. It is also faster than for example 10.5.8 as standalone. MariaDB and Codership company has failed with wsrep-libraries multiple times. Now it is time to put some serious effort to get this problem permanantly fixed.

Comment by Derk-Jan Hartman [ 2021-06-29 ]

We have now switched back to 10.5.6 as well and are back in galera cluster mode (6 nodes). We'll keep a close eye and I hope this problem can be fully identified soon.

Comment by Seppo Jaakola [ 2021-06-30 ]

Thanks for stack traces pp and stack trace analysis wdoekes , this is helpful exercise.
Both traces show similar hang pattern. In Walter's trace the thread 42 has failed in certification, and has rolled back already, it has now some post mortem bookkeeping to do, and is hanging for correct order to update last committed GTID into innodb. The actual trouble maker is thread 41, who is executing bg_wsrep_kill_trx(), and there blocking replication applier (thread 40) to progress and itself blocked by thread 42.

Reason for the hang appears to be in bg_wsrep_kill_trx(), Refactoring for this function is suggested in the pull request for #25114, which is currently in review status.

Comment by Seppo Jaakola [ 2021-06-30 ]

different symptom caused by bg_wsrep_kill_trx()

Comment by Stephan Vos [ 2021-07-01 ]

@Seppo does this mean you know where the issue lies and that it can be resolved?

Comment by Seppo Jaakola [ 2021-07-02 ]

stephanvos yes, the hang is quite obvious from the attached stack traces.
However, the fix suggested for #MDEV-25114 is a large refactoring, removing the bg_wsrep_kill_trx() altogether, and review appears to take some time.

Comment by Marc Bachmann [ 2021-09-09 ]

Facing the same behaviour in 10.5.12. Is the bug affecting this version, too? Or should I search/create another bug?

Comment by MikaH [ 2021-09-10 ]

10.5.6 is only version what we use, any other 10.5x is unstable.

Comment by Thomas Lutz [ 2021-10-13 ]

Does anybody know if 10.6 is affected by this as well?

Comment by Brian Minton [ 2021-10-21 ]

I can confirm that this is happening in 10.5.12 on Centos 7 x86_64.

Comment by Stephan Vos [ 2021-10-25 ]

We are now also on 10.5.6 but any future upgrade will prove risky with this issue.
How is this review progressing?

Comment by Tommi [ 2021-12-02 ]

Is this now fixed also in 10.5. branch?

Comment by La Cancellera Yoann [ 2022-07-12 ]

Hi, I reproduced it in 10.5.15
I believed MDEV-25114 fixed it, I guess not

It appears to happen after a TOI replication in my case (due to myisam replication).

An interesting note : I even reproduced it on a single galera node that was just bootstrapped.

Comment by La Cancellera Yoann [ 2022-07-20 ]

Hi again
In case anyone need a confirmation, I downgraded to 10.5.6 too, and it actually fixed it for me as well

Comment by Stephan Vos [ 2022-09-04 ]

@Seppo has this been fixed in 10.5.x or is this still and issue, planning on upgrading from 10.5.6 to 10.5.17 and worried this will be an issue.

Comment by Matt Le Fevre [ 2022-09-05 ]

We have (anecdotally) heard from other users that have upgraded to later versions that either a similar issue exists, or a different issue was present that caused them to rollback to 10.5.6 again.

It would be fantastic to hear definitively if this issue is actually resolved/the usual high stability has been restored, we have also been very nervous about upgrading from 10.5.6 since my original comment back in April 2021 :/

Comment by Stephan Vos [ 2022-09-26 ]

This was supposedly fixed in 10.5.13: MDEV-25114
Can anyone confirm that this is the case?
It is not realistic being stuck on 10.5.6 or the 10.6 equivalent.

Comment by La Cancellera Yoann [ 2022-09-26 ]

I can confirm this was NOT the case on my side for 10.5.15

Unfortunately I cannot contribute more to this, I do not have access anymore to my related case.

Comment by Stephan Vos [ 2022-09-27 ]

Wondering if it could be related to this: https://rewardgateway.engineering/2021/12/11/a-galera-replication-race-condition/
https://jira.percona.com/browse/PXC-2959

From trawling all these cluster freeze issues seems more often than not FK's and DELETE or UPDATE involved.

Comment by Jean-Louis Dupond [ 2022-09-27 ]

I had a similar lock, see MDEV-29512

Generated at Thu Feb 08 09:37:10 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.