|
Also experiencing this on v10.5.9, might be related to https://jira.mariadb.org/browse/MDEV-24294
|
|
We've started experiencing this issue too, multiple times per week (sometimes multiple times per day), but only at "busy" times of day.
We recently upgraded from 10.5.4 to 10.5.9 and the problem has only occurred since then.
Using 4 node Galera cluster on Ubuntu (installed from apt repo) with the following config in addition to defaults. (One node is permanently kept out of the load balancer for backups, via Mariabackup). Each node has 377G of memory and 40 cores (80 with hyperthreading).
Here's an extract of SHOW PROCESSLIST; at the time:
| 162 |
system user |
|
NULL |
Sleep |
116 |
committing |
NULL |
0.000 |
| 161 |
system user |
|
NULL |
Sleep |
123 |
wsrep applier committed |
NULL |
0.000 |
| 163 |
system user |
|
NULL |
Sleep |
120 |
wsrep applier committed |
NULL |
0.000 |
| 164 |
system user |
|
NULL |
Sleep |
112 |
committing |
NULL |
0.000 |
| 179 |
REDACTED |
REDACTED:43766 |
REDACTED |
Query |
118 |
starting |
COMMIT |
0.000 |
| 180 |
REDACTED |
REDACTED:43768 |
REDACTED |
Query |
116 |
Commit |
UPDATE `front_end`.`session_data` SET `data` = 'a:7:{s:16:\"is_mobile_device\";a:2:{s:5:\"value\";b: |
0.000 |
| 269 |
REDACTED |
REDACTED:43866 |
REDACTED |
Query |
115 |
Commit |
UPDATE `front_end`.`session_data` SET `data` = 'a:6:{s:11:\"user_source\";a:1:{i:0;s:23:\"https://du |
0.000 |
| 270 |
REDACTED |
REDACTED:43868 |
REDACTED |
Query |
117 |
Commit |
INSERT INTO front_end.session_data (`customers_id`, `data`, `last_used`, `created`, `shopping_basket |
0.000 |
| 837 |
REDACTED |
REDACTED:44506 |
REDACTED |
Query |
118 |
Commit |
INSERT INTO front_end.session_data (`customers_id`, `data`, `last_used`, `created`, `shopping_basket |
0.000 |
| 846 |
REDACTED |
REDACTED:44522 |
REDACTED |
Query |
118 |
Commit |
UPDATE `warehouse`.`bulk_picks` SET `last_action_ts` = '1618315874' WHERE `bulk_pick_id` = '39400991 |
0.000 |
| 3926 |
REDACTED |
REDACTED:25504 |
|
Busy |
118 |
Freeing items |
NULL |
0.000 |
| 3927 |
REDACTED |
REDACTED:16145 |
REDACTED |
Sleep |
88 |
|
NULL |
0.000 |
The session_data table is quite contentious, but this shouldn't be a problem, wasn't before the upgrade and nothing else material that we know of has changed.
It's consistent to see a `Busy` `Freeing items` row with a higher process ID than the other "stuck" queries, but a running time in seconds of the same (or within a second of) those of the other stuck queries (which are typically on our `session_data` table).
What further information would be useful? This bug is causing us enormous headaches at the moment so anything we can do to help get it resolved, we'd be happy to do.
Thanks for your help.
Luke
Config:
[mysqld]
|
performance_schema
|
max_connections=1000
|
wait_timeout=1800
|
default_storage_engine=InnoDB
|
binlog_format=ROW
|
|
query_cache_size=0
|
query_cache_type=0
|
|
# 70% of available RAM, 377G server, assume 5GB for other processes so 70% of 372GB is 260GB https://mariadb.com/kb/en/mariadb-memory-allocation/
|
innodb_buffer_pool_size=260G
|
|
# Log file should be 25% of the buffer pool size
|
innodb_log_file_size=65G
|
|
sql_mode = ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_AUTO_VALUE_ON_ZERO,NO_ENGINE_SUBSTITUTION,NO_ZERO_DATE,NO_ZERO_IN_DATE
|
innodb_file_per_table=1
|
interactive_timeout=28800
|
table_definition_cache=800
|
|
wsrep_on=on
|
wsrep_provider=/usr/lib/galera/libgalera_smm.so
|
wsrep_cluster_name="[REDACTED]"
|
wsrep_cluster_address="gcomm://[REDACTED]"
|
wsrep_sst_method=mariabackup
|
wsrep_sst_auth=[REDACTED]
|
|
# gcs.fc_limit (in wsrep_provider_options) should be wsrep_slave_threads * 5
|
wsrep_slave_threads=160
|
innodb_autoinc_lock_mode=2
|
innodb_flush_log_at_trx_commit=2
|
wsrep_sync_wait=1
|
wsrep_provider_options=gcs.fc_limit=800;gcs.fc_factor=0.8;gcache.size=2G;gcache.recover=yes
|
|
# The below varies by node
|
wsrep_node_name=[REDACTED]
|
wsrep_node_address=[REDACTED]
|
bind-address=[REDACTED]
|
|
|
Possibly this was introduced by the changes in MDEV-23328?
|
|
We have just downgraded from 10.5.9 to 10.5.8. Will report back if this has an effect.
|
|
So far, so good To us, this strongly suggests that this bug was introduced in 10.5.9. Downgrading via apt (on Ubuntu) was as simple as:
sudo systemctl stop mariadb
|
|
sudo apt remove --purge mariadb-server* mariadb-backup* mariadb-client*
|
|
sudo apt install mariadb-server=1:10.5.8+maria~focal mariadb-server-10.5=1:10.5.8+maria~focal mariadb-backup=1:10.5.8+maria~focal mariadb-client=1:10.5.8+maria~focal mariadb-client-core-10.5=1:10.5.8+maria~focal mariadb-client-10.5=1:10.5.8+maria~focal mariadb-server-core-10.5=1:10.5.8+maria~focal
|
|
|
We're affected by this as well in a 10.4.18 3-node cluster. Will try to downgrade to 10.4.17.
|
|
Have downgraded to 10.5.8 from 10.5.9 and have stopped seeing this issue. A warning to others though that might do the same, we believe we've now started encountering this issue in 10.5.8 https://jira.mariadb.org/browse/MDEV-23851, and so are considering on downgrading further back to 10.5.6
edit (28/04/2021)
We have since downgraded to 10.5.6 on Friday last week and have not seen either issue since, so this seems like a 'safe version' to wait out resolution for now if you're on >10.5.6.
edit (23/09/2021)
This issue seems to be sticking around for some time, so adding this further edit as reassurance for anyone else unfortunately coming across this. Since downgrading our cluster to 10.5.6 ~5 months ago we've had 0 problems. Hoping a successful fix does come along at some point though..
|
|
Thanks for the heads-up! We will consider downgrading to 10.4.15 accordingly.
|
|
We had a cluster of 3 nodes running 10.5.9 and were experiences same issues described above with 3 to 5 crashes a day.
We have downgraded tonight to 10.5.6 and so far looks good after 15 hours.
The one thing that changed in our munin graphs was Command Counters, so I thought it might be worth sharing it if is relevant for anyone:

The counters look "bursty" on the left side (10.5.9) and more "steady" on the right side (10.5.6).
I will report here if the cluster goes down again.
Meanwhile, thanks Matt for the tips!
Update:
After 10 days on 10.5.6 everything has been very smooth and had no issues so far
|
|
I can confirm that downgrading from 10.4.18 to 10.4.17 helps. 3 node cluster setup that regularly got frozen when running 10.4.18.
|
|
Florian, consider downgrading to 10.4.15 - we experienced another problem with 10.4.17 and are now running stable since Saturday around noon.
|
|
Is this problem resolved in 10.4.19? Any experience?
|
|
I'm seeing the same issue on 10.5.10
| 8224328 |
xxxxx21 |
10.11.12.15:47006 |
|
Busy |
368 |
Freeing items |
NULL |
All writes across all 3 galera nodes remain hanging. It happened twice in one week so far. Not sure what data to gather, if you let me know I will setup monitoring. All the things we monitor seem to be normal compared to regular operation.
|
|
Here is one stack trace of instance paused with "Freeing items" from 10.5.9., if it helps. gdb2.log
|
|
And here's another one.
Running 1:10.4.19+maria~focal with Galera-4 26.4.8.
In state: Busy "Freeing Items".
| 1 |
system user |
|
NULL |
Sleep |
7762 |
wsrep aborter idle |
NULL |
0.000 |
| 2 |
system user |
|
NULL |
Sleep |
664 |
wsrep applier committed |
NULL |
0.000 |
| 4 |
system user |
|
NULL |
Daemon |
NULL |
InnoDB purge coordinator |
NULL |
0.000 |
| 5 |
system user |
|
NULL |
Daemon |
NULL |
InnoDB shutdown handler |
NULL |
0.000 |
| 9 |
system user |
|
NULL |
Sleep |
664 |
wsrep applier committed |
NULL |
0.000 |
| 8 |
system user |
|
NULL |
Sleep |
664 |
wsrep applier committed |
NULL |
0.000 |
| 10 |
system user |
|
NULL |
Sleep |
663 |
committing |
NULL |
0.000 |
| 11 |
system user |
|
NULL |
Sleep |
664 |
wsrep applier committed |
NULL |
0.000 |
| 13 |
system user |
|
NULL |
Sleep |
664 |
wsrep applier committed |
NULL |
0.000 |
| 14 |
system user |
|
|
Busy |
663 |
committing |
NULL |
0.000 |
| 12 |
system user |
|
NULL |
Sleep |
665 |
wsrep applier committed |
NULL |
0.000 |
| 87152 |
my_db |
my_ip |
|
Busy |
663 |
Freeing items |
NULL |
0.000 |
| 87165 |
my_db |
my_db |
my_user |
Query |
663 |
Commit |
INSERT INTO `soll_statistieken` (... |
0.000 |
| 88643 |
root |
localhost |
NULL |
Query |
0 |
Init |
show full processlist |
0.000 |
(All non-system Sleep commands elided.)
My guess would be that ID 10, 14 and 87152 are thread 40, 41 and 42. But I cannot tell which is which.
Thread 42
I have thread 42, holding lock X (0x7fd73c002758):
(gdb) bt
|
#0 futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7fd7f427adf0, clockid=<optimized out>, expected=0, futex_word=0x7fe108d7c700) at ../sysdeps/nptl/futex-internal.h:320
|
#1 __pthread_cond_wait_common (abstime=0x7fd7f427adf0, clockid=<optimized out>, mutex=0x55c26646abf8, cond=0x7fe108d7c6d8) at pthread_cond_wait.c:520
|
#2 __pthread_cond_timedwait (cond=cond@entry=0x7fe108d7c6d8, mutex=mutex@entry=0x55c26646abf8, abstime=abstime@entry=0x7fd7f427adf0) at pthread_cond_wait.c:656
|
#3 0x00007fe10a0b9716 in gu::Lock::wait (this=<synthetic pointer>, date=<synthetic pointer>..., cond=...) at ./galerautils/src/gu_mutex.hpp:86
|
#4 galera::Monitor<galera::ReplicatorSMM::ApplyOrder>::wait (wait_until=<synthetic pointer>..., gtid=..., this=0x55c26646abf8) at ./galera/src/monitor.hpp:360
|
#5 galera::ReplicatorSMM::sync_wait (this=0x55c266469ca0, upto=<optimized out>, tout=<optimized out>, gtid=0x0) at ./galera/src/replicator_smm.cpp:1595
|
#6 0x00007fe10a09acb4 in galera_sync_wait (wsrep=<optimized out>, upto=<optimized out>, tout=<optimized out>, gtid=<optimized out>) at ./galera/src/wsrep_provider.cpp:475
|
#7 0x000055c2633a846e in wsrep::wsrep_provider_v26::wait_for_gtid (this=<optimized out>, gtid=..., timeout=<optimized out>) at ./wsrep-lib/src/wsrep_provider_v26.cpp:994
|
#8 0x000055c262f7d842 in Wsrep_server_service::set_position (this=<optimized out>, c=..., gtid=...) at ./wsrep-lib/include/wsrep/server_state.hpp:308
|
#9 0x000055c26339c10a in wsrep::transaction::release_commit_order (this=0x7fd73c006c48, lock=...) at ./wsrep-lib/include/wsrep/provider.hpp:141
|
#10 0x000055c2633a45bd in wsrep::transaction::after_statement (this=this@entry=0x7fd73c006c48) at ./wsrep-lib/src/transaction.cpp:884
|
#11 0x000055c26338c372 in wsrep::client_state::after_statement (this=this@entry=0x7fd73c006be0) at ./wsrep-lib/src/client_state.cpp:276
|
#12 0x000055c262bbaa85 in wsrep_after_statement (thd=0x7fd73c000c48) at ./sql/sql_class.h:4939
|
#13 wsrep_mysql_parse (thd=0x7fd73c000c48,
|
rawbuf=0x7fd73c010120 "REPLACE INTO sessions VALUES('xxxxxxxxxxxx', 'browser|O:8:\\\"stdClass\\\":50:{s:18:\\\"browser_name_regex\\\";s:121:\\\"/^mozilla\\\\/5\\\\.0 \\\\(.*windows nt 10\\\\.0.*win64. x64.*\\\\) applewebkit"..., length=5904, parser_state=0x7fd7f427b490, is_com_multi=<optimized out>, is_next_command=<optimized out>) at ./sql/sql_parse.cc:7829
|
#14 0x000055c262bbdee3 in dispatch_command (command=COM_QUERY, thd=0x7fd73c000c48, packet=<optimized out>, packet_length=<optimized out>, is_com_multi=<optimized out>, is_next_command=<optimized out>)
|
at ./sql/sql_class.h:1184
|
#15 0x000055c262bbe6a0 in do_command (thd=0x7fd73c000c48) at ./sql/sql_parse.cc:1373
|
#16 0x000055c262cb276e in do_handle_one_connection (connect=connect@entry=0x55c2834791b8) at ./sql/sql_connect.cc:1412
|
#17 0x000055c262cb288f in handle_one_connection (arg=arg@entry=0x55c2834791b8) at ./sql/sql_connect.cc:1316
|
#18 0x000055c2632d5f56 in pfs_spawn_thread (arg=0x55c2836322f8) at ./storage/perfschema/pfs.cc:1869
|
#19 0x00007fe132286609 in start_thread (arg=<optimized out>) at pthread_create.c:477
|
#20 0x00007fe131e75293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
|
(gdb) print *(pthread_mutex_t*)0x7fd73c002758
|
$1 = pthread_mutex_t = {Type = Adaptive, Status = Acquired, possibly with waiters, Owner ID = 2181645, Robust = No, Shared = No, Protocol = None}
|
(gdb) thread find 2181645
|
Thread 42 has target id 'Thread 0x7fd7f427c700 (LWP 2181645)'
|
Thread 41
There's thread 41 who wants to get that lock X, but who is holding on to lock Y (0x7fd778002758):
(gdb) print *(pthread_mutex_t*)0x7fd778002758
|
$2 = pthread_mutex_t = {Type = Adaptive, Status = Acquired, possibly with waiters, Owner ID = 2181252, Robust = No, Shared = No, Protocol = None}
|
(gdb) thread find 2181252
|
Thread 41 has target id 'Thread 0x7fdd0c207700 (LWP 2181252)'
|
(gdb) bt
|
#0 __lll_lock_wait (futex=futex@entry=0x7fd73c002758, private=0) at lowlevellock.c:52
|
#1 0x00007fe132289235 in __GI___pthread_mutex_lock (mutex=mutex@entry=0x7fd73c002758) at ../nptl/pthread_mutex_lock.c:135
|
#2 0x000055c262d5ddbc in inline_mysql_mutex_lock (src_file=0x55c2633f51e8 "/home/buildbot/buildbot/build/mariadb-10.4.19/sql/service_wsrep.cc", src_line=32, that=0x7fd73c002758)
|
at ./include/mysql/psi/mysql_thread.h:719
|
#3 wsrep_thd_LOCK (thd=thd@entry=0x7fd73c000c48) at ./sql/service_wsrep.cc:32
|
#4 0x000055c262fa9639 in bg_wsrep_kill_trx (void_arg=0x7fd7783222c0) at ./storage/innobase/handler/ha_innodb.cc:18743
|
#5 0x000055c262ba882e in handle_manager (arg=arg@entry=0x0) at ./sql/sql_manager.cc:112
|
#6 0x000055c2632d5f56 in pfs_spawn_thread (arg=0x55c283628678) at ./storage/perfschema/pfs.cc:1869
|
#7 0x00007fe132286609 in start_thread (arg=<optimized out>) at pthread_create.c:477
|
#8 0x00007fe131e75293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
|
Thread 40
And lastly, I have thread 40, who wants lock Y, but is stuck waiting forever.
(gdb) bt
|
#0 __lll_lock_wait (futex=futex@entry=0x7fd778002758, private=0) at lowlevellock.c:52
|
#1 0x00007fe132289235 in __GI___pthread_mutex_lock (mutex=mutex@entry=0x7fd778002758) at ../nptl/pthread_mutex_lock.c:135
|
#2 0x000055c262b6e027 in inline_mysql_mutex_lock (src_line=34, src_file=0x55c2633c5af8 "/home/buildbot/buildbot/build/mariadb-10.4.19/sql/wsrep_mutex.h", that=0x7fd778002758)
|
at ./include/mysql/psi/mysql_thread.h:719
|
#3 Wsrep_mutex::lock (this=<optimized out>) at ./sql/wsrep_mutex.h:34
|
#4 0x000055c26339f012 in std::unique_lock<wsrep::mutex>::lock (this=0x7fdd0c24fe20, this=0x7fdd0c24fe20) at ./wsrep-lib/include/wsrep/mutex.hpp:76
|
#5 std::unique_lock<wsrep::mutex>::unique_lock (__m=..., this=0x7fdd0c24fe20) at /usr/include/c++/9/bits/unique_lock.h:71
|
#6 wsrep::transaction::after_commit (this=this@entry=0x7fd778006c48) at ./wsrep-lib/src/transaction.cpp:600
|
#7 0x000055c262ddfa50 in wsrep::client_state::after_commit (this=0x7fd778006be0) at ./wsrep-lib/include/wsrep/client_state.hpp:517
|
#8 wsrep_after_commit (all=true, thd=0x7fd778000c48) at ./sql/wsrep_trans_observer.h:326
|
#9 ha_commit_trans (thd=thd@entry=0x7fd778000c48, all=all@entry=true) at ./sql/handler.cc:1659
|
#10 0x000055c262cc13a9 in trans_commit (thd=thd@entry=0x7fd778000c48) at ./sql/transaction.cc:236
|
#11 0x000055c262d3d558 in Wsrep_high_priority_service::commit (this=0x7fdd0c251cf0, ws_handle=..., ws_meta=...) at ./sql/wsrep_high_priority_service.cc:313
|
#12 0x000055c26339573b in apply_write_set (server_state=..., high_priority_service=..., ws_handle=..., ws_meta=..., data=...) at ./wsrep-lib/src/server_state.cpp:332
|
#13 0x000055c263396771 in wsrep::server_state::on_apply (this=<optimized out>, high_priority_service=..., ws_handle=..., ws_meta=..., data=...) at ./wsrep-lib/src/server_state.cpp:1136
|
#14 0x000055c2633a7a6c in wsrep::high_priority_service::apply (data=..., ws_meta=..., ws_handle=..., this=0x7fdd0c251cf0) at ./wsrep-lib/include/wsrep/high_priority_service.hpp:47
|
#15 (anonymous namespace)::apply_cb (ctx=ctx@entry=0x7fdd0c251cf0, wsh=wsh@entry=0x7fdd0c250d80, flags=<optimized out>, flags@entry=65, buf=buf@entry=0x7fdd0c250d90, meta=meta@entry=0x7fdd0c251040,
|
exit_loop=exit_loop@entry=0x7fdd0c250fff) at ./wsrep-lib/src/wsrep_provider_v26.cpp:504
|
#16 0x00007fe10a0af64b in galera::TrxHandleSlave::apply (this=this@entry=0x7fd78823fc20, recv_ctx=recv_ctx@entry=0x7fdd0c251cf0,
|
apply_cb=0x55c2633a7940 <(anonymous namespace)::apply_cb(void*, wsrep_ws_handle_t const*, uint32_t, wsrep_buf_t const*, wsrep_trx_meta_t const*, wsrep_bool_t*)>, meta=...,
|
exit_loop=exit_loop@entry=@0x7fdd0c250fff: false) at ./galera/src/trx_handle.cpp:391
|
#17 0x00007fe10a0c149e in galera::ReplicatorSMM::apply_trx (this=0x55c266469ca0, recv_ctx=0x7fdd0c251cf0, ts=...) at ./galera/src/replicator_smm.cpp:504
|
#18 0x00007fe10a0c4008 in galera::ReplicatorSMM::process_trx (this=0x55c266469ca0, recv_ctx=0x7fdd0c251cf0, ts_ptr=...) at ./galera/src/replicator_smm.cpp:2127
|
#19 0x00007fe10a0f7902 in galera::GcsActionSource::process_writeset (this=0x55c266472c30, recv_ctx=0x7fdd0c251cf0, act=..., exit_loop=@0x7fdd0c2518ef: false) at ./galera/src/gcs_action_source.cpp:62
|
#20 0x00007fe10a0f7a8b in galera::GcsActionSource::dispatch (this=<optimized out>, recv_ctx=<optimized out>, act=..., exit_loop=<optimized out>) at ./galera/src/gcs_action_source.cpp:110
|
#21 0x00007fe10a0f7c82 in galera::GcsActionSource::process (this=0x55c266472c30, recv_ctx=0x7fdd0c251cf0, exit_loop=@0x7fdd0c2518ef: false) at ./galera/src/gcs_action_source.cpp:183
|
#22 0x00007fe10a0c4a00 in galera::ReplicatorSMM::async_recv (this=0x55c266469ca0, recv_ctx=0x7fdd0c251cf0) at ./galera/src/replicator_smm.cpp:390
|
#23 0x00007fe10a09ac41 in galera_recv (gh=<optimized out>, recv_ctx=<optimized out>) at ./galera/src/wsrep_provider.cpp:236
|
#24 0x000055c2633a8082 in wsrep::wsrep_provider_v26::run_applier (this=<optimized out>, applier_ctx=<optimized out>) at ./wsrep-lib/src/wsrep_provider_v26.cpp:796
|
#25 0x000055c262d574b3 in wsrep_replication_process (thd=0x7fd778000c48, arg=<optimized out>) at ./wsrep-lib/include/wsrep/server_state.hpp:308
|
#26 0x000055c262d4833c in start_wsrep_THD (arg=arg@entry=0x55c283628300) at ./sql/wsrep_mysqld.h:430
|
#27 0x000055c2632d5f56 in pfs_spawn_thread (arg=0x55c283628328) at ./storage/perfschema/pfs.cc:1869
|
#28 0x00007fe132286609 in start_thread (arg=<optimized out>) at pthread_create.c:477
|
#29 0x00007fe131e75293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
|
Thread 73
Additionally, I have thread 73, which is also waiting for something:
Thread 73 (Thread 0x7fd7e4e01700 (LWP 2185314)):
|
#0 futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fdd43e747c8) at ../sysdeps/nptl/futex-internal.h:183
|
#1 __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x55c26646aca8, cond=0x7fdd43e747a0) at pthread_cond_wait.c:508
|
#2 __pthread_cond_wait (cond=cond@entry=0x7fdd43e747a0, mutex=mutex@entry=0x55c26646aca8) at pthread_cond_wait.c:638
|
#3 0x00007fe10a0cea52 in gu::Lock::wait (this=<synthetic pointer>, cond=...) at ./galerautils/src/gu_mutex.hpp:86
|
#4 galera::Monitor<galera::ReplicatorSMM::CommitOrder>::enter (this=this@entry=0x55c26646aca8, obj=...) at ./galera/src/monitor.hpp:161
|
#5 0x00007fe10a0bc153 in galera::ReplicatorSMM::commit_order_enter_local (this=this@entry=0x55c266469ca0, trx=...) at ./galera/src/replicator_smm.cpp:1246
|
#6 0x00007fe10a09d081 in galera_commit_order_enter (gh=<optimized out>, ws_handle=<optimized out>, meta=<optimized out>) at ./galera/src/wsrep_provider.cpp:730
|
#7 0x000055c2633a82ea in wsrep::wsrep_provider_v26::commit_order_enter (this=<optimized out>, ws_handle=..., ws_meta=...) at ./wsrep-lib/src/wsrep_provider_v26.cpp:273
|
#8 0x000055c2633a2fc7 in wsrep::transaction::before_commit (this=this@entry=0x7fd7101a4fe8) at ./wsrep-lib/include/wsrep/server_state.hpp:308
|
#9 0x000055c262ddfe41 in wsrep::client_state::before_commit (this=0x7fd7101a4f80) at ./wsrep-lib/include/wsrep/client_state.hpp:503
|
#10 wsrep_before_commit (all=false, thd=0x7fd71019efe8) at ./sql/wsrep_trans_observer.h:273
|
#11 ha_commit_trans (thd=thd@entry=0x7fd71019efe8, all=all@entry=false) at ./sql/handler.cc:1643
|
#12 0x000055c262cc16c1 in trans_commit_stmt (thd=thd@entry=0x7fd71019efe8) at ./sql/transaction.cc:436
|
#13 0x000055c262bb2932 in mysql_execute_command (thd=<optimized out>) at ./sql/sql_parse.cc:6250
|
#14 0x000055c262bb9f43 in mysql_parse (thd=thd@entry=0x7fd71019efe8, rawbuf=rawbuf@entry=0x7fd71012ae50 "INSERT INTO `soll_statistieken` ("..., length=length@entry=239, parser_state=parser_state@entry=0x7fd7e4e00490, is_com_multi=is_com_multi@entry=false, is_next_command=is_next_command@entry=false) at ./sql/sql_parse.cc:7995
|
#15 0x000055c262bba8a8 in wsrep_mysql_parse (thd=0x7fd71019efe8, rawbuf=0x7fd71012ae50 "INSERT INTO `soll_statistieken` ("..., length=239, parser_state=0x7fd7e4e00490, is_com_multi=<optimized out>, is_next_command=<optimized out>) at ./sql/sql_parse.cc:7799
|
#16 0x000055c262bbdee3 in dispatch_command (command=COM_QUERY, thd=0x7fd71019efe8, packet=<optimized out>, packet_length=<optimized out>, is_com_multi=<optimized out>, is_next_command=<optimized out>) at ./sql/sql_class.h:1184
|
#17 0x000055c262bbe6a0 in do_command (thd=0x7fd71019efe8) at ./sql/sql_parse.cc:1373
|
#18 0x000055c262cb276e in do_handle_one_connection (connect=connect@entry=0x55c28376aa18) at ./sql/sql_connect.cc:1412
|
#19 0x000055c262cb288f in handle_one_connection (arg=arg@entry=0x55c28376aa18) at ./sql/sql_connect.cc:1316
|
#20 0x000055c2632d5f56 in pfs_spawn_thread (arg=0x55c283733de8) at ./storage/perfschema/pfs.cc:1869
|
#21 0x00007fe132286609 in start_thread (arg=<optimized out>) at pthread_create.c:477
|
#22 0x00007fe131e75293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
|
This one should not be the one in stage_freeing_items, as it's still in mysql_execute_command(thd). Probably 87165.
I've salvaged the generated core file while we've downgraded. If you want I can extract additional info from it.
If this belongs in a different bug report, I'll gladly move the info.
|
|
I have the same issue running Galera Mariadb cluster with Mariadb version 10.5.9 and 10.4.18. Is there any update on the issue?
|
|
Hi. Just a bit of an update. Our version 10.4.18 Galera Mariadb cluster sits behind HAProxy. Changed HaProxy from roundrobin to active/backup, so that read and write requests are only sent to a single Mariadb node. And no Mariadb process stuck in Freeing Item since.
|
|
We've also been struggling with paused/locked clusters ever since a few upgrades. We since dealt with MDEV-25405 and MDEV-25378 but I suspect those were just fallout / side effects of this underlying and still not resolved issue...
After 10.5.10 still didn't seem to fix our problems, we gave in for a bit and we have been running without galera, as we didn't have time to further investigate all these problems at the time.
|
|
This is worrying as we are planning to upgrade from 10.4.13 to the latest version of 10.5 which is currently 10.5.11
Is this problem still present in the 10.5 and 10.6 stream?
|
|
@Stephan Vos It's hard to say, we didn't have the time to investigate last time what our 10.5.10 issue was exactly, but the symptoms seem very comparible to what is described here incl by Ionut Dumitru (also at 10.5.10) and I can't find a confirmation that anyone has definetively defined what this problem is and how to fix it, so.... I' wouldn't update.
We've been pondering about switching back to 10.5.6 as described by others in this ticket, by pinning to that version from the archives.
|
|
Thanks @Derk-Jan Hartman.
It seems that most 10.5.x minor versions after 10.5.6 is plagued by some sort of issue and any cluster hanging is an absolutely not acceptable for us, so we will then consider upgrading to 10.5.6 or staying on 10.4.13.
It is a bit concerning that there is so little feedback from the developers regarding this issues and other related "Closed" issues.
I would expect that it should be high priority as it cancels out the high availability feature of a Galera Cluster.
|
|
We stay on 10.5.6 since it is stable with Galera. It is also faster than for example 10.5.8 as standalone. MariaDB and Codership company has failed with wsrep-libraries multiple times. Now it is time to put some serious effort to get this problem permanantly fixed.
|
|
We have now switched back to 10.5.6 as well and are back in galera cluster mode (6 nodes). We'll keep a close eye and I hope this problem can be fully identified soon.
|
|
Thanks for stack traces pp and stack trace analysis wdoekes , this is helpful exercise.
Both traces show similar hang pattern. In Walter's trace the thread 42 has failed in certification, and has rolled back already, it has now some post mortem bookkeeping to do, and is hanging for correct order to update last committed GTID into innodb. The actual trouble maker is thread 41, who is executing bg_wsrep_kill_trx(), and there blocking replication applier (thread 40) to progress and itself blocked by thread 42.
Reason for the hang appears to be in bg_wsrep_kill_trx(), Refactoring for this function is suggested in the pull request for #25114, which is currently in review status.
|
|
different symptom caused by bg_wsrep_kill_trx()
|
|
@Seppo does this mean you know where the issue lies and that it can be resolved?
|
|
stephanvos yes, the hang is quite obvious from the attached stack traces.
However, the fix suggested for #MDEV-25114 is a large refactoring, removing the bg_wsrep_kill_trx() altogether, and review appears to take some time.
|
|
Facing the same behaviour in 10.5.12. Is the bug affecting this version, too? Or should I search/create another bug?
|
|
10.5.6 is only version what we use, any other 10.5x is unstable.
|
|
Does anybody know if 10.6 is affected by this as well?
|
|
I can confirm that this is happening in 10.5.12 on Centos 7 x86_64.
|
|
We are now also on 10.5.6 but any future upgrade will prove risky with this issue.
How is this review progressing?
|
|
Is this now fixed also in 10.5. branch?
|
|
Hi, I reproduced it in 10.5.15
I believed MDEV-25114 fixed it, I guess not
It appears to happen after a TOI replication in my case (due to myisam replication).
An interesting note : I even reproduced it on a single galera node that was just bootstrapped.
|
|
Hi again
In case anyone need a confirmation, I downgraded to 10.5.6 too, and it actually fixed it for me as well
|
|
@Seppo has this been fixed in 10.5.x or is this still and issue, planning on upgrading from 10.5.6 to 10.5.17 and worried this will be an issue.
|
|
We have (anecdotally) heard from other users that have upgraded to later versions that either a similar issue exists, or a different issue was present that caused them to rollback to 10.5.6 again.
It would be fantastic to hear definitively if this issue is actually resolved/the usual high stability has been restored, we have also been very nervous about upgrading from 10.5.6 since my original comment back in April 2021 :/
|
|
This was supposedly fixed in 10.5.13: MDEV-25114
Can anyone confirm that this is the case?
It is not realistic being stuck on 10.5.6 or the 10.6 equivalent.
|
|
I can confirm this was NOT the case on my side for 10.5.15
Unfortunately I cannot contribute more to this, I do not have access anymore to my related case.
|
|
Wondering if it could be related to this: https://rewardgateway.engineering/2021/12/11/a-galera-replication-race-condition/
https://jira.percona.com/browse/PXC-2959
From trawling all these cluster freeze issues seems more often than not FK's and DELETE or UPDATE involved.
|
|
I had a similar lock, see MDEV-29512
|