[MDEV-31532] Galera cluster blocks concurrent commits and leads to server hang Created: 2023-06-23  Updated: 2023-11-06  Resolved: 2023-11-06

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.6
Fix Version/s: N/A

Type: Bug Priority: Critical
Reporter: Ramesh Sivaraman Assignee: Denis Protivensky
Resolution: Incomplete Votes: 1
Labels: None

Attachments: File full_bt.log     File full_bt_new.log     File node1.cnf     File node2.cnf     File node3.cnf    

 Description   

Testcase

Start 3 node cluster
-- Node1
CREATE TABLE test.tbl (
id int NOT NULL AUTO_INCREMENT, 
t_id varchar(10) NOT NULL, 
dt timestamp DEFAULT current_timestamp(), 
a_id int DEFAULT 0, 
PRIMARY KEY (id), 
INDEX ix_dt (dt)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 DEFAULT COLLATE=latin1_swedish_ci;
 
-- Node3
Run mysqlslap test on node3
sudo mysqlslap -uroot -i 50000 -c 5 --create-schema=test --no-drop --delimiter=";" --query="insert into test.tbl (t_id, a_id) values('100100',1);insert into test.tbl (t_id) values('200200');" &
 
Node3 goes into hang mode with a commit state of "Waiting for prior transaction to commit".
 
|  48 | root        | localhost | NULL    | Sleep   |   19 |                                         | NULL                                                                       |    0.000 |
| 844 | root        | localhost | test | Query   |   18 | Commit                                  | insert into test.tbl (t_id, a_id) values('100100',1) |    0.000 |
| 845 | root        | localhost | test | Query   |   18 | Commit                                  | insert into test.tbl (t_id) values('200200')               |    0.000 |
| 846 | root        | localhost | test | Query   |   18 | Commit                                  | insert into test.tbl (t_id, a_id) values('100100',1) |    0.000 |
| 847 | root        | localhost | test | Query   |   18 | Waiting for prior transaction to commit | insert into test.tbl (t_id, a_id) values('100100',1) |    0.000 |
| 848 | root        | localhost | test | Query   |   18 | Commit                                  | insert into test.tbl (t_id, a_id) values('100100',1) |    0.000 |
| 849 | root        | localhost | NULL    | Query   |    0 | starting                                | show processlist                                                           |    0.000 |
+-----+-------------+-----------+---------+---------+------+-----------------------------------------+----------------------------------------------------------------------------+----------+
48 rows in set (0.001 sec)



 Comments   
Comment by Seppo Jaakola [ 2023-06-26 ]

the stack trace shows that there is one connection executing some DDL statement (actual SQL not shown), and this connection is probable reason for the hang. ramesh do you know what this DDL is about? mysqlslap is not supposed to cause this

Comment by Ramesh Sivaraman [ 2023-06-26 ]

seppo Did not run any DDL statements apart from test table creation to reproduce the issue. I am not sure what DDL is being executed in the background, if any.

Comment by Seppo Jaakola [ 2023-06-26 ]

Thread 3, in the stack trace has following call stack:

Thread 3 (Thread 0x7f4150734700 (LWP 26434)):
#0  0x00007f415fca5a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
---Type <return> to continue, or q <return> to quit---
#1  0x00007f4152a7b7db in gu_cond_wait_SYS (mutex=0x5566d0057868, cond=0x7f3acc074f00) at /home/buildbot/buildbot/build/galerautils/src/gu_threads.h:251
#2  wait (this=<synthetic pointer>, cond=...) at /home/buildbot/buildbot/build/galerautils/src/gu_lock.hpp:46
#3  galera::Monitor<galera::ReplicatorSMM::ApplyOrder>::enter (this=this@entry=0x5566d0057868, obj=...) at /home/buildbot/buildbot/build/galera/src/monitor.hpp:194
#4  0x00007f4152a72df8 in galera::ReplicatorSMM::to_isolation_begin (this=0x5566d00566d0, trx=..., meta=<optimized out>) at /home/buildbot/buildbot/build/galera/src/replicator_smm.cpp:1760
#5  0x00007f4152a576b7 in galera_to_execute_start (gh=<optimized out>, conn_id=849, keys=<optimized out>, keys_num=<optimized out>, data=0x7f41507326a0, count=1, flags=65, meta=0x7f41507327c8)
    at /home/buildbot/buildbot/build/galera/src/wsrep_provider.cpp:1254
#6  0x00005566cea5d98f in wsrep::wsrep_provider_v26::enter_toi(wsrep::client_id, std::vector<wsrep::key, std::allocator<wsrep::key> > const&, wsrep::const_buffer const&, wsrep::ws_meta&, int) ()
#7  0x00005566cea3c336 in wsrep::client_state::poll_enter_toi(std::unique_lock<wsrep::mutex>&, std::vector<wsrep::key, std::allocator<wsrep::key> > const&, wsrep::const_buffer const&, wsrep::ws_meta&, int, std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >, bool&) ()
#8  0x00005566cea3e037 in wsrep::client_state::enter_toi_local(std::vector<wsrep::key, std::allocator<wsrep::key> > const&, wsrep::const_buffer const&, std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >) ()
#9  0x00005566ce6d0f18 in wsrep_to_isolation_begin(THD*, char const*, char const*, TABLE_LIST const*, Alter_info const*, std::vector<wsrep::key, std::allocator<wsrep::key> > const*, HA_CREATE_INFO const*) ()
#10 0x00005566ce1edf6c in mysql_execute_command(THD*, bool) ()
#11 0x00005566ce1ef3eb in mysql_parse(THD*, char*, unsigned int, Parser_state*) ()
#12 0x00005566ce1efc01 in wsrep_mysql_parse(THD*, char*, unsigned int, Parser_state*) ()
#13 0x00005566ce1f2153 in dispatch_command(enum_server_command, THD*, char*, unsigned int, bool) ()
#14 0x00005566ce1f2d53 in do_command(THD*, bool) ()
#15 0x00005566ce2eb877 in do_handle_one_connection(CONNECT*, bool) ()
#16 0x00005566ce2ebb14 in handle_one_connection ()
#17 0x00005566ce65e1dc in pfs_spawn_thread ()
#18 0x00007f415fca1ea5 in start_thread () from /lib64/libpthread.so.0
#19 0x00007f415f1bd8dd in clone () from /lib64/libc.so.6

Too bad that SQL statement is not visible

Comment by Ramesh Sivaraman [ 2023-06-26 ]

seppo Attached full bt full_bt_new.log from another run. It appears that there are no DDL statements

Comment by Seppo Jaakola [ 2023-06-27 ]

ok, no DDL in the latest stack traces, and looks that there is genuine binlog group commit deadlock.
ramesh Which mariadb version is used in this testing? Configuration has innodb_locks_unsafe_for_binlog which should be deprecated, which suggest that some old version is in play here. I could not reproduce with 10.6 development head version.

There is a potentially related old issue: MDEV-24954 , this was fixed ~2 years ago.
Configuration has also: repl.commit_order=1, which is generally not a good idea.

Comment by Ramesh Sivaraman [ 2023-06-27 ]

Server version is 10.6.14-MariaDB-log. The issue is happening only when we set repl.commit_order=1. Did not see this issue with the default repl.commit_order value.

Comment by Denis Protivensky [ 2023-11-06 ]

The issue won't be fixed as users should be discouraged from changing `repl.commit_order` variable. This should be documented explicitly.
Track the corresponding issue in the documentation repository: https://github.com/codership/documentation/issues/390

Generated at Thu Feb 08 10:24:35 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.