Details
-
Bug
-
Status: Open (View Workflow)
-
Major
-
Resolution: Unresolved
-
10.5.9
-
None
-
None
-
Test case
Galera provider loaded, one cluster node is enough.
session_1:
create database test;
use test;
create table t1(id int primary key auto_increment, k int);
insert into t1(k) values (1),(2),(3),(101),(102),(103);
begin;
update t1 set k=k+1 where id<100;
session_2:
use test;
set wsrep_OSU_method=RSU;
alter table t1 add key(k);
session_1:
commit;
Result:
Both sessions are locked.
Expected result:
When session_1 commits, session_2 should continue and perform ALTER TABLE.
I investigated it a little bit and here are my findings:
1. session_1: holds MDL lock
2. session_2: RSU is started, which causes Galera desync and then pause. Pause causes entering into LocalOrder lock with seqno N
3. session_2: stops on MDL lock which is held by session_1
4. session_1: commits. it tries to replicate, however session_2 keeps LocalOrder lock
I also see that 37deed3f37561f264f65e162146bbc2ad35fb1a2 introduced Galera 4. With Galera 3 when we called wsrep_to_isolation_begin(), regardless of TOI or RSU we set thd->wsrep_exec_mode = TOTAL_ORDER. Then when session 2 detects deadlock, abort action is done if thd->wsrep_exec_mode == TOTAL_ORDER. So we perform abort for both TOI and RSU.
Now with Galera4 we don't have wsrep_exec_mode.
wsrep::client_state::m_toi and wsrep::client_state::m_rsu were introduced. We set them accordingly in wsrep_to_isolation_begin(). The logic that used to check for thd->wsrep_exec_mode previously was refactored to check if wsrep_thd_is_toi(), which returns true only in case of wsrep::client_state::m_toi.
It looks like it was some mechanical refactoring, and wsrep::client_state::m_rsu was simply omitted.Test case Galera provider loaded, one cluster node is enough. session_1: create database test; use test; create table t1(id int primary key auto_increment, k int); insert into t1(k) values (1),(2),(3),(101),(102),(103); begin; update t1 set k=k+1 where id<100; session_2: use test; set wsrep_OSU_method=RSU; alter table t1 add key(k); session_1: commit; Result: Both sessions are locked. Expected result: When session_1 commits, session_2 should continue and perform ALTER TABLE. I investigated it a little bit and here are my findings: 1. session_1: holds MDL lock 2. session_2: RSU is started, which causes Galera desync and then pause. Pause causes entering into LocalOrder lock with seqno N 3. session_2: stops on MDL lock which is held by session_1 4. session_1: commits. it tries to replicate, however session_2 keeps LocalOrder lock I also see that 37deed3f37561f264f65e162146bbc2ad35fb1a2 introduced Galera 4. With Galera 3 when we called wsrep_to_isolation_begin(), regardless of TOI or RSU we set thd->wsrep_exec_mode = TOTAL_ORDER. Then when session 2 detects deadlock, abort action is done if thd->wsrep_exec_mode == TOTAL_ORDER. So we perform abort for both TOI and RSU. Now with Galera4 we don't have wsrep_exec_mode. wsrep::client_state::m_toi and wsrep::client_state::m_rsu were introduced. We set them accordingly in wsrep_to_isolation_begin(). The logic that used to check for thd->wsrep_exec_mode previously was refactored to check if wsrep_thd_is_toi(), which returns true only in case of wsrep::client_state::m_toi. It looks like it was some mechanical refactoring, and wsrep::client_state::m_rsu was simply omitted.
Description
Test case
Galera provider loaded, one cluster node is enough.
session_1:
create database test;
use test;
create table t1(id int primary key auto_increment, k int);
insert into t1(k) values (1),(2),(3),(101),(102),(103);
begin;
update t1 set k=k+1 where id<100;
session_2:
use test;
set wsrep_OSU_method=RSU;
alter table t1 add key(k);
session_1:
commit;
Result:
Both sessions are locked.
Expected result:
When session_1 commits, session_2 should continue and perform ALTER TABLE.
I investigated it a little bit and here are my findings:
1. session_1: holds MDL lock
2. session_2: RSU is started, which causes Galera desync and then pause. Pause causes entering into LocalOrder lock with seqno N
3. session_2: stops on MDL lock which is held by session_1
4. session_1: commits. it tries to replicate, however session_2 keeps LocalOrder lock
I also see that 37deed3f37561f264f65e162146bbc2ad35fb1a2 introduced Galera 4. With Galera 3 when we called wsrep_to_isolation_begin(), regardless of TOI or RSU we set thd->wsrep_exec_mode = TOTAL_ORDER. Then when session 2 detects deadlock, abort action is done if thd->wsrep_exec_mode == TOTAL_ORDER. So we perform abort for both TOI and RSU.
Now with Galera4 we don't have wsrep_exec_mode.
wsrep::client_state::m_toi and wsrep::client_state::m_rsu were introduced. We set them accordingly in wsrep_to_isolation_begin(). The logic that used to check for thd->wsrep_exec_mode previously was refactored to check if wsrep_thd_is_toi(), which returns true only in case of wsrep::client_state::m_toi.
It looks like it was some mechanical refactoring, and wsrep::client_state::m_rsu was simply omitted.