[MDEV-32122] Node killing could cause applier to fail on write rows event leading inconsistency Created: 2023-09-07  Updated: 2023-11-08  Resolved: 2023-10-02

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.6
Fix Version/s: N/A

Type: Bug Priority: Critical
Reporter: Jan Lindström Assignee: Jan Lindström
Resolution: Cannot Reproduce Votes: 1
Labels: None

Attachments: File n1.cnf     File n2.cnf     File n3.cnf     File node1.err     File node2.err     File node3.err     File transactions-bkp.yy     File transactions-bkp_1.yy     File transactions.zz    
Issue Links:
Relates
relates to MDEV-30217 Assertion `mode_ == m_local || transa... Closed
relates to MDEV-31658 Deadlock found when trying to get loc... Open

 Description   

3-node cluster with RQG load where node_2 is killed. Node_1 applier fails on deadlock error and drops from cluster, node_3 self leaves as it is alone.

2023-09-07 11:50:07 2 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table test.table20_int_autoinc; Deadlock found when trying to get lock; try restarting transaction, Error_code: 1213; handler error HA_ERR_LOCK_DEADLOCK; the event's master log FIRST, end_log_pos 398, Internal MariaDB error code: 1213



 Comments   
Comment by Jan Lindström [ 2023-09-07 ]

node1 (OLTP load was active) :

node1:root@localhost> show status like '%wsrep%st%';
+------------------------------+--------------------------------------+
| Variable_name                | Value                                |
+------------------------------+--------------------------------------+
| wsrep_local_state_uuid       | 00000000-0000-0000-0000-000000000000 |
| wsrep_last_committed         | -1                                   |
| wsrep_flow_control_requested | false                                |
| wsrep_cert_deps_distance     | 70.7591                              |
| wsrep_local_state            | 5                                    |
| wsrep_local_state_comment    | Inconsistent                         |
| wsrep_cluster_capabilities   |                                      |
| wsrep_cluster_conf_id        | 18446744073709551615                 |
| wsrep_cluster_size           | 0                                    |
| wsrep_cluster_state_uuid     | 65ef358e-4d5b-11ee-a7ce-76be38a797b3 |
| wsrep_cluster_status         | Disconnected                         |
+------------------------------+--------------------------------------+
11 rows in set (0.001 sec)
 
node1:root@localhost>

Node3 :

node3:root@localhost> show status like '%wsrep%st%';
+------------------------------+--------------------------------------+
| Variable_name                | Value                                |
+------------------------------+--------------------------------------+
| wsrep_local_state_uuid       | 65ef358e-4d5b-11ee-a7ce-76be38a797b3 |
| wsrep_last_committed         | 23055                                |
| wsrep_flow_control_requested | false                                |
| wsrep_cert_deps_distance     | 46.5084                              |
| wsrep_local_state            | 0                                    |
| wsrep_local_state_comment    | Initialized                          |
| wsrep_cluster_weight         | 0                                    |
| wsrep_evs_evict_list         |                                      |
| wsrep_evs_state              | OPERATIONAL                          |
| wsrep_gmcast_segment         | 0                                    |
| wsrep_cluster_capabilities   |                                      |
| wsrep_cluster_conf_id        | 18446744073709551615                 |
| wsrep_cluster_size           | 1                                    |
| wsrep_cluster_state_uuid     | 65ef358e-4d5b-11ee-a7ce-76be38a797b3 |
| wsrep_cluster_status         | non-Primary                          |
+------------------------------+--------------------------------------+
15 rows in set (0.001 sec)
 
node3:root@localhost>

Comment by Ramesh Sivaraman [ 2023-09-07 ]

Test case to reproduce the issue
1) Start 3 node cluster
2) Run RQG load on Node1 and Node2 (RQG grammar and cnf files attached)

perl gendata.pl --dsn=dbi:mysql:host=127.0.0.1:port=12107:user=root:database=test --spec=conf/transactions/transactions.zz
perl gentest.pl --dsn=dbi:mysql:host=127.0.0.1:port=12107:user=root:database=test --grammar=conf/transactions/transactions-bkp.yy --threads=32 --duration=30000 --queries=1000000000 &
perl gentest.pl --dsn=dbi:mysql:host=127.0.0.1:port=12943:user=root:database=test --grammar=conf/transactions/transactions-bkp.yy --threads=32 --duration=30000 --queries=1000000000 &

4) Forcefully kill node2 and analyse the remaining cluster nodes wsrep state.

Comment by Jan Lindström [ 2023-09-29 ]

Relevant info from node_1 where we start inconsistency voting:

2023-09-07 11:50:07 0 [Note] WSREP: Member 0(galapq) initiates vote on 65ef358e-4d5b-11ee-a7ce-76be38a797b3:9523,ded413bc1b661ec5:  Deadlock found when trying to get lock; try restarting transaction, Error_code: 1213;
2023-09-07 11:50:07 0 [Note] WSREP: Votes over 65ef358e-4d5b-11ee-a7ce-76be38a797b3:9523:
   ded413bc1b661ec5:   1/3
Waiting for more votes.
2023-09-07 11:50:07 0 [Note] WSREP: Member 1(galapq) responds to vote on 65ef358e-4d5b-11ee-a7ce-76be38a797b3:9523,0000000000000000: Success
2023-09-07 11:50:07 0 [Note] WSREP: Votes over 65ef358e-4d5b-11ee-a7ce-76be38a797b3:9523:
   0000000000000000:   1/3
   ded413bc1b661ec5:   1/3
Waiting for more votes.
2023-09-07 11:50:07 0 [Note] WSREP: Member 2(galapq) responds to vote on 65ef358e-4d5b-11ee-a7ce-76be38a797b3:9523,0000000000000000: Success
2023-09-07 11:50:07 0 [Note] WSREP: Votes over 65ef358e-4d5b-11ee-a7ce-76be38a797b3:9523:
   0000000000000000:   2/3
   ded413bc1b661ec5:   1/3
Winner: 0000000000000000
2023-09-07 11:50:07 2 [ERROR] WSREP: Inconsistency detected: Inconsistent by consensus on 65ef358e-4d5b-11ee-a7ce-76be38a797b3:9523
	 at /test/galera_4x_opt/galera/src/replicator_smm.cpp:process_apply_error():1357

Comment by Jan Lindström [ 2023-10-02 ]

Based on testing this problem does not reproduce anymore with 10.6 commit 388296a1.

Comment by Ramesh Sivaraman [ 2023-10-03 ]

could not reproduce the issue on latest 10.6 build.

Generated at Thu Feb 08 10:29:00 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.