Details
-
Bug
-
Status: Closed (View Workflow)
-
Minor
-
Resolution: Done
-
10.4.27, 10.6.11
-
None
-
https://mariadbcorp.atlassian.net/wiki/spaces/CustomerEngineers/pages/1518829569/snsvr01
CentOS 7 running on 2x Intel Xeon Gold 6134 with 592GB RAM on a PCI-E 3.0 x8 NVMe SSD
Description
Please review comments for reproduction information.
More to the point, this appears to be a direct reproduction of MDEV-20628 on a MariaDB version patched against MDEV-20605.
Primary using binlog_format=mixed. Replica using the below, key configurations-
slave_parallel_mode=optimistic
|
slave_parallel_threads=16
|
slave_domain_parallel_threads=16
|
slave_parallel_max_queued=2147483647
|
Table structure uses the following, critical columns-
CREATE TABLE test ( |
id char(32) NOT NULL PRIMARY KEY, |
name VARCHAR(255) DEFAULT NULL UNIQUE KEY, |
touched datetime DEFAULT NULL |
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_general_ci; |
Main workload consists of patterns such as-
INSERT INTO test (id, name, touched) VALUES('000000000000000000000000000000001','SYSTEM_LOCK:discovery_device_history:00000000000000000000000000000001', '2019-09-12 19:36:46'); |
UPDATE test SET touched = '2019-09-12 19:36:46' WHERE test.name = 'SYSTEM_LOCK:discovery_device_history:00000000000000000000000000000001'; |
DELETE FROM test WHERE test.name = 'SYSTEM_LOCK:discovery_device_history:00000000000000000000000000000001' |
As expected to optimistic parallel replication, this kind of workload results in tons of rollbacks on the replica (but to be clear, the primary is fine throughout), such as-
2022-11-11 22:29:39 16 [Note] InnoDB: *** (2) TRANSACTION:
|
 |
TRANSACTION 1757073418, ACTIVE 0 sec inserting
|
mysql tables in use 1, locked 1
|
3 lock struct(s), heap size 1128, 3 row lock(s), undo log entries 1
|
MySQL thread id 16, OS thread handle 139911793866496, query id 182677598 Update
|
INSERT INTO sys_mutex (`sys_id`,`sys_updated_by`,`system_id`,`sys_created_on`,`name`,`sys_mod_count`,`sys_updated_on`,`sys_created_by`) VALUES('cdd75b0d22e849008aa321f01222ec46','replay.admin.29','available','2019-09-12 19:36:46','SYSTEM_LOCK:discovery_device_history:ba1e3b9adb73b30834fa5e77dc96199b',0,'2019-09-12 19:36:46','replay.admin.29')
|
2022-11-11 22:29:39 16 [Note] InnoDB: *** (2) HOLDS THE LOCK(S):
|
 |
RECORD LOCKS space id 26 page no 4 n bits 72 index name of table `master_4`.`sys_mutex` trx id 1757073418 lock mode S
|
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
|
0: len 8; hex 73757072656d756d; asc supremum;;
|
 |
Record lock, heap no 2 PHYSICAL RECORD: n_fields 2; compact format; info bits 32
|
0: len 30; hex 53595354454d5f4c4f434b3a646973636f766572795f6465766963655f68; asc SYSTEM_LOCK:discovery_device_h; (total 69 bytes);
|
1: len 30; hex 643039653233346535356137343363326266616264656532393166313561; asc d09e234e55a743c2bfabdee291f15a; (total 32 bytes);
|
 |
2022-11-11 22:29:39 16 [Note] InnoDB: *** (2) WAITING FOR THIS LOCK TO BE GRANTED:
|
 |
RECORD LOCKS space id 26 page no 4 n bits 72 index name of table `master_4`.`sys_mutex` trx id 1757073418 lock_mode X locks gap before rec insert intention waiting
|
Record lock, heap no 2 PHYSICAL RECORD: n_fields 2; compact format; info bits 32
|
0: len 30; hex 53595354454d5f4c4f434b3a646973636f766572795f6465766963655f68; asc SYSTEM_LOCK:discovery_device_h; (total 69 bytes);
|
1: len 30; hex 643039653233346535356137343363326266616264656532393166313561; asc d09e234e55a743c2bfabdee291f15a; (total 32 bytes);
|
 |
2022-11-11 22:29:39 16 [Note] InnoDB: *** WE ROLL BACK TRANSACTION (1)
|
Despite tons of these deadlocks happening constantly, in my attempts the replica held out for over 24 hours before finally throwing the below error-
2022-11-11 22:29:39 17 [ERROR] Slave SQL: Error 'Duplicate entry 'SYSTEM_LOCK:discovery_device_history:ba1e3b9adb73b30834fa5e77...' for key 'name'' on query. Default database: 'master_6'. Query: 'INSERT INTO sys_mutex (`sys_id`,`sys_updated_by`,`system_id`,`sys_created_on`,`name`,`sys_mod_count`,`sys_updated_on`,`sys_created_by`) VALUES('a21dfb83830740b38da2db5e67d6c2c6','replay.admin.29','available','2019-09-12 19:36:46','SYSTEM_LOCK:discovery_device_history:ba1e3b9adb73b30834fa5e77dc96199b',0,'2019-09-12 19:36:46','replay.admin.29')', Gtid 0-1-445975955, Internal MariaDB error code: 1062
|
2022-11-11 22:29:39 17 [Warning] Slave: Duplicate entry 'SYSTEM_LOCK:discovery_device_history:ba1e3b9adb73b30834fa5e77...' for key 'name' Error_code: 1062
|
2022-11-11 22:29:39 17 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'CS_10.000174' position 791116152; GTID position '0-1-445975954'
|
2022-11-11 22:29:39 16 [ERROR] Slave (additional info): Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
|
2022-11-11 22:29:39 16 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
|
2022-11-11 22:29:39 16 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'CS_10.000174' position 791116152; GTID position '0-1-445975954'
|
2022-11-11 22:29:39 15 [ERROR] Slave (additional info): Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
|
2022-11-11 22:29:39 15 [Warning] Slave: Connection was killed Error_code: 1927
|
2022-11-11 22:29:39 15 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213
|
2022-11-11 22:29:39 15 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
|
Resolved binary log from this repro is available via link in comments.
Data leftover on replica after this crash is-
SELECT * FROM master_6.sys_mutex\G |
*************************** 1. row ***************************
|
name: SYSTEM_LOCK:discovery_device_history:ba1e3b9adb73b30834fa5e77dc96199b |
system_id: app128033.iad105.service-now.com:ciscotesttemp2021
|
acquired: 2019-09-12 19:36:46
|
touched: 2019-09-12 19:36:46
|
sys_id: 05259991e9ac49cb9b8860533c352800
|
sys_updated_by: replay.admin.29
|
sys_updated_on: 2019-09-12 19:36:46
|
sys_created_by: replay.admin.29
|
sys_created_on: 2019-09-12 19:36:46
|
sys_mod_count: 0
|
The same query on the primary returns-
SELECT * FROM master_6.sys_mutex; |
Empty set (0.000 sec) |
SHOW GLOBAL VARIABLES from primary and replica also attached while the full error log from the replica is in the comments.
Attachments
Issue Links
- relates to
-
MDEV-20605 Awaken transaction can miss inserted by other transaction records due to wrong persistent cursor restoration
- Closed
-
MDEV-20628 If temporary error occurs with optimistic mode of parallel replication, error message has false information
- Closed
-
MDEV-30225 RR isolation violation with locking unique search
- Closed
- mentioned in
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...