[MDEV-30010] Slave (additional info): Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964 - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Minor
Resolution: Done
Affects Version/s: 10.4.27, 10.6.11
Fix Version/s: 10.4.28, 10.5.19, 10.6.12, 10.7.8, 10.8.7, 10.9.5, 10.10.3
Component/s: Replication, Storage Engine - InnoDB
Labels:
None
Environment:

Hide
https://mariadbcorp.atlassian.net/wiki/spaces/CustomerEngineers/pages/1518829569/snsvr01

CentOS 7 running on 2x Intel Xeon Gold 6134 with 592GB RAM on a PCI-E 3.0 x8 NVMe SSD

Show
https://mariadbcorp.atlassian.net/wiki/spaces/CustomerEngineers/pages/1518829569/snsvr01 CentOS 7 running on 2x Intel Xeon Gold 6134 with 592GB RAM on a PCI-E 3.0 x8 NVMe SSD

Description

Please review comments for reproduction information.

More to the point, this appears to be a direct reproduction of ~~MDEV-20628~~ on a MariaDB version patched against ~~MDEV-20605~~.

Primary using binlog_format=mixed. Replica using the below, key configurations-

slave_parallel_mode=optimistic

slave_parallel_threads=16

slave_domain_parallel_threads=16

slave_parallel_max_queued=2147483647

Table structure uses the following, critical columns-

CREATE TABLE test (

  id char(32) NOT NULL PRIMARY KEY,

  name VARCHAR(255) DEFAULT NULL UNIQUE KEY,

  touched datetime DEFAULT NULL

) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_general_ci;

Main workload consists of patterns such as-

INSERT INTO test (id, name, touched) VALUES('000000000000000000000000000000001','SYSTEM_LOCK:discovery_device_history:00000000000000000000000000000001', '2019-09-12 19:36:46');

UPDATE test SET touched = '2019-09-12 19:36:46' WHERE test.name = 'SYSTEM_LOCK:discovery_device_history:00000000000000000000000000000001';

DELETE FROM test WHERE test.name = 'SYSTEM_LOCK:discovery_device_history:00000000000000000000000000000001'

As expected to optimistic parallel replication, this kind of workload results in tons of rollbacks on the replica (but to be clear, the primary is fine throughout), such as-

2022-11-11 22:29:39 16 [Note] InnoDB: *** (2) TRANSACTION:

TRANSACTION 1757073418, ACTIVE 0 sec inserting

mysql tables in use 1, locked 1

3 lock struct(s), heap size 1128, 3 row lock(s), undo log entries 1

MySQL thread id 16, OS thread handle 139911793866496, query id 182677598 Update

INSERT INTO sys_mutex (`sys_id`,`sys_updated_by`,`system_id`,`sys_created_on`,`name`,`sys_mod_count`,`sys_updated_on`,`sys_created_by`) VALUES('cdd75b0d22e849008aa321f01222ec46','replay.admin.29','available','2019-09-12 19:36:46','SYSTEM_LOCK:discovery_device_history:ba1e3b9adb73b30834fa5e77dc96199b',0,'2019-09-12 19:36:46','replay.admin.29')

2022-11-11 22:29:39 16 [Note] InnoDB: *** (2) HOLDS THE LOCK(S):

RECORD LOCKS space id 26 page no 4 n bits 72 index name of table `master_4`.`sys_mutex` trx id 1757073418 lock mode S

Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0

 0: len 8; hex 73757072656d756d; asc supremum;;

Record lock, heap no 2 PHYSICAL RECORD: n_fields 2; compact format; info bits 32

 0: len 30; hex 53595354454d5f4c4f434b3a646973636f766572795f6465766963655f68; asc SYSTEM_LOCK:discovery_device_h; (total 69 bytes);

 1: len 30; hex 643039653233346535356137343363326266616264656532393166313561; asc d09e234e55a743c2bfabdee291f15a; (total 32 bytes);

2022-11-11 22:29:39 16 [Note] InnoDB: *** (2) WAITING FOR THIS LOCK TO BE GRANTED:

RECORD LOCKS space id 26 page no 4 n bits 72 index name of table `master_4`.`sys_mutex` trx id 1757073418 lock_mode X locks gap before rec insert intention waiting

Record lock, heap no 2 PHYSICAL RECORD: n_fields 2; compact format; info bits 32

 0: len 30; hex 53595354454d5f4c4f434b3a646973636f766572795f6465766963655f68; asc SYSTEM_LOCK:discovery_device_h; (total 69 bytes);

 1: len 30; hex 643039653233346535356137343363326266616264656532393166313561; asc d09e234e55a743c2bfabdee291f15a; (total 32 bytes);

2022-11-11 22:29:39 16 [Note] InnoDB: *** WE ROLL BACK TRANSACTION (1)

Despite tons of these deadlocks happening constantly, in my attempts the replica held out for over 24 hours before finally throwing the below error-

2022-11-11 22:29:39 17 [ERROR] Slave SQL: Error 'Duplicate entry 'SYSTEM_LOCK:discovery_device_history:ba1e3b9adb73b30834fa5e77...' for key 'name'' on query. Default database: 'master_6'. Query: 'INSERT INTO sys_mutex (`sys_id`,`sys_updated_by`,`system_id`,`sys_created_on`,`name`,`sys_mod_count`,`sys_updated_on`,`sys_created_by`) VALUES('a21dfb83830740b38da2db5e67d6c2c6','replay.admin.29','available','2019-09-12 19:36:46','SYSTEM_LOCK:discovery_device_history:ba1e3b9adb73b30834fa5e77dc96199b',0,'2019-09-12 19:36:46','replay.admin.29')', Gtid 0-1-445975955, Internal MariaDB error code: 1062

2022-11-11 22:29:39 17 [Warning] Slave: Duplicate entry 'SYSTEM_LOCK:discovery_device_history:ba1e3b9adb73b30834fa5e77...' for key 'name' Error_code: 1062

2022-11-11 22:29:39 17 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'CS_10.000174' position 791116152; GTID position '0-1-445975954'

2022-11-11 22:29:39 16 [ERROR] Slave (additional info): Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964

2022-11-11 22:29:39 16 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964

2022-11-11 22:29:39 16 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'CS_10.000174' position 791116152; GTID position '0-1-445975954'

2022-11-11 22:29:39 15 [ERROR] Slave (additional info): Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964

2022-11-11 22:29:39 15 [Warning] Slave: Connection was killed Error_code: 1927

2022-11-11 22:29:39 15 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213

2022-11-11 22:29:39 15 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964

Resolved binary log from this repro is available via link in comments.

Data leftover on replica after this crash is-

SELECT * FROM master_6.sys_mutex\G

*************************** 1. row ***************************

          name: SYSTEM_LOCK:discovery_device_history:ba1e3b9adb73b30834fa5e77dc96199b

     system_id: app128033.iad105.service-now.com:ciscotesttemp2021

      acquired: 2019-09-12 19:36:46

       touched: 2019-09-12 19:36:46

        sys_id: 05259991e9ac49cb9b8860533c352800

sys_updated_by: replay.admin.29

sys_updated_on: 2019-09-12 19:36:46

sys_created_by: replay.admin.29

sys_created_on: 2019-09-12 19:36:46

 sys_mod_count: 0

The same query on the primary returns-

SELECT * FROM master_6.sys_mutex;

Empty set (0.000 sec)

SHOW GLOBAL VARIABLES from primary and replica also attached while the full error log from the replica is in the comments.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

MDEV-30010.test
3 kB
2022-12-02 13:13
MDEV-30010-pk.test
1 kB
2022-12-09 08:42
primary_global_variables.csv
21 kB
2022-11-14 22:06
replica_global_variables.csv
21 kB
2022-11-14 22:06

Issue Links

relates to

MDEV-20605 Awaken transaction can miss inserted by other transaction records due to wrong persistent cursor restoration

Closed

MDEV-20628 If temporary error occurs with optimistic mode of parallel replication, error message has false information

Closed

MDEV-30225 RR isolation violation with locking unique search

Closed

mentioned in: Page Failed to load; Page Failed to load; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

(13 mentioned in)

Activity

Transition	Time In Source Status	Execution Times

Vladislav Lesin made transition - 2022-11-15 16:46

Open

In Progress

18h 38m

Vladislav Lesin made transition - 2022-12-02 14:12

In Progress

Stalled

16d 21h 26m

Vladislav Lesin made transition - 2022-12-08 08:45

Stalled

In Progress

5d 18h 33m

Andrei Elkin made transition - 2023-01-24 19:38

In Progress

Closed

47d 10h 53m

People

Assignee:: Andrei Elkin

Reporter:: Rob Schwyzer (Inactive)

Votes:: 1 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 2022-11-14 22:07

Updated:: 2024-09-05 19:52

Resolved:: 2023-01-24 19:38

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration