[MDEV-31755] Replica's DML event deadlocks with online alter table - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 11.2.1
Component/s: Data Definition - Alter Table
Labels:
None

Description

A deadlock of two phase binlogged "start" ALTER with the following in binlog order DML statement on the parallel slave is caused
by partly missed integration of a ~~MDEV-16329~~'s feature with the ~~MDEV-11675~~ framework. A stronger than required MDL lock
by ALTER prevents the DML execution on slave (while it succeed on master).
The following mtr test exposes it

source include/master-slave.inc;

--connection slave

source include/stop_slave.inc;

# MDEV-31755 Replica's DML event deadlocks wit online alter table

# Three threads for SA,U,CA

--let $slave_parallel_threads=`select @@global.slave_parallel_threads`

--let $slave_parallel_mode=   `select @@global.slave_parallel_mode`

set global slave_parallel_threads=3;

set global slave_parallel_mode= optimistic;

--connection master

create table t (id int, a int, b text, primary key (id));

insert into t values (1,10,''),(2,20,'');

set @@session.binlog_alter_two_phase=1;

set debug_sync= 'alter_table_online_progress signal ready wait_for go';

send alter table t force, algorithm=copy, lock=none;

connect (con1,localhost,root,,);

set debug_sync= 'now wait_for ready';

update t set a = 1;

set debug_sync= 'now signal go';

--connection master

--reap

--source include/save_master_gtid.inc

--connection slave

source include/start_slave.inc;

--source include/sync_with_master_gtid.inc

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

slave_status
2 kB
2023-07-20 21:03
status
65 kB
2023-07-20 21:03
threads_slave
58 kB
2023-07-20 21:03
threads_slave_full
161 kB
2023-07-20 21:03
variables
85 kB
2023-07-20 21:03

Issue Links

is caused by

MDEV-16329 Engine-independent online ALTER TABLE

Closed

Activity

Ascending order - Click to sort in descending order

Kristian Nielsen added a comment - 2023-07-20 19:34

How can I view the details of this issue? The Google drive link doesn't seem to work.

Kristian Nielsen added a comment - 2023-07-20 19:34 How can I view the details of this issue? The Google drive link doesn't seem to work.

Elena Stepanova added a comment - 2023-07-20 21:06 - edited

The google drive link should have never been there. I hope nikitamalyavin will write a proper description soon enough.
Meanwhile, I've attached some of the files from that archive:
status variables threads_slave_full threads_slave slave_status

I cannot however attach the actual binlog which I guess is the most important, if you need it, ~~I'll upload it to the foundation FTP's public when I find out how~~ it has been uploaded to ftp.mariadb.org/public/~~MDEV-31755~~-deadlocked_replication_of_alter.tar.gz

Elena Stepanova added a comment - 2023-07-20 21:06 - edited The google drive link should have never been there. I hope nikitamalyavin will write a proper description soon enough. Meanwhile, I've attached some of the files from that archive: status variables threads_slave_full threads_slave slave_status I cannot however attach the actual binlog which I guess is the most important, if you need it, I'll upload it to the foundation FTP's public when I find out how it has been uploaded to ftp.mariadb.org/public/ MDEV-31755 -deadlocked_replication_of_alter.tar.gz

Kristian Nielsen added a comment - 2023-07-21 10:19

Thanks, Elena, Nikita.

I was a bit confused at first having only threads_slave_full to look into, as the GTIDs and sub_ids are a bit strange. But it's possible that they are just incorrect values shown by GDB. It makes sense that different locking for the ALTER on master and slave can cause this kind of hang.

It's something that worries be in general: the START ALTER feels fragile since it can hang if any kind of different lock conflict occurs against a later query. It feels like we need the lock wait report and kill for metadata locks that we currently have for InnoDB row locks. (But that's probably a separate issue from this particular issue).

- Kristian.

Kristian Nielsen added a comment - 2023-07-21 10:19 Thanks, Elena, Nikita. I was a bit confused at first having only threads_slave_full to look into, as the GTIDs and sub_ids are a bit strange. But it's possible that they are just incorrect values shown by GDB. It makes sense that different locking for the ALTER on master and slave can cause this kind of hang. It's something that worries be in general: the START ALTER feels fragile since it can hang if any kind of different lock conflict occurs against a later query. It feels like we need the lock wait report and kill for metadata locks that we currently have for InnoDB row locks. (But that's probably a separate issue from this particular issue). - Kristian.

Nikita Malyavin added a comment - 2023-07-21 13:08 - edited

knielsen I have counted 6 locking systems in the server. Some of them (like innodb's row/table locks, or my_safe_mutex, or MDL) have their own lock detection systems, others don't, at all. It'd be nice to generalize it all in one deadlock detection module and spread to the locks not covered.

Not sure though, can we do it here – in general, condvars can't be deadlock-detected (we don't know who else can signal), but here is the only producer that is known to us, so we should be able.

Nikita Malyavin added a comment - 2023-07-21 13:08 - edited knielsen I have counted 6 locking systems in the server. Some of them (like innodb's row/table locks, or my_safe_mutex, or MDL) have their own lock detection systems, others don't, at all. It'd be nice to generalize it all in one deadlock detection module and spread to the locks not covered. Not sure though, can we do it here – in general, condvars can't be deadlock-detected (we don't know who else can signal), but here is the only producer that is known to us, so we should be able.

People

Assignee:: Nikita Malyavin

Reporter:: Nikita Malyavin

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 2023-07-20 16:38

Updated:: 2023-08-16 09:31

Resolved:: 2023-08-16 09:31

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration