Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-31755

Replica's DML event deadlocks with online alter table

Details

    Description

      A deadlock of two phase binlogged "start" ALTER with the following in binlog order DML statement on the parallel slave is caused
      by partly missed integration of a MDEV-16329's feature with the MDEV-11675 framework. A stronger than required MDL lock
      by ALTER prevents the DML execution on slave (while it succeed on master).
      The following mtr test exposes it

      source include/master-slave.inc;
       
      --connection slave
      source include/stop_slave.inc;
       
      #
      # MDEV-31755 Replica's DML event deadlocks wit online alter table
      #
      # Three threads for SA,U,CA
      --let $slave_parallel_threads=`select @@global.slave_parallel_threads`
      --let $slave_parallel_mode=   `select @@global.slave_parallel_mode`
      set global slave_parallel_threads=3;
      set global slave_parallel_mode= optimistic;
       
      --connection master
      create table t (id int, a int, b text, primary key (id));
      insert into t values (1,10,''),(2,20,'');
       
      set @@session.binlog_alter_two_phase=1;
      set debug_sync= 'alter_table_online_progress signal ready wait_for go';
      send alter table t force, algorithm=copy, lock=none;
       
      connect (con1,localhost,root,,);
      set debug_sync= 'now wait_for ready';
       
      update t set a = 1;
       
      set debug_sync= 'now signal go';
       
      --connection master
      --reap
      --source include/save_master_gtid.inc
       
      --connection slave
      source include/start_slave.inc;
      --source include/sync_with_master_gtid.inc
      

      Attachments

        1. slave_status
          2 kB
        2. status
          65 kB
        3. threads_slave
          58 kB
        4. threads_slave_full
          161 kB
        5. variables
          85 kB

        Issue Links

          Activity

            How can I view the details of this issue? The Google drive link doesn't seem to work.

            knielsen Kristian Nielsen added a comment - How can I view the details of this issue? The Google drive link doesn't seem to work.
            elenst Elena Stepanova added a comment - - edited

            The google drive link should have never been there. I hope nikitamalyavin will write a proper description soon enough.
            Meanwhile, I've attached some of the files from that archive:
            status variables threads_slave_full threads_slave slave_status

            I cannot however attach the actual binlog which I guess is the most important, if you need it, I'll upload it to the foundation FTP's public when I find out how it has been uploaded to ftp.mariadb.org/public/MDEV-31755-deadlocked_replication_of_alter.tar.gz

            elenst Elena Stepanova added a comment - - edited The google drive link should have never been there. I hope nikitamalyavin will write a proper description soon enough. Meanwhile, I've attached some of the files from that archive: status variables threads_slave_full threads_slave slave_status I cannot however attach the actual binlog which I guess is the most important, if you need it, I'll upload it to the foundation FTP's public when I find out how it has been uploaded to ftp.mariadb.org/public/ MDEV-31755 -deadlocked_replication_of_alter.tar.gz

            Thanks, Elena, Nikita.

            I was a bit confused at first having only threads_slave_full to look into, as the GTIDs and sub_ids are a bit strange. But it's possible that they are just incorrect values shown by GDB. It makes sense that different locking for the ALTER on master and slave can cause this kind of hang.

            It's something that worries be in general: the START ALTER feels fragile since it can hang if any kind of different lock conflict occurs against a later query. It feels like we need the lock wait report and kill for metadata locks that we currently have for InnoDB row locks. (But that's probably a separate issue from this particular issue).

            - Kristian.

            knielsen Kristian Nielsen added a comment - Thanks, Elena, Nikita. I was a bit confused at first having only threads_slave_full to look into, as the GTIDs and sub_ids are a bit strange. But it's possible that they are just incorrect values shown by GDB. It makes sense that different locking for the ALTER on master and slave can cause this kind of hang. It's something that worries be in general: the START ALTER feels fragile since it can hang if any kind of different lock conflict occurs against a later query. It feels like we need the lock wait report and kill for metadata locks that we currently have for InnoDB row locks. (But that's probably a separate issue from this particular issue). - Kristian.
            nikitamalyavin Nikita Malyavin added a comment - - edited

            knielsen I have counted 6 locking systems in the server. Some of them (like innodb's row/table locks, or my_safe_mutex, or MDL) have their own lock detection systems, others don't, at all. It'd be nice to generalize it all in one deadlock detection module and spread to the locks not covered.

            Not sure though, can we do it here – in general, condvars can't be deadlock-detected (we don't know who else can signal), but here is the only producer that is known to us, so we should be able.

            nikitamalyavin Nikita Malyavin added a comment - - edited knielsen I have counted 6 locking systems in the server. Some of them (like innodb's row/table locks, or my_safe_mutex, or MDL) have their own lock detection systems, others don't, at all. It'd be nice to generalize it all in one deadlock detection module and spread to the locks not covered. Not sure though, can we do it here – in general, condvars can't be deadlock-detected (we don't know who else can signal), but here is the only producer that is known to us, so we should be able.

            People

              nikitamalyavin Nikita Malyavin
              nikitamalyavin Nikita Malyavin
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.