Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-35110

Deadlock on Replica during BACKUP STAGE BLOCK_COMMIT on XA transactions

Details

    Description

      The user is seeing deadlocks on slave sql thread using parallel replication.
      During the backup it causes the slave_sql_thread to stuck.
      Affected version is all versions up to 10.6.19

      This ticket is an addition to MDEV-30423, which did not completely fix the problem.

      +--------+-------------+---------------------+---------+--------------+--------+-----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+
      | Id     | User        | Host                | db      | Command      | Time   | State                                         | Info  | Progress |
      +--------+-------------+---------------------+---------+--------------+--------+-----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+
      |      6 | system user |                     | NULL    | Slave_IO     | 756023 | Waiting for master to send event              | NULL  |    0.000 |
      |      8 | system user |                     | NULL    | Slave_worker |  44471 | Waiting for backup lock                       | NULL  |    0.000 |
      |      9 | system user |                     | NULL    | Slave_worker |  44471 | Waiting for prior transaction to commit       | NULL  |    0.000 |
      |     10 | system user |                     | NULL    | Slave_worker |  44471 | Waiting for prior transaction to commit       | NULL  |    0.000 |
      |     11 | system user |                     | bankfrm | Slave_worker |  44471 | Waiting for backup lock                       | XA COMMIT  ..... |    0.000 |
      |      7 | system user |                     | NULL    | Slave_SQL    |  38598 | Waiting for room in worker thread event queue | NULL  |    0.000 |
      | 261485 | myoper      | localhost           | NULL    | Query        |  44471 | Waiting for backup lock                       | BACKUP STAGE BLOCK_COMMIT                                                                                                                                 |    0.000 |
      

      One of the commits is a XA transaction which causes the hang as it is waiting for a prior commits while BACKUP STAGE BLOCK_COMMIT blocks the prior commit to continue

      Attachments

        Issue Links

          Activity

            From the commit message:

            The original commit in MDEV-30423 was not complete as some usage in XA of
            MDL_BACKUP_COMMIT locks did not set thd->backup_commit_lock.
            This is required to be set when using parallel replication.

            Fixed by ensuring that all usage of BACKUP_COMMIT lock i XA is uniform and
            all sets thd->backup_commit_lock. I also changed all locks to be
            MDL_EXPLICIT to keep also that part uniform.

            monty Michael Widenius added a comment - From the commit message: The original commit in MDEV-30423 was not complete as some usage in XA of MDL_BACKUP_COMMIT locks did not set thd->backup_commit_lock. This is required to be set when using parallel replication. Fixed by ensuring that all usage of BACKUP_COMMIT lock i XA is uniform and all sets thd->backup_commit_lock. I also changed all locks to be MDL_EXPLICIT to keep also that part uniform.
            susil.behera Susil Behera added a comment -

            I've completed my verification with the following RQG grammar rule,

            query:
              {$rand_xid= "'xid".abs($$)."'"; '' } complete_xa
                                                 | normal_trans 
            ;
             
            complete_xa:
               extra_sql | xa_2phase | xa_1phase | xa_rollback_idle | xa_rollback_prepared
            ;
             
            extra_sql:
                  BACKUP STAGE START 
              ;;  BACKUP STAGE BLOCK_COMMIT
              ;;  BACKUP STAGE END
            ;
             
            xa_2phase:
                 XA START { $rand_xid }
              ;; normal_query_list
              ;; XA END { $rand_xid }
              ;; XA PREPARE { $rand_xid }
              ;; XA COMMIT { $rand_xid }
            ;
             
            xa_1phase:
                 XA START { $rand_xid }
              ;; normal_query_list
              ;; XA END { $rand_xid }
              ;; XA COMMIT { $rand_xid } ONE PHASE
            ;
             
            xa_rollback_idle:
                 XA START { $rand_xid }
              ;; normal_query_list
              ;; XA END { $rand_xid }
              ;; XA ROLLBACK { $rand_xid }
            ;
             
            xa_rollback_prepared:
                 XA START { $rand_xid }
              ;; normal_query_list
              ;; XA END { $rand_xid }
              ;; XA PREPARE { $rand_xid }
              ;; XA ROLLBACK { $rand_xid }
            ;
             
            normal_trans:
              normal_trans_commit | normal_trans_rollback
            ;
             
            normal_trans_commit:
                  START TRANSACTION
              ;;  normal_query_list
              ;;  COMMIT
            ;
             
            normal_trans_rollback:
                  START TRANSACTION
              ;;  normal_query_list
              ;;  ROLLBACK
            ;
             
            normal_query_list:
              normal_query | 
              normal_query ;; normal_query_list
            ;
             
            normal_query:
              select | insert | insert | update | delete ;
             
            select:
                point_select | simple_range | sum_range | order_range | distinct_range
            ;
             
            point_select:
                SELECT k FROM _table WHERE _field_pk = _smallint_unsigned ;
             
            simple_range:
                SELECT k FROM _table WHERE _field_pk BETWEEN _smallint_unsigned AND _smallint_unsigned ;
             
            sum_range:
                SELECT SUM(k) FROM _table WHERE _field_pk BETWEEN _smallint_unsigned AND _smallint_unsigned ;
             
            order_range:
                SELECT k FROM _table WHERE _field_pk BETWEEN _smallint_unsigned AND _smallint_unsigned ORDER BY _field ;
             
            distinct_range:
                SELECT DISTINCT k FROM _table WHERE _field_pk BETWEEN _smallint_unsigned AND _smallint_unsigned ORDER BY _field ;
             
            insert:
              INSERT IGNORE INTO _table ( k, c, pad)  VALUES ( _int, _string, _string )
            ;
             
            update:
              index_update | non_index_update ;
             
            delete:
              DELETE FROM _table WHERE _field_pk = _smallint_unsigned ;
             
            index_update:
              UPDATE IGNORE _table SET _field_int_indexed = _field_int_indexed + 1 WHERE _field_pk = _smallint_unsigned ;
             
            # It relies on char fields being unindexed.
            # If char fields happen to be indexed in the table spec, then this update can be indexed as well. No big harm though.
            non_index_update:
              UPDATE _table SET _field_char = _string WHERE _field_pk = _smallint_unsigned ;
            

            The following 3 parallel sessions were run,

            session1 (RQG Run)>
            $ perl ./run.pl --mysqld=--lock-wait-timeout=10 --mysqld=--innodb-lock-wait-timeout=5 --mysqld=--binlog-format=mixed --mysqld=--slave_parallel_threads=4 --mysqld=--slave-parallel-mode=minimal --reporters=Backtrace,Deadlock  --gendata=conf/zz/mdev-34466.zz --grammar=conf/yy/mdev-34466.yy --engine=InnoDB --base-port=14000 --server1-basedir=/test/MDEV-35110-MD161024-mariadb-10.6.20-linux-x86_64-opt --vardir=/dev/shm/var1 --queries=100000 --duration=600 --threads=4
             
            session2 (BACKUP STAGE commands)>
            $ while true; do ./bin/mariadb -h localhost -P 14000 -uroot -e "SELECT SLEEP(2); BACKUP STAGE START ; BACKUP STAGE BLOCK_COMMIT ; BACKUP STAGE END;" ; done
             
            session3 (Some SQL queries)>
            $ while true; do ./bin/mariadb -h localhost -P 14000 -uroot -e "SELECT COUNT(1) FROM oltp_db.oltp1; SELECT * FROM oltp_db.oltp1 WHERE id < 100000 LIMIT 2; SELECT * FROM oltp_db.oltp1 WHERE id > 1 LIMIT 2;" ; done
            

            The following settings/modes have been exercised with the above test runs,
            serial replication
            binlog_format=

            {statement|row|mixed}

            parallel_mode=

            {optimistic|aggressive|minimal}

            binlog_row_image=

            {FULL|MINIMAL|NOBLOB}

            Serial replication mode

            Everything looks good. OK TO PUSH from my side.

            susil.behera Susil Behera added a comment - I've completed my verification with the following RQG grammar rule, query: {$rand_xid= "'xid".abs($$)."'"; '' } complete_xa | normal_trans ;   complete_xa: extra_sql | xa_2phase | xa_1phase | xa_rollback_idle | xa_rollback_prepared ;   extra_sql: BACKUP STAGE START ;; BACKUP STAGE BLOCK_COMMIT ;; BACKUP STAGE END ;   xa_2phase: XA START { $rand_xid } ;; normal_query_list ;; XA END { $rand_xid } ;; XA PREPARE { $rand_xid } ;; XA COMMIT { $rand_xid } ;   xa_1phase: XA START { $rand_xid } ;; normal_query_list ;; XA END { $rand_xid } ;; XA COMMIT { $rand_xid } ONE PHASE ;   xa_rollback_idle: XA START { $rand_xid } ;; normal_query_list ;; XA END { $rand_xid } ;; XA ROLLBACK { $rand_xid } ;   xa_rollback_prepared: XA START { $rand_xid } ;; normal_query_list ;; XA END { $rand_xid } ;; XA PREPARE { $rand_xid } ;; XA ROLLBACK { $rand_xid } ;   normal_trans: normal_trans_commit | normal_trans_rollback ;   normal_trans_commit: START TRANSACTION ;; normal_query_list ;; COMMIT ;   normal_trans_rollback: START TRANSACTION ;; normal_query_list ;; ROLLBACK ;   normal_query_list: normal_query | normal_query ;; normal_query_list ;   normal_query: select | insert | insert | update | delete ;   select: point_select | simple_range | sum_range | order_range | distinct_range ;   point_select: SELECT k FROM _table WHERE _field_pk = _smallint_unsigned ;   simple_range: SELECT k FROM _table WHERE _field_pk BETWEEN _smallint_unsigned AND _smallint_unsigned ;   sum_range: SELECT SUM(k) FROM _table WHERE _field_pk BETWEEN _smallint_unsigned AND _smallint_unsigned ;   order_range: SELECT k FROM _table WHERE _field_pk BETWEEN _smallint_unsigned AND _smallint_unsigned ORDER BY _field ;   distinct_range: SELECT DISTINCT k FROM _table WHERE _field_pk BETWEEN _smallint_unsigned AND _smallint_unsigned ORDER BY _field ;   insert: INSERT IGNORE INTO _table ( k, c, pad) VALUES ( _int, _string, _string ) ;   update: index_update | non_index_update ;   delete: DELETE FROM _table WHERE _field_pk = _smallint_unsigned ;   index_update: UPDATE IGNORE _table SET _field_int_indexed = _field_int_indexed + 1 WHERE _field_pk = _smallint_unsigned ;   # It relies on char fields being unindexed. # If char fields happen to be indexed in the table spec, then this update can be indexed as well. No big harm though. non_index_update: UPDATE _table SET _field_char = _string WHERE _field_pk = _smallint_unsigned ; The following 3 parallel sessions were run, session1 (RQG Run)> $ perl ./run.pl --mysqld=--lock-wait-timeout=10 --mysqld=--innodb-lock-wait-timeout=5 --mysqld=--binlog-format=mixed --mysqld=--slave_parallel_threads=4 --mysqld=--slave-parallel-mode=minimal --reporters=Backtrace,Deadlock --gendata=conf/zz/mdev-34466.zz --grammar=conf/yy/mdev-34466.yy --engine=InnoDB --base-port=14000 --server1-basedir=/test/MDEV-35110-MD161024-mariadb-10.6.20-linux-x86_64-opt --vardir=/dev/shm/var1 --queries=100000 --duration=600 --threads=4   session2 (BACKUP STAGE commands)> $ while true; do ./bin/mariadb -h localhost -P 14000 -uroot -e "SELECT SLEEP(2); BACKUP STAGE START ; BACKUP STAGE BLOCK_COMMIT ; BACKUP STAGE END;" ; done   session3 (Some SQL queries)> $ while true; do ./bin/mariadb -h localhost -P 14000 -uroot -e "SELECT COUNT(1) FROM oltp_db.oltp1; SELECT * FROM oltp_db.oltp1 WHERE id < 100000 LIMIT 2; SELECT * FROM oltp_db.oltp1 WHERE id > 1 LIMIT 2;" ; done The following settings/modes have been exercised with the above test runs, serial replication binlog_format= {statement|row|mixed} parallel_mode= {optimistic|aggressive|minimal} binlog_row_image= {FULL|MINIMAL|NOBLOB} Serial replication mode Everything looks good. OK TO PUSH from my side.
            susil.behera Susil Behera added a comment - - edited

            I've added disconnect XAs to the grammar, also performed periodic backups on the slave while RQG was running. I didn't see any problem. The slave seems to be stuck after a while.

            RQG grammar rule:

            query_init:
              # Need debug build. 
              # It removes the restrictions for master, i.e. unmodified records are unlocked for master also, 
              # normally it is done only for slave workers. It would increase the probability to catch some bug in the code
              SET GLOBAL innodb_enable_xap_unlock_unmodified_for_primary_debug = 1
            ;
             
            query:
              {$rand_xid= "'xid".abs($$)."'"; '' } complete_xa
                                                 | disconnect_xa
                                                 | normal_trans 
            ;
             
            thread1_init:
                  USE oltp_db
              ;;  DROP TABLE IF EXISTS t1
              ;;  CREATE TABLE t1 (a int, b int, c int, INDEX i1(a), INDEX i2(b)) 
              ;;  INSERT INTO t1 VALUES (1,1,0), (1,2,0), (2,1,0), (2,2,0)
            ;
             
            complete_xa:
               extra_sql | xa_2phase | xa_1phase | xa_rollback_idle | xa_rollback_prepared
            ;
             
            disconnect_xa:
                xa_2phase_1 | xa_2phase_2 | xa_1phase_1 | xa_1phase_2 | xa_rollback_idle_1 | xa_rollback_idle_2 | xa_rollback_prepared_1 | xa_rollback_prepared_2
            ;
             
            extra_sql:
                  extra_sql_1
              ;;  extra_sql_2
            ;
             
            extra_sql_1:
                  
                  USE oltp_db
              ;;  XA START { $rand_xid }
              ;;  DELETE FROM t1
              ;;  INSERT INTO t1 VALUES (1,1,0), (1,2,0), (2,1,0), (2,2,0)
              ;;  UPDATE t1 FORCE INDEX (i2) SET c=c+1 WHERE a=1 AND b=1
              ;;  XA END { $rand_xid }
              ;;  XA PREPARE { $rand_xid }
              ;;  XA COMMIT { $rand_xid }
            ;
             
            extra_sql_2:
                  USE oltp_db
              ;;  SELECT * FROM t1 FORCE INDEX (i1) WHERE a=2 AND b=1 FOR UPDATE
            ;
             
            xa_2phase:
                 XA START { $rand_xid }
              ;; normal_query_list
              ;; XA END { $rand_xid }
              ;; XA PREPARE { $rand_xid }
              ;; XA COMMIT { $rand_xid }
            ;
             
            xa_1phase:
                 XA START { $rand_xid }
              ;; normal_query_list
              ;; XA END { $rand_xid }
              ;; XA COMMIT { $rand_xid } ONE PHASE
            ;
             
            xa_rollback_idle:
                 XA START { $rand_xid }
              ;; normal_query_list
              ;; XA END { $rand_xid }
              ;; XA ROLLBACK { $rand_xid }
            ;
             
            xa_rollback_prepared:
                 XA START { $rand_xid }
              ;; normal_query_list
              ;; XA END { $rand_xid }
              ;; XA PREPARE { $rand_xid }
              ;; XA ROLLBACK { $rand_xid }
            ;
             
            xa_2phase_1:
                 XA START 'xa_2phase'
              ;; normal_query_list
              ;; XA END 'xa_2phase'
              ;; XA PREPARE 'xa_2phase'
            ;
             
            xa_2phase_2:
              XA COMMIT 'xa_2phase'
            ;
             
            xa_1phase_1:
                 XA START 'xa_1phase'
              ;; normal_query_list
              ;; XA END 'xa_1phase'
            ;
             
            xa_1phase_2:
               XA COMMIT 'xa_1phase' ONE PHASE
            ;
             
            xa_rollback_idle_1:
                 XA START 'xa_rollback_idle'
              ;; normal_query_list
              ;; XA END 'xa_rollback_idle'
            ;
             
            xa_rollback_idle_2:
              XA ROLLBACK 'xa_rollback_idle'
            ;
             
            xa_rollback_prepared_1:
                 XA START 'xa_rollback_prepared'
              ;; normal_query_list
              ;; XA END 'xa_rollback_prepared'
              ;; XA PREPARE 'xa_rollback_prepared'
            ;
             
            xa_rollback_prepared_2:
              XA ROLLBACK 'xa_rollback_prepared'
            ;
             
            normal_trans:
              normal_trans_commit | normal_trans_rollback
            ;
             
            normal_trans_commit:
                  START TRANSACTION
              ;;  normal_query_list
              ;;  COMMIT
            ;
             
            normal_trans_rollback:
                  START TRANSACTION
              ;;  normal_query_list
              ;;  ROLLBACK
            ;
             
            normal_query_list:
              normal_query | 
              normal_query ;; normal_query_list
            ;
             
            normal_query:
              select | insert | insert | update | delete ;
             
            select:
                point_select | simple_range | sum_range | order_range | distinct_range
            ;
             
            point_select:
                SELECT k FROM _table WHERE _field_pk = _smallint_unsigned ;
             
            simple_range:
                SELECT k FROM _table WHERE _field_pk BETWEEN _smallint_unsigned AND _smallint_unsigned ;
             
            sum_range:
                SELECT SUM(k) FROM _table WHERE _field_pk BETWEEN _smallint_unsigned AND _smallint_unsigned ;
             
            order_range:
                SELECT k FROM _table WHERE _field_pk BETWEEN _smallint_unsigned AND _smallint_unsigned ORDER BY _field ;
             
            distinct_range:
                SELECT DISTINCT k FROM _table WHERE _field_pk BETWEEN _smallint_unsigned AND _smallint_unsigned ORDER BY _field ;
             
            insert:
              INSERT IGNORE INTO _table ( k, c, pad)  VALUES ( _int, _string, _string )
            ;
             
            update:
              index_update | non_index_update ;
             
            delete:
              DELETE FROM _table WHERE _field_pk = _smallint_unsigned ;
             
            index_update:
              UPDATE IGNORE _table SET _field_int_indexed = _field_int_indexed + 1 WHERE _field_pk = _smallint_unsigned ;
             
            # It relies on char fields being unindexed.
            # If char fields happen to be indexed in the table spec, then this update can be indexed as well. No big harm though.
            non_index_update:
              UPDATE _table SET _field_char = _string WHERE _field_pk = _smallint_unsigned ;
            

            RQG command:

            perl ./run.pl --scenario=Replication --mysqld=--lock-wait-timeout=10 --mysqld=--innodb-lock-wait-timeout=5 --mysqld=--binlog-format=row --mysqld=--slave_parallel_threads=4  --mysqld=--slave-parallel-mode=optimistic --reporters=Backtrace,Deadlock  --gendata=conf/zz/mdev-34466.zz --grammar=conf/yy/mdev-34466.yy --engine=InnoDB --base-port=14000 --server1-basedir=/test/MD231024-mariadb-10.6.20-linux-x86_64-opt --server2-basedir=/test/MD231024-mariadb-10.6.20-linux-x86_64-opt --vardir=/dev/shm/var1 --queries=1000000 --duration=1800 --threads=4
            

            backup command run on the slave:

            mariabackup --backup --target-dir /home/mariadb_backup_<index> --socket /dev/shm/var1/s2/mysql.sock -uroot
            

            But I've not tested the above yet to see if it reproduces the problem on the BASE (without the current fix). As per Elkin this validation is important.

            MariaDB [(none)]> SHOW SLAVE STATUS\G\
            *************************** 1. row ***************************
                            Slave_IO_State: Waiting for master to send event
                               Master_Host: 127.0.0.1
                               Master_User: replication
                               Master_Port: 14000
                             Connect_Retry: 60
                           Master_Log_File: mysql-bin.000001
                       Read_Master_Log_Pos: 9100679
                            Relay_Log_File: mysql-relay-bin.000002
                             Relay_Log_Pos: 3862766
                     Relay_Master_Log_File: mysql-bin.000001
                          Slave_IO_Running: Yes
                         Slave_SQL_Running: Yes
                           Replicate_Do_DB: 
                       Replicate_Ignore_DB: 
                        Replicate_Do_Table: 
                    Replicate_Ignore_Table: 
                   Replicate_Wild_Do_Table: 
               Replicate_Wild_Ignore_Table: 
                                Last_Errno: 0
                                Last_Error: 
                              Skip_Counter: 0
                       Exec_Master_Log_Pos: 3862467
                           Relay_Log_Space: 9101287
                           Until_Condition: None
                            Until_Log_File: 
                             Until_Log_Pos: 0
                        Master_SSL_Allowed: No
                        Master_SSL_CA_File: 
                        Master_SSL_CA_Path: 
                           Master_SSL_Cert: 
                         Master_SSL_Cipher: 
                            Master_SSL_Key: 
                     Seconds_Behind_Master: 753
             Master_SSL_Verify_Server_Cert: No
                             Last_IO_Errno: 0
                             Last_IO_Error: 
                            Last_SQL_Errno: 0
                            Last_SQL_Error: 
               Replicate_Ignore_Server_Ids: 
                          Master_Server_Id: 1
                            Master_SSL_Crl: 
                        Master_SSL_Crlpath: 
                                Using_Gtid: No
                               Gtid_IO_Pos: 
                   Replicate_Do_Domain_Ids: 
               Replicate_Ignore_Domain_Ids: 
                             Parallel_Mode: optimistic
                                 SQL_Delay: 0
                       SQL_Remaining_Delay: NULL
                   Slave_SQL_Running_State: Waiting for room in worker thread event queue
                          Slave_DDL_Groups: 10
            Slave_Non_Transactional_Groups: 1
                Slave_Transactional_Groups: 9375
            1 row in set (0.000 sec)
            

            susil.behera Susil Behera added a comment - - edited I've added disconnect XAs to the grammar, also performed periodic backups on the slave while RQG was running. I didn't see any problem. The slave seems to be stuck after a while. RQG grammar rule: query_init: # Need debug build. # It removes the restrictions for master, i.e. unmodified records are unlocked for master also, # normally it is done only for slave workers. It would increase the probability to catch some bug in the code SET GLOBAL innodb_enable_xap_unlock_unmodified_for_primary_debug = 1 ;   query: {$rand_xid= "'xid".abs($$)."'"; '' } complete_xa | disconnect_xa | normal_trans ;   thread1_init: USE oltp_db ;; DROP TABLE IF EXISTS t1 ;; CREATE TABLE t1 (a int, b int, c int, INDEX i1(a), INDEX i2(b)) ;; INSERT INTO t1 VALUES (1,1,0), (1,2,0), (2,1,0), (2,2,0) ;   complete_xa: extra_sql | xa_2phase | xa_1phase | xa_rollback_idle | xa_rollback_prepared ;   disconnect_xa: xa_2phase_1 | xa_2phase_2 | xa_1phase_1 | xa_1phase_2 | xa_rollback_idle_1 | xa_rollback_idle_2 | xa_rollback_prepared_1 | xa_rollback_prepared_2 ;   extra_sql: extra_sql_1 ;; extra_sql_2 ;   extra_sql_1: USE oltp_db ;; XA START { $rand_xid } ;; DELETE FROM t1 ;; INSERT INTO t1 VALUES (1,1,0), (1,2,0), (2,1,0), (2,2,0) ;; UPDATE t1 FORCE INDEX (i2) SET c=c+1 WHERE a=1 AND b=1 ;; XA END { $rand_xid } ;; XA PREPARE { $rand_xid } ;; XA COMMIT { $rand_xid } ;   extra_sql_2: USE oltp_db ;; SELECT * FROM t1 FORCE INDEX (i1) WHERE a=2 AND b=1 FOR UPDATE ;   xa_2phase: XA START { $rand_xid } ;; normal_query_list ;; XA END { $rand_xid } ;; XA PREPARE { $rand_xid } ;; XA COMMIT { $rand_xid } ;   xa_1phase: XA START { $rand_xid } ;; normal_query_list ;; XA END { $rand_xid } ;; XA COMMIT { $rand_xid } ONE PHASE ;   xa_rollback_idle: XA START { $rand_xid } ;; normal_query_list ;; XA END { $rand_xid } ;; XA ROLLBACK { $rand_xid } ;   xa_rollback_prepared: XA START { $rand_xid } ;; normal_query_list ;; XA END { $rand_xid } ;; XA PREPARE { $rand_xid } ;; XA ROLLBACK { $rand_xid } ;   xa_2phase_1: XA START 'xa_2phase' ;; normal_query_list ;; XA END 'xa_2phase' ;; XA PREPARE 'xa_2phase' ;   xa_2phase_2: XA COMMIT 'xa_2phase' ;   xa_1phase_1: XA START 'xa_1phase' ;; normal_query_list ;; XA END 'xa_1phase' ;   xa_1phase_2: XA COMMIT 'xa_1phase' ONE PHASE ;   xa_rollback_idle_1: XA START 'xa_rollback_idle' ;; normal_query_list ;; XA END 'xa_rollback_idle' ;   xa_rollback_idle_2: XA ROLLBACK 'xa_rollback_idle' ;   xa_rollback_prepared_1: XA START 'xa_rollback_prepared' ;; normal_query_list ;; XA END 'xa_rollback_prepared' ;; XA PREPARE 'xa_rollback_prepared' ;   xa_rollback_prepared_2: XA ROLLBACK 'xa_rollback_prepared' ;   normal_trans: normal_trans_commit | normal_trans_rollback ;   normal_trans_commit: START TRANSACTION ;; normal_query_list ;; COMMIT ;   normal_trans_rollback: START TRANSACTION ;; normal_query_list ;; ROLLBACK ;   normal_query_list: normal_query | normal_query ;; normal_query_list ;   normal_query: select | insert | insert | update | delete ;   select: point_select | simple_range | sum_range | order_range | distinct_range ;   point_select: SELECT k FROM _table WHERE _field_pk = _smallint_unsigned ;   simple_range: SELECT k FROM _table WHERE _field_pk BETWEEN _smallint_unsigned AND _smallint_unsigned ;   sum_range: SELECT SUM(k) FROM _table WHERE _field_pk BETWEEN _smallint_unsigned AND _smallint_unsigned ;   order_range: SELECT k FROM _table WHERE _field_pk BETWEEN _smallint_unsigned AND _smallint_unsigned ORDER BY _field ;   distinct_range: SELECT DISTINCT k FROM _table WHERE _field_pk BETWEEN _smallint_unsigned AND _smallint_unsigned ORDER BY _field ;   insert: INSERT IGNORE INTO _table ( k, c, pad) VALUES ( _int, _string, _string ) ;   update: index_update | non_index_update ;   delete: DELETE FROM _table WHERE _field_pk = _smallint_unsigned ;   index_update: UPDATE IGNORE _table SET _field_int_indexed = _field_int_indexed + 1 WHERE _field_pk = _smallint_unsigned ;   # It relies on char fields being unindexed. # If char fields happen to be indexed in the table spec, then this update can be indexed as well. No big harm though. non_index_update: UPDATE _table SET _field_char = _string WHERE _field_pk = _smallint_unsigned ; RQG command: perl ./run.pl --scenario=Replication --mysqld=--lock-wait-timeout=10 --mysqld=--innodb-lock-wait-timeout=5 --mysqld=--binlog-format=row --mysqld=--slave_parallel_threads=4 --mysqld=--slave-parallel-mode=optimistic --reporters=Backtrace,Deadlock --gendata=conf/zz/mdev-34466.zz --grammar=conf/yy/mdev-34466.yy --engine=InnoDB --base-port=14000 --server1-basedir=/test/MD231024-mariadb-10.6.20-linux-x86_64-opt --server2-basedir=/test/MD231024-mariadb-10.6.20-linux-x86_64-opt --vardir=/dev/shm/var1 --queries=1000000 --duration=1800 --threads=4 backup command run on the slave: mariabackup --backup --target-dir /home/mariadb_backup_<index> --socket /dev/shm/var1/s2/mysql.sock -uroot But I've not tested the above yet to see if it reproduces the problem on the BASE (without the current fix). As per Elkin this validation is important. MariaDB [(none)]> SHOW SLAVE STATUS\G\ *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 127.0.0.1 Master_User: replication Master_Port: 14000 Connect_Retry: 60 Master_Log_File: mysql-bin.000001 Read_Master_Log_Pos: 9100679 Relay_Log_File: mysql-relay-bin.000002 Relay_Log_Pos: 3862766 Relay_Master_Log_File: mysql-bin.000001 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 3862467 Relay_Log_Space: 9101287 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 753 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 1 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: No Gtid_IO_Pos: Replicate_Do_Domain_Ids: Replicate_Ignore_Domain_Ids: Parallel_Mode: optimistic SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Waiting for room in worker thread event queue Slave_DDL_Groups: 10 Slave_Non_Transactional_Groups: 1 Slave_Transactional_Groups: 9375 1 row in set (0.000 sec)
            susil.behera Susil Behera added a comment -

            I'm reconfirming that there is a problem. After a while the slave is stuck.

            susil.behera Susil Behera added a comment - I'm reconfirming that there is a problem. After a while the slave is stuck.
            Elkin Andrei Elkin added a comment - - edited

            monty, susil.behera: today I've reviewed and tested to find out a glitch in the patch that ASAN reacted with

            ==1071988==ERROR: AddressSanitizer: stack-use-after-scope on address 0x713255064078 at pc 0x635a77703400 bp 0x7132562fe710 sp 0x7132562fe700
            READ of size 8 at 0x713255064078 thread T18
                #0 0x635a777033ff in trans_xa_release_backup_lock /home/andrei/MDB/WTs/CS/Review/II/sql/xa.cc:538
                #1 0x635a77703d30 in trans_xa_prepare(THD*) /home/andrei/MDB/WTs/CS/Review/II/sql/xa.cc:603
                #2 0x635a76fe4009 in mysql_execute_command(THD*, bool) /home/andrei/MDB/WTs/CS/Review/II/sql/sql_parse.cc:5995
            

            I pushed my commit on the top of Monty's branch HEAD but in a separate bb-10.6-monty_andrei-xa rebased onto the current 10.6.
            It passes the grammar, does not hit an innodb assert (thanks to the rebase) but I did not have time to exercize the grammar with concurrent BACKUP.

            Elkin Andrei Elkin added a comment - - edited monty , susil.behera : today I've reviewed and tested to find out a glitch in the patch that ASAN reacted with == 1071988 ==ERROR: AddressSanitizer: stack-use-after-scope on address 0x713255064078 at pc 0x635a77703400 bp 0x7132562fe710 sp 0x7132562fe700 READ of size 8 at 0x713255064078 thread T18 # 0 0x635a777033ff in trans_xa_release_backup_lock /home/andrei/MDB/WTs/CS/Review/II/sql/xa.cc: 538 # 1 0x635a77703d30 in trans_xa_prepare(THD*) /home/andrei/MDB/WTs/CS/Review/II/sql/xa.cc: 603 # 2 0x635a76fe4009 in mysql_execute_command(THD*, bool) /home/andrei/MDB/WTs/CS/Review/II/sql/sql_parse.cc: 5995 I pushed my commit on the top of Monty's branch HEAD but in a separate bb-10.6-monty_andrei-xa rebased onto the current 10.6. It passes the grammar, does not hit an innodb assert (thanks to the rebase) but I did not have time to exercize the grammar with concurrent BACKUP.
            susil.behera Susil Behera added a comment -

            Verified against the new branch bb-10.6-monty_andrei-xa. I'm not seeing any slave hang now. The replication is in good state even after 45min RQG run.

            RQG run cmd:

            perl ./run.pl --scenario=Replication --mysqld=--lock-wait-timeout=10 --mysqld=--innodb-lock-wait-timeout=5 --mysqld=--binlog-format=row --mysqld=--slave_parallel_threads=4  --mysqld=--slave-parallel-mode=optimistic --reporters=Backtrace,Deadlock  --gendata=conf/zz/mdev-35110.zz --grammar=conf/yy/mdev-35110.yy --engine=InnoDB --base-port=14000 --server1-basedir=/test/MD261024-mariadb-10.6.20-linux-x86_64-dbg --server2-basedir=/test/MD261024-mariadb-10.6.20-linux-x86_64-dbg --vardir=/dev/shm/var1 --queries=1000000 --duration=1800 --threads=4
            

            I've added the BACKUP STAGE commands to the grammar.

            --gendata config file
            --grammar config file

            susil.behera Susil Behera added a comment - Verified against the new branch bb-10.6-monty_andrei-xa. I'm not seeing any slave hang now. The replication is in good state even after 45min RQG run. RQG run cmd: perl ./run.pl --scenario=Replication --mysqld=--lock-wait-timeout=10 --mysqld=--innodb-lock-wait-timeout=5 --mysqld=--binlog-format=row --mysqld=--slave_parallel_threads=4 --mysqld=--slave-parallel-mode=optimistic --reporters=Backtrace,Deadlock --gendata=conf/zz/mdev-35110.zz --grammar=conf/yy/mdev-35110.yy --engine=InnoDB --base-port=14000 --server1-basedir=/test/MD261024-mariadb-10.6.20-linux-x86_64-dbg --server2-basedir=/test/MD261024-mariadb-10.6.20-linux-x86_64-dbg --vardir=/dev/shm/var1 --queries=1000000 --duration=1800 --threads=4 I've added the BACKUP STAGE commands to the grammar. --gendata config file --grammar config file
            susil.behera Susil Behera added a comment -

            The branch bb-10.6-monty_andrei-xa looks good. No hangs seen on the slave, replication has been successful after 45m RQG run.

            susil.behera Susil Behera added a comment - The branch bb-10.6-monty_andrei-xa looks good. No hangs seen on the slave, replication has been successful after 45m RQG run.

            People

              Elkin Andrei Elkin
              monty Michael Widenius
              Votes:
              1 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.