Details
-
Bug
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
-
None
-
None
Description
The user is seeing deadlocks on slave sql thread using parallel replication.
During the backup it causes the slave_sql_thread to stuck.
Affected version is all versions up to 10.6.19
This ticket is an addition to MDEV-30423, which did not completely fix the problem.
+--------+-------------+---------------------+---------+--------------+--------+-----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+
|
| Id | User | Host | db | Command | Time | State | Info | Progress |
|
+--------+-------------+---------------------+---------+--------------+--------+-----------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+
|
| 6 | system user | | NULL | Slave_IO | 756023 | Waiting for master to send event | NULL | 0.000 |
|
| 8 | system user | | NULL | Slave_worker | 44471 | Waiting for backup lock | NULL | 0.000 |
|
| 9 | system user | | NULL | Slave_worker | 44471 | Waiting for prior transaction to commit | NULL | 0.000 |
|
| 10 | system user | | NULL | Slave_worker | 44471 | Waiting for prior transaction to commit | NULL | 0.000 |
|
| 11 | system user | | bankfrm | Slave_worker | 44471 | Waiting for backup lock | XA COMMIT ..... | 0.000 |
|
| 7 | system user | | NULL | Slave_SQL | 38598 | Waiting for room in worker thread event queue | NULL | 0.000 |
|
| 261485 | myoper | localhost | NULL | Query | 44471 | Waiting for backup lock | BACKUP STAGE BLOCK_COMMIT | 0.000 |
|
One of the commits is a XA transaction which causes the hang as it is waiting for a prior commits while BACKUP STAGE BLOCK_COMMIT blocks the prior commit to continue
Attachments
Issue Links
- includes
-
MDEV-30423 Deadlock on Replica during BACKUP STAGE BLOCK_COMMIT on XA transactions
-
- Closed
-
Activity
I've completed my verification with the following RQG grammar rule,
query:
|
{$rand_xid= "'xid".abs($$)."'"; '' } complete_xa
|
| normal_trans
|
;
|
|
complete_xa:
|
extra_sql | xa_2phase | xa_1phase | xa_rollback_idle | xa_rollback_prepared
|
;
|
|
extra_sql:
|
BACKUP STAGE START
|
;; BACKUP STAGE BLOCK_COMMIT
|
;; BACKUP STAGE END
|
;
|
|
xa_2phase:
|
XA START { $rand_xid }
|
;; normal_query_list
|
;; XA END { $rand_xid }
|
;; XA PREPARE { $rand_xid }
|
;; XA COMMIT { $rand_xid }
|
;
|
|
xa_1phase:
|
XA START { $rand_xid }
|
;; normal_query_list
|
;; XA END { $rand_xid }
|
;; XA COMMIT { $rand_xid } ONE PHASE
|
;
|
|
xa_rollback_idle:
|
XA START { $rand_xid }
|
;; normal_query_list
|
;; XA END { $rand_xid }
|
;; XA ROLLBACK { $rand_xid }
|
;
|
|
xa_rollback_prepared:
|
XA START { $rand_xid }
|
;; normal_query_list
|
;; XA END { $rand_xid }
|
;; XA PREPARE { $rand_xid }
|
;; XA ROLLBACK { $rand_xid }
|
;
|
|
normal_trans:
|
normal_trans_commit | normal_trans_rollback
|
;
|
|
normal_trans_commit:
|
START TRANSACTION
|
;; normal_query_list
|
;; COMMIT
|
;
|
|
normal_trans_rollback:
|
START TRANSACTION
|
;; normal_query_list
|
;; ROLLBACK
|
;
|
|
normal_query_list:
|
normal_query |
|
normal_query ;; normal_query_list
|
;
|
|
normal_query:
|
select | insert | insert | update | delete ;
|
|
select:
|
point_select | simple_range | sum_range | order_range | distinct_range
|
;
|
|
point_select:
|
SELECT k FROM _table WHERE _field_pk = _smallint_unsigned ;
|
|
simple_range:
|
SELECT k FROM _table WHERE _field_pk BETWEEN _smallint_unsigned AND _smallint_unsigned ;
|
|
sum_range:
|
SELECT SUM(k) FROM _table WHERE _field_pk BETWEEN _smallint_unsigned AND _smallint_unsigned ;
|
|
order_range:
|
SELECT k FROM _table WHERE _field_pk BETWEEN _smallint_unsigned AND _smallint_unsigned ORDER BY _field ;
|
|
distinct_range:
|
SELECT DISTINCT k FROM _table WHERE _field_pk BETWEEN _smallint_unsigned AND _smallint_unsigned ORDER BY _field ;
|
|
insert:
|
INSERT IGNORE INTO _table ( k, c, pad) VALUES ( _int, _string, _string )
|
;
|
|
update:
|
index_update | non_index_update ;
|
|
delete:
|
DELETE FROM _table WHERE _field_pk = _smallint_unsigned ;
|
|
index_update:
|
UPDATE IGNORE _table SET _field_int_indexed = _field_int_indexed + 1 WHERE _field_pk = _smallint_unsigned ;
|
|
# It relies on char fields being unindexed.
|
# If char fields happen to be indexed in the table spec, then this update can be indexed as well. No big harm though.
|
non_index_update:
|
UPDATE _table SET _field_char = _string WHERE _field_pk = _smallint_unsigned ;
|
The following 3 parallel sessions were run,
session1 (RQG Run)>
|
$ perl ./run.pl --mysqld=--lock-wait-timeout=10 --mysqld=--innodb-lock-wait-timeout=5 --mysqld=--binlog-format=mixed --mysqld=--slave_parallel_threads=4 --mysqld=--slave-parallel-mode=minimal --reporters=Backtrace,Deadlock --gendata=conf/zz/mdev-34466.zz --grammar=conf/yy/mdev-34466.yy --engine=InnoDB --base-port=14000 --server1-basedir=/test/MDEV-35110-MD161024-mariadb-10.6.20-linux-x86_64-opt --vardir=/dev/shm/var1 --queries=100000 --duration=600 --threads=4
|
|
session2 (BACKUP STAGE commands)>
|
$ while true; do ./bin/mariadb -h localhost -P 14000 -uroot -e "SELECT SLEEP(2); BACKUP STAGE START ; BACKUP STAGE BLOCK_COMMIT ; BACKUP STAGE END;" ; done
|
|
session3 (Some SQL queries)>
|
$ while true; do ./bin/mariadb -h localhost -P 14000 -uroot -e "SELECT COUNT(1) FROM oltp_db.oltp1; SELECT * FROM oltp_db.oltp1 WHERE id < 100000 LIMIT 2; SELECT * FROM oltp_db.oltp1 WHERE id > 1 LIMIT 2;" ; done
|
The following settings/modes have been exercised with the above test runs,
serial replication
binlog_format=
parallel_mode=
{optimistic|aggressive|minimal}binlog_row_image=
{FULL|MINIMAL|NOBLOB}Serial replication mode
Everything looks good. OK TO PUSH from my side.
I've added disconnect XAs to the grammar, also performed periodic backups on the slave while RQG was running. I didn't see any problem. The slave seems to be stuck after a while.
RQG grammar rule:
query_init:
|
# Need debug build.
|
# It removes the restrictions for master, i.e. unmodified records are unlocked for master also,
|
# normally it is done only for slave workers. It would increase the probability to catch some bug in the code
|
SET GLOBAL innodb_enable_xap_unlock_unmodified_for_primary_debug = 1
|
;
|
|
query:
|
{$rand_xid= "'xid".abs($$)."'"; '' } complete_xa
|
| disconnect_xa
|
| normal_trans
|
;
|
|
thread1_init:
|
USE oltp_db
|
;; DROP TABLE IF EXISTS t1
|
;; CREATE TABLE t1 (a int, b int, c int, INDEX i1(a), INDEX i2(b))
|
;; INSERT INTO t1 VALUES (1,1,0), (1,2,0), (2,1,0), (2,2,0)
|
;
|
|
complete_xa:
|
extra_sql | xa_2phase | xa_1phase | xa_rollback_idle | xa_rollback_prepared
|
;
|
|
disconnect_xa:
|
xa_2phase_1 | xa_2phase_2 | xa_1phase_1 | xa_1phase_2 | xa_rollback_idle_1 | xa_rollback_idle_2 | xa_rollback_prepared_1 | xa_rollback_prepared_2
|
;
|
|
extra_sql:
|
extra_sql_1
|
;; extra_sql_2
|
;
|
|
extra_sql_1:
|
|
USE oltp_db
|
;; XA START { $rand_xid }
|
;; DELETE FROM t1
|
;; INSERT INTO t1 VALUES (1,1,0), (1,2,0), (2,1,0), (2,2,0)
|
;; UPDATE t1 FORCE INDEX (i2) SET c=c+1 WHERE a=1 AND b=1
|
;; XA END { $rand_xid }
|
;; XA PREPARE { $rand_xid }
|
;; XA COMMIT { $rand_xid }
|
;
|
|
extra_sql_2:
|
USE oltp_db
|
;; SELECT * FROM t1 FORCE INDEX (i1) WHERE a=2 AND b=1 FOR UPDATE
|
;
|
|
xa_2phase:
|
XA START { $rand_xid }
|
;; normal_query_list
|
;; XA END { $rand_xid }
|
;; XA PREPARE { $rand_xid }
|
;; XA COMMIT { $rand_xid }
|
;
|
|
xa_1phase:
|
XA START { $rand_xid }
|
;; normal_query_list
|
;; XA END { $rand_xid }
|
;; XA COMMIT { $rand_xid } ONE PHASE
|
;
|
|
xa_rollback_idle:
|
XA START { $rand_xid }
|
;; normal_query_list
|
;; XA END { $rand_xid }
|
;; XA ROLLBACK { $rand_xid }
|
;
|
|
xa_rollback_prepared:
|
XA START { $rand_xid }
|
;; normal_query_list
|
;; XA END { $rand_xid }
|
;; XA PREPARE { $rand_xid }
|
;; XA ROLLBACK { $rand_xid }
|
;
|
|
xa_2phase_1:
|
XA START 'xa_2phase'
|
;; normal_query_list
|
;; XA END 'xa_2phase'
|
;; XA PREPARE 'xa_2phase'
|
;
|
|
xa_2phase_2:
|
XA COMMIT 'xa_2phase'
|
;
|
|
xa_1phase_1:
|
XA START 'xa_1phase'
|
;; normal_query_list
|
;; XA END 'xa_1phase'
|
;
|
|
xa_1phase_2:
|
XA COMMIT 'xa_1phase' ONE PHASE
|
;
|
|
xa_rollback_idle_1:
|
XA START 'xa_rollback_idle'
|
;; normal_query_list
|
;; XA END 'xa_rollback_idle'
|
;
|
|
xa_rollback_idle_2:
|
XA ROLLBACK 'xa_rollback_idle'
|
;
|
|
xa_rollback_prepared_1:
|
XA START 'xa_rollback_prepared'
|
;; normal_query_list
|
;; XA END 'xa_rollback_prepared'
|
;; XA PREPARE 'xa_rollback_prepared'
|
;
|
|
xa_rollback_prepared_2:
|
XA ROLLBACK 'xa_rollback_prepared'
|
;
|
|
normal_trans:
|
normal_trans_commit | normal_trans_rollback
|
;
|
|
normal_trans_commit:
|
START TRANSACTION
|
;; normal_query_list
|
;; COMMIT
|
;
|
|
normal_trans_rollback:
|
START TRANSACTION
|
;; normal_query_list
|
;; ROLLBACK
|
;
|
|
normal_query_list:
|
normal_query |
|
normal_query ;; normal_query_list
|
;
|
|
normal_query:
|
select | insert | insert | update | delete ;
|
|
select:
|
point_select | simple_range | sum_range | order_range | distinct_range
|
;
|
|
point_select:
|
SELECT k FROM _table WHERE _field_pk = _smallint_unsigned ;
|
|
simple_range:
|
SELECT k FROM _table WHERE _field_pk BETWEEN _smallint_unsigned AND _smallint_unsigned ;
|
|
sum_range:
|
SELECT SUM(k) FROM _table WHERE _field_pk BETWEEN _smallint_unsigned AND _smallint_unsigned ;
|
|
order_range:
|
SELECT k FROM _table WHERE _field_pk BETWEEN _smallint_unsigned AND _smallint_unsigned ORDER BY _field ;
|
|
distinct_range:
|
SELECT DISTINCT k FROM _table WHERE _field_pk BETWEEN _smallint_unsigned AND _smallint_unsigned ORDER BY _field ;
|
|
insert:
|
INSERT IGNORE INTO _table ( k, c, pad) VALUES ( _int, _string, _string )
|
;
|
|
update:
|
index_update | non_index_update ;
|
|
delete:
|
DELETE FROM _table WHERE _field_pk = _smallint_unsigned ;
|
|
index_update:
|
UPDATE IGNORE _table SET _field_int_indexed = _field_int_indexed + 1 WHERE _field_pk = _smallint_unsigned ;
|
|
# It relies on char fields being unindexed.
|
# If char fields happen to be indexed in the table spec, then this update can be indexed as well. No big harm though.
|
non_index_update:
|
UPDATE _table SET _field_char = _string WHERE _field_pk = _smallint_unsigned ;
|
RQG command:
perl ./run.pl --scenario=Replication --mysqld=--lock-wait-timeout=10 --mysqld=--innodb-lock-wait-timeout=5 --mysqld=--binlog-format=row --mysqld=--slave_parallel_threads=4 --mysqld=--slave-parallel-mode=optimistic --reporters=Backtrace,Deadlock --gendata=conf/zz/mdev-34466.zz --grammar=conf/yy/mdev-34466.yy --engine=InnoDB --base-port=14000 --server1-basedir=/test/MD231024-mariadb-10.6.20-linux-x86_64-opt --server2-basedir=/test/MD231024-mariadb-10.6.20-linux-x86_64-opt --vardir=/dev/shm/var1 --queries=1000000 --duration=1800 --threads=4
|
backup command run on the slave:
mariabackup --backup --target-dir /home/mariadb_backup_<index> --socket /dev/shm/var1/s2/mysql.sock -uroot
|
But I've not tested the above yet to see if it reproduces the problem on the BASE (without the current fix). As per Elkin this validation is important.
MariaDB [(none)]> SHOW SLAVE STATUS\G\
|
*************************** 1. row ***************************
|
Slave_IO_State: Waiting for master to send event |
Master_Host: 127.0.0.1
|
Master_User: replication
|
Master_Port: 14000
|
Connect_Retry: 60
|
Master_Log_File: mysql-bin.000001
|
Read_Master_Log_Pos: 9100679
|
Relay_Log_File: mysql-relay-bin.000002
|
Relay_Log_Pos: 3862766
|
Relay_Master_Log_File: mysql-bin.000001
|
Slave_IO_Running: Yes
|
Slave_SQL_Running: Yes
|
Replicate_Do_DB:
|
Replicate_Ignore_DB:
|
Replicate_Do_Table:
|
Replicate_Ignore_Table:
|
Replicate_Wild_Do_Table:
|
Replicate_Wild_Ignore_Table:
|
Last_Errno: 0
|
Last_Error:
|
Skip_Counter: 0
|
Exec_Master_Log_Pos: 3862467
|
Relay_Log_Space: 9101287
|
Until_Condition: None
|
Until_Log_File:
|
Until_Log_Pos: 0
|
Master_SSL_Allowed: No |
Master_SSL_CA_File:
|
Master_SSL_CA_Path:
|
Master_SSL_Cert:
|
Master_SSL_Cipher:
|
Master_SSL_Key:
|
Seconds_Behind_Master: 753
|
Master_SSL_Verify_Server_Cert: No |
Last_IO_Errno: 0
|
Last_IO_Error:
|
Last_SQL_Errno: 0
|
Last_SQL_Error:
|
Replicate_Ignore_Server_Ids:
|
Master_Server_Id: 1
|
Master_SSL_Crl:
|
Master_SSL_Crlpath:
|
Using_Gtid: No |
Gtid_IO_Pos:
|
Replicate_Do_Domain_Ids:
|
Replicate_Ignore_Domain_Ids:
|
Parallel_Mode: optimistic
|
SQL_Delay: 0
|
SQL_Remaining_Delay: NULL |
Slave_SQL_Running_State: Waiting for room in worker thread event queue |
Slave_DDL_Groups: 10
|
Slave_Non_Transactional_Groups: 1
|
Slave_Transactional_Groups: 9375
|
1 row in set (0.000 sec) |
I'm reconfirming that there is a problem. After a while the slave is stuck.
monty, susil.behera: today I've reviewed and tested to find out a glitch in the patch that ASAN reacted with
==1071988==ERROR: AddressSanitizer: stack-use-after-scope on address 0x713255064078 at pc 0x635a77703400 bp 0x7132562fe710 sp 0x7132562fe700 |
READ of size 8 at 0x713255064078 thread T18 |
#0 0x635a777033ff in trans_xa_release_backup_lock /home/andrei/MDB/WTs/CS/Review/II/sql/xa.cc:538 |
#1 0x635a77703d30 in trans_xa_prepare(THD*) /home/andrei/MDB/WTs/CS/Review/II/sql/xa.cc:603 |
#2 0x635a76fe4009 in mysql_execute_command(THD*, bool) /home/andrei/MDB/WTs/CS/Review/II/sql/sql_parse.cc:5995 |
I pushed my commit on the top of Monty's branch HEAD but in a separate bb-10.6-monty_andrei-xa rebased onto the current 10.6.
It passes the grammar, does not hit an innodb assert (thanks to the rebase) but I did not have time to exercize the grammar with concurrent BACKUP.
Verified against the new branch bb-10.6-monty_andrei-xa. I'm not seeing any slave hang now. The replication is in good state even after 45min RQG run.
RQG run cmd:
perl ./run.pl --scenario=Replication --mysqld=--lock-wait-timeout=10 --mysqld=--innodb-lock-wait-timeout=5 --mysqld=--binlog-format=row --mysqld=--slave_parallel_threads=4 --mysqld=--slave-parallel-mode=optimistic --reporters=Backtrace,Deadlock --gendata=conf/zz/mdev-35110.zz --grammar=conf/yy/mdev-35110.yy --engine=InnoDB --base-port=14000 --server1-basedir=/test/MD261024-mariadb-10.6.20-linux-x86_64-dbg --server2-basedir=/test/MD261024-mariadb-10.6.20-linux-x86_64-dbg --vardir=/dev/shm/var1 --queries=1000000 --duration=1800 --threads=4
|
I've added the BACKUP STAGE commands to the grammar.
--gendata config file
--grammar config file
The branch bb-10.6-monty_andrei-xa looks good. No hangs seen on the slave, replication has been successful after 45m RQG run.
From the commit message:
The original commit in
MDEV-30423was not complete as some usage in XA ofMDL_BACKUP_COMMIT locks did not set thd->backup_commit_lock.
This is required to be set when using parallel replication.
Fixed by ensuring that all usage of BACKUP_COMMIT lock i XA is uniform and
all sets thd->backup_commit_lock. I also changed all locks to be
MDL_EXPLICIT to keep also that part uniform.