[MDEV-33578] Replica severely lags on XA ROLLBACK when deleting rows from a non-transactional table while another table receives an INSERT inside the same XA transaction - Jira

Details

Type: Bug
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 10.4(EOL), 10.5, 10.6, 10.11, 11.0(EOL), 11.1(EOL), 11.2(EOL), 11.3(EOL), 11.4, 11.5(EOL)
Fix Version/s: 10.5, 10.6, 10.11, 11.4
Component/s: Replication, XA
Labels:

Description

# Use RBR, standard m/s setup

CREATE TABLE t1 (c INT) ENGINE=MyISAM;  # Or MEMORY,Aria

CREATE TABLE t2 (c INT);  # Engine choice does not matter

INSERT t2 SELECT SEQ FROM seq_1_to_200000;

XA START 'x';

DELETE FROM t2;

INSERT INTO t1 VALUES(1);

XA END 'x';

XA ROLLBACK 'x';

SHOW WARNINGS;  # Warning: 1196: Some non-transactional changed tables couldn't be rolled back

Leads to:

11.4.0 9b1ea6904965dd345478dedd80e181ad54c767da (Optimized)
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
...
Seconds_Behind_Master: 1213
...
Slave_SQL_Running_State: Waiting for room in worker thread event queue

And

11.4.0 9b1ea6904965dd345478dedd80e181ad54c767da (Optimized)

2024-03-02  8:30:07 8 [Note] The slave is applying a ROW event on behalf of a DELETE statement on table t2 and is currently taking a considerable amount of time (61 seconds). This is due to the fact that it is scanning the table while looking up records to be processed. Consider adding a primary key (or unique key) to the table to improve performance.

The DELETE application continues for 20+ minutes: (after which I stopped it)

11.4.0 9b1ea6904965dd345478dedd80e181ad54c767da (Optimized)
11.4.0-opt>SHOW PROCESSLIST;
+----+-------------+-----------+------+--------------+------+--------------------------------------------------------+------------------+----------+
\| Id \| User \| Host \| db \| Command \| Time \| State \| Info \| Progress \|
+----+-------------+-----------+------+--------------+------+--------------------------------------------------------+------------------+----------+
\| 6 \| system user \| \| NULL \| Slave_IO \| 1219 \| Waiting for master to send event \| NULL \| 0.000 \|
\| 9 \| system user \| \| NULL \| Slave_worker \| 1215 \| Waiting for work from SQL thread \| NULL \| 0.000 \|
\| 8 \| system user \| \| NULL \| Slave_worker \| 1219 \| Waiting for work from SQL thread \| NULL \| 0.000 \|
\| 10 \| system user \| \| NULL \| Slave_worker \| 1215 \| Waiting for work from SQL thread \| NULL \| 0.000 \|
\| 11 \| system user \| \| test \| Slave_worker \| 21 \| Delete_rows_log_event::find_row(-1) on table `t2` \| NULL \| 0.000 \|
\| 7 \| system user \| \| NULL \| Slave_SQL \| 1219 \| Slave has read all relay log; waiting for more updates \| NULL \| 0.000 \|
\| 12 \| root \| localhost \| test \| Query \| 0 \| starting \| SHOW PROCESSLIST \| 0.000 \|
+----+-------------+-----------+------+--------------+------+--------------------------------------------------------+------------------+----------+
7 rows in set (0.000 sec)

Findings and questions:

The Delete_rows_log_event::find_row(-1) time cycles over and over from 0 to 8 seconds at the start. Later on the rounds become longer and we see 0 to 24 seconds for example. (what is happening?)
While adding a PRIMARY KEY indeed helps with the DELETE for t2 (no wait), the issue does not happen when the INSERT INTO t1 VALUES(1); is not present even without adding a key - which in a way disqualifies the error message (i.e. the issue can be resolved without adding a key, or again - another issue is at play here, and it is related to the INSERT).
Why does the XA rollback need to fail on the master (leaving 200k records in t2, even when it is InnoDB)?
There is a performance issue in that the slave "work" (and a 20 minute+ slave lockup for a single relatively small transaction) is unnecessary and can be avoided, especially given that the DELETE could not be actioned on the primary and/or can be avoided when the INSERT is not present (and the tables are not related).

Attachments

Activity

Ascending order - Click to sort in descending order

Roel Van de Paar created issue - 2024-03-01 22:09

Roel Van de Paar made changes - 2024-03-01 22:17

Field	Original Value	New Value
Description	{code:sql} # Use RBR, standard m/s setup CREATE TABLE t1 (c INT) ENGINE=MyISAM; # Or MEMORY,Aria CREATE TABLE t2 (c INT); # Engine choice does not matter INSERT t2 SELECT SEQ FROM seq_1_to_200000; XA START 'x'; DELETE FROM t2; INSERT INTO t1 VALUES(1); XA END 'x'; XA ROLLBACK 'x'; SHOW WARNINGS; # Warning: 1196: Some non-transactional changed tables couldn't be rolled back {code} Leads to: {noformat:title=11.4.0 9b1ea6904965dd345478dedd80e181ad54c767da (Optimized)} Slave_IO_Running: Yes Slave_SQL_Running: Yes ... Seconds_Behind_Master: 1213 ... Slave_SQL_Running_State: Waiting for room in worker thread event queue {noformat} And {noformat:title=11.4.0 9b1ea6904965dd345478dedd80e181ad54c767da (Optimized)} 2024-03-02 8:30:07 8 [Note] The slave is applying a ROW event on behalf of a DELETE statement on table t2 and is currently taking a considerable amount of time (61 seconds). This is due to the fact that it is scanning the table while looking up records to be processed. Consider adding a primary key (or unique key) to the table to improve performance. {noformat} The {{DELETE}} application continues for 20+ minutes: (after which I stopped it) {noformat:title=11.4.0 9b1ea6904965dd345478dedd80e181ad54c767da (Optimized)} 11.4.0-opt>SHOW PROCESSLIST; +----+-------------+-----------+------+--------------+------+--------------------------------------------------------+------------------+----------+ \| Id \| User \| Host \| db \| Command \| Time \| State \| Info \| Progress \| +----+-------------+-----------+------+--------------+------+--------------------------------------------------------+------------------+----------+ \| 6 \| system user \| \| NULL \| Slave_IO \| 1219 \| Waiting for master to send event \| NULL \| 0.000 \| \| 9 \| system user \| \| NULL \| Slave_worker \| 1215 \| Waiting for work from SQL thread \| NULL \| 0.000 \| \| 8 \| system user \| \| NULL \| Slave_worker \| 1219 \| Waiting for work from SQL thread \| NULL \| 0.000 \| \| 10 \| system user \| \| NULL \| Slave_worker \| 1215 \| Waiting for work from SQL thread \| NULL \| 0.000 \| \| 11 \| system user \| \| test \| Slave_worker \| 21 \| Delete_rows_log_event::find_row(-1) on table `t2` \| NULL \| 0.000 \| \| 7 \| system user \| \| NULL \| Slave_SQL \| 1219 \| Slave has read all relay log; waiting for more updates \| NULL \| 0.000 \| \| 12 \| root \| localhost \| test \| Query \| 0 \| starting \| SHOW PROCESSLIST \| 0.000 \| +----+-------------+-----------+------+--------------+------+--------------------------------------------------------+------------------+----------+ 7 rows in set (0.000 sec) {noformat} Findings: * The {{Delete_rows_log_event::find_row(-1)}} time cycles over and over from 0 to 8 seconds at the start. Later on the rounds become longer and we see 0 to 24 seconds for example. (what is happening?) * While adding a PRIMARY KEY indeed helps with the {{DELETE}} for t2 (no wait), the issue does not happen when the {{INSERT INTO t1 VALUES(1);}} is not present even without adding a key - which in a way disqualifies the error message (i.e. the issue can be resolved without adding a key, or again - another issue is at play here, and it is related to the INSERT). * On a sidenote, why does the XA rollback need to fail on the master (leaving 200k records in t2)? * There is a performance issue in that the slave "work" (and a 20 minute+ slave lockup for a single relatively small transaction) is unnecessary and can be avoided, especially given that the {{DELETE}} could not be actioned on the primary and/or can be avoided when the INSERT is not present (and the tables are not related).	{code:sql} # Use RBR, standard m/s setup CREATE TABLE t1 (c INT) ENGINE=MyISAM; # Or MEMORY,Aria CREATE TABLE t2 (c INT); # Engine choice does not matter INSERT t2 SELECT SEQ FROM seq_1_to_200000; XA START 'x'; DELETE FROM t2; INSERT INTO t1 VALUES(1); XA END 'x'; XA ROLLBACK 'x'; SHOW WARNINGS; # Warning: 1196: Some non-transactional changed tables couldn't be rolled back {code} Leads to: {noformat:title=11.4.0 9b1ea6904965dd345478dedd80e181ad54c767da (Optimized)} Slave_IO_Running: Yes Slave_SQL_Running: Yes ... Seconds_Behind_Master: 1213 ... Slave_SQL_Running_State: Waiting for room in worker thread event queue {noformat} And {noformat:title=11.4.0 9b1ea6904965dd345478dedd80e181ad54c767da (Optimized)} 2024-03-02 8:30:07 8 [Note] The slave is applying a ROW event on behalf of a DELETE statement on table t2 and is currently taking a considerable amount of time (61 seconds). This is due to the fact that it is scanning the table while looking up records to be processed. Consider adding a primary key (or unique key) to the table to improve performance. {noformat} The {{DELETE}} application continues for 20+ minutes: (after which I stopped it) {noformat:title=11.4.0 9b1ea6904965dd345478dedd80e181ad54c767da (Optimized)} 11.4.0-opt>SHOW PROCESSLIST; +----+-------------+-----------+------+--------------+------+--------------------------------------------------------+------------------+----------+ \| Id \| User \| Host \| db \| Command \| Time \| State \| Info \| Progress \| +----+-------------+-----------+------+--------------+------+--------------------------------------------------------+------------------+----------+ \| 6 \| system user \| \| NULL \| Slave_IO \| 1219 \| Waiting for master to send event \| NULL \| 0.000 \| \| 9 \| system user \| \| NULL \| Slave_worker \| 1215 \| Waiting for work from SQL thread \| NULL \| 0.000 \| \| 8 \| system user \| \| NULL \| Slave_worker \| 1219 \| Waiting for work from SQL thread \| NULL \| 0.000 \| \| 10 \| system user \| \| NULL \| Slave_worker \| 1215 \| Waiting for work from SQL thread \| NULL \| 0.000 \| \| 11 \| system user \| \| test \| Slave_worker \| 21 \| Delete_rows_log_event::find_row(-1) on table `t2` \| NULL \| 0.000 \| \| 7 \| system user \| \| NULL \| Slave_SQL \| 1219 \| Slave has read all relay log; waiting for more updates \| NULL \| 0.000 \| \| 12 \| root \| localhost \| test \| Query \| 0 \| starting \| SHOW PROCESSLIST \| 0.000 \| +----+-------------+-----------+------+--------------+------+--------------------------------------------------------+------------------+----------+ 7 rows in set (0.000 sec) {noformat} Findings: * The {{Delete_rows_log_event::find_row(-1)}} time cycles over and over from 0 to 8 seconds at the start. Later on the rounds become longer and we see 0 to 24 seconds for example. (what is happening?) * While adding a PRIMARY KEY indeed helps with the {{DELETE}} for t2 (no wait), the issue does not happen when the {{INSERT INTO t1 VALUES(1);}} is not present even without adding a key - which in a way disqualifies the error message (i.e. the issue can be resolved without adding a key, or again - another issue is at play here, and it is related to the INSERT). * Why does the XA rollback need to fail on the master (leaving 200k records in t2)? The warning is also not correct as the same warning happens when t2 is InnoDB. * There is a performance issue in that the slave "work" (and a 20 minute+ slave lockup for a single relatively small transaction) is unnecessary and can be avoided, especially given that the {{DELETE}} could not be actioned on the primary and/or can be avoided when the INSERT is not present (and the tables are not related).

Roel Van de Paar added a comment - 2024-03-01 22:19 - edited

MTR Testcase

-- source include/master-slave.inc

-- source include/have_innodb.inc

-- source include/have_sequence.inc

CREATE TABLE t1 (c INT) ENGINE=MyISAM;  # Or MEMORY,Aria

CREATE TABLE t2 (c INT) ENGINE=InnoDB;  # Engine choice does not matter

INSERT t2 SELECT SEQ FROM seq_1_to_200000;

XA START 'x';

DELETE FROM t2;

INSERT INTO t1 VALUES(1);

XA END 'x';

XA ROLLBACK 'x';  # Warning: 1196: Some non-transactional changed tables couldn't be rolled back

--sync_slave_with_master

--source include/rpl_end.inc

It will show [pass] but then semi-hang, and terminate on the --sync_slave_with_master after 300 seconds:

11.4.0 9b1ea6904965dd345478dedd80e181ad54c767da (Optimized)
CURRENT_TEST: main.test
analyze: sync_with_master
mysqltest: At line 13: sync_slave_with_master failed: 'select master_pos_wait('master-bin.000001', 2009383, 300, '')' returned -1 indicating timeout after 300 seconds

CLI to MTR replication slave before MTR terminates:

11.4.0 9b1ea6904965dd345478dedd80e181ad54c767da (Optimized)
$ /test/MD060224-mariadb-11.4.0-linux-x86_64-opt/bin/mariadb -uroot -P16001
WARNING: option --ssl-verify-server-cert is disabled, because of an insecure passwordless login.
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 11
Server version: 11.4.0-MariaDB-log MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> SHOW PROCESSLIST;
+----+-------------+-----------------+------+-----------+------+------------------------------------------------------+---------------------------------------------------------------+----------+
\| Id \| User \| Host \| db \| Command \| Time \| State \| Info \| Progress \|
+----+-------------+-----------------+------+-----------+------+------------------------------------------------------+---------------------------------------------------------------+----------+
\| 5 \| root \| localhost:46154 \| test \| Sleep \| 260 \| \| NULL \| 0.000 \|
\| 6 \| system user \| \| NULL \| Slave_IO \| 260 \| Waiting for master to send event \| NULL \| 0.000 \|
\| 7 \| system user \| \| test \| Slave_SQL \| 17 \| Delete_rows_log_event::find_row(-1) on table `t2` \| DELETE FROM t2 \| 0.000 \|
\| 8 \| root \| localhost:46166 \| test \| Query \| 256 \| Waiting for the slave SQL thread to advance position \| select master_pos_wait('master-bin.000001', 2009383, 300, '') \| 0.000 \|
\| 9 \| root \| localhost:46174 \| test \| Sleep \| 260 \| \| NULL \| 0.000 \|
\| 11 \| root \| localhost:21434 \| NULL \| Query \| 0 \| starting \| SHOW PROCESSLIST \| 0.000 \|
+----+-------------+-----------------+------+-----------+------+------------------------------------------------------+---------------------------------------------------------------+----------+
6 rows in set (0.000 sec)

Roel Van de Paar added a comment - 2024-03-01 22:19 - edited MTR Testcase -- source include/master-slave.inc -- source include/have_innodb.inc -- source include/have_sequence.inc CREATE TABLE t1 (c INT ) ENGINE=MyISAM; # Or MEMORY,Aria CREATE TABLE t2 (c INT ) ENGINE=InnoDB; # Engine choice does not matter INSERT t2 SELECT SEQ FROM seq_1_to_200000; XA START 'x' ; DELETE FROM t2; INSERT INTO t1 VALUES (1); XA END 'x' ; XA ROLLBACK 'x' ; # Warning: 1196: Some non-transactional changed tables couldn't be rolled back --sync_slave_with_master --source include/rpl_end.inc It will show [pass] but then semi-hang, and terminate on the --sync_slave_with_master after 300 seconds: 11.4.0 9b1ea6904965dd345478dedd80e181ad54c767da (Optimized) CURRENT_TEST: main.test analyze: sync_with_master mysqltest: At line 13: sync_slave_with_master failed: 'select master_pos_wait('master-bin.000001', 2009383, 300, '')' returned -1 indicating timeout after 300 seconds CLI to MTR replication slave before MTR terminates: 11.4.0 9b1ea6904965dd345478dedd80e181ad54c767da (Optimized) $ /test/MD060224-mariadb-11.4.0-linux-x86_64-opt/bin/mariadb -uroot -P16001 WARNING: option --ssl-verify-server-cert is disabled, because of an insecure passwordless login. Welcome to the MariaDB monitor. Commands end with ; or \g. Your MariaDB connection id is 11 Server version: 11.4.0-MariaDB-log MariaDB Server Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. MariaDB [(none)]> SHOW PROCESSLIST; +----+-------------+-----------------+------+-----------+------+------------------------------------------------------+---------------------------------------------------------------+----------+ | Id | User | Host | db | Command | Time | State | Info | Progress | +----+-------------+-----------------+------+-----------+------+------------------------------------------------------+---------------------------------------------------------------+----------+ | 5 | root | localhost:46154 | test | Sleep | 260 | | NULL | 0.000 | | 6 | system user | | NULL | Slave_IO | 260 | Waiting for master to send event | NULL | 0.000 | | 7 | system user | | test | Slave_SQL | 17 | Delete_rows_log_event::find_row(-1) on table `t2` | DELETE FROM t2 | 0.000 | | 8 | root | localhost:46166 | test | Query | 256 | Waiting for the slave SQL thread to advance position | select master_pos_wait('master-bin.000001', 2009383, 300, '') | 0.000 | | 9 | root | localhost:46174 | test | Sleep | 260 | | NULL | 0.000 | | 11 | root | localhost:21434 | NULL | Query | 0 | starting | SHOW PROCESSLIST | 0.000 | +----+-------------+-----------------+------+-----------+------+------------------------------------------------------+---------------------------------------------------------------+----------+ 6 rows in set (0.000 sec)

Roel Van de Paar made changes - 2024-03-01 22:24

Description

{code:sql}
# Use RBR, standard m/s setup
CREATE TABLE t1 (c INT) ENGINE=MyISAM; # Or MEMORY,Aria
CREATE TABLE t2 (c INT); # Engine choice does not matter
INSERT t2 SELECT SEQ FROM seq_1_to_200000;
XA START 'x';
DELETE FROM t2;
INSERT INTO t1 VALUES(1);
XA END 'x';
XA ROLLBACK 'x';
SHOW WARNINGS; # Warning: 1196: Some non-transactional changed tables couldn't be rolled back
{code}
Leads to:
{noformat:title=11.4.0 9b1ea6904965dd345478dedd80e181ad54c767da (Optimized)}
              Slave_IO_Running: Yes
             Slave_SQL_Running: Yes
...
         Seconds_Behind_Master: 1213
...
       Slave_SQL_Running_State: Waiting for room in worker thread event queue
{noformat}
And
{noformat:title=11.4.0 9b1ea6904965dd345478dedd80e181ad54c767da (Optimized)}
2024-03-02 8:30:07 8 [Note] The slave is applying a ROW event on behalf of a DELETE statement on table t2 and is currently taking a considerable amount of time (61 seconds). This is due to the fact that it is scanning the table while looking up records to be processed. Consider adding a primary key (or unique key) to the table to improve performance.
{noformat}
The {{DELETE}} application continues for 20+ minutes: (after which I stopped it)
{noformat:title=11.4.0 9b1ea6904965dd345478dedd80e181ad54c767da (Optimized)}
11.4.0-opt>SHOW PROCESSLIST;
+----+-------------+-----------+------+--------------+------+--------------------------------------------------------+------------------+----------+
| Id | User | Host | db | Command | Time | State | Info | Progress |
+----+-------------+-----------+------+--------------+------+--------------------------------------------------------+------------------+----------+
| 6 | system user | | NULL | Slave_IO | 1219 | Waiting for master to send event | NULL | 0.000 |
| 9 | system user | | NULL | Slave_worker | 1215 | Waiting for work from SQL thread | NULL | 0.000 |
| 8 | system user | | NULL | Slave_worker | 1219 | Waiting for work from SQL thread | NULL | 0.000 |
| 10 | system user | | NULL | Slave_worker | 1215 | Waiting for work from SQL thread | NULL | 0.000 |
| 11 | system user | | test | Slave_worker | 21 | Delete_rows_log_event::find_row(-1) on table `t2` | NULL | 0.000 |
| 7 | system user | | NULL | Slave_SQL | 1219 | Slave has read all relay log; waiting for more updates | NULL | 0.000 |
| 12 | root | localhost | test | Query | 0 | starting | SHOW PROCESSLIST | 0.000 |
+----+-------------+-----------+------+--------------+------+--------------------------------------------------------+------------------+----------+
7 rows in set (0.000 sec)
{noformat}
Findings:
* The {{Delete_rows_log_event::find_row(-1)}} time cycles over and over from 0 to 8 seconds at the start. Later on the rounds become longer and we see 0 to 24 seconds for example. (what is happening?)
* While adding a PRIMARY KEY indeed helps with the {{DELETE}} for t2 (no wait), the issue does not happen when the {{INSERT INTO t1 VALUES(1);}} is not present *even without adding a key* - which in a way disqualifies the error message (i.e. the issue can be resolved without adding a key, or again - another issue is at play here, and it is related to the INSERT).
* Why does the XA rollback need to fail on the master (leaving 200k records in t2)? The warning is also not correct as the same warning happens when t2 is InnoDB.
* There is a performance issue in that the slave "work" (and a 20 minute+ slave lockup for a single relatively small transaction) is unnecessary and can be avoided, especially given that the {{DELETE}} could not be actioned on the primary and/or can be avoided when the INSERT is not present (and the tables are not related).

{code:sql}
# Use RBR, standard m/s setup
CREATE TABLE t1 (c INT) ENGINE=MyISAM; # Or MEMORY,Aria
CREATE TABLE t2 (c INT); # Engine choice does not matter
INSERT t2 SELECT SEQ FROM seq_1_to_200000;
XA START 'x';
DELETE FROM t2;
INSERT INTO t1 VALUES(1);
XA END 'x';
XA ROLLBACK 'x';
SHOW WARNINGS; # Warning: 1196: Some non-transactional changed tables couldn't be rolled back
{code}
Leads to:
{noformat:title=11.4.0 9b1ea6904965dd345478dedd80e181ad54c767da (Optimized)}
              Slave_IO_Running: Yes
             Slave_SQL_Running: Yes
...
         Seconds_Behind_Master: 1213
...
       Slave_SQL_Running_State: Waiting for room in worker thread event queue
{noformat}
And
{noformat:title=11.4.0 9b1ea6904965dd345478dedd80e181ad54c767da (Optimized)}
2024-03-02 8:30:07 8 [Note] The slave is applying a ROW event on behalf of a DELETE statement on table t2 and is currently taking a considerable amount of time (61 seconds). This is due to the fact that it is scanning the table while looking up records to be processed. Consider adding a primary key (or unique key) to the table to improve performance.
{noformat}
The {{DELETE}} application continues for 20+ minutes: (after which I stopped it)
{noformat:title=11.4.0 9b1ea6904965dd345478dedd80e181ad54c767da (Optimized)}
11.4.0-opt>SHOW PROCESSLIST;
+----+-------------+-----------+------+--------------+------+--------------------------------------------------------+------------------+----------+
| Id | User | Host | db | Command | Time | State | Info | Progress |
+----+-------------+-----------+------+--------------+------+--------------------------------------------------------+------------------+----------+
| 6 | system user | | NULL | Slave_IO | 1219 | Waiting for master to send event | NULL | 0.000 |
| 9 | system user | | NULL | Slave_worker | 1215 | Waiting for work from SQL thread | NULL | 0.000 |
| 8 | system user | | NULL | Slave_worker | 1219 | Waiting for work from SQL thread | NULL | 0.000 |
| 10 | system user | | NULL | Slave_worker | 1215 | Waiting for work from SQL thread | NULL | 0.000 |
| 11 | system user | | test | Slave_worker | 21 | Delete_rows_log_event::find_row(-1) on table `t2` | NULL | 0.000 |
| 7 | system user | | NULL | Slave_SQL | 1219 | Slave has read all relay log; waiting for more updates | NULL | 0.000 |
| 12 | root | localhost | test | Query | 0 | starting | SHOW PROCESSLIST | 0.000 |
+----+-------------+-----------+------+--------------+------+--------------------------------------------------------+------------------+----------+
7 rows in set (0.000 sec)
{noformat}
Findings and questions:
* The {{Delete_rows_log_event::find_row(-1)}} time cycles over and over from 0 to 8 seconds at the start. Later on the rounds become longer and we see 0 to 24 seconds for example. (what is happening?)
* While adding a PRIMARY KEY indeed helps with the {{DELETE}} for t2 (no wait), the issue does not happen when the {{INSERT INTO t1 VALUES(1);}} is not present *even without adding a key* - which in a way disqualifies the error message (i.e. the issue can be resolved without adding a key, or again - another issue is at play here, and it is related to the INSERT).
* Why does the XA rollback need to fail on the master (leaving 200k records in t2)? The warning is also not correct as the same warning happens when t2 is InnoDB.
* There is a performance issue in that the slave "work" (and a 20 minute+ slave lockup for a single relatively small transaction) is unnecessary and can be avoided, especially given that the {{DELETE}} could not be actioned on the primary and/or can be avoided when the INSERT is not present (and the tables are not related).

Roel Van de Paar made changes - 2024-03-01 22:34

Labels

performance replication

RBR performance replication

Roel Van de Paar made changes - 2024-03-01 22:34

Fix Version/s		10.4 [ 22408 ]
Fix Version/s		10.5 [ 23123 ]
Fix Version/s		10.6 [ 24028 ]
Fix Version/s		10.11 [ 27614 ]
Fix Version/s		11.0 [ 28320 ]
Fix Version/s		11.1 [ 28549 ]
Fix Version/s		11.2 [ 28603 ]
Fix Version/s		11.3 [ 28565 ]
Affects Version/s		10.4 [ 22408 ]
Affects Version/s		10.5 [ 23123 ]
Affects Version/s		10.6 [ 24028 ]
Affects Version/s		10.11 [ 27614 ]
Affects Version/s		11.0 [ 28320 ]
Affects Version/s		11.1 [ 28549 ]
Affects Version/s		11.2 [ 28603 ]
Affects Version/s		11.3 [ 28565 ]
Affects Version/s		11.5 [ 29506 ]

Roel Van de Paar made changes - 2024-03-01 22:37

Description

{code:sql}
# Use RBR, standard m/s setup
CREATE TABLE t1 (c INT) ENGINE=MyISAM; # Or MEMORY,Aria
CREATE TABLE t2 (c INT); # Engine choice does not matter
INSERT t2 SELECT SEQ FROM seq_1_to_200000;
XA START 'x';
DELETE FROM t2;
INSERT INTO t1 VALUES(1);
XA END 'x';
XA ROLLBACK 'x';
SHOW WARNINGS; # Warning: 1196: Some non-transactional changed tables couldn't be rolled back
{code}
Leads to:
{noformat:title=11.4.0 9b1ea6904965dd345478dedd80e181ad54c767da (Optimized)}
              Slave_IO_Running: Yes
             Slave_SQL_Running: Yes
...
         Seconds_Behind_Master: 1213
...
       Slave_SQL_Running_State: Waiting for room in worker thread event queue
{noformat}
And
{noformat:title=11.4.0 9b1ea6904965dd345478dedd80e181ad54c767da (Optimized)}
2024-03-02 8:30:07 8 [Note] The slave is applying a ROW event on behalf of a DELETE statement on table t2 and is currently taking a considerable amount of time (61 seconds). This is due to the fact that it is scanning the table while looking up records to be processed. Consider adding a primary key (or unique key) to the table to improve performance.
{noformat}
The {{DELETE}} application continues for 20+ minutes: (after which I stopped it)
{noformat:title=11.4.0 9b1ea6904965dd345478dedd80e181ad54c767da (Optimized)}
11.4.0-opt>SHOW PROCESSLIST;
+----+-------------+-----------+------+--------------+------+--------------------------------------------------------+------------------+----------+
| Id | User | Host | db | Command | Time | State | Info | Progress |
+----+-------------+-----------+------+--------------+------+--------------------------------------------------------+------------------+----------+
| 6 | system user | | NULL | Slave_IO | 1219 | Waiting for master to send event | NULL | 0.000 |
| 9 | system user | | NULL | Slave_worker | 1215 | Waiting for work from SQL thread | NULL | 0.000 |
| 8 | system user | | NULL | Slave_worker | 1219 | Waiting for work from SQL thread | NULL | 0.000 |
| 10 | system user | | NULL | Slave_worker | 1215 | Waiting for work from SQL thread | NULL | 0.000 |
| 11 | system user | | test | Slave_worker | 21 | Delete_rows_log_event::find_row(-1) on table `t2` | NULL | 0.000 |
| 7 | system user | | NULL | Slave_SQL | 1219 | Slave has read all relay log; waiting for more updates | NULL | 0.000 |
| 12 | root | localhost | test | Query | 0 | starting | SHOW PROCESSLIST | 0.000 |
+----+-------------+-----------+------+--------------+------+--------------------------------------------------------+------------------+----------+
7 rows in set (0.000 sec)
{noformat}
Findings and questions:
* The {{Delete_rows_log_event::find_row(-1)}} time cycles over and over from 0 to 8 seconds at the start. Later on the rounds become longer and we see 0 to 24 seconds for example. (what is happening?)
* While adding a PRIMARY KEY indeed helps with the {{DELETE}} for t2 (no wait), the issue does not happen when the {{INSERT INTO t1 VALUES(1);}} is not present *even without adding a key* - which in a way disqualifies the error message (i.e. the issue can be resolved without adding a key, or again - another issue is at play here, and it is related to the INSERT).
* Why does the XA rollback need to fail on the master (leaving 200k records in t2)? The warning is also not correct as the same warning happens when t2 is InnoDB.
* There is a performance issue in that the slave "work" (and a 20 minute+ slave lockup for a single relatively small transaction) is unnecessary and can be avoided, especially given that the {{DELETE}} could not be actioned on the primary and/or can be avoided when the INSERT is not present (and the tables are not related).

{code:sql}
# Use RBR, standard m/s setup
CREATE TABLE t1 (c INT) ENGINE=MyISAM; # Or MEMORY,Aria
CREATE TABLE t2 (c INT); # Engine choice does not matter
INSERT t2 SELECT SEQ FROM seq_1_to_200000;
XA START 'x';
DELETE FROM t2;
INSERT INTO t1 VALUES(1);
XA END 'x';
XA ROLLBACK 'x';
SHOW WARNINGS; # Warning: 1196: Some non-transactional changed tables couldn't be rolled back
{code}
Leads to:
{noformat:title=11.4.0 9b1ea6904965dd345478dedd80e181ad54c767da (Optimized)}
              Slave_IO_Running: Yes
             Slave_SQL_Running: Yes
...
         Seconds_Behind_Master: 1213
...
       Slave_SQL_Running_State: Waiting for room in worker thread event queue
{noformat}
And
{noformat:title=11.4.0 9b1ea6904965dd345478dedd80e181ad54c767da (Optimized)}
2024-03-02 8:30:07 8 [Note] The slave is applying a ROW event on behalf of a DELETE statement on table t2 and is currently taking a considerable amount of time (61 seconds). This is due to the fact that it is scanning the table while looking up records to be processed. Consider adding a primary key (or unique key) to the table to improve performance.
{noformat}
The {{DELETE}} application continues for 20+ minutes: (after which I stopped it)
{noformat:title=11.4.0 9b1ea6904965dd345478dedd80e181ad54c767da (Optimized)}
11.4.0-opt>SHOW PROCESSLIST;
+----+-------------+-----------+------+--------------+------+--------------------------------------------------------+------------------+----------+
| Id | User | Host | db | Command | Time | State | Info | Progress |
+----+-------------+-----------+------+--------------+------+--------------------------------------------------------+------------------+----------+
| 6 | system user | | NULL | Slave_IO | 1219 | Waiting for master to send event | NULL | 0.000 |
| 9 | system user | | NULL | Slave_worker | 1215 | Waiting for work from SQL thread | NULL | 0.000 |
| 8 | system user | | NULL | Slave_worker | 1219 | Waiting for work from SQL thread | NULL | 0.000 |
| 10 | system user | | NULL | Slave_worker | 1215 | Waiting for work from SQL thread | NULL | 0.000 |
| 11 | system user | | test | Slave_worker | 21 | Delete_rows_log_event::find_row(-1) on table `t2` | NULL | 0.000 |
| 7 | system user | | NULL | Slave_SQL | 1219 | Slave has read all relay log; waiting for more updates | NULL | 0.000 |
| 12 | root | localhost | test | Query | 0 | starting | SHOW PROCESSLIST | 0.000 |
+----+-------------+-----------+------+--------------+------+--------------------------------------------------------+------------------+----------+
7 rows in set (0.000 sec)
{noformat}
Findings and questions:
* The {{Delete_rows_log_event::find_row(-1)}} time cycles over and over from 0 to 8 seconds at the start. Later on the rounds become longer and we see 0 to 24 seconds for example. (what is happening?)
* While adding a PRIMARY KEY indeed helps with the {{DELETE}} for t2 (no wait), the issue does not happen when the {{INSERT INTO t1 VALUES(1);}} is not present *even without adding a key* - which in a way disqualifies the error message (i.e. the issue can be resolved without adding a key, or again - another issue is at play here, and it is related to the INSERT).
* Why does the XA rollback need to fail on the master (leaving 200k records in t2, even when it is InnoDB)?
* There is a performance issue in that the slave "work" (and a 20 minute+ slave lockup for a single relatively small transaction) is unnecessary and can be avoided, especially given that the {{DELETE}} could not be actioned on the primary and/or can be avoided when the INSERT is not present (and the tables are not related).

Roel Van de Paar made changes - 2024-03-01 22:37

Summary

Replica severely lags on XA rollback when deleting rows from a non-transactional table while another table receives an INSERT inside the same XA transaction

Replica severely lags on XA ROLLBACK when deleting rows from a non-transactional table while another table receives an INSERT inside the same XA transaction

Roel Van de Paar made changes - 2024-03-02 00:50

Link

This issue relates to TODO-4062 [ TODO-4062 ]

Julien Fritsch made changes - 2024-05-03 14:32

Fix Version/s	11.0 [ 28320 ]
Fix Version/s	11.3 [ 28565 ]

Julien Fritsch made changes - 2024-09-10 15:01

Fix Version/s

10.4 [ 22408 ]

Julien Fritsch made changes - 2024-09-10 15:26

Fix Version/s

11.1 [ 28549 ]

Julien Fritsch made changes - 2024-11-22 17:11

Fix Version/s

11.2(EOL) [ 28603 ]

People

Assignee:: Andrei Elkin

Reporter:: Roel Van de Paar

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 2024-03-01 22:09

Updated:: 2024-12-05 14:52

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.