Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-33578

Replica severely lags on XA ROLLBACK when deleting rows from a non-transactional table while another table receives an INSERT inside the same XA transaction

Details

    Description

      # Use RBR, standard m/s setup
      CREATE TABLE t1 (c INT) ENGINE=MyISAM;  # Or MEMORY,Aria
      CREATE TABLE t2 (c INT);  # Engine choice does not matter
      INSERT t2 SELECT SEQ FROM seq_1_to_200000;
      XA START 'x';
      DELETE FROM t2;
      INSERT INTO t1 VALUES(1);
      XA END 'x';
      XA ROLLBACK 'x';
      SHOW WARNINGS;  # Warning: 1196: Some non-transactional changed tables couldn't be rolled back
      

      Leads to:

      11.4.0 9b1ea6904965dd345478dedd80e181ad54c767da (Optimized)

                    Slave_IO_Running: Yes
                   Slave_SQL_Running: Yes
      ...
               Seconds_Behind_Master: 1213
      ...
             Slave_SQL_Running_State: Waiting for room in worker thread event queue
      

      And

      11.4.0 9b1ea6904965dd345478dedd80e181ad54c767da (Optimized)

      2024-03-02  8:30:07 8 [Note] The slave is applying a ROW event on behalf of a DELETE statement on table t2 and is currently taking a considerable amount of time (61 seconds). This is due to the fact that it is scanning the table while looking up records to be processed. Consider adding a primary key (or unique key) to the table to improve performance.
      

      The DELETE application continues for 20+ minutes: (after which I stopped it)

      11.4.0 9b1ea6904965dd345478dedd80e181ad54c767da (Optimized)

      11.4.0-opt>SHOW PROCESSLIST;
      +----+-------------+-----------+------+--------------+------+--------------------------------------------------------+------------------+----------+
      | Id | User        | Host      | db   | Command      | Time | State                                                  | Info             | Progress |
      +----+-------------+-----------+------+--------------+------+--------------------------------------------------------+------------------+----------+
      |  6 | system user |           | NULL | Slave_IO     | 1219 | Waiting for master to send event                       | NULL             |    0.000 |
      |  9 | system user |           | NULL | Slave_worker | 1215 | Waiting for work from SQL thread                       | NULL             |    0.000 |
      |  8 | system user |           | NULL | Slave_worker | 1219 | Waiting for work from SQL thread                       | NULL             |    0.000 |
      | 10 | system user |           | NULL | Slave_worker | 1215 | Waiting for work from SQL thread                       | NULL             |    0.000 |
      | 11 | system user |           | test | Slave_worker |   21 | Delete_rows_log_event::find_row(-1) on table `t2`      | NULL             |    0.000 |
      |  7 | system user |           | NULL | Slave_SQL    | 1219 | Slave has read all relay log; waiting for more updates | NULL             |    0.000 |
      | 12 | root        | localhost | test | Query        |    0 | starting                                               | SHOW PROCESSLIST |    0.000 |
      +----+-------------+-----------+------+--------------+------+--------------------------------------------------------+------------------+----------+
      7 rows in set (0.000 sec)
      

      Findings and questions:

      • The Delete_rows_log_event::find_row(-1) time cycles over and over from 0 to 8 seconds at the start. Later on the rounds become longer and we see 0 to 24 seconds for example. (what is happening?)
      • While adding a PRIMARY KEY indeed helps with the DELETE for t2 (no wait), the issue does not happen when the INSERT INTO t1 VALUES(1); is not present even without adding a key - which in a way disqualifies the error message (i.e. the issue can be resolved without adding a key, or again - another issue is at play here, and it is related to the INSERT).
      • Why does the XA rollback need to fail on the master (leaving 200k records in t2, even when it is InnoDB)?
      • There is a performance issue in that the slave "work" (and a 20 minute+ slave lockup for a single relatively small transaction) is unnecessary and can be avoided, especially given that the DELETE could not be actioned on the primary and/or can be avoided when the INSERT is not present (and the tables are not related).

      Attachments

        Activity

          Roel Roel Van de Paar added a comment - - edited

          MTR Testcase

          -- source include/master-slave.inc
          -- source include/have_innodb.inc
          -- source include/have_sequence.inc
          CREATE TABLE t1 (c INT) ENGINE=MyISAM;  # Or MEMORY,Aria
          CREATE TABLE t2 (c INT) ENGINE=InnoDB;  # Engine choice does not matter
          INSERT t2 SELECT SEQ FROM seq_1_to_200000;
          XA START 'x';
          DELETE FROM t2;
          INSERT INTO t1 VALUES(1);
          XA END 'x';
          XA ROLLBACK 'x';  # Warning: 1196: Some non-transactional changed tables couldn't be rolled back
          --sync_slave_with_master
          --source include/rpl_end.inc
          

          It will show [pass] but then semi-hang, and terminate on the --sync_slave_with_master after 300 seconds:

          11.4.0 9b1ea6904965dd345478dedd80e181ad54c767da (Optimized)

          CURRENT_TEST: main.test
          analyze: sync_with_master
          mysqltest: At line 13: sync_slave_with_master failed: 'select master_pos_wait('master-bin.000001', 2009383, 300, '')' returned -1 indicating timeout after 300 seconds
          

          CLI to MTR replication slave before MTR terminates:

          11.4.0 9b1ea6904965dd345478dedd80e181ad54c767da (Optimized)

          $ /test/MD060224-mariadb-11.4.0-linux-x86_64-opt/bin/mariadb -uroot -P16001
          WARNING: option --ssl-verify-server-cert is disabled, because of an insecure passwordless login.
          Welcome to the MariaDB monitor.  Commands end with ; or \g.
          Your MariaDB connection id is 11
          Server version: 11.4.0-MariaDB-log MariaDB Server
           
          Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.
           
          Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
           
          MariaDB [(none)]> SHOW PROCESSLIST;
          +----+-------------+-----------------+------+-----------+------+------------------------------------------------------+---------------------------------------------------------------+----------+
          | Id | User        | Host            | db   | Command   | Time | State                                                | Info                                                          | Progress |
          +----+-------------+-----------------+------+-----------+------+------------------------------------------------------+---------------------------------------------------------------+----------+
          |  5 | root        | localhost:46154 | test | Sleep     |  260 |                                                      | NULL                                                          |    0.000 |
          |  6 | system user |                 | NULL | Slave_IO  |  260 | Waiting for master to send event                     | NULL                                                          |    0.000 |
          |  7 | system user |                 | test | Slave_SQL |   17 | Delete_rows_log_event::find_row(-1) on table `t2`    | DELETE FROM t2                                                |    0.000 |
          |  8 | root        | localhost:46166 | test | Query     |  256 | Waiting for the slave SQL thread to advance position | select master_pos_wait('master-bin.000001', 2009383, 300, '') |    0.000 |
          |  9 | root        | localhost:46174 | test | Sleep     |  260 |                                                      | NULL                                                          |    0.000 |
          | 11 | root        | localhost:21434 | NULL | Query     |    0 | starting                                             | SHOW PROCESSLIST                                              |    0.000 |
          +----+-------------+-----------------+------+-----------+------+------------------------------------------------------+---------------------------------------------------------------+----------+
          6 rows in set (0.000 sec)
          

          Roel Roel Van de Paar added a comment - - edited MTR Testcase -- source include/master-slave.inc -- source include/have_innodb.inc -- source include/have_sequence.inc CREATE TABLE t1 (c INT ) ENGINE=MyISAM; # Or MEMORY,Aria CREATE TABLE t2 (c INT ) ENGINE=InnoDB; # Engine choice does not matter INSERT t2 SELECT SEQ FROM seq_1_to_200000; XA START 'x' ; DELETE FROM t2; INSERT INTO t1 VALUES (1); XA END 'x' ; XA ROLLBACK 'x' ; # Warning: 1196: Some non-transactional changed tables couldn't be rolled back --sync_slave_with_master --source include/rpl_end.inc It will show [pass] but then semi-hang, and terminate on the --sync_slave_with_master after 300 seconds: 11.4.0 9b1ea6904965dd345478dedd80e181ad54c767da (Optimized) CURRENT_TEST: main.test analyze: sync_with_master mysqltest: At line 13: sync_slave_with_master failed: 'select master_pos_wait('master-bin.000001', 2009383, 300, '')' returned -1 indicating timeout after 300 seconds CLI to MTR replication slave before MTR terminates: 11.4.0 9b1ea6904965dd345478dedd80e181ad54c767da (Optimized) $ /test/MD060224-mariadb-11.4.0-linux-x86_64-opt/bin/mariadb -uroot -P16001 WARNING: option --ssl-verify-server-cert is disabled, because of an insecure passwordless login. Welcome to the MariaDB monitor. Commands end with ; or \g. Your MariaDB connection id is 11 Server version: 11.4.0-MariaDB-log MariaDB Server   Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.   Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.   MariaDB [(none)]> SHOW PROCESSLIST; +----+-------------+-----------------+------+-----------+------+------------------------------------------------------+---------------------------------------------------------------+----------+ | Id | User | Host | db | Command | Time | State | Info | Progress | +----+-------------+-----------------+------+-----------+------+------------------------------------------------------+---------------------------------------------------------------+----------+ | 5 | root | localhost:46154 | test | Sleep | 260 | | NULL | 0.000 | | 6 | system user | | NULL | Slave_IO | 260 | Waiting for master to send event | NULL | 0.000 | | 7 | system user | | test | Slave_SQL | 17 | Delete_rows_log_event::find_row(-1) on table `t2` | DELETE FROM t2 | 0.000 | | 8 | root | localhost:46166 | test | Query | 256 | Waiting for the slave SQL thread to advance position | select master_pos_wait('master-bin.000001', 2009383, 300, '') | 0.000 | | 9 | root | localhost:46174 | test | Sleep | 260 | | NULL | 0.000 | | 11 | root | localhost:21434 | NULL | Query | 0 | starting | SHOW PROCESSLIST | 0.000 | +----+-------------+-----------------+------+-----------+------+------------------------------------------------------+---------------------------------------------------------------+----------+ 6 rows in set (0.000 sec)

          People

            Elkin Andrei Elkin
            Roel Roel Van de Paar
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.