Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-33578

Replica severely lags on XA ROLLBACK when deleting rows from a non-transactional table while another table receives an INSERT inside the same XA transaction

    XMLWordPrintable

Details

    Description

      # Use RBR, standard m/s setup
      CREATE TABLE t1 (c INT) ENGINE=MyISAM;  # Or MEMORY,Aria
      CREATE TABLE t2 (c INT);  # Engine choice does not matter
      INSERT t2 SELECT SEQ FROM seq_1_to_200000;
      XA START 'x';
      DELETE FROM t2;
      INSERT INTO t1 VALUES(1);
      XA END 'x';
      XA ROLLBACK 'x';
      SHOW WARNINGS;  # Warning: 1196: Some non-transactional changed tables couldn't be rolled back
      

      Leads to:

      11.4.0 9b1ea6904965dd345478dedd80e181ad54c767da (Optimized)

                    Slave_IO_Running: Yes
                   Slave_SQL_Running: Yes
      ...
               Seconds_Behind_Master: 1213
      ...
             Slave_SQL_Running_State: Waiting for room in worker thread event queue
      

      And

      11.4.0 9b1ea6904965dd345478dedd80e181ad54c767da (Optimized)

      2024-03-02  8:30:07 8 [Note] The slave is applying a ROW event on behalf of a DELETE statement on table t2 and is currently taking a considerable amount of time (61 seconds). This is due to the fact that it is scanning the table while looking up records to be processed. Consider adding a primary key (or unique key) to the table to improve performance.
      

      The DELETE application continues for 20+ minutes: (after which I stopped it)

      11.4.0 9b1ea6904965dd345478dedd80e181ad54c767da (Optimized)

      11.4.0-opt>SHOW PROCESSLIST;
      +----+-------------+-----------+------+--------------+------+--------------------------------------------------------+------------------+----------+
      | Id | User        | Host      | db   | Command      | Time | State                                                  | Info             | Progress |
      +----+-------------+-----------+------+--------------+------+--------------------------------------------------------+------------------+----------+
      |  6 | system user |           | NULL | Slave_IO     | 1219 | Waiting for master to send event                       | NULL             |    0.000 |
      |  9 | system user |           | NULL | Slave_worker | 1215 | Waiting for work from SQL thread                       | NULL             |    0.000 |
      |  8 | system user |           | NULL | Slave_worker | 1219 | Waiting for work from SQL thread                       | NULL             |    0.000 |
      | 10 | system user |           | NULL | Slave_worker | 1215 | Waiting for work from SQL thread                       | NULL             |    0.000 |
      | 11 | system user |           | test | Slave_worker |   21 | Delete_rows_log_event::find_row(-1) on table `t2`      | NULL             |    0.000 |
      |  7 | system user |           | NULL | Slave_SQL    | 1219 | Slave has read all relay log; waiting for more updates | NULL             |    0.000 |
      | 12 | root        | localhost | test | Query        |    0 | starting                                               | SHOW PROCESSLIST |    0.000 |
      +----+-------------+-----------+------+--------------+------+--------------------------------------------------------+------------------+----------+
      7 rows in set (0.000 sec)
      

      Findings and questions:

      • The Delete_rows_log_event::find_row(-1) time cycles over and over from 0 to 8 seconds at the start. Later on the rounds become longer and we see 0 to 24 seconds for example. (what is happening?)
      • While adding a PRIMARY KEY indeed helps with the DELETE for t2 (no wait), the issue does not happen when the INSERT INTO t1 VALUES(1); is not present even without adding a key - which in a way disqualifies the error message (i.e. the issue can be resolved without adding a key, or again - another issue is at play here, and it is related to the INSERT).
      • Why does the XA rollback need to fail on the master (leaving 200k records in t2, even when it is InnoDB)?
      • There is a performance issue in that the slave "work" (and a 20 minute+ slave lockup for a single relatively small transaction) is unnecessary and can be avoided, especially given that the DELETE could not be actioned on the primary and/or can be avoided when the INSERT is not present (and the tables are not related).

      Attachments

        Activity

          People

            Elkin Andrei Elkin
            Roel Roel Van de Paar
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.