Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-26670

Unable to maintain replication since upgrading from 10.4 to 10.5

Details

    Description

      We're seeing the following error on a very regular basis when replicating a workload of XA transactions with a 10.5.12 primary / secondary:

      210923 09:27:20 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
      210923 09:34:52 mysqld_safe Starting mariadbd daemon with databases from /opt/mysql
      2021-09-23  9:34:52 0 [Note] /usr/sbin/mariadbd (mysqld 10.5.12-MariaDB-1:10.5.12+maria~focal-log) starting as process 3812964 ...
      2021-09-23  9:34:52 0 [Note] Using unique option prefix 'myisam_recover' is error-prone and can break in the future. Please use the full name 'myisam-recover-options' instead.
      2021-09-23  9:34:52 0 [Warning] The parameter innodb_buffer_pool_instances is deprecated and has no effect.
      2021-09-23  9:34:52 0 [Warning] The parameter innodb_log_files_in_group is deprecated and has no effect.
      2021-09-23  9:34:52 0 [Note] InnoDB: Uses event mutexes
      2021-09-23  9:34:52 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
      2021-09-23  9:34:52 0 [Note] InnoDB: Number of pools: 1
      2021-09-23  9:34:52 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
      2021-09-23  9:34:52 0 [Note] mariadbd: O_TMPFILE is not supported on /opt/tmp (disabling future attempts)
      2021-09-23  9:34:52 0 [Note] InnoDB: Using Linux native AIO
      2021-09-23  9:34:52 0 [Note] InnoDB: Initializing buffer pool, total size = 5100273664, chunk size = 134217728
      2021-09-23  9:34:53 0 [Note] InnoDB: Completed initialization of buffer pool
      2021-09-23  9:34:53 0 [Note] InnoDB: Transaction 51592635 was in the XA prepared state.
      2021-09-23  9:34:53 0 [Note] InnoDB: 1 transaction(s) which must be rolled back or cleaned up in total 0 row operations to undo
      2021-09-23  9:34:53 0 [Note] InnoDB: Trx id counter is 51592647
      2021-09-23  9:34:53 0 [Note] InnoDB: 128 rollback segments are active.
      2021-09-23  9:34:53 0 [Note] InnoDB: Starting in background the rollback of recovered transactions
      2021-09-23  9:34:53 0 [Note] InnoDB: Rollback of non-prepared transactions completed
      2021-09-23  9:34:53 0 [Note] InnoDB: Creating shared tablespace for temporary tables
      2021-09-23  9:34:53 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
      2021-09-23  9:34:53 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
      2021-09-23  9:34:53 0 [Note] InnoDB: 10.5.12 started; log sequence number 111796551705; transaction id 51592648
      2021-09-23  9:34:53 0 [Note] Plugin 'FEEDBACK' is disabled.
      2021-09-23  9:34:53 0 [Note] InnoDB: Loading buffer pool(s) from /opt/mysql/ib_buffer_pool
      2021-09-23  9:34:53 0 [Note] InnoDB: Starting recovery for XA transactions...
      2021-09-23  9:34:53 0 [Note] InnoDB: Transaction 51592635 in prepared state after recovery
      2021-09-23  9:34:53 0 [Note] InnoDB: Transaction contains changes to 3 rows
      2021-09-23  9:34:53 0 [Note] InnoDB: 1 transactions in prepared state after recovery
      2021-09-23  9:34:53 0 [Note] Found 1 prepared transaction(s) in InnoDB
      2021-09-23  9:34:53 0 [Warning] Found 1 prepared XA transactions
      2021-09-23  9:34:55 0 [Note] Server socket created on IP: '0.0.0.0'.
      2021-09-23  9:34:55 0 [Note] Reading of all Master_info entries succeeded
      2021-09-23  9:34:55 0 [Note] Added new Master_info '' to hash table
      2021-09-23  9:34:55 0 [Note] /usr/sbin/mariadbd: ready for connections.
      Version: '10.5.12-MariaDB-1:10.5.12+maria~focal-log'  socket: '/var/run/mysqld/mysqld.sock'  port: 3306  mariadb.org binary distribution
      2021-09-23  9:34:58 8 [Note] Slave I/O thread: Start asynchronous replication to master 'repl@ras2:3306' in log 'mariadb-bin.013324' at position 33344260
      2021-09-23  9:34:58 9 [Note] Slave SQL thread initialized, starting replication in log 'mariadb-bin.013323' at position 253811069, relay log '/opt/mysql/mariadb-relay-bin.000110' position: 28258665
      2021-09-23  9:34:58 8 [Note] Slave I/O thread: connected to master 'repl@ras2:3306',replication started in log 'mariadb-bin.013324' at position 33344260
      2021-09-23  9:36:08 0 [Note] InnoDB: Buffer pool(s) load completed at 210923  9:36:08
      2021-09-23 10:23:27 9 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table DBNAME.TBNAME; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the event's master log mariadb-bin.013324, end_log_pos 248101224, Gtid 0-15-2237601904, Internal MariaDB error code: 1205
      2021-09-23 10:24:18 9 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table DBNAME.TBNAME; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the event's master log mariadb-bin.013324, end_log_pos 248101224, Gtid 0-15-2237601904, Internal MariaDB error code: 1205
      2021-09-23 10:25:10 9 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table DBNAME.TBNAME; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the event's master log mariadb-bin.013324, end_log_pos 248101224, Gtid 0-15-2237601904, Internal MariaDB error code: 1205
      2021-09-23 10:26:03 9 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table DBNAME.TBNAME; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the event's master log mariadb-bin.013324, end_log_pos 248101224, Gtid 0-15-2237601904, Internal MariaDB error code: 1205
      2021-09-23 10:26:57 9 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table DBNAME.TBNAME; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the event's master log mariadb-bin.013324, end_log_pos 248101224, Gtid 0-15-2237601904, Internal MariaDB error code: 1205
      2021-09-23 10:27:52 9 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table DBNAME.TBNAME; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the event's master log mariadb-bin.013324, end_log_pos 248101224, Gtid 0-15-2237601904, Internal MariaDB error code: 1205
      2021-09-23 10:28:48 9 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table DBNAME.TBNAME; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the event's master log mariadb-bin.013324, end_log_pos 248101224, Gtid 0-15-2237601904, Internal MariaDB error code: 1205
      2021-09-23 10:29:44 9 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table DBNAME.TBNAME; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the event's master log mariadb-bin.013324, end_log_pos 248101224, Gtid 0-15-2237601904, Internal MariaDB error code: 1205
      2021-09-23 10:30:40 9 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table DBNAME.TBNAME; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the event's master log mariadb-bin.013324, end_log_pos 248101224, Gtid 0-15-2237601904, Internal MariaDB error code: 1205
      2021-09-23 10:31:36 9 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table DBNAME.TBNAME; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the event's master log mariadb-bin.013324, end_log_pos 248101224, Gtid 0-15-2237601904, Internal MariaDB error code: 1205
      2021-09-23 10:32:32 9 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table DBNAME.TBNAME; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the event's master log mariadb-bin.013324, end_log_pos 248101224, Gtid 0-15-2237601904, Internal MariaDB error code: 1205
      2021-09-23 10:32:32 9 [ERROR] Slave SQL thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable.
      2021-09-23 10:32:32 9 [ERROR] Slave SQL: Lock wait timeout exceeded; try restarting transaction, Gtid 0-15-2237601904, Internal MariaDB error code: 1205
      2021-09-23 10:32:32 9 [Warning] Slave: Lock wait timeout exceeded; try restarting transaction Error_code: 1205
      2021-09-23 10:32:32 9 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.013324' position 248100321
      2021-09-23 10:32:32 9 [Note] Slave SQL thread exiting, replication stopped in log 'mariadb-bin.013324' at position 248100321
      2021-09-23 10:32:32 9 [Note] master was ras2:3306
      2021-09-23 10:33:06 0 [Note] /usr/sbin/mariadbd (initiated by: unknown): Normal shutdown
      2021-09-23 10:33:06 0 [Note] Event Scheduler: Purging the queue. 4 events
      2021-09-23 10:33:06 8 [Note] Slave I/O thread exiting, read up to log 'mariadb-bin.013325', position 27431871
      2021-09-23 10:33:06 8 [Note] master was ras2:3306
      2021-09-23 10:33:06 0 [Note] InnoDB: FTS optimize thread exiting.
      2021-09-23 10:33:06 0 [Note] InnoDB: Starting shutdown...
      2021-09-23 10:33:06 0 [Note] InnoDB: Dumping buffer pool(s) to /opt/mysql/ib_buffer_pool
      2021-09-23 10:33:06 0 [Note] InnoDB: Restricted to 76617 pages due to innodb_buf_pool_dump_pct=25
      2021-09-23 10:33:06 0 [Note] InnoDB: Buffer pool(s) dump completed at 210923 10:33:06
      2021-09-23 10:34:01 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1"
      2021-09-23 10:34:01 0 [Note] InnoDB: Shutdown completed; log sequence number 111905493822; transaction id 51806218
      2021-09-23 10:34:01 0 [Note] /usr/sbin/mariadbd: Shutdown complete
      

      The replication snapshot is setup with:

       mysqldump \
                      -u"${REP_USER_NAME}" -p"${REP_USER_PASSWORD}" \
                      --single-transaction \
                      --routines --events \
                      --master-data -h"${DB_MASTER_IP}" \
                      DBNAME
      

      We've tried changing slave_parallel_mode back to conservative and non-zero values for slave_parallel_threads but nothing keeps replication alive consistently. We're seeing this across multiple installs.

      What can we do to debug / analyze this problem?

      Attachments

        Issue Links

          Activity

            jellisgwn Jon Ellis added a comment -

            Sergei,

            when you said:

            But there should be significantly less gap locks on the READ-COMMITTED isolation level, so trying that on the slave can show if your problem is caused by gap locks.

            i'm confused by what it means to try the READ-COMMITTED isolation level "on the slave". our application is using READ-COMMITTED, but is there replication configuration that changes the slave's isolation level? that seems strange...

            jellisgwn Jon Ellis added a comment - Sergei, when you said: But there should be significantly less gap locks on the READ-COMMITTED isolation level, so trying that on the slave can show if your problem is caused by gap locks. i'm confused by what it means to try the READ-COMMITTED isolation level "on the slave". our application is using READ-COMMITTED, but is there replication configuration that changes the slave's isolation level? that seems strange...

            The default isolation level is REPEATABLE-READ, and I assume that's what most people use. I suggested that if you use REPEATABLE-READ, then you can set --transaction-isolation=READ-COMMITTED (only on the slave so that it'd have less effect on your application) and see if it'll help.

            If you're already using READ-COMMITTED, then, of course, this suggestion doesn't apply. You can still try 10.6 to see if you're affected by MDEV-26652.

            serg Sergei Golubchik added a comment - The default isolation level is REPEATABLE-READ, and I assume that's what most people use. I suggested that if you use REPEATABLE-READ, then you can set --transaction-isolation=READ-COMMITTED (only on the slave so that it'd have less effect on your application) and see if it'll help. If you're already using READ-COMMITTED, then, of course, this suggestion doesn't apply. You can still try 10.6 to see if you're affected by MDEV-26652 .
            jellisgwn Jon Ellis added a comment -

            Sergei,

            Your response prompted some further investigation / experiments. By snapshotting a database and binlog and bringing it to a test environment we were able to run the secondary with READ_COMMITTED as the default isolation... and this, in turn, made it possible to play the binlog back without issue.

            This was (is?) surprising. As previously noted, our workload is created via an XA connection, and at READ_COMMITTED. However, that causes problems when the binlog is processed on a secondary where the default is REPEATABLE_READ. My expectation was that the locking would be encapsulated within the binlog and that the default isolation of the secondary would have no impact... this is obviously not the case - was this behavior something which changed between 10.4 and 10.5?

            With the above progress we were able to take the step of setting the default isolation of both the primary and secondary to READ_COMMITTED for both the primary and secondary. Since that time replication has returned to its normal, pre-10.5, state of stability.

            One other thing to note, use of 10.6 had no impact on the issue, only changing the default isolation was required.

            jellisgwn Jon Ellis added a comment - Sergei, Your response prompted some further investigation / experiments. By snapshotting a database and binlog and bringing it to a test environment we were able to run the secondary with READ_COMMITTED as the default isolation... and this, in turn, made it possible to play the binlog back without issue. This was (is?) surprising. As previously noted, our workload is created via an XA connection, and at READ_COMMITTED . However, that causes problems when the binlog is processed on a secondary where the default is REPEATABLE_READ . My expectation was that the locking would be encapsulated within the binlog and that the default isolation of the secondary would have no impact... this is obviously not the case - was this behavior something which changed between 10.4 and 10.5? With the above progress we were able to take the step of setting the default isolation of both the primary and secondary to READ_COMMITTED for both the primary and secondary. Since that time replication has returned to its normal, pre-10.5, state of stability. One other thing to note, use of 10.6 had no impact on the issue, only changing the default isolation was required.
            Elkin Andrei Elkin added a comment -

            jellisgwn. It's good to hear you have sorted it out!

            Locks are not presented within binlog events, and apparently in your case the slave creates unnecessary ones. The 10.5 changed XA binlogging to log the prepared part of the transaction separately from the following COMMIT|ROLLBACK part.

            Elkin Andrei Elkin added a comment - jellisgwn . It's good to hear you have sorted it out! Locks are not presented within binlog events, and apparently in your case the slave creates unnecessary ones. The 10.5 changed XA binlogging to log the prepared part of the transaction separately from the following COMMIT|ROLLBACK part.

            I'll close it as a duplicate of MDEV-26682. It seems that we've established that this issue was caused by gap locks, and MDEV-26682 should solve gap lock related deadlocks on the slave side.

            serg Sergei Golubchik added a comment - I'll close it as a duplicate of MDEV-26682 . It seems that we've established that this issue was caused by gap locks, and MDEV-26682 should solve gap lock related deadlocks on the slave side.

            People

              marko Marko Mäkelä
              jellisgwn Jon Ellis
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.