Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-24845

Oddities around innodb_fatal_semaphore_wait_threshold and global.innodb_disallow_writes

Details

    Description

      Discussing with Marko.

      # mysqld options required for replay:  --innodb_fatal_semaphore_wait_threshold=7
      SET @@global.innodb_disallow_writes=ON;
      CREATE TABLE t (c INT);
      

      Then connect with another client:

      10.6.0 bfb4761ca04704d68dba51f76d7c9967f880a6ee (Debug)

      10.6.0>SHOW FULL PROCESSLIST;
      +----+------+-----------+------+---------+------+----------------+------------------------+----------+
      | Id | User | Host      | db   | Command | Time | State          | Info                   | Progress |
      +----+------+-----------+------+---------+------+----------------+------------------------+----------+
      |  4 | root | localhost | test | Query   |  225 | creating table | CREATE TABLE t (c INT) |    0.000 |
      |  5 | root | localhost | test | Query   |    0 | starting       | SHOW FULL PROCESSLIST  |    0.000 |
      +----+------+-----------+------+---------+------+----------------+------------------------+----------+
      2 rows in set (0.000 sec)
      ...
      10.6.0>SHOW FULL PROCESSLIST;
      +----+------+-----------+------+---------+------+----------------+------------------------+----------+
      | Id | User | Host      | db   | Command | Time | State          | Info                   | Progress |
      +----+------+-----------+------+---------+------+----------------+------------------------+----------+
      |  4 | root | localhost | test | Query   |  604 | creating table | CREATE TABLE t (c INT) |    0.000 |
      |  5 | root | localhost | test | Query   |    0 | starting       | SHOW FULL PROCESSLIST  |    0.000 |
      +----+------+-----------+------+---------+------+----------------+------------------------+----------+
      2 rows in set (0.000 sec)
      

      This looks normal, except that it should abort after 7 seconds due to the setting. And, CTRL+C kills the query but makes the client hang:

      10.6.0 bfb4761ca04704d68dba51f76d7c9967f880a6ee (Debug)

      10.6.0>CREATE TABLE t (c INT);
      ^CCtrl-C -- query killed. Continuing normally.
      

      And in the other client we can then see a hanging killed thread:

      10.6.0 bfb4761ca04704d68dba51f76d7c9967f880a6ee (Debug)

      10.6.0>SHOW FULL PROCESSLIST;
      +----+------+-----------+------+---------+------+----------------+------------------------+----------+
      | Id | User | Host      | db   | Command | Time | State          | Info                   | Progress |
      +----+------+-----------+------+---------+------+----------------+------------------------+----------+
      |  4 | root | localhost | test | Killed  |  683 | creating table | CREATE TABLE t (c INT) |    0.000 |
      |  5 | root | localhost | test | Query   |    0 | starting       | SHOW FULL PROCESSLIST  |    0.000 |
      +----+------+-----------+------+---------+------+----------------+------------------------+----------+
      2 rows in set (0.000 sec)
      

      Another CTRL+C in the CLI lets us loose connection with the server (ERROR 2013 (HY000): Lost connection to MySQL server during query), and we can then exit to the command line and re-enter the client, but the killed hanging CREATE TABLE remains.

      10.6.0 bfb4761ca04704d68dba51f76d7c9967f880a6ee (Debug)

      10.6.0>SHOW FULL PROCESSLIST;
      +----+------+-----------+------+---------+------+----------------+------------------------+----------+
      | Id | User | Host      | db   | Command | Time | State          | Info                   | Progress |
      +----+------+-----------+------+---------+------+----------------+------------------------+----------+
      |  4 | root | localhost | test | Killed  |  811 | creating table | CREATE TABLE t (c INT) |    0.000 |
      |  5 | root | localhost | test | Query   |    0 | starting       | SHOW FULL PROCESSLIST  |    0.000 |
      |  8 | root | localhost | test | Sleep   |   83 |                | NULL                   |    0.000 |
      +----+------+-----------+------+---------+------+----------------+------------------------+----------+
      3 rows in set (0.001 sec)
      

      Other various related testcases give other outcomes it seems (more testing to be done).

      Attachments

        Issue Links

          Activity

            Roel Roel Van de Paar added a comment - - edited

            Note to self: other things to try later:
            1) Add a failed CREATE statement before disallowing writes (ENGINE=none for example)
            2) Leave off innodb_fatal_semaphore_wait_threshold

            Roel Roel Van de Paar added a comment - - edited Note to self: other things to try later: 1) Add a failed CREATE statement before disallowing writes (ENGINE=none for example) 2) Leave off innodb_fatal_semaphore_wait_threshold

            I think that the parameter innodb_disallow_writes was badly designed and implemented. Instead of blocking arbitrary user threads that happen to trigger I/O, it would be better to disallow the writes at the high level.

            The backup locks that were introduced in MDEV-5336 are a better solution: blocking writes at the high level. Starting with MDEV-16678 in 10.5, the backup locks might actually block the purge of history. Starting with MDEV-19514 in 10.5, no change buffer merge is happening in the background.

            I think that the only writes that might happen while the strongest backup lock is granted is page flushing related to an InnoDB log checkpoint. (Note that if the backup lock does not block reads from the SQL layer, change buffer merge may still be triggered by a read, even after MDEV-19514.)

            The easy fix to avoid hangs related to innodb_disallow_writes would be to only set this parameter while holding a backup lock that prevents any access to InnoDB tables. (Reads would be OK if it were not for the change buffer merge).

            The InnoDB watchdog was rewritten in MDEV-21452, and it currently only covers dict_sys.mutex. I am reluctant to relax the watchdog for the previous cases where it was relaxed: innodb_disallow_writes is set, or a CHECK TABLE is running.

            Roel, please try to also reproduce the watchdog kill during CHECK TABLE. I think that the proper fix for that one would be to acquire MDL_EXCLUSIVE before acquiring the dict_index_t::lock in btr_validate_level().

            marko Marko Mäkelä added a comment - I think that the parameter innodb_disallow_writes was badly designed and implemented. Instead of blocking arbitrary user threads that happen to trigger I/O, it would be better to disallow the writes at the high level. The backup locks that were introduced in MDEV-5336 are a better solution: blocking writes at the high level. Starting with MDEV-16678 in 10.5, the backup locks might actually block the purge of history. Starting with MDEV-19514 in 10.5, no change buffer merge is happening in the background. I think that the only writes that might happen while the strongest backup lock is granted is page flushing related to an InnoDB log checkpoint. (Note that if the backup lock does not block reads from the SQL layer, change buffer merge may still be triggered by a read, even after MDEV-19514 .) The easy fix to avoid hangs related to innodb_disallow_writes would be to only set this parameter while holding a backup lock that prevents any access to InnoDB tables. (Reads would be OK if it were not for the change buffer merge). The InnoDB watchdog was rewritten in MDEV-21452 , and it currently only covers dict_sys.mutex . I am reluctant to relax the watchdog for the previous cases where it was relaxed: innodb_disallow_writes is set, or a CHECK TABLE is running. Roel , please try to also reproduce the watchdog kill during CHECK TABLE . I think that the proper fix for that one would be to acquire MDL_EXCLUSIVE before acquiring the dict_index_t::lock in btr_validate_level() .

            Out of curiosity, why can the following functions be paused by innodb_disallow_writes?

            • row_merge_file_create_low(): This is used during certain forms of ALTER TABLE. Shouldn’t any DDL be blocked by MDL before snapshot transfer?
            • os_file_create_tmpfile(): Probably duplicating the above.
            • os_file_flush_func(): This is only executing fsync() or fdatasync(). What purpose does the blocking serve?
            • os_file_create_simple_func(): This could be executed during DDL. But, why do we not block DDL by MDL?
            • os_file_create_directory(): Invoked only during DDL when a DATA DIRECTORY clause is present. Why not use MDL?
            • os_file_create_simple_no_error_handling(): Invoked when opening a file for *reading* (yes, the naming is confusing), at database startup or IMPORT TABLESPACE. Again, why not use MDL?
            • os_file_delete_if_exists_func(): Invoked during crash recovery or during DDL. All DDL could easily be blocked by suitable MDL.
            • os_file_delete_func(): Ditto.
            • os_file_rename_func(): Ditto.
            • os_file_create_func(): Blocked when creating (not merely opening) a file. File creation should only happen during DDL and be easily blocked by MDL.
            • os_file_set_eof(): Only invoked on some temporary files that buffer InnoDB status messages. These temporary files should not be part of a snapshot transfer. Why block these?
            • os_file_write_func(): This is where *synchronous* writes to InnoDB files (possibly including temporary files) happen.

            It looks like when innodb_use_native_aio is in effect, innodb_disallow_writes will not block the actual writes to the InnoDB data files at all. Writes to the redo log are being blocked in os_file_write_func().

            I would propose the following:

            1. Consider flushing the entire buffer pool to minimize the recovery time at the receiver: SET GLOBAL innodb_max_dirty_pages_pct=0;
            2. Acquire a metadata lock (FLUSH TABLES WITH READ LOCK or backup locks) to block all writes from the SQL layer.
            3. Pause the purge of transaction history: purge_sys.stop().
            4. Block buf_flush_page() or its callers, except for the temporary tablespace. This is what initiates persistent InnoDB page writes.
            5. Ensure that log_free_check() will block.
            6. Block the log file writes in log0log.cc.
            7. Invoke os_aio_wait_until_no_pending_writes() to ensure that all writes to data pages have completed.

            Some of these steps are redundant.

            The fix would be very different for versions before 10.5, because the page flushing has been heavily refactored in 10.5.

            marko Marko Mäkelä added a comment - Out of curiosity, why can the following functions be paused by innodb_disallow_writes ? row_merge_file_create_low() : This is used during certain forms of ALTER TABLE . Shouldn’t any DDL be blocked by MDL before snapshot transfer? os_file_create_tmpfile() : Probably duplicating the above. os_file_flush_func() : This is only executing fsync() or fdatasync() . What purpose does the blocking serve? os_file_create_simple_func() : This could be executed during DDL. But, why do we not block DDL by MDL? os_file_create_directory() : Invoked only during DDL when a DATA DIRECTORY clause is present. Why not use MDL? os_file_create_simple_no_error_handling() : Invoked when opening a file for * reading * (yes, the naming is confusing), at database startup or IMPORT TABLESPACE . Again, why not use MDL? os_file_delete_if_exists_func() : Invoked during crash recovery or during DDL. All DDL could easily be blocked by suitable MDL. os_file_delete_func() : Ditto. os_file_rename_func() : Ditto. os_file_create_func() : Blocked when creating (not merely opening) a file. File creation should only happen during DDL and be easily blocked by MDL. os_file_set_eof() : Only invoked on some temporary files that buffer InnoDB status messages. These temporary files should not be part of a snapshot transfer. Why block these? os_file_write_func() : This is where * synchronous * writes to InnoDB files (possibly including temporary files) happen. It looks like when innodb_use_native_aio is in effect, innodb_disallow_writes will not block the actual writes to the InnoDB data files at all. Writes to the redo log are being blocked in os_file_write_func() . I would propose the following: Consider flushing the entire buffer pool to minimize the recovery time at the receiver: SET GLOBAL innodb_max_dirty_pages_pct=0; Acquire a metadata lock ( FLUSH TABLES WITH READ LOCK or backup locks) to block all writes from the SQL layer. Pause the purge of transaction history: purge_sys.stop() . Block buf_flush_page() or its callers, except for the temporary tablespace. This is what initiates persistent InnoDB page writes. Ensure that log_free_check() will block. Block the log file writes in log0log.cc . Invoke os_aio_wait_until_no_pending_writes() to ensure that all writes to data pages have completed. Some of these steps are redundant. The fix would be very different for versions before 10.5, because the page flushing has been heavily refactored in 10.5.

            I think there is some misunderstanding about usage of innodb_disallow_writes. It is not intended to be used by user at all (it is actually worse than set global wsrep_on=OFF). Only reasonable usage is when Galera does SST and even there I think we could use something better starting from 10.4. I do not see much hope on fixing this issue on 10.2-10.3 but I let seppo to decide. Note that I decreased priority after careful consideration because I do not see any customers affected on this issue.

            jplindst Jan Lindström (Inactive) added a comment - - edited I think there is some misunderstanding about usage of innodb_disallow_writes. It is not intended to be used by user at all (it is actually worse than set global wsrep_on=OFF). Only reasonable usage is when Galera does SST and even there I think we could use something better starting from 10.4. I do not see much hope on fixing this issue on 10.2-10.3 but I let seppo to decide. Note that I decreased priority after careful consideration because I do not see any customers affected on this issue.

            As far as I can tell, the parameter innodb_disallow_writes is being set by the Galera function sst_disallow_writes(), right after sst_flush_tables() was invoked. I see that sst_flush_tables() invokes FLUSH TABLES WITH READ LOCK. What it definitely fails to do is

            SET GLOBAL innodb_max_dirty_pages_pct=0;
            

            Currently, innodb_disallow_writes would block any ‘new’ InnoDB writes of persistent files by blocking writes to the redo log. But, if the buffer pool contains any modified pages, writes of such pages can be submitted asynchronously via io_submit() or io_uring(). On those code paths, there is no check for innodb_disallow_writes().

            I hope that someone would test the following when using a very large innodb_log_file_size and innodb_buffer_pool_size (so that checkpoints and page eviction will not occur), and innodb_page_size=64k:

            SET GLOBAL innodb_max_dirty_pages_pct_lwm=99;
            -- Modify lots of InnoDB data pages, and ensure that a large number is reported:
            SELECT variable_value FROM information_schema.global_status
            WHERE variable_name = 'INNODB_BUFFER_POOL_PAGES_DIRTY';
            -- Initiate snapshot transfer. While it is running, issue
            SET GLOBAL innodb_max_dirty_pages_pct=0;
            

            I would expect recovery to fail on the new replica when using the rsync snapshot algorithm, because as far as I can tell, InnoDB data page writes may run concurrently with rsync. Recovery should fail if rsync copies a page that is being simultaneously written. If recovery does not fail, also try running CHECK TABLE on the new replica.

            marko Marko Mäkelä added a comment - As far as I can tell, the parameter innodb_disallow_writes is being set by the Galera function sst_disallow_writes() , right after sst_flush_tables() was invoked. I see that sst_flush_tables() invokes FLUSH TABLES WITH READ LOCK . What it definitely fails to do is SET GLOBAL innodb_max_dirty_pages_pct=0; Currently, innodb_disallow_writes would block any ‘new’ InnoDB writes of persistent files by blocking writes to the redo log. But, if the buffer pool contains any modified pages, writes of such pages can be submitted asynchronously via io_submit() or io_uring() . On those code paths, there is no check for innodb_disallow_writes() . I hope that someone would test the following when using a very large innodb_log_file_size and innodb_buffer_pool_size (so that checkpoints and page eviction will not occur), and innodb_page_size=64k : SET GLOBAL innodb_max_dirty_pages_pct_lwm=99; -- Modify lots of InnoDB data pages, and ensure that a large number is reported: SELECT variable_value FROM information_schema.global_status WHERE variable_name = 'INNODB_BUFFER_POOL_PAGES_DIRTY' ; -- Initiate snapshot transfer. While it is running, issue SET GLOBAL innodb_max_dirty_pages_pct=0; I would expect recovery to fail on the new replica when using the rsync snapshot algorithm, because as far as I can tell, InnoDB data page writes may run concurrently with rsync . Recovery should fail if rsync copies a page that is being simultaneously written. If recovery does not fail, also try running CHECK TABLE on the new replica.

            I may have misunderstood something, but most tests in the galera suite that have rsync in their name seem to initiate a snapshot transfer on an idle cluster.

            We only seem to have one test that may cover some sort of concurrent writes during an rsync snapshot transfer. Reproducing the race condition between asynchronous page writes and rsync would probably require a much larger amount of page writes. I was unable to reproduce problems with the following:

            diff --git a/mysql-test/suite/galera/t/galera_gtid_slave_sst_rsync.test b/mysql-test/suite/galera/t/galera_gtid_slave_sst_rsync.test
            --- a/mysql-test/suite/galera/t/galera_gtid_slave_sst_rsync.test
            +++ b/mysql-test/suite/galera/t/galera_gtid_slave_sst_rsync.test
            @@ -9,6 +9,9 @@
             --source include/big_test.inc
             --source include/have_innodb.inc
             --source include/galera_cluster.inc
            +--source include/have_sequence.inc
            +
            +SET GLOBAL innodb_max_dirty_pages_pct_lwm=99;
             
             # As node #3 is not a Galera node, and galera_cluster.inc does not open connetion to it
             # we open the node_3 connection here
            @@ -59,6 +62,8 @@ START TRANSACTION;
             INSERT INTO t1 VALUES ('node1_committed_before');
             INSERT INTO t1 VALUES ('node1_committed_before');
             COMMIT;
            +BEGIN;INSERT INTO t2 SELECT seq,seq*10 FROM seq_9_to_1000;ROLLBACK;
            +SET GLOBAL innodb_max_dirty_pages_pct=0;
             --source include/save_master_gtid.inc
             --echo #Connection 2
             --connection node_2
            

            I also tried to pad the table so that a little more pages would be written, but it did not help:

            diff --git a/mysql-test/suite/galera/t/galera_gtid_slave_sst_rsync.test b/mysql-test/suite/galera/t/galera_gtid_slave_sst_rsync.test
            --- a/mysql-test/suite/galera/t/galera_gtid_slave_sst_rsync.test
            +++ b/mysql-test/suite/galera/t/galera_gtid_slave_sst_rsync.test
            @@ -9,6 +9,9 @@
             --source include/big_test.inc
             --source include/have_innodb.inc
             --source include/galera_cluster.inc
            +--source include/have_sequence.inc
            +
            +SET GLOBAL innodb_max_dirty_pages_pct_lwm=99;
             
             # As node #3 is not a Galera node, and galera_cluster.inc does not open connetion to it
             # we open the node_3 connection here
            @@ -23,10 +26,10 @@ START SLAVE;
             
             --echo #Connection 3
             --connection node_3
            -CREATE TABLE t2 (f1 INTEGER PRIMARY KEY, f2 int unique) ENGINE=InnoDB;
            -INSERT INTO t2 VALUES(1,11);
            -INSERT INTO t2 VALUES(2,22);
            -INSERT INTO t2 VALUES(3,33);
            +CREATE TABLE t2 (f1 INTEGER PRIMARY KEY, f2 int unique, f3 CHAR(255) CHARACTER SET utf8mb4 NOT NULL DEFAULT '') ENGINE=InnoDB;
            +INSERT INTO t2 VALUES(1,11,'');
            +INSERT INTO t2 VALUES(2,22,'');
            +INSERT INTO t2 VALUES(3,33,'');
             
             SELECT @@global.gtid_binlog_state;
             --source include/save_master_gtid.inc
            @@ -35,9 +38,9 @@ SELECT @@global.gtid_binlog_state;
             --source include/sync_with_master_gtid.inc
             SELECT @@global.gtid_binlog_state;
             
            -INSERT INTO t2 VALUES(4,44);
            -INSERT INTO t2 VALUES(5,55);
            -INSERT INTO t2 VALUES(6,66);
            +INSERT INTO t2 VALUES(4,44,'');
            +INSERT INTO t2 VALUES(5,55,'');
            +INSERT INTO t2 VALUES(6,66,'');
             SELECT @@global.gtid_binlog_state;
             --echo #Connection 1
             --connection node_1
            @@ -46,8 +49,8 @@ SELECT @@global.gtid_binlog_state;
             --let $wait_condition = SELECT COUNT(*) = 6 FROM t2;
             --source include/wait_condition.inc
             
            -INSERT INTO t2 VALUES(7,77);
            -INSERT INTO t2 VALUES(8,88);
            +INSERT INTO t2 VALUES(7,77,'');
            +INSERT INTO t2 VALUES(8,88,'');
             SELECT @@global.gtid_binlog_state;
             
             #Perform SST
            @@ -59,6 +62,8 @@ START TRANSACTION;
             INSERT INTO t1 VALUES ('node1_committed_before');
             INSERT INTO t1 VALUES ('node1_committed_before');
             COMMIT;
            +BEGIN;INSERT INTO t2 SELECT seq,seq*10,seq FROM seq_9_to_1000;ROLLBACK;
            +SET GLOBAL innodb_max_dirty_pages_pct=0;
             --source include/save_master_gtid.inc
             --echo #Connection 2
             --connection node_2
            

            For some reason, this test is ignoring the --parallel setting. My invocation was as follows:

            ./mtr --mysqld=--innodb-page-size=64k --mysqld=--innodb-buffer-pool-size=28m --parallel=auto --repeat=10 galera.galera_gtid_slave_sst_rsync{,,,,,,,}
            

            Note: I may have misinterpreted something. It is not clear to me what exactly is triggering the snapshot transfer and from which cluster member the data is being copied.

            marko Marko Mäkelä added a comment - I may have misunderstood something, but most tests in the galera suite that have rsync in their name seem to initiate a snapshot transfer on an idle cluster. We only seem to have one test that may cover some sort of concurrent writes during an rsync snapshot transfer. Reproducing the race condition between asynchronous page writes and rsync would probably require a much larger amount of page writes. I was unable to reproduce problems with the following: diff --git a/mysql-test/suite/galera/t/galera_gtid_slave_sst_rsync.test b/mysql-test/suite/galera/t/galera_gtid_slave_sst_rsync.test --- a/mysql-test/suite/galera/t/galera_gtid_slave_sst_rsync.test +++ b/mysql-test/suite/galera/t/galera_gtid_slave_sst_rsync.test @@ -9,6 +9,9 @@ --source include/big_test.inc --source include/have_innodb.inc --source include/galera_cluster.inc +--source include/have_sequence.inc + +SET GLOBAL innodb_max_dirty_pages_pct_lwm=99; # As node #3 is not a Galera node, and galera_cluster.inc does not open connetion to it # we open the node_3 connection here @@ -59,6 +62,8 @@ START TRANSACTION; INSERT INTO t1 VALUES ('node1_committed_before'); INSERT INTO t1 VALUES ('node1_committed_before'); COMMIT; +BEGIN;INSERT INTO t2 SELECT seq,seq*10 FROM seq_9_to_1000;ROLLBACK; +SET GLOBAL innodb_max_dirty_pages_pct=0; --source include/save_master_gtid.inc --echo #Connection 2 --connection node_2 I also tried to pad the table so that a little more pages would be written, but it did not help: diff --git a/mysql-test/suite/galera/t/galera_gtid_slave_sst_rsync.test b/mysql-test/suite/galera/t/galera_gtid_slave_sst_rsync.test --- a/mysql-test/suite/galera/t/galera_gtid_slave_sst_rsync.test +++ b/mysql-test/suite/galera/t/galera_gtid_slave_sst_rsync.test @@ -9,6 +9,9 @@ --source include/big_test.inc --source include/have_innodb.inc --source include/galera_cluster.inc +--source include/have_sequence.inc + +SET GLOBAL innodb_max_dirty_pages_pct_lwm=99; # As node #3 is not a Galera node, and galera_cluster.inc does not open connetion to it # we open the node_3 connection here @@ -23,10 +26,10 @@ START SLAVE; --echo #Connection 3 --connection node_3 -CREATE TABLE t2 (f1 INTEGER PRIMARY KEY, f2 int unique) ENGINE=InnoDB; -INSERT INTO t2 VALUES(1,11); -INSERT INTO t2 VALUES(2,22); -INSERT INTO t2 VALUES(3,33); +CREATE TABLE t2 (f1 INTEGER PRIMARY KEY, f2 int unique, f3 CHAR(255) CHARACTER SET utf8mb4 NOT NULL DEFAULT '') ENGINE=InnoDB; +INSERT INTO t2 VALUES(1,11,''); +INSERT INTO t2 VALUES(2,22,''); +INSERT INTO t2 VALUES(3,33,''); SELECT @@global.gtid_binlog_state; --source include/save_master_gtid.inc @@ -35,9 +38,9 @@ SELECT @@global.gtid_binlog_state; --source include/sync_with_master_gtid.inc SELECT @@global.gtid_binlog_state; -INSERT INTO t2 VALUES(4,44); -INSERT INTO t2 VALUES(5,55); -INSERT INTO t2 VALUES(6,66); +INSERT INTO t2 VALUES(4,44,''); +INSERT INTO t2 VALUES(5,55,''); +INSERT INTO t2 VALUES(6,66,''); SELECT @@global.gtid_binlog_state; --echo #Connection 1 --connection node_1 @@ -46,8 +49,8 @@ SELECT @@global.gtid_binlog_state; --let $wait_condition = SELECT COUNT(*) = 6 FROM t2; --source include/wait_condition.inc -INSERT INTO t2 VALUES(7,77); -INSERT INTO t2 VALUES(8,88); +INSERT INTO t2 VALUES(7,77,''); +INSERT INTO t2 VALUES(8,88,''); SELECT @@global.gtid_binlog_state; #Perform SST @@ -59,6 +62,8 @@ START TRANSACTION; INSERT INTO t1 VALUES ('node1_committed_before'); INSERT INTO t1 VALUES ('node1_committed_before'); COMMIT; +BEGIN;INSERT INTO t2 SELECT seq,seq*10,seq FROM seq_9_to_1000;ROLLBACK; +SET GLOBAL innodb_max_dirty_pages_pct=0; --source include/save_master_gtid.inc --echo #Connection 2 --connection node_2 For some reason, this test is ignoring the --parallel setting. My invocation was as follows: ./mtr --mysqld=--innodb-page-size=64k --mysqld=--innodb-buffer-pool-size=28m --parallel=auto --repeat=10 galera.galera_gtid_slave_sst_rsync{,,,,,,,} Note: I may have misinterpreted something. It is not clear to me what exactly is triggering the snapshot transfer and from which cluster member the data is being copied.

            I tried with attached test case but could not reproduce anything bad. Either I have too few dirty pages or too slow laptop.

            jan@jan-HP-ZBook-15u-G5:~/mysql/10.4/mysql-test$ ./mtr galera.galera_sst_rsync_load
            Logging: ./mtr  galera.galera_sst_rsync_load
            VS config: 
            vardir: /home/jan/mysql/10.4/mysql-test/var
            Checking leftover processes...
             - found old pid 2745195 in 'mysqld.1.pid', killing it...
               process did not exist!
            Removing old var directory...
            Creating var directory '/home/jan/mysql/10.4/mysql-test/var'...
            Checking supported features...
            MariaDB Version 10.4.23-MariaDB-debug
             - SSL connections supported
             - binaries are debug compiled
             - binaries built with wsrep patch
            Collecting tests...
            Installing system database...
             
            ==============================================================================
             
            TEST                                      RESULT   TIME (ms) or COMMENT
            --------------------------------------------------------------------------
             
            worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019
            connection node_2;
            connection node_1;
            connection node_1;
            connection node_2;
            connection node_1;
            CREATE TABLE t1(id int not null primary key auto_increment,
            b int, c int, d char(255), key (b)) engine=innodb;
            connection node_2;
            Shutting down server ...
            connection node_1;
            SET GLOBAL innodb_max_dirty_pages_pct=99;
            SET GLOBAL innodb_max_dirty_pages_pct_lwm=99;
            create procedure innodb_insert_proc (repeat_count int)
            begin
            declare current_num int;
            set current_num = 0;
            while current_num < repeat_count do
            insert into t1 values (NULL,current_num,current_num, substring(MD5(RAND()), -200));
            update t1 set b = current_num +100,
            d =  substring(MD5(RAND()), -200)
            where b = current_num;
            set current_num = current_num + 1;
            end while;
            end//
            commit;
            SET autocommit=0;
            call innodb_insert_proc(7000);
            COMMIT;
            SELECT variable_value FROM information_schema.global_status
            WHERE variable_name = 'INNODB_BUFFER_POOL_PAGES_DIRTY';
            variable_value
            158
            SET GLOBAL innodb_max_dirty_pages_pct_lwm=0;
            SET GLOBAL innodb_max_dirty_pages_pct=0;
            connection node_2;
            Starting server ...
            SELECT COUNT(*) FROM t1;
            COUNT(*)
            7000
            CHECK TABLE t1;
            Table	Op	Msg_type	Msg_text
            test.t1	check	status	OK
            DROP PROCEDURE innodb_insert_proc;
            DROP TABLE t1;
            galera.galera_sst_rsync_load 'innodb'    [ pass ]  217410
            

            jplindst Jan Lindström (Inactive) added a comment - I tried with attached test case but could not reproduce anything bad. Either I have too few dirty pages or too slow laptop. jan@jan-HP-ZBook-15u-G5:~/mysql/10.4/mysql-test$ ./mtr galera.galera_sst_rsync_load Logging: ./mtr galera.galera_sst_rsync_load VS config: vardir: /home/jan/mysql/10.4/mysql-test/var Checking leftover processes... - found old pid 2745195 in 'mysqld.1.pid', killing it... process did not exist! Removing old var directory... Creating var directory '/home/jan/mysql/10.4/mysql-test/var'... Checking supported features... MariaDB Version 10.4.23-MariaDB-debug - SSL connections supported - binaries are debug compiled - binaries built with wsrep patch Collecting tests... Installing system database...   ==============================================================================   TEST RESULT TIME (ms) or COMMENT --------------------------------------------------------------------------   worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019 connection node_2; connection node_1; connection node_1; connection node_2; connection node_1; CREATE TABLE t1(id int not null primary key auto_increment, b int, c int, d char(255), key (b)) engine=innodb; connection node_2; Shutting down server ... connection node_1; SET GLOBAL innodb_max_dirty_pages_pct=99; SET GLOBAL innodb_max_dirty_pages_pct_lwm=99; create procedure innodb_insert_proc (repeat_count int) begin declare current_num int; set current_num = 0; while current_num < repeat_count do insert into t1 values (NULL,current_num,current_num, substring(MD5(RAND()), -200)); update t1 set b = current_num +100, d = substring(MD5(RAND()), -200) where b = current_num; set current_num = current_num + 1; end while; end// commit; SET autocommit=0; call innodb_insert_proc(7000); COMMIT; SELECT variable_value FROM information_schema.global_status WHERE variable_name = 'INNODB_BUFFER_POOL_PAGES_DIRTY'; variable_value 158 SET GLOBAL innodb_max_dirty_pages_pct_lwm=0; SET GLOBAL innodb_max_dirty_pages_pct=0; connection node_2; Starting server ... SELECT COUNT(*) FROM t1; COUNT(*) 7000 CHECK TABLE t1; Table Op Msg_type Msg_text test.t1 check status OK DROP PROCEDURE innodb_insert_proc; DROP TABLE t1; galera.galera_sst_rsync_load 'innodb' [ pass ] 217410

            A buf_pool.flush_list size of 158 pages is not really much, and due to MDEV-25113, that list may actually contain clean pages (with the special value oldest_modification()==1) that have already been written out.

            To reproduce this, we only need one change per index page. I would suggest to bloat the records as much as possible with CHAR(255) NOT NULL DEFAULT '' columns, so that there will be few records in a innodb_page_size=64k page. About 120 such columns should fit in a 64KiB page, or maybe it is about 60 if we are limiting the maximum record size to 16383 bytes, like I vaguely remember. With such heavily padded records, we should have 2 to 4 records per leaf page.

            Writing constant strings should be faster than writing pseudorandom strings. I do not see any benefit in writing pseudorandom data to non-indexed columns.

            I do not think that we need any secondary indexes. We will only have to modify one non-indexed column in at least one record in each page. The type of that column could be an integer.

            It may be beneficial to use a release build, just like our users do.

            marko Marko Mäkelä added a comment - A buf_pool.flush_list size of 158 pages is not really much, and due to MDEV-25113 , that list may actually contain clean pages (with the special value oldest_modification()==1 ) that have already been written out. To reproduce this, we only need one change per index page. I would suggest to bloat the records as much as possible with CHAR(255) NOT NULL DEFAULT '' columns, so that there will be few records in a innodb_page_size=64k page. About 120 such columns should fit in a 64KiB page, or maybe it is about 60 if we are limiting the maximum record size to 16383 bytes, like I vaguely remember. With such heavily padded records, we should have 2 to 4 records per leaf page. Writing constant strings should be faster than writing pseudorandom strings. I do not see any benefit in writing pseudorandom data to non-indexed columns. I do not think that we need any secondary indexes. We will only have to modify one non-indexed column in at least one record in each page. The type of that column could be an integer. It may be beneficial to use a release build, just like our users do.

            I still can't repeat anything bad. This time I used 10G buffer pool and galapq with release build. I let my bad test to run 20 iterations and still I could not see any crashes. Based on this I do not see any evidence that rsync SST method would be broken on Galera. I'm still running bigger 50 iterations on that machine.

            jan@galapq:~/10.4/mysql-test$ ./mtr galera.galera_sst_rsync_load --mysqld=--innodb-page-size=64K
            Logging: ./mtr  galera.galera_sst_rsync_load --mysqld=--innodb-page-size=64K
            VS config: 
            vardir: /home/jan/10.4/mysql-test/var
            Checking leftover processes...
             - found old pid 925082 in 'mysqld.1.pid', killing it...
               process did not exist!
            Removing old var directory...
            Creating var directory '/home/jan/10.4/mysql-test/var'...
            Checking supported features...
            MariaDB Version 10.4.23-MariaDB
             - SSL connections supported
             - binaries built with wsrep patch
            Collecting tests...
            Installing system database...
             
            ==============================================================================
             
            TEST                                      RESULT   TIME (ms) or COMMENT
            --------------------------------------------------------------------------
             
            worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019
            connection node_2;
            connection node_1;
            connection node_1;
            connection node_2;
            connection node_1;
            CREATE TABLE t1(id int not null primary key auto_increment,
            d1 CHAR(255) NOT NULL DEFAULT '',
            d2 CHAR(255) NOT NULL DEFAULT '',
            d3 CHAR(255) NOT NULL DEFAULT '',
            d4 CHAR(255) NOT NULL DEFAULT '',
            d5 CHAR(255) NOT NULL DEFAULT '',
            d6 CHAR(255) NOT NULL DEFAULT '',
            d7 CHAR(255) NOT NULL DEFAULT '',
            d8 CHAR(255) NOT NULL DEFAULT '',
            d9 CHAR(255) NOT NULL DEFAULT '',
            d10 CHAR(255) NOT NULL DEFAULT '',
            d11 CHAR(255) NOT NULL DEFAULT '',
            d12 CHAR(255) NOT NULL DEFAULT '',
            d13 CHAR(255) NOT NULL DEFAULT '',
            d14 CHAR(255) NOT NULL DEFAULT '',
            d15 CHAR(255) NOT NULL DEFAULT '',
            d16 CHAR(255) NOT NULL DEFAULT '',
            d17 CHAR(255) NOT NULL DEFAULT '',
            d18 CHAR(255) NOT NULL DEFAULT '',
            d19 CHAR(255) NOT NULL DEFAULT '',
            d20 CHAR(255) NOT NULL DEFAULT '',
            d21 CHAR(255) NOT NULL DEFAULT '',
            d22 CHAR(255) NOT NULL DEFAULT '',
            d23 CHAR(255) NOT NULL DEFAULT '',
            d24 CHAR(255) NOT NULL DEFAULT '',
            d25 CHAR(255) NOT NULL DEFAULT '',
            d26 CHAR(255) NOT NULL DEFAULT '',
            d27 CHAR(255) NOT NULL DEFAULT '',
            d28 CHAR(255) NOT NULL DEFAULT '',
            d29 CHAR(255) NOT NULL DEFAULT '',
            d30 CHAR(255) NOT NULL DEFAULT '',
            d31 CHAR(255) NOT NULL DEFAULT '',
            d32 CHAR(255) NOT NULL DEFAULT '',
            d33 CHAR(255) NOT NULL DEFAULT '',
            d34 CHAR(255) NOT NULL DEFAULT '',
            d35 CHAR(255) NOT NULL DEFAULT '',
            d36 CHAR(255) NOT NULL DEFAULT '',
            d37 CHAR(255) NOT NULL DEFAULT '',
            d38 CHAR(255) NOT NULL DEFAULT '',
            d39 CHAR(255) NOT NULL DEFAULT '',
            d40 CHAR(255) NOT NULL DEFAULT '',
            d41 CHAR(255) NOT NULL DEFAULT '',
            d42 CHAR(255) NOT NULL DEFAULT '',
            d43 CHAR(255) NOT NULL DEFAULT '',
            d44 CHAR(255) NOT NULL DEFAULT '',
            d45 CHAR(255) NOT NULL DEFAULT '',
            d46 CHAR(255) NOT NULL DEFAULT '',
            d47 CHAR(255) NOT NULL DEFAULT '',
            d48 CHAR(255) NOT NULL DEFAULT '',
            d49 CHAR(255) NOT NULL DEFAULT '',
            d50 CHAR(255) NOT NULL DEFAULT ''
            ) engine=innodb;
            connection node_2;
            Shutting down server ...
            connection node_1;
            SET GLOBAL innodb_max_dirty_pages_pct=99;
            SET GLOBAL innodb_max_dirty_pages_pct_lwm=99;
            create procedure innodb_insert_proc (repeat_count int)
            begin
            declare current_num int;
            set current_num = 0;
            while current_num < repeat_count do
            insert into t1(id,d1,d5,d10,d20,d25,d30) values (NULL,
            substring(MD5(RAND()), -200),
            substring(MD5(RAND()), -200),
            substring(MD5(RAND()), -200),
            substring(MD5(RAND()), -200),
            substring(MD5(RAND()), -200),
            substring(MD5(RAND()), -200)
            );
            set current_num = current_num + 1;
            end while;
            end//
            commit;
            SET autocommit=0;
            call innodb_insert_proc(350000);
            COMMIT;
            SET autocommit=1;
            # Make sure we have dirty pages on buffer pool
            SELECT variable_value FROM information_schema.global_status
            WHERE variable_name = 'INNODB_BUFFER_POOL_PAGES_DIRTY';
            variable_value
            50656
            # Start flushing dirty pages
            SET GLOBAL innodb_max_dirty_pages_pct_lwm=0;
            SET GLOBAL innodb_max_dirty_pages_pct=0;
            SELECT variable_value FROM information_schema.global_status
            WHERE variable_name = 'INNODB_BUFFER_POOL_PAGES_DIRTY';
            variable_value
            50755
            # Start SST node_1 is donor and node_2 joiner
            connection node_2;
            Starting server ...
            SELECT COUNT(*) FROM t1;
            COUNT(*)
            350000
            CHECK TABLE t1;
            Table	Op	Msg_type	Msg_text
            test.t1	check	status	OK
            

            jplindst Jan Lindström (Inactive) added a comment - I still can't repeat anything bad. This time I used 10G buffer pool and galapq with release build. I let my bad test to run 20 iterations and still I could not see any crashes. Based on this I do not see any evidence that rsync SST method would be broken on Galera. I'm still running bigger 50 iterations on that machine. jan@galapq:~/10.4/mysql-test$ ./mtr galera.galera_sst_rsync_load --mysqld=--innodb-page-size=64K Logging: ./mtr galera.galera_sst_rsync_load --mysqld=--innodb-page-size=64K VS config: vardir: /home/jan/10.4/mysql-test/var Checking leftover processes... - found old pid 925082 in 'mysqld.1.pid', killing it... process did not exist! Removing old var directory... Creating var directory '/home/jan/10.4/mysql-test/var'... Checking supported features... MariaDB Version 10.4.23-MariaDB - SSL connections supported - binaries built with wsrep patch Collecting tests... Installing system database...   ==============================================================================   TEST RESULT TIME (ms) or COMMENT --------------------------------------------------------------------------   worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019 connection node_2; connection node_1; connection node_1; connection node_2; connection node_1; CREATE TABLE t1(id int not null primary key auto_increment, d1 CHAR(255) NOT NULL DEFAULT '', d2 CHAR(255) NOT NULL DEFAULT '', d3 CHAR(255) NOT NULL DEFAULT '', d4 CHAR(255) NOT NULL DEFAULT '', d5 CHAR(255) NOT NULL DEFAULT '', d6 CHAR(255) NOT NULL DEFAULT '', d7 CHAR(255) NOT NULL DEFAULT '', d8 CHAR(255) NOT NULL DEFAULT '', d9 CHAR(255) NOT NULL DEFAULT '', d10 CHAR(255) NOT NULL DEFAULT '', d11 CHAR(255) NOT NULL DEFAULT '', d12 CHAR(255) NOT NULL DEFAULT '', d13 CHAR(255) NOT NULL DEFAULT '', d14 CHAR(255) NOT NULL DEFAULT '', d15 CHAR(255) NOT NULL DEFAULT '', d16 CHAR(255) NOT NULL DEFAULT '', d17 CHAR(255) NOT NULL DEFAULT '', d18 CHAR(255) NOT NULL DEFAULT '', d19 CHAR(255) NOT NULL DEFAULT '', d20 CHAR(255) NOT NULL DEFAULT '', d21 CHAR(255) NOT NULL DEFAULT '', d22 CHAR(255) NOT NULL DEFAULT '', d23 CHAR(255) NOT NULL DEFAULT '', d24 CHAR(255) NOT NULL DEFAULT '', d25 CHAR(255) NOT NULL DEFAULT '', d26 CHAR(255) NOT NULL DEFAULT '', d27 CHAR(255) NOT NULL DEFAULT '', d28 CHAR(255) NOT NULL DEFAULT '', d29 CHAR(255) NOT NULL DEFAULT '', d30 CHAR(255) NOT NULL DEFAULT '', d31 CHAR(255) NOT NULL DEFAULT '', d32 CHAR(255) NOT NULL DEFAULT '', d33 CHAR(255) NOT NULL DEFAULT '', d34 CHAR(255) NOT NULL DEFAULT '', d35 CHAR(255) NOT NULL DEFAULT '', d36 CHAR(255) NOT NULL DEFAULT '', d37 CHAR(255) NOT NULL DEFAULT '', d38 CHAR(255) NOT NULL DEFAULT '', d39 CHAR(255) NOT NULL DEFAULT '', d40 CHAR(255) NOT NULL DEFAULT '', d41 CHAR(255) NOT NULL DEFAULT '', d42 CHAR(255) NOT NULL DEFAULT '', d43 CHAR(255) NOT NULL DEFAULT '', d44 CHAR(255) NOT NULL DEFAULT '', d45 CHAR(255) NOT NULL DEFAULT '', d46 CHAR(255) NOT NULL DEFAULT '', d47 CHAR(255) NOT NULL DEFAULT '', d48 CHAR(255) NOT NULL DEFAULT '', d49 CHAR(255) NOT NULL DEFAULT '', d50 CHAR(255) NOT NULL DEFAULT '' ) engine=innodb; connection node_2; Shutting down server ... connection node_1; SET GLOBAL innodb_max_dirty_pages_pct=99; SET GLOBAL innodb_max_dirty_pages_pct_lwm=99; create procedure innodb_insert_proc (repeat_count int) begin declare current_num int; set current_num = 0; while current_num < repeat_count do insert into t1(id,d1,d5,d10,d20,d25,d30) values (NULL, substring(MD5(RAND()), -200), substring(MD5(RAND()), -200), substring(MD5(RAND()), -200), substring(MD5(RAND()), -200), substring(MD5(RAND()), -200), substring(MD5(RAND()), -200) ); set current_num = current_num + 1; end while; end// commit; SET autocommit=0; call innodb_insert_proc(350000); COMMIT; SET autocommit=1; # Make sure we have dirty pages on buffer pool SELECT variable_value FROM information_schema.global_status WHERE variable_name = 'INNODB_BUFFER_POOL_PAGES_DIRTY'; variable_value 50656 # Start flushing dirty pages SET GLOBAL innodb_max_dirty_pages_pct_lwm=0; SET GLOBAL innodb_max_dirty_pages_pct=0; SELECT variable_value FROM information_schema.global_status WHERE variable_name = 'INNODB_BUFFER_POOL_PAGES_DIRTY'; variable_value 50755 # Start SST node_1 is donor and node_2 joiner connection node_2; Starting server ... SELECT COUNT(*) FROM t1; COUNT(*) 350000 CHECK TABLE t1; Table Op Msg_type Msg_text test.t1 check status OK

            jplindst, can you please try to reproduce the problem with the following patch:

            diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc
            index 8f5c9ab56ed..51cf116747f 100644
            --- a/storage/innobase/handler/ha_innodb.cc
            +++ b/storage/innobase/handler/ha_innodb.cc
            @@ -19654,8 +19654,7 @@ static MYSQL_SYSVAR_ULONG(buf_dump_status_frequency, srv_buf_dump_status_frequen
              *    innobase_disallow_writes variable definition     *
              *******************************************************/
              
            -/* Must always init to FALSE. */
            -static my_bool	innobase_disallow_writes	= FALSE;
            +my_bool innobase_disallow_writes;
             
             /**************************************************************************
             An "update" method for innobase_disallow_writes variable. */
            diff --git a/storage/innobase/os/os0file.cc b/storage/innobase/os/os0file.cc
            index 9e1eeff202d..38b3c3811da 100644
            --- a/storage/innobase/os/os0file.cc
            +++ b/storage/innobase/os/os0file.cc
            @@ -3890,6 +3890,8 @@ dberr_t os_aio(const IORequest &type, void *buf, os_offset_t offset, size_t n)
             	if (type.is_read()) {
             		++os_n_file_reads;
             	} else {
            +		extern my_bool innobase_disallow_writes;
            +		ut_a(!innobase_disallow_writes);
             		++os_n_file_writes;
             	}
             
            

            The assertion did not fail for me when I tried to run a modified galera.galera_gtid_slave_sst_rsync (as in my previous comment). We would want to make sure that on the donor node, there are dirty pages in the buffer pool, and that buf_flush_page_cleaner will be woken up to issue page writes, shortly before the Galera rsync snapshot transfer mechanism is initiated. My understanding is that SET GLOBAL innodb_disallow_writes=ON only blocks synchronous writes (such as writes to the redo log or writes via innodb_use_native_aio=0) but not asynchronous page writes.

            Once we have something that repeats the above assertion failure, we can correct the innodb_disallow_writes=ON logic so that it does the necessary and sufficient blocking, and nothing else. (The global shared metadata lock acquired by FLUSH TABLES WITH READ LOCK should already block any file creation and renaming, for example.)

            marko Marko Mäkelä added a comment - jplindst , can you please try to reproduce the problem with the following patch: diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc index 8f5c9ab56ed..51cf116747f 100644 --- a/storage/innobase/handler/ha_innodb.cc +++ b/storage/innobase/handler/ha_innodb.cc @@ -19654,8 +19654,7 @@ static MYSQL_SYSVAR_ULONG(buf_dump_status_frequency, srv_buf_dump_status_frequen * innobase_disallow_writes variable definition * *******************************************************/ -/* Must always init to FALSE. */ -static my_bool innobase_disallow_writes = FALSE; +my_bool innobase_disallow_writes; /************************************************************************** An "update" method for innobase_disallow_writes variable. */ diff --git a/storage/innobase/os/os0file.cc b/storage/innobase/os/os0file.cc index 9e1eeff202d..38b3c3811da 100644 --- a/storage/innobase/os/os0file.cc +++ b/storage/innobase/os/os0file.cc @@ -3890,6 +3890,8 @@ dberr_t os_aio(const IORequest &type, void *buf, os_offset_t offset, size_t n) if (type.is_read()) { ++os_n_file_reads; } else { + extern my_bool innobase_disallow_writes; + ut_a(!innobase_disallow_writes); ++os_n_file_writes; } The assertion did not fail for me when I tried to run a modified galera.galera_gtid_slave_sst_rsync (as in my previous comment). We would want to make sure that on the donor node, there are dirty pages in the buffer pool, and that buf_flush_page_cleaner will be woken up to issue page writes, shortly before the Galera rsync snapshot transfer mechanism is initiated. My understanding is that SET GLOBAL innodb_disallow_writes=ON only blocks synchronous writes (such as writes to the redo log or writes via innodb_use_native_aio=0 ) but not asynchronous page writes. Once we have something that repeats the above assertion failure, we can correct the innodb_disallow_writes=ON logic so that it does the necessary and sufficient blocking, and nothing else. (The global shared metadata lock acquired by FLUSH TABLES WITH READ LOCK should already block any file creation and renaming, for example.)

            marko I can repeat above assertion using 10.4. I used ./mtr --suite=galera,galera_3nodes --force --parallel=4 --big-test.

            os/os0file.cc:6404(os_aio_func(IORequest&, unsigned long, char const*, pfs_os_file_t, void*, unsigned long, unsigned long, bool, fil_node_t*, void*))[0x560b8d803d5b]
            include/os0file.ic:253(pfs_os_aio_func(IORequest&, unsigned long, char const*, pfs_os_file_t, void*, unsigned long, unsigned long, bool, fil_node_t*, void*, char const*, unsigned int))[0x560b8dad7b76]
            fil/fil0fil.cc:4270(fil_io(IORequest const&, bool, page_id_t, unsigned long, unsigned long, unsigned long, void*, void*, bool))[0x560b8dae8d3b]
            buf/buf0flu.cc:1048(buf_flush_write_block_low(buf_page_t*, buf_flush_t, bool))[0x560b8da5816f]
            buf/buf0flu.cc:1210(buf_flush_page(buf_pool_t*, buf_page_t*, buf_flush_t, bool))[0x560b8da58ccd]
            buf/buf0flu.cc:1431(buf_flush_try_neighbors(page_id_t, buf_flush_t, unsigned long, unsigned long))[0x560b8da597f6]
            buf/buf0flu.cc:1505(buf_flush_page_and_try_neighbors(buf_page_t*, buf_flush_t, unsigned long, unsigned long*))[0x560b8da59bb7]
            buf/buf0flu.cc:1763(buf_do_flush_list_batch(buf_pool_t*, unsigned long, unsigned long))[0x560b8da5aa4e]
            buf/buf0flu.cc:1834(buf_flush_batch(buf_pool_t*, buf_flush_t, unsigned long, unsigned long, flush_counters_t*))[0x560b8da5b14d]
            buf/buf0flu.cc:1994(buf_flush_do_batch(buf_pool_t*, buf_flush_t, unsigned long, unsigned long, flush_counters_t*))[0x560b8da5b9c6]
            buf/buf0flu.cc:2094(buf_flush_lists(unsigned long, unsigned long, unsigned long*))[0x560b8da5bcd2]
            buf/buf0flu.cc:3252(buf_flush_page_cleaner_coordinator)[0x560b8da5ffa4]
            

            Update: crash does not easily reproduce by running a single test case only and actual test care where it crashes on above command line varies.

            jplindst Jan Lindström (Inactive) added a comment - - edited marko I can repeat above assertion using 10.4. I used ./mtr --suite=galera,galera_3nodes --force --parallel=4 --big-test. os/os0file.cc:6404(os_aio_func(IORequest&, unsigned long, char const*, pfs_os_file_t, void*, unsigned long, unsigned long, bool, fil_node_t*, void*))[0x560b8d803d5b] include/os0file.ic:253(pfs_os_aio_func(IORequest&, unsigned long, char const*, pfs_os_file_t, void*, unsigned long, unsigned long, bool, fil_node_t*, void*, char const*, unsigned int))[0x560b8dad7b76] fil/fil0fil.cc:4270(fil_io(IORequest const&, bool, page_id_t, unsigned long, unsigned long, unsigned long, void*, void*, bool))[0x560b8dae8d3b] buf/buf0flu.cc:1048(buf_flush_write_block_low(buf_page_t*, buf_flush_t, bool))[0x560b8da5816f] buf/buf0flu.cc:1210(buf_flush_page(buf_pool_t*, buf_page_t*, buf_flush_t, bool))[0x560b8da58ccd] buf/buf0flu.cc:1431(buf_flush_try_neighbors(page_id_t, buf_flush_t, unsigned long, unsigned long))[0x560b8da597f6] buf/buf0flu.cc:1505(buf_flush_page_and_try_neighbors(buf_page_t*, buf_flush_t, unsigned long, unsigned long*))[0x560b8da59bb7] buf/buf0flu.cc:1763(buf_do_flush_list_batch(buf_pool_t*, unsigned long, unsigned long))[0x560b8da5aa4e] buf/buf0flu.cc:1834(buf_flush_batch(buf_pool_t*, buf_flush_t, unsigned long, unsigned long, flush_counters_t*))[0x560b8da5b14d] buf/buf0flu.cc:1994(buf_flush_do_batch(buf_pool_t*, buf_flush_t, unsigned long, unsigned long, flush_counters_t*))[0x560b8da5b9c6] buf/buf0flu.cc:2094(buf_flush_lists(unsigned long, unsigned long, unsigned long*))[0x560b8da5bcd2] buf/buf0flu.cc:3252(buf_flush_page_cleaner_coordinator)[0x560b8da5ffa4] Update: crash does not easily reproduce by running a single test case only and actual test care where it crashes on above command line varies.

            jplindst thank you for your dedication on this issue

            Roel Roel Van de Paar added a comment - jplindst thank you for your dedication on this issue

            https://github.com/MariaDB/server/commit/1002f9c7af4a6487a045743d925e0c37fc3134d0

            I used few hours to trying to understand why innodb.undo_truncate test case fails. However, I could not understand what goes wrong.

            jplindst Jan Lindström (Inactive) added a comment - https://github.com/MariaDB/server/commit/1002f9c7af4a6487a045743d925e0c37fc3134d0 I used few hours to trying to understand why innodb.undo_truncate test case fails. However, I could not understand what goes wrong.
            jplindst Jan Lindström (Inactive) added a comment - 10.6 version that works fine: https://github.com/MariaDB/server/commit/41eae76ee94a4eba9068c29726122f06ef17deaa

            I reviewed the 10.6 version. Apart from an MDL wait while holding fil_crypt_threads_mutex, it looks like it could work. But, please simplify it further. I do not think that we need the proposed function ha_force_checkpoint() at all. It should be possible to wait for a full buffer pool flush by waiting for the status variable Innodb_buffer_pool_pages_dirty to reach 0.

            marko Marko Mäkelä added a comment - I reviewed the 10.6 version. Apart from an MDL wait while holding fil_crypt_threads_mutex , it looks like it could work. But, please simplify it further. I do not think that we need the proposed function ha_force_checkpoint() at all. It should be possible to wait for a full buffer pool flush by waiting for the status variable Innodb_buffer_pool_pages_dirty to reach 0.
            mleich Matthias Leich added a comment - - edited

            Results of RQG testing(~10000 tests with different grammars, setups etc.) on
            origin/bb-10.6-MDEV-24845-galera 41eae76ee94a4eba9068c29726122f06ef17deaa 2022-03-03T18:51:12+02:00
            New:
                Server started on backupped data with success.
                A  'CHECK TABLE `test`.`D` EXTENDED'  failed with : 2013.
                mysqld: /data/Server/bb-10.6-MDEV-24845-galera/storage/innobase/include/btr0pcur.inl:84: ulint btr_pcur_get_low_match(const btr_pcur_t*): Assertion `btr_cursor->low_match != ((ulint)(-1))' failed.
                ...
                # 2022-03-04T21:42:44 [257480] | /data/Server_bin/bb-10.6-MDEV-24845-galera_asan/bin/mysqld(+0x2b73d06)[0x5638fc359d06]
                # 2022-03-04T21:42:44 [257480] | include/btr0pcur.inl:86(btr_pcur_get_low_match(btr_pcur_t const*))[0x5638fc3619cd]
                # 2022-03-04T21:42:44 [257480] | row/row0purge.cc:534(row_purge_remove_sec_if_poss_leaf(purge_node_t*, dict_index_t*, dtuple_t const*))[0x5638fc34b744]
                # 2022-03-04T21:42:44 [257480] | row/row0purge.cc:643(row_purge_remove_sec_if_poss(purge_node_t*, dict_index_t*, dtuple_t const*))[0x5638fc34c137]
                # 2022-03-04T21:42:44 [257480] | row/row0purge.cc:714(row_purge_del_mark(purge_node_t*))[0x5638fc34c577]
                # 2022-03-04T21:42:44 [257480] | row/row0purge.cc:1150(row_purge_record_func(purge_node_t*, unsigned char*, que_thr_t const*, bool))[0x5638fc350294]
                # 2022-03-04T21:42:44 [257480] | row/row0purge.cc:1211(row_purge(purge_node_t*, unsigned char*, que_thr_t*))[0x5638fc3508e7]
                # 2022-03-04T21:42:44 [257480] | row/row0purge.cc:1262(row_purge_step(que_thr_t*))[0x5638fc350cda]
                # 2022-03-04T21:42:44 [257480] | que/que0que.cc:653(que_thr_step(que_thr_t*))[0x5638fc2422bf]
                # 2022-03-04T21:42:44 [257480] | que/que0que.cc:709(que_run_threads_low(que_thr_t*))[0x5638fc242667]
                # 2022-03-04T21:42:44 [257480] | que/que0que.cc:731(que_run_threads(que_thr_t*))[0x5638fc242809]
                # 2022-03-04T21:42:44 [257480] | srv/srv0srv.cc:1715(srv_task_execute())[0x5638fc3ceecb]
                # 2022-03-04T21:42:44 [257480] | srv/srv0srv.cc:1957(purge_worker_callback(void*))[0x5638fc3cfa7b]
                # 2022-03-04T21:42:44 [257480] | tpool/task_group.cc:56(tpool::task_group::execute(tpool::task*))[0x5638fc78e81c]
                # 2022-03-04T21:42:44 [257480] | tpool/task.cc:40(tpool::task::execute())[0x5638fc78f0e5]
                # 2022-03-04T21:42:44 [257480] | tpool/tpool_generic.cc:547(tpool::thread_pool_generic::worker_main(tpool::worker_data*))[0x5638fc77b893]
                # 2022-03-04T21:42:44 [257480] | bits/invoke.h:73(void std::__invoke_impl<void, void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*>(std::__invoke_memfun_deref, void (tpool::thread_pool_generic::*&&)(tpool::worker_data*), tpool::thread_pool_generic*&&, tpool::worker_data*&&))[0x5638fc78e27d]
                # 2022-03-04T21:42:44 [257480] | bits/invoke.h:96(std::__invoke_result<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*>::type std::__invoke<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*>(void (tpool::thread_pool_generic::*&&)(tpool::worker_data*), tpool::thread_pool_generic*&&, tpool::worker_data*&&))[0x5638fc78e028]
                # 2022-03-04T21:42:44 [257480] | 9/thread:244(void std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> >::_M_invoke<0ul, 1ul, 2ul>(std::_Index_tuple<0ul, 1ul, 2ul>))[0x5638fc78deb9]
                # 2022-03-04T21:42:44 [257480] | 9/thread:247(std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> >::operator()())[0x5638fc78de03]
                # 2022-03-04T21:42:44 [257480] | 9/thread:195(std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::_M_run())[0x5638fc78dd68]
                # 2022-03-04T21:42:44 [257480] | ??:0(std::error_code::default_error_condition() const)[0x7f3b986dede4]
                # 2022-03-04T21:42:44 [257480] | nptl/pthread_create.c:478(start_thread)[0x7f3b987f9609]
                # 2022-03-04T21:42:44 [257480] | ??:0(clone)[0x7f3b983cc293]
                Query (0x0): (null)
                Connection ID (thread ID): 0
                Status: NOT_KILLED
            Old:
             -   [ERROR] [FATAL] InnoDB: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.*
                 This problem occurs sometimes when a too short semaphore timeout, fat tables and
                 small bufferpools combine.  Than its rather a false alarm.
                 But its on the MDEV-24845 tree serious more frequent.
            - Mariabackup --backup fails to finish its operation within the timeout set by the RQG
              component (180s or 360s if using rr).
              This problem occurs usually extreme rare.  But on the current tree its very frequent.
               
            

            mleich Matthias Leich added a comment - - edited Results of RQG testing(~10000 tests with different grammars, setups etc.) on origin/bb-10.6-MDEV-24845-galera 41eae76ee94a4eba9068c29726122f06ef17deaa 2022-03-03T18:51:12+02:00 New: Server started on backupped data with success. A 'CHECK TABLE `test`.`D` EXTENDED' failed with : 2013. mysqld: /data/Server/bb-10.6-MDEV-24845-galera/storage/innobase/include/btr0pcur.inl:84: ulint btr_pcur_get_low_match(const btr_pcur_t*): Assertion `btr_cursor->low_match != ((ulint)(-1))' failed. ... # 2022-03-04T21:42:44 [257480] | /data/Server_bin/bb-10.6-MDEV-24845-galera_asan/bin/mysqld(+0x2b73d06)[0x5638fc359d06] # 2022-03-04T21:42:44 [257480] | include/btr0pcur.inl:86(btr_pcur_get_low_match(btr_pcur_t const*))[0x5638fc3619cd] # 2022-03-04T21:42:44 [257480] | row/row0purge.cc:534(row_purge_remove_sec_if_poss_leaf(purge_node_t*, dict_index_t*, dtuple_t const*))[0x5638fc34b744] # 2022-03-04T21:42:44 [257480] | row/row0purge.cc:643(row_purge_remove_sec_if_poss(purge_node_t*, dict_index_t*, dtuple_t const*))[0x5638fc34c137] # 2022-03-04T21:42:44 [257480] | row/row0purge.cc:714(row_purge_del_mark(purge_node_t*))[0x5638fc34c577] # 2022-03-04T21:42:44 [257480] | row/row0purge.cc:1150(row_purge_record_func(purge_node_t*, unsigned char*, que_thr_t const*, bool))[0x5638fc350294] # 2022-03-04T21:42:44 [257480] | row/row0purge.cc:1211(row_purge(purge_node_t*, unsigned char*, que_thr_t*))[0x5638fc3508e7] # 2022-03-04T21:42:44 [257480] | row/row0purge.cc:1262(row_purge_step(que_thr_t*))[0x5638fc350cda] # 2022-03-04T21:42:44 [257480] | que/que0que.cc:653(que_thr_step(que_thr_t*))[0x5638fc2422bf] # 2022-03-04T21:42:44 [257480] | que/que0que.cc:709(que_run_threads_low(que_thr_t*))[0x5638fc242667] # 2022-03-04T21:42:44 [257480] | que/que0que.cc:731(que_run_threads(que_thr_t*))[0x5638fc242809] # 2022-03-04T21:42:44 [257480] | srv/srv0srv.cc:1715(srv_task_execute())[0x5638fc3ceecb] # 2022-03-04T21:42:44 [257480] | srv/srv0srv.cc:1957(purge_worker_callback(void*))[0x5638fc3cfa7b] # 2022-03-04T21:42:44 [257480] | tpool/task_group.cc:56(tpool::task_group::execute(tpool::task*))[0x5638fc78e81c] # 2022-03-04T21:42:44 [257480] | tpool/task.cc:40(tpool::task::execute())[0x5638fc78f0e5] # 2022-03-04T21:42:44 [257480] | tpool/tpool_generic.cc:547(tpool::thread_pool_generic::worker_main(tpool::worker_data*))[0x5638fc77b893] # 2022-03-04T21:42:44 [257480] | bits/invoke.h:73(void std::__invoke_impl<void, void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*>(std::__invoke_memfun_deref, void (tpool::thread_pool_generic::*&&)(tpool::worker_data*), tpool::thread_pool_generic*&&, tpool::worker_data*&&))[0x5638fc78e27d] # 2022-03-04T21:42:44 [257480] | bits/invoke.h:96(std::__invoke_result<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*>::type std::__invoke<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*>(void (tpool::thread_pool_generic::*&&)(tpool::worker_data*), tpool::thread_pool_generic*&&, tpool::worker_data*&&))[0x5638fc78e028] # 2022-03-04T21:42:44 [257480] | 9/thread:244(void std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> >::_M_invoke<0ul, 1ul, 2ul>(std::_Index_tuple<0ul, 1ul, 2ul>))[0x5638fc78deb9] # 2022-03-04T21:42:44 [257480] | 9/thread:247(std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> >::operator()())[0x5638fc78de03] # 2022-03-04T21:42:44 [257480] | 9/thread:195(std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::_M_run())[0x5638fc78dd68] # 2022-03-04T21:42:44 [257480] | ??:0(std::error_code::default_error_condition() const)[0x7f3b986dede4] # 2022-03-04T21:42:44 [257480] | nptl/pthread_create.c:478(start_thread)[0x7f3b987f9609] # 2022-03-04T21:42:44 [257480] | ??:0(clone)[0x7f3b983cc293] Query (0x0): (null) Connection ID (thread ID): 0 Status: NOT_KILLED Old: - [ERROR] [FATAL] InnoDB: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.* This problem occurs sometimes when a too short semaphore timeout, fat tables and small bufferpools combine. Than its rather a false alarm. But its on the MDEV-24845 tree serious more frequent. - Mariabackup --backup fails to finish its operation within the timeout set by the RQG component (180s or 360s if using rr). This problem occurs usually extreme rare. But on the current tree its very frequent.

            https://github.com/MariaDB/server/commit/7cbf809fb18420b47e5536c17506f89b1b78488b

            I addressed those review comments that I found relevant. I do not want to remove force flushing of buffer pool and force checkpoint because I would like joiner to start more cleaner state. If we just wait for dirty pages count InnoDB checkpoint where recovery in joiner starts could be old.

            jplindst Jan Lindström (Inactive) added a comment - https://github.com/MariaDB/server/commit/7cbf809fb18420b47e5536c17506f89b1b78488b I addressed those review comments that I found relevant. I do not want to remove force flushing of buffer pool and force checkpoint because I would like joiner to start more cleaner state. If we just wait for dirty pages count InnoDB checkpoint where recovery in joiner starts could be old.

            Can you please provide your analysis of the previously provided rr replay traces for hangs, and how you addressed them?

            Ever since MDEV-23855 and MDEV-27416, it is the dedicated buf_flush_page_cleaner thread that is solely responsible for initiating log checkpoints. A checkpoint can be initiated by SET GLOBAL innodb_max_dirty_pages_pct=0 or by invoking buf_flush_sync(). If writes to the redo log are blocked, a single call of buf_flush_sync() should suffice to clean the buffer pool. However, buf_flush_sync() will not wait for the checkpoint to be updated, because buf_pool.done_flush_list will be broadcast already after the pages have been written out from the buffer pool. If you want the server to start up fast after the snapshot transfer, you will also have to wait for the checkpoint to be updated. The function log_make_checkpoint() would do exactly that.

            marko Marko Mäkelä added a comment - Can you please provide your analysis of the previously provided rr replay traces for hangs, and how you addressed them? Ever since MDEV-23855 and MDEV-27416 , it is the dedicated buf_flush_page_cleaner thread that is solely responsible for initiating log checkpoints. A checkpoint can be initiated by SET GLOBAL innodb_max_dirty_pages_pct=0 or by invoking buf_flush_sync() . If writes to the redo log are blocked, a single call of buf_flush_sync() should suffice to clean the buffer pool. However, buf_flush_sync() will not wait for the checkpoint to be updated, because buf_pool.done_flush_list will be broadcast already after the pages have been written out from the buffer pool. If you want the server to start up fast after the snapshot transfer, you will also have to wait for the checkpoint to be updated. The function log_make_checkpoint() would do exactly that.

            I analyzed one hang.

            rr replay /data/results/1647535234/Backup_freeze/dev/shm/rqg/1647535234/3/1/rr/latest-trace
            

            The statement BACKUP STAGE BLOCK_COMMIT is blocked. I could not figure out where to set watchpoints in the MDL data structures, but by checking the stack traces shortly before the SIGKILL, I noticed that there is an ALTER TABLE waiting in fts_optimize_remove_table() for fts_optimize_callback() to remove the table from fts_optimize_wq. That will never happen, because the MDL acquisition that you added to fts_optimize_callback() will always fail:

            2866				if (thd_get_backup_lock(fts_opt_thd, &mdl)) {
            Value returned is $6 = 1
            (rr) n
            2867					if (n_tables)
            (rr) 
            2868						timer->set_time(5000, 0);
            (rr) 
            2869					return;
            

            Even if the backup lock acquisition fails, I think that fts_optimize_callback() must process those requests that will not result in any writes, namely anything else except fts_optimize_sync_table(().

            marko Marko Mäkelä added a comment - I analyzed one hang. rr replay /data/results/1647535234/Backup_freeze/dev/shm/rqg/1647535234/3/1/rr/latest-trace The statement BACKUP STAGE BLOCK_COMMIT is blocked. I could not figure out where to set watchpoints in the MDL data structures, but by checking the stack traces shortly before the SIGKILL , I noticed that there is an ALTER TABLE waiting in fts_optimize_remove_table() for fts_optimize_callback() to remove the table from fts_optimize_wq . That will never happen, because the MDL acquisition that you added to fts_optimize_callback() will always fail: 2866 if (thd_get_backup_lock(fts_opt_thd, &mdl)) { Value returned is $6 = 1 (rr) n 2867 if (n_tables) (rr) 2868 timer->set_time(5000, 0); (rr) 2869 return; Even if the backup lock acquisition fails, I think that fts_optimize_callback() must process those requests that will not result in any writes, namely anything else except fts_optimize_sync_table(() .

            mleich I pushed new version that addresses Marko's review comments and above problem to bb-10.6-MDEV-24845-galera. Can you run tests again?

            jplindst Jan Lindström (Inactive) added a comment - mleich I pushed new version that addresses Marko's review comments and above problem to bb-10.6- MDEV-24845 -galera. Can you run tests again?

            I posted some review comments to https://github.com/MariaDB/server/commit/e56f90eaf0faffa0d31f8fd11e2625293f422eff

            Most notably, it seems to me that there is a busy loop in dict_stats_func().

            marko Marko Mäkelä added a comment - I posted some review comments to https://github.com/MariaDB/server/commit/e56f90eaf0faffa0d31f8fd11e2625293f422eff Most notably, it seems to me that there is a busy loop in dict_stats_func() .

            Fix for rsync race condition pushed to 10.3. Ref: https://github.com/MariaDB/server/pull/2053

            jplindst Jan Lindström (Inactive) added a comment - Fix for rsync race condition pushed to 10.3. Ref: https://github.com/MariaDB/server/pull/2053

            MDEV-28185 needs to be fixed before the variable innodb_disallow_writes can be removed.

            marko Marko Mäkelä added a comment - MDEV-28185 needs to be fixed before the variable innodb_disallow_writes can be removed.

            jplindst, I see that you already pushed something with this ticket number. I suggest that this ticket be closed and the parameter innodb_disallow_writes be removed in 10.5 or a later release under MDEV-25975, because I do not think that it is easy to refactor the locking around the Galera rsync snapshot transfer in older releases.

            marko Marko Mäkelä added a comment - jplindst , I see that you already pushed something with this ticket number. I suggest that this ticket be closed and the parameter innodb_disallow_writes be removed in 10.5 or a later release under MDEV-25975 , because I do not think that it is easy to refactor the locking around the Galera rsync snapshot transfer in older releases.

            I submitted a proposed simpler fix in MDEV-25975.

            marko Marko Mäkelä added a comment - I submitted a proposed simpler fix in MDEV-25975 .

            The variable innodb_disallow_writes was removed in MDEV-25975.

            marko Marko Mäkelä added a comment - The variable innodb_disallow_writes was removed in MDEV-25975 .

            People

              jplindst Jan Lindström (Inactive)
              Roel Roel Van de Paar
              Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.