Details
-
Bug
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
-
10.2(EOL), 10.3(EOL), 10.4(EOL), 10.5, 10.6
Description
Discussing with Marko.
# mysqld options required for replay: --innodb_fatal_semaphore_wait_threshold=7
|
SET @@global.innodb_disallow_writes=ON;
|
CREATE TABLE t (c INT);
|
Then connect with another client:
10.6.0 bfb4761ca04704d68dba51f76d7c9967f880a6ee (Debug) |
10.6.0>SHOW FULL PROCESSLIST;
|
+----+------+-----------+------+---------+------+----------------+------------------------+----------+
|
| Id | User | Host | db | Command | Time | State | Info | Progress |
|
+----+------+-----------+------+---------+------+----------------+------------------------+----------+
|
| 4 | root | localhost | test | Query | 225 | creating table | CREATE TABLE t (c INT) | 0.000 |
|
| 5 | root | localhost | test | Query | 0 | starting | SHOW FULL PROCESSLIST | 0.000 |
|
+----+------+-----------+------+---------+------+----------------+------------------------+----------+
|
2 rows in set (0.000 sec)
|
...
|
10.6.0>SHOW FULL PROCESSLIST;
|
+----+------+-----------+------+---------+------+----------------+------------------------+----------+
|
| Id | User | Host | db | Command | Time | State | Info | Progress |
|
+----+------+-----------+------+---------+------+----------------+------------------------+----------+
|
| 4 | root | localhost | test | Query | 604 | creating table | CREATE TABLE t (c INT) | 0.000 |
|
| 5 | root | localhost | test | Query | 0 | starting | SHOW FULL PROCESSLIST | 0.000 |
|
+----+------+-----------+------+---------+------+----------------+------------------------+----------+
|
2 rows in set (0.000 sec)
|
This looks normal, except that it should abort after 7 seconds due to the setting. And, CTRL+C kills the query but makes the client hang:
10.6.0 bfb4761ca04704d68dba51f76d7c9967f880a6ee (Debug) |
10.6.0>CREATE TABLE t (c INT);
|
^CCtrl-C -- query killed. Continuing normally.
|
And in the other client we can then see a hanging killed thread:
10.6.0 bfb4761ca04704d68dba51f76d7c9967f880a6ee (Debug) |
10.6.0>SHOW FULL PROCESSLIST;
|
+----+------+-----------+------+---------+------+----------------+------------------------+----------+
|
| Id | User | Host | db | Command | Time | State | Info | Progress |
|
+----+------+-----------+------+---------+------+----------------+------------------------+----------+
|
| 4 | root | localhost | test | Killed | 683 | creating table | CREATE TABLE t (c INT) | 0.000 |
|
| 5 | root | localhost | test | Query | 0 | starting | SHOW FULL PROCESSLIST | 0.000 |
|
+----+------+-----------+------+---------+------+----------------+------------------------+----------+
|
2 rows in set (0.000 sec)
|
Another CTRL+C in the CLI lets us loose connection with the server (ERROR 2013 (HY000): Lost connection to MySQL server during query), and we can then exit to the command line and re-enter the client, but the killed hanging CREATE TABLE remains.
10.6.0 bfb4761ca04704d68dba51f76d7c9967f880a6ee (Debug) |
10.6.0>SHOW FULL PROCESSLIST;
|
+----+------+-----------+------+---------+------+----------------+------------------------+----------+
|
| Id | User | Host | db | Command | Time | State | Info | Progress |
|
+----+------+-----------+------+---------+------+----------------+------------------------+----------+
|
| 4 | root | localhost | test | Killed | 811 | creating table | CREATE TABLE t (c INT) | 0.000 |
|
| 5 | root | localhost | test | Query | 0 | starting | SHOW FULL PROCESSLIST | 0.000 |
|
| 8 | root | localhost | test | Sleep | 83 | | NULL | 0.000 |
|
+----+------+-----------+------+---------+------+----------------+------------------------+----------+
|
3 rows in set (0.001 sec)
|
Other various related testcases give other outcomes it seems (more testing to be done).
Attachments
Issue Links
- is blocked by
-
MDEV-27934 Test SST after removing innodb_disallow_writes
-
- Closed
-
- is duplicated by
-
MDEV-26491 SIGABRT when setting innodb_disallow_writes and --innodb-fatal-semaphore-wait-threshold=2 and attempting DROP TABLE
-
- Closed
-
- relates to
-
MDEV-25975 Turning on innodb_disallow_writes will cause mysqladmin shutdown to hang
-
- Closed
-
-
MDEV-27053 Crash on assertion failure in btr0cur.cc - apparent index corruption
-
- Closed
-
-
MDEV-28185 InnoDB generates redundant log checkpoints
-
- Closed
-
-
MDEV-32115 Assertion `!recv_no_log_write' failed in bool log_checkpoint(bool)
-
- Open
-
-
MDEV-5336 Implement BACKUP STAGE for safe external backups
-
- Closed
-
-
MDEV-21452 Use condition variables and normal mutexes instead of InnoDB os_event and mutex
-
- Closed
-
Activity
I think that the parameter innodb_disallow_writes was badly designed and implemented. Instead of blocking arbitrary user threads that happen to trigger I/O, it would be better to disallow the writes at the high level.
The backup locks that were introduced in MDEV-5336 are a better solution: blocking writes at the high level. Starting with MDEV-16678 in 10.5, the backup locks might actually block the purge of history. Starting with MDEV-19514 in 10.5, no change buffer merge is happening in the background.
I think that the only writes that might happen while the strongest backup lock is granted is page flushing related to an InnoDB log checkpoint. (Note that if the backup lock does not block reads from the SQL layer, change buffer merge may still be triggered by a read, even after MDEV-19514.)
The easy fix to avoid hangs related to innodb_disallow_writes would be to only set this parameter while holding a backup lock that prevents any access to InnoDB tables. (Reads would be OK if it were not for the change buffer merge).
The InnoDB watchdog was rewritten in MDEV-21452, and it currently only covers dict_sys.mutex. I am reluctant to relax the watchdog for the previous cases where it was relaxed: innodb_disallow_writes is set, or a CHECK TABLE is running.
Roel, please try to also reproduce the watchdog kill during CHECK TABLE. I think that the proper fix for that one would be to acquire MDL_EXCLUSIVE before acquiring the dict_index_t::lock in btr_validate_level().
Out of curiosity, why can the following functions be paused by innodb_disallow_writes?
- row_merge_file_create_low(): This is used during certain forms of ALTER TABLE. Shouldn’t any DDL be blocked by MDL before snapshot transfer?
- os_file_create_tmpfile(): Probably duplicating the above.
- os_file_flush_func(): This is only executing fsync() or fdatasync(). What purpose does the blocking serve?
- os_file_create_simple_func(): This could be executed during DDL. But, why do we not block DDL by MDL?
- os_file_create_directory(): Invoked only during DDL when a DATA DIRECTORY clause is present. Why not use MDL?
- os_file_create_simple_no_error_handling(): Invoked when opening a file for *reading* (yes, the naming is confusing), at database startup or IMPORT TABLESPACE. Again, why not use MDL?
- os_file_delete_if_exists_func(): Invoked during crash recovery or during DDL. All DDL could easily be blocked by suitable MDL.
- os_file_delete_func(): Ditto.
- os_file_rename_func(): Ditto.
- os_file_create_func(): Blocked when creating (not merely opening) a file. File creation should only happen during DDL and be easily blocked by MDL.
- os_file_set_eof(): Only invoked on some temporary files that buffer InnoDB status messages. These temporary files should not be part of a snapshot transfer. Why block these?
- os_file_write_func(): This is where *synchronous* writes to InnoDB files (possibly including temporary files) happen.
It looks like when innodb_use_native_aio is in effect, innodb_disallow_writes will not block the actual writes to the InnoDB data files at all. Writes to the redo log are being blocked in os_file_write_func().
I would propose the following:
- Consider flushing the entire buffer pool to minimize the recovery time at the receiver: SET GLOBAL innodb_max_dirty_pages_pct=0;
- Acquire a metadata lock (FLUSH TABLES WITH READ LOCK or backup locks) to block all writes from the SQL layer.
- Pause the purge of transaction history: purge_sys.stop().
- Block buf_flush_page() or its callers, except for the temporary tablespace. This is what initiates persistent InnoDB page writes.
- Ensure that log_free_check() will block.
- Block the log file writes in log0log.cc.
- Invoke os_aio_wait_until_no_pending_writes() to ensure that all writes to data pages have completed.
Some of these steps are redundant.
The fix would be very different for versions before 10.5, because the page flushing has been heavily refactored in 10.5.
I think there is some misunderstanding about usage of innodb_disallow_writes. It is not intended to be used by user at all (it is actually worse than set global wsrep_on=OFF). Only reasonable usage is when Galera does SST and even there I think we could use something better starting from 10.4. I do not see much hope on fixing this issue on 10.2-10.3 but I let seppo to decide. Note that I decreased priority after careful consideration because I do not see any customers affected on this issue.
As far as I can tell, the parameter innodb_disallow_writes is being set by the Galera function sst_disallow_writes(), right after sst_flush_tables() was invoked. I see that sst_flush_tables() invokes FLUSH TABLES WITH READ LOCK. What it definitely fails to do is
SET GLOBAL innodb_max_dirty_pages_pct=0; |
Currently, innodb_disallow_writes would block any ‘new’ InnoDB writes of persistent files by blocking writes to the redo log. But, if the buffer pool contains any modified pages, writes of such pages can be submitted asynchronously via io_submit() or io_uring(). On those code paths, there is no check for innodb_disallow_writes().
I hope that someone would test the following when using a very large innodb_log_file_size and innodb_buffer_pool_size (so that checkpoints and page eviction will not occur), and innodb_page_size=64k:
SET GLOBAL innodb_max_dirty_pages_pct_lwm=99; |
-- Modify lots of InnoDB data pages, and ensure that a large number is reported:
|
SELECT variable_value FROM information_schema.global_status |
WHERE variable_name = 'INNODB_BUFFER_POOL_PAGES_DIRTY'; |
-- Initiate snapshot transfer. While it is running, issue
|
SET GLOBAL innodb_max_dirty_pages_pct=0; |
I would expect recovery to fail on the new replica when using the rsync snapshot algorithm, because as far as I can tell, InnoDB data page writes may run concurrently with rsync. Recovery should fail if rsync copies a page that is being simultaneously written. If recovery does not fail, also try running CHECK TABLE on the new replica.
I may have misunderstood something, but most tests in the galera suite that have rsync in their name seem to initiate a snapshot transfer on an idle cluster.
We only seem to have one test that may cover some sort of concurrent writes during an rsync snapshot transfer. Reproducing the race condition between asynchronous page writes and rsync would probably require a much larger amount of page writes. I was unable to reproduce problems with the following:
diff --git a/mysql-test/suite/galera/t/galera_gtid_slave_sst_rsync.test b/mysql-test/suite/galera/t/galera_gtid_slave_sst_rsync.test
|
--- a/mysql-test/suite/galera/t/galera_gtid_slave_sst_rsync.test
|
+++ b/mysql-test/suite/galera/t/galera_gtid_slave_sst_rsync.test
|
@@ -9,6 +9,9 @@
|
--source include/big_test.inc
|
--source include/have_innodb.inc
|
--source include/galera_cluster.inc
|
+--source include/have_sequence.inc
|
+
|
+SET GLOBAL innodb_max_dirty_pages_pct_lwm=99;
|
|
# As node #3 is not a Galera node, and galera_cluster.inc does not open connetion to it
|
# we open the node_3 connection here
|
@@ -59,6 +62,8 @@ START TRANSACTION;
|
INSERT INTO t1 VALUES ('node1_committed_before');
|
INSERT INTO t1 VALUES ('node1_committed_before');
|
COMMIT;
|
+BEGIN;INSERT INTO t2 SELECT seq,seq*10 FROM seq_9_to_1000;ROLLBACK;
|
+SET GLOBAL innodb_max_dirty_pages_pct=0;
|
--source include/save_master_gtid.inc
|
--echo #Connection 2
|
--connection node_2 |
I also tried to pad the table so that a little more pages would be written, but it did not help:
diff --git a/mysql-test/suite/galera/t/galera_gtid_slave_sst_rsync.test b/mysql-test/suite/galera/t/galera_gtid_slave_sst_rsync.test
|
--- a/mysql-test/suite/galera/t/galera_gtid_slave_sst_rsync.test
|
+++ b/mysql-test/suite/galera/t/galera_gtid_slave_sst_rsync.test
|
@@ -9,6 +9,9 @@
|
--source include/big_test.inc
|
--source include/have_innodb.inc
|
--source include/galera_cluster.inc
|
+--source include/have_sequence.inc
|
+
|
+SET GLOBAL innodb_max_dirty_pages_pct_lwm=99;
|
|
# As node #3 is not a Galera node, and galera_cluster.inc does not open connetion to it
|
# we open the node_3 connection here
|
@@ -23,10 +26,10 @@ START SLAVE;
|
|
--echo #Connection 3
|
--connection node_3
|
-CREATE TABLE t2 (f1 INTEGER PRIMARY KEY, f2 int unique) ENGINE=InnoDB;
|
-INSERT INTO t2 VALUES(1,11);
|
-INSERT INTO t2 VALUES(2,22);
|
-INSERT INTO t2 VALUES(3,33);
|
+CREATE TABLE t2 (f1 INTEGER PRIMARY KEY, f2 int unique, f3 CHAR(255) CHARACTER SET utf8mb4 NOT NULL DEFAULT '') ENGINE=InnoDB;
|
+INSERT INTO t2 VALUES(1,11,'');
|
+INSERT INTO t2 VALUES(2,22,'');
|
+INSERT INTO t2 VALUES(3,33,'');
|
|
SELECT @@global.gtid_binlog_state;
|
--source include/save_master_gtid.inc
|
@@ -35,9 +38,9 @@ SELECT @@global.gtid_binlog_state;
|
--source include/sync_with_master_gtid.inc
|
SELECT @@global.gtid_binlog_state;
|
|
-INSERT INTO t2 VALUES(4,44);
|
-INSERT INTO t2 VALUES(5,55);
|
-INSERT INTO t2 VALUES(6,66);
|
+INSERT INTO t2 VALUES(4,44,'');
|
+INSERT INTO t2 VALUES(5,55,'');
|
+INSERT INTO t2 VALUES(6,66,'');
|
SELECT @@global.gtid_binlog_state;
|
--echo #Connection 1
|
--connection node_1
|
@@ -46,8 +49,8 @@ SELECT @@global.gtid_binlog_state;
|
--let $wait_condition = SELECT COUNT(*) = 6 FROM t2;
|
--source include/wait_condition.inc
|
|
-INSERT INTO t2 VALUES(7,77);
|
-INSERT INTO t2 VALUES(8,88);
|
+INSERT INTO t2 VALUES(7,77,'');
|
+INSERT INTO t2 VALUES(8,88,'');
|
SELECT @@global.gtid_binlog_state;
|
|
#Perform SST
|
@@ -59,6 +62,8 @@ START TRANSACTION;
|
INSERT INTO t1 VALUES ('node1_committed_before');
|
INSERT INTO t1 VALUES ('node1_committed_before');
|
COMMIT;
|
+BEGIN;INSERT INTO t2 SELECT seq,seq*10,seq FROM seq_9_to_1000;ROLLBACK;
|
+SET GLOBAL innodb_max_dirty_pages_pct=0;
|
--source include/save_master_gtid.inc
|
--echo #Connection 2
|
--connection node_2 |
For some reason, this test is ignoring the --parallel setting. My invocation was as follows:
./mtr --mysqld=--innodb-page-size=64k --mysqld=--innodb-buffer-pool-size=28m --parallel=auto --repeat=10 galera.galera_gtid_slave_sst_rsync{,,,,,,,}
|
Note: I may have misinterpreted something. It is not clear to me what exactly is triggering the snapshot transfer and from which cluster member the data is being copied.
I tried with attached test case but could not reproduce anything bad. Either I have too few dirty pages or too slow laptop.
jan@jan-HP-ZBook-15u-G5:~/mysql/10.4/mysql-test$ ./mtr galera.galera_sst_rsync_load
|
Logging: ./mtr galera.galera_sst_rsync_load
|
VS config:
|
vardir: /home/jan/mysql/10.4/mysql-test/var
|
Checking leftover processes...
|
- found old pid 2745195 in 'mysqld.1.pid', killing it...
|
process did not exist!
|
Removing old var directory...
|
Creating var directory '/home/jan/mysql/10.4/mysql-test/var'...
|
Checking supported features...
|
MariaDB Version 10.4.23-MariaDB-debug
|
- SSL connections supported
|
- binaries are debug compiled
|
- binaries built with wsrep patch
|
Collecting tests...
|
Installing system database...
|
|
==============================================================================
|
|
TEST RESULT TIME (ms) or COMMENT
|
--------------------------------------------------------------------------
|
|
worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019
|
connection node_2;
|
connection node_1;
|
connection node_1;
|
connection node_2;
|
connection node_1;
|
CREATE TABLE t1(id int not null primary key auto_increment,
|
b int, c int, d char(255), key (b)) engine=innodb;
|
connection node_2;
|
Shutting down server ...
|
connection node_1;
|
SET GLOBAL innodb_max_dirty_pages_pct=99;
|
SET GLOBAL innodb_max_dirty_pages_pct_lwm=99;
|
create procedure innodb_insert_proc (repeat_count int)
|
begin
|
declare current_num int;
|
set current_num = 0;
|
while current_num < repeat_count do
|
insert into t1 values (NULL,current_num,current_num, substring(MD5(RAND()), -200));
|
update t1 set b = current_num +100,
|
d = substring(MD5(RAND()), -200)
|
where b = current_num;
|
set current_num = current_num + 1;
|
end while;
|
end//
|
commit;
|
SET autocommit=0;
|
call innodb_insert_proc(7000);
|
COMMIT;
|
SELECT variable_value FROM information_schema.global_status
|
WHERE variable_name = 'INNODB_BUFFER_POOL_PAGES_DIRTY';
|
variable_value
|
158
|
SET GLOBAL innodb_max_dirty_pages_pct_lwm=0;
|
SET GLOBAL innodb_max_dirty_pages_pct=0;
|
connection node_2;
|
Starting server ...
|
SELECT COUNT(*) FROM t1;
|
COUNT(*)
|
7000
|
CHECK TABLE t1;
|
Table Op Msg_type Msg_text
|
test.t1 check status OK
|
DROP PROCEDURE innodb_insert_proc;
|
DROP TABLE t1;
|
galera.galera_sst_rsync_load 'innodb' [ pass ] 217410
|
A buf_pool.flush_list size of 158 pages is not really much, and due to MDEV-25113, that list may actually contain clean pages (with the special value oldest_modification()==1) that have already been written out.
To reproduce this, we only need one change per index page. I would suggest to bloat the records as much as possible with CHAR(255) NOT NULL DEFAULT '' columns, so that there will be few records in a innodb_page_size=64k page. About 120 such columns should fit in a 64KiB page, or maybe it is about 60 if we are limiting the maximum record size to 16383 bytes, like I vaguely remember. With such heavily padded records, we should have 2 to 4 records per leaf page.
Writing constant strings should be faster than writing pseudorandom strings. I do not see any benefit in writing pseudorandom data to non-indexed columns.
I do not think that we need any secondary indexes. We will only have to modify one non-indexed column in at least one record in each page. The type of that column could be an integer.
It may be beneficial to use a release build, just like our users do.
I still can't repeat anything bad. This time I used 10G buffer pool and galapq with release build. I let my bad test to run 20 iterations and still I could not see any crashes. Based on this I do not see any evidence that rsync SST method would be broken on Galera. I'm still running bigger 50 iterations on that machine.
jan@galapq:~/10.4/mysql-test$ ./mtr galera.galera_sst_rsync_load --mysqld=--innodb-page-size=64K
|
Logging: ./mtr galera.galera_sst_rsync_load --mysqld=--innodb-page-size=64K
|
VS config:
|
vardir: /home/jan/10.4/mysql-test/var
|
Checking leftover processes...
|
- found old pid 925082 in 'mysqld.1.pid', killing it...
|
process did not exist!
|
Removing old var directory...
|
Creating var directory '/home/jan/10.4/mysql-test/var'...
|
Checking supported features...
|
MariaDB Version 10.4.23-MariaDB
|
- SSL connections supported
|
- binaries built with wsrep patch
|
Collecting tests...
|
Installing system database...
|
|
==============================================================================
|
|
TEST RESULT TIME (ms) or COMMENT
|
--------------------------------------------------------------------------
|
|
worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019
|
connection node_2;
|
connection node_1;
|
connection node_1;
|
connection node_2;
|
connection node_1;
|
CREATE TABLE t1(id int not null primary key auto_increment,
|
d1 CHAR(255) NOT NULL DEFAULT '',
|
d2 CHAR(255) NOT NULL DEFAULT '',
|
d3 CHAR(255) NOT NULL DEFAULT '',
|
d4 CHAR(255) NOT NULL DEFAULT '',
|
d5 CHAR(255) NOT NULL DEFAULT '',
|
d6 CHAR(255) NOT NULL DEFAULT '',
|
d7 CHAR(255) NOT NULL DEFAULT '',
|
d8 CHAR(255) NOT NULL DEFAULT '',
|
d9 CHAR(255) NOT NULL DEFAULT '',
|
d10 CHAR(255) NOT NULL DEFAULT '',
|
d11 CHAR(255) NOT NULL DEFAULT '',
|
d12 CHAR(255) NOT NULL DEFAULT '',
|
d13 CHAR(255) NOT NULL DEFAULT '',
|
d14 CHAR(255) NOT NULL DEFAULT '',
|
d15 CHAR(255) NOT NULL DEFAULT '',
|
d16 CHAR(255) NOT NULL DEFAULT '',
|
d17 CHAR(255) NOT NULL DEFAULT '',
|
d18 CHAR(255) NOT NULL DEFAULT '',
|
d19 CHAR(255) NOT NULL DEFAULT '',
|
d20 CHAR(255) NOT NULL DEFAULT '',
|
d21 CHAR(255) NOT NULL DEFAULT '',
|
d22 CHAR(255) NOT NULL DEFAULT '',
|
d23 CHAR(255) NOT NULL DEFAULT '',
|
d24 CHAR(255) NOT NULL DEFAULT '',
|
d25 CHAR(255) NOT NULL DEFAULT '',
|
d26 CHAR(255) NOT NULL DEFAULT '',
|
d27 CHAR(255) NOT NULL DEFAULT '',
|
d28 CHAR(255) NOT NULL DEFAULT '',
|
d29 CHAR(255) NOT NULL DEFAULT '',
|
d30 CHAR(255) NOT NULL DEFAULT '',
|
d31 CHAR(255) NOT NULL DEFAULT '',
|
d32 CHAR(255) NOT NULL DEFAULT '',
|
d33 CHAR(255) NOT NULL DEFAULT '',
|
d34 CHAR(255) NOT NULL DEFAULT '',
|
d35 CHAR(255) NOT NULL DEFAULT '',
|
d36 CHAR(255) NOT NULL DEFAULT '',
|
d37 CHAR(255) NOT NULL DEFAULT '',
|
d38 CHAR(255) NOT NULL DEFAULT '',
|
d39 CHAR(255) NOT NULL DEFAULT '',
|
d40 CHAR(255) NOT NULL DEFAULT '',
|
d41 CHAR(255) NOT NULL DEFAULT '',
|
d42 CHAR(255) NOT NULL DEFAULT '',
|
d43 CHAR(255) NOT NULL DEFAULT '',
|
d44 CHAR(255) NOT NULL DEFAULT '',
|
d45 CHAR(255) NOT NULL DEFAULT '',
|
d46 CHAR(255) NOT NULL DEFAULT '',
|
d47 CHAR(255) NOT NULL DEFAULT '',
|
d48 CHAR(255) NOT NULL DEFAULT '',
|
d49 CHAR(255) NOT NULL DEFAULT '',
|
d50 CHAR(255) NOT NULL DEFAULT ''
|
) engine=innodb;
|
connection node_2;
|
Shutting down server ...
|
connection node_1;
|
SET GLOBAL innodb_max_dirty_pages_pct=99;
|
SET GLOBAL innodb_max_dirty_pages_pct_lwm=99;
|
create procedure innodb_insert_proc (repeat_count int)
|
begin
|
declare current_num int;
|
set current_num = 0;
|
while current_num < repeat_count do
|
insert into t1(id,d1,d5,d10,d20,d25,d30) values (NULL,
|
substring(MD5(RAND()), -200),
|
substring(MD5(RAND()), -200),
|
substring(MD5(RAND()), -200),
|
substring(MD5(RAND()), -200),
|
substring(MD5(RAND()), -200),
|
substring(MD5(RAND()), -200)
|
);
|
set current_num = current_num + 1;
|
end while;
|
end//
|
commit;
|
SET autocommit=0;
|
call innodb_insert_proc(350000);
|
COMMIT;
|
SET autocommit=1;
|
# Make sure we have dirty pages on buffer pool
|
SELECT variable_value FROM information_schema.global_status
|
WHERE variable_name = 'INNODB_BUFFER_POOL_PAGES_DIRTY';
|
variable_value
|
50656
|
# Start flushing dirty pages
|
SET GLOBAL innodb_max_dirty_pages_pct_lwm=0;
|
SET GLOBAL innodb_max_dirty_pages_pct=0;
|
SELECT variable_value FROM information_schema.global_status
|
WHERE variable_name = 'INNODB_BUFFER_POOL_PAGES_DIRTY';
|
variable_value
|
50755
|
# Start SST node_1 is donor and node_2 joiner
|
connection node_2;
|
Starting server ...
|
SELECT COUNT(*) FROM t1;
|
COUNT(*)
|
350000
|
CHECK TABLE t1;
|
Table Op Msg_type Msg_text
|
test.t1 check status OK
|
jplindst, can you please try to reproduce the problem with the following patch:
diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc
|
index 8f5c9ab56ed..51cf116747f 100644
|
--- a/storage/innobase/handler/ha_innodb.cc
|
+++ b/storage/innobase/handler/ha_innodb.cc
|
@@ -19654,8 +19654,7 @@ static MYSQL_SYSVAR_ULONG(buf_dump_status_frequency, srv_buf_dump_status_frequen
|
* innobase_disallow_writes variable definition *
|
*******************************************************/
|
|
-/* Must always init to FALSE. */
|
-static my_bool innobase_disallow_writes = FALSE;
|
+my_bool innobase_disallow_writes;
|
|
/**************************************************************************
|
An "update" method for innobase_disallow_writes variable. */
|
diff --git a/storage/innobase/os/os0file.cc b/storage/innobase/os/os0file.cc
|
index 9e1eeff202d..38b3c3811da 100644
|
--- a/storage/innobase/os/os0file.cc
|
+++ b/storage/innobase/os/os0file.cc
|
@@ -3890,6 +3890,8 @@ dberr_t os_aio(const IORequest &type, void *buf, os_offset_t offset, size_t n)
|
if (type.is_read()) {
|
++os_n_file_reads;
|
} else {
|
+ extern my_bool innobase_disallow_writes;
|
+ ut_a(!innobase_disallow_writes);
|
++os_n_file_writes;
|
}
|
|
The assertion did not fail for me when I tried to run a modified galera.galera_gtid_slave_sst_rsync (as in my previous comment). We would want to make sure that on the donor node, there are dirty pages in the buffer pool, and that buf_flush_page_cleaner will be woken up to issue page writes, shortly before the Galera rsync snapshot transfer mechanism is initiated. My understanding is that SET GLOBAL innodb_disallow_writes=ON only blocks synchronous writes (such as writes to the redo log or writes via innodb_use_native_aio=0) but not asynchronous page writes.
Once we have something that repeats the above assertion failure, we can correct the innodb_disallow_writes=ON logic so that it does the necessary and sufficient blocking, and nothing else. (The global shared metadata lock acquired by FLUSH TABLES WITH READ LOCK should already block any file creation and renaming, for example.)
marko I can repeat above assertion using 10.4. I used ./mtr --suite=galera,galera_3nodes --force --parallel=4 --big-test.
os/os0file.cc:6404(os_aio_func(IORequest&, unsigned long, char const*, pfs_os_file_t, void*, unsigned long, unsigned long, bool, fil_node_t*, void*))[0x560b8d803d5b]
|
include/os0file.ic:253(pfs_os_aio_func(IORequest&, unsigned long, char const*, pfs_os_file_t, void*, unsigned long, unsigned long, bool, fil_node_t*, void*, char const*, unsigned int))[0x560b8dad7b76]
|
fil/fil0fil.cc:4270(fil_io(IORequest const&, bool, page_id_t, unsigned long, unsigned long, unsigned long, void*, void*, bool))[0x560b8dae8d3b]
|
buf/buf0flu.cc:1048(buf_flush_write_block_low(buf_page_t*, buf_flush_t, bool))[0x560b8da5816f]
|
buf/buf0flu.cc:1210(buf_flush_page(buf_pool_t*, buf_page_t*, buf_flush_t, bool))[0x560b8da58ccd]
|
buf/buf0flu.cc:1431(buf_flush_try_neighbors(page_id_t, buf_flush_t, unsigned long, unsigned long))[0x560b8da597f6]
|
buf/buf0flu.cc:1505(buf_flush_page_and_try_neighbors(buf_page_t*, buf_flush_t, unsigned long, unsigned long*))[0x560b8da59bb7]
|
buf/buf0flu.cc:1763(buf_do_flush_list_batch(buf_pool_t*, unsigned long, unsigned long))[0x560b8da5aa4e]
|
buf/buf0flu.cc:1834(buf_flush_batch(buf_pool_t*, buf_flush_t, unsigned long, unsigned long, flush_counters_t*))[0x560b8da5b14d]
|
buf/buf0flu.cc:1994(buf_flush_do_batch(buf_pool_t*, buf_flush_t, unsigned long, unsigned long, flush_counters_t*))[0x560b8da5b9c6]
|
buf/buf0flu.cc:2094(buf_flush_lists(unsigned long, unsigned long, unsigned long*))[0x560b8da5bcd2]
|
buf/buf0flu.cc:3252(buf_flush_page_cleaner_coordinator)[0x560b8da5ffa4]
|
Update: crash does not easily reproduce by running a single test case only and actual test care where it crashes on above command line varies.
https://github.com/MariaDB/server/commit/1002f9c7af4a6487a045743d925e0c37fc3134d0
I used few hours to trying to understand why innodb.undo_truncate test case fails. However, I could not understand what goes wrong.
10.6 version that works fine: https://github.com/MariaDB/server/commit/41eae76ee94a4eba9068c29726122f06ef17deaa
I reviewed the 10.6 version. Apart from an MDL wait while holding fil_crypt_threads_mutex, it looks like it could work. But, please simplify it further. I do not think that we need the proposed function ha_force_checkpoint() at all. It should be possible to wait for a full buffer pool flush by waiting for the status variable Innodb_buffer_pool_pages_dirty to reach 0.
Results of RQG testing(~10000 tests with different grammars, setups etc.) on
|
origin/bb-10.6-MDEV-24845-galera 41eae76ee94a4eba9068c29726122f06ef17deaa 2022-03-03T18:51:12+02:00
|
New:
|
Server started on backupped data with success.
|
A 'CHECK TABLE `test`.`D` EXTENDED' failed with : 2013.
|
mysqld: /data/Server/bb-10.6-MDEV-24845-galera/storage/innobase/include/btr0pcur.inl:84: ulint btr_pcur_get_low_match(const btr_pcur_t*): Assertion `btr_cursor->low_match != ((ulint)(-1))' failed.
|
...
|
# 2022-03-04T21:42:44 [257480] | /data/Server_bin/bb-10.6-MDEV-24845-galera_asan/bin/mysqld(+0x2b73d06)[0x5638fc359d06]
|
# 2022-03-04T21:42:44 [257480] | include/btr0pcur.inl:86(btr_pcur_get_low_match(btr_pcur_t const*))[0x5638fc3619cd]
|
# 2022-03-04T21:42:44 [257480] | row/row0purge.cc:534(row_purge_remove_sec_if_poss_leaf(purge_node_t*, dict_index_t*, dtuple_t const*))[0x5638fc34b744]
|
# 2022-03-04T21:42:44 [257480] | row/row0purge.cc:643(row_purge_remove_sec_if_poss(purge_node_t*, dict_index_t*, dtuple_t const*))[0x5638fc34c137]
|
# 2022-03-04T21:42:44 [257480] | row/row0purge.cc:714(row_purge_del_mark(purge_node_t*))[0x5638fc34c577]
|
# 2022-03-04T21:42:44 [257480] | row/row0purge.cc:1150(row_purge_record_func(purge_node_t*, unsigned char*, que_thr_t const*, bool))[0x5638fc350294]
|
# 2022-03-04T21:42:44 [257480] | row/row0purge.cc:1211(row_purge(purge_node_t*, unsigned char*, que_thr_t*))[0x5638fc3508e7]
|
# 2022-03-04T21:42:44 [257480] | row/row0purge.cc:1262(row_purge_step(que_thr_t*))[0x5638fc350cda]
|
# 2022-03-04T21:42:44 [257480] | que/que0que.cc:653(que_thr_step(que_thr_t*))[0x5638fc2422bf]
|
# 2022-03-04T21:42:44 [257480] | que/que0que.cc:709(que_run_threads_low(que_thr_t*))[0x5638fc242667]
|
# 2022-03-04T21:42:44 [257480] | que/que0que.cc:731(que_run_threads(que_thr_t*))[0x5638fc242809]
|
# 2022-03-04T21:42:44 [257480] | srv/srv0srv.cc:1715(srv_task_execute())[0x5638fc3ceecb]
|
# 2022-03-04T21:42:44 [257480] | srv/srv0srv.cc:1957(purge_worker_callback(void*))[0x5638fc3cfa7b]
|
# 2022-03-04T21:42:44 [257480] | tpool/task_group.cc:56(tpool::task_group::execute(tpool::task*))[0x5638fc78e81c]
|
# 2022-03-04T21:42:44 [257480] | tpool/task.cc:40(tpool::task::execute())[0x5638fc78f0e5]
|
# 2022-03-04T21:42:44 [257480] | tpool/tpool_generic.cc:547(tpool::thread_pool_generic::worker_main(tpool::worker_data*))[0x5638fc77b893]
|
# 2022-03-04T21:42:44 [257480] | bits/invoke.h:73(void std::__invoke_impl<void, void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*>(std::__invoke_memfun_deref, void (tpool::thread_pool_generic::*&&)(tpool::worker_data*), tpool::thread_pool_generic*&&, tpool::worker_data*&&))[0x5638fc78e27d]
|
# 2022-03-04T21:42:44 [257480] | bits/invoke.h:96(std::__invoke_result<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*>::type std::__invoke<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*>(void (tpool::thread_pool_generic::*&&)(tpool::worker_data*), tpool::thread_pool_generic*&&, tpool::worker_data*&&))[0x5638fc78e028]
|
# 2022-03-04T21:42:44 [257480] | 9/thread:244(void std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> >::_M_invoke<0ul, 1ul, 2ul>(std::_Index_tuple<0ul, 1ul, 2ul>))[0x5638fc78deb9]
|
# 2022-03-04T21:42:44 [257480] | 9/thread:247(std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> >::operator()())[0x5638fc78de03]
|
# 2022-03-04T21:42:44 [257480] | 9/thread:195(std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::_M_run())[0x5638fc78dd68]
|
# 2022-03-04T21:42:44 [257480] | ??:0(std::error_code::default_error_condition() const)[0x7f3b986dede4]
|
# 2022-03-04T21:42:44 [257480] | nptl/pthread_create.c:478(start_thread)[0x7f3b987f9609]
|
# 2022-03-04T21:42:44 [257480] | ??:0(clone)[0x7f3b983cc293]
|
Query (0x0): (null)
|
Connection ID (thread ID): 0
|
Status: NOT_KILLED
|
Old:
|
- [ERROR] [FATAL] InnoDB: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.*
|
This problem occurs sometimes when a too short semaphore timeout, fat tables and
|
small bufferpools combine. Than its rather a false alarm.
|
But its on the MDEV-24845 tree serious more frequent.
|
- Mariabackup --backup fails to finish its operation within the timeout set by the RQG
|
component (180s or 360s if using rr).
|
This problem occurs usually extreme rare. But on the current tree its very frequent.
|
|
https://github.com/MariaDB/server/commit/7cbf809fb18420b47e5536c17506f89b1b78488b
I addressed those review comments that I found relevant. I do not want to remove force flushing of buffer pool and force checkpoint because I would like joiner to start more cleaner state. If we just wait for dirty pages count InnoDB checkpoint where recovery in joiner starts could be old.
Can you please provide your analysis of the previously provided rr replay traces for hangs, and how you addressed them?
Ever since MDEV-23855 and MDEV-27416, it is the dedicated buf_flush_page_cleaner thread that is solely responsible for initiating log checkpoints. A checkpoint can be initiated by SET GLOBAL innodb_max_dirty_pages_pct=0 or by invoking buf_flush_sync(). If writes to the redo log are blocked, a single call of buf_flush_sync() should suffice to clean the buffer pool. However, buf_flush_sync() will not wait for the checkpoint to be updated, because buf_pool.done_flush_list will be broadcast already after the pages have been written out from the buffer pool. If you want the server to start up fast after the snapshot transfer, you will also have to wait for the checkpoint to be updated. The function log_make_checkpoint() would do exactly that.
I analyzed one hang.
rr replay /data/results/1647535234/Backup_freeze/dev/shm/rqg/1647535234/3/1/rr/latest-trace
|
The statement BACKUP STAGE BLOCK_COMMIT is blocked. I could not figure out where to set watchpoints in the MDL data structures, but by checking the stack traces shortly before the SIGKILL, I noticed that there is an ALTER TABLE waiting in fts_optimize_remove_table() for fts_optimize_callback() to remove the table from fts_optimize_wq. That will never happen, because the MDL acquisition that you added to fts_optimize_callback() will always fail:
2866 if (thd_get_backup_lock(fts_opt_thd, &mdl)) {
|
Value returned is $6 = 1
|
(rr) n
|
2867 if (n_tables)
|
(rr)
|
2868 timer->set_time(5000, 0);
|
(rr)
|
2869 return;
|
Even if the backup lock acquisition fails, I think that fts_optimize_callback() must process those requests that will not result in any writes, namely anything else except fts_optimize_sync_table(().
mleich I pushed new version that addresses Marko's review comments and above problem to bb-10.6-MDEV-24845-galera. Can you run tests again?
I posted some review comments to https://github.com/MariaDB/server/commit/e56f90eaf0faffa0d31f8fd11e2625293f422eff
Most notably, it seems to me that there is a busy loop in dict_stats_func().
Fix for rsync race condition pushed to 10.3. Ref: https://github.com/MariaDB/server/pull/2053
MDEV-28185 needs to be fixed before the variable innodb_disallow_writes can be removed.
jplindst, I see that you already pushed something with this ticket number. I suggest that this ticket be closed and the parameter innodb_disallow_writes be removed in 10.5 or a later release under MDEV-25975, because I do not think that it is easy to refactor the locking around the Galera rsync snapshot transfer in older releases.
Note to self: other things to try later:
1) Add a failed CREATE statement before disallowing writes (ENGINE=none for example)
2) Leave off innodb_fatal_semaphore_wait_threshold