|
angelique.sklavounos, let me ask you to take over at least to clear out that's its test's slowness to blame on
in MSAN environment. The first thing to do is to try out a manual timeout increase
diff --git a/mysql-test/suite/rpl/t/rpl_start_alter_7.test b/mysql-test/suite/rpl/t/rpl_start_alter_7.test
|
index 7225c075ea7..79acf1f2c69 100644
|
--- a/mysql-test/suite/rpl/t/rpl_start_alter_7.test
|
+++ b/mysql-test/suite/rpl/t/rpl_start_alter_7.test
|
@@ -79,9 +79,11 @@ set default_master_connection = 'm2';
|
|
set default_master_connection = 'm1';
|
--let $master_pos= $master_pos_1
|
+--let $slave_timeout= 600
|
--source include/sync_with_master_gtid.inc
|
set default_master_connection = 'm2';
|
--let $master_pos= $master_pos_2
|
+--let $slave_timeout= 600
|
--source include/sync_with_master_gtid.inc
|
|
--echo # cleanup
|
and if that helps (expected), go to improve include/sync_with_master_gtid.inc that always has had
a valgrind branch that now needs to be generalized to include MSAN.
|
|
The fail has not been seen on the newer amd64-ubuntu-2004-msan builder, but is still occasionally seen on kvm-ubsan, amd64-ubuntu-1804-valgrind, kvm-asan, and kvm-fulltest2.
|
|
The replica error log shows a duplicate key error (mysqld.3.err):
2022-11-24 4:29:00 31 [Note] Master 'm1': Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './mysqld-relay-bin-m1.000001' position: 4; GTID position '12-2-2'
|
2022-11-24 4:29:02 24 [ERROR] Slave SQL: Error during XID COMMIT: failed to update GTID state in mysql.gtid_slave_pos: 1062: Duplicate entry '11-605' for key 'PRIMARY', Gtid 11-1-281, Internal MariaDB error code: 1942
|
2022-11-24 4:29:02 24 [ERROR] Slave (additional info): Duplicate entry '11-605' for key 'PRIMARY' Error_code: 1062
|
2022-11-24 4:29:02 24 [Warning] Slave: Duplicate entry '11-605' for key 'PRIMARY' Error_code: 1062
|
2022-11-24 4:29:02 24 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'master-bin.000001' position 49198; GTID position '11-1-280,12-2-334'
|
2022-11-24 4:29:02 31 [Note] Master 'm1': Slave SQL thread exiting, replication stopped in log 'master-bin.000001' at position 49198; GTID position '11-1-280,12-2-334', master: 127.0.0.1:16040
|
Error logs attached as logs.tar.gz.
|
11.0 81c1f526a
|
rpl.rpl_start_alter_mysqlbinlog_2 'innodb' w4 [ fail ]
|
Test ended at 2022-11-24 04:31:01
|
|
CURRENT_TEST: rpl.rpl_start_alter_mysqlbinlog_2
|
mysqltest: In included file "./include/sync_with_master_gtid.inc":
|
included from /mnt/buildbot/build/mariadb-11.0.1/mysql-test/suite/rpl/t/rpl_start_alter_mysqlbinlog_2.test at line 117:
|
At line 48: Failed to sync with master
|
|
The result from queries just before the failure was:
|
< snip >
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
connection server_2;
|
DROP DATABASE s2;
|
connection server_3;
|
START ALL SLAVES;
|
Warnings:
|
Note 1937 SLAVE 'm2' started
|
Note 1937 SLAVE 'm1' started
|
SET default_master_connection = 'm1';
|
include/wait_for_slave_to_start.inc
|
SET default_master_connection = 'm2';
|
include/wait_for_slave_to_start.inc
|
SET default_master_connection = 'm1';
|
include/sync_with_master_gtid.inc
|
Timeout in master_gtid_wait('11-1-412', 120), current slave GTID position is: 11-1-280,12-2-412.
|
|
|
https://buildbot.mariadb.net/buildbot/builders/kvm-rpm-centos74-amd64-debug/builds/10374
|
11.0 d5d7c8ba9
|
rpl.rpl_start_alter_4 'innodb,stmt' w1 [ fail ]
|
Test ended at 2023-02-21 15:45:30
|
|
CURRENT_TEST: rpl.rpl_start_alter_4
|
mysqltest: In included file "./include/sync_with_master_gtid.inc":
|
included from /usr/share/mysql-test/suite/rpl/include/start_alter_concurrent.inc at line 193:
|
included from /usr/share/mysql-test/suite/rpl/include/start_alter_include.inc at line 55:
|
included from /usr/share/mysql-test/suite/rpl/t/rpl_start_alter_4.test at line 37:
|
At line 48: Failed to sync with master
|
|
The result from queries just before the failure was:
|
< snip >
|
include/sync_with_master_gtid.inc
|
# Concurrent DML
|
# Sync slave
|
include/save_master_gtid.inc
|
include/sync_with_master_gtid.inc
|
# Rollback tests
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
# Sync slave
|
include/save_master_gtid.inc
|
include/sync_with_master_gtid.inc
|
Timeout in master_gtid_wait('0-1-36,11-1-135,12-1-137', 120), current slave GTID position is: 0-1-36,11-1-130,12-1-124.
|
https://buildbot.mariadb.net/buildbot/builders/kvm-rpm-centos74-amd64-debug/builds/10374/steps/mtr/logs/mysqld.2.err.1
2023-02-21 15:43:28 41 [ERROR] Slave SQL: Error during XID COMMIT: failed to update GTID state in mysql.gtid_slave_pos: 1062: Duplicate entry '12-701' for key 'PRIMARY', Gtid 12-1-125, Internal MariaDB error code: 1942
|
2023-02-21 15:43:28 41 [ERROR] Slave (additional info): Duplicate entry '12-701' for key 'PRIMARY' Error_code: 1062
|
2023-02-21 15:43:28 41 [Warning] Slave: Duplicate entry '12-701' for key 'PRIMARY' Error_code: 1062
|
2023-02-21 15:43:28 41 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'master-bin.000001' position 52213; GTID position '0-1-36,11-1-130,12-1-124'
|
|
|
https://buildbot.mariadb.org/#/builders/168/builds/21371
|
bb-11.2-midenok-MDEV-27180 86c327775
|
rpl.rpl_start_alter_3 'innodb,mix' w7 [ fail ]
|
Test ended at 2023-05-18 11:04:58
|
|
CURRENT_TEST: rpl.rpl_start_alter_3
|
mysqltest: In included file "./include/sync_with_master_gtid.inc":
|
included from /buildbot/amd64-ubuntu-1804-clang10-asan/build/mysql-test/suite/rpl/include/start_alter_concurrent.inc at line 193:
|
included from /buildbot/amd64-ubuntu-1804-clang10-asan/build/mysql-test/suite/rpl/include/start_alter_include.inc at line 55:
|
included from /buildbot/amd64-ubuntu-1804-clang10-asan/build/mysql-test/suite/rpl/t/rpl_start_alter_3.test at line 38:
|
At line 48: Failed to sync with master
|
|
The result from queries just before the failure was:
|
< snip >
|
include/sync_with_master_gtid.inc
|
# Concurrent DML
|
# Sync slave
|
include/save_master_gtid.inc
|
include/sync_with_master_gtid.inc
|
# Rollback tests
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
# Sync slave
|
include/save_master_gtid.inc
|
include/sync_with_master_gtid.inc
|
Timeout in master_gtid_wait('0-1-308', 120), current slave GTID position is: 0-1-281.
|
https://ci.mariadb.org/35116/logs/amd64-ubuntu-1804-clang10-asan/mysqld.2.err.7
2023-05-18 11:02:56 13 [ERROR] Slave SQL: Error during XID COMMIT: failed to update GTID state in mysql.gtid_slave_pos: 1062: Duplicate entry '0-282' for key 'PRIMARY', Gtid 0-1-282, Internal MariaDB error code: 1942
|
2023-05-18 11:02:56 13 [ERROR] Slave (additional info): Duplicate entry '0-282' for key 'PRIMARY' Error_code: 1062
|
2023-05-18 11:02:56 13 [Warning] Slave: Duplicate entry '0-282' for key 'PRIMARY' Error_code: 1062
|
2023-05-18 11:02:56 13 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'master-bin.000001' position 50683; GTID position '0-1-281'
|
|
|
https://buildbot.mariadb.org/#/builders/537/builds/6057
|
refs/pull/2570/merge 11d7b1c61
|
rpl.rpl_start_alter_6 'innodb,row' w31 [ fail ]
|
Test ended at 2023-05-23 01:08:10
|
|
CURRENT_TEST: rpl.rpl_start_alter_6
|
mysqltest: In included file "./include/sync_with_master_gtid.inc":
|
included from /home/buildbot/ppc64le-ubuntu-2004-debug/build/mysql-test/suite/rpl/include/start_alter_concurrent.inc at line 193:
|
included from /home/buildbot/ppc64le-ubuntu-2004-debug/build/mysql-test/suite/rpl/include/start_alter_include.inc at line 55:
|
included from /home/buildbot/ppc64le-ubuntu-2004-debug/build/mysql-test/suite/rpl/t/rpl_start_alter_6.test at line 40:
|
At line 48: Failed to sync with master
|
|
The result from queries just before the failure was:
|
< snip >
|
include/sync_with_master_gtid.inc
|
# Concurrent DML
|
# Sync slave
|
include/save_master_gtid.inc
|
include/sync_with_master_gtid.inc
|
# Rollback tests
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
# Sync slave
|
include/save_master_gtid.inc
|
include/sync_with_master_gtid.inc
|
Timeout in master_gtid_wait('0-1-36,11-1-135,12-1-137', 120), current slave GTID position is: 0-1-36,11-1-133,12-1-126.
|
https://ci.mariadb.org/35225/logs/ppc64le-ubuntu-2004-debug/mysqld.2.err.31
2023-05-23 1:06:09 44 [ERROR] Slave SQL: Error during XID COMMIT: failed to update GTID state in mysql.gtid_slave_pos: 1062: Duplicate entry '12-701' for key 'PRIMARY', Error_code: 1062; the event's master log master-bin.000001, end_log_pos 55999, Gtid 12-1-127, Internal MariaDB error code: 1942
|
2023-05-23 1:06:09 44 [ERROR] Slave (additional info): Duplicate entry '12-701' for key 'PRIMARY' Error_code: 1062
|
2023-05-23 1:06:09 44 [Warning] Slave: Duplicate entry '12-701' for key 'PRIMARY' Error_code: 1062
|
2023-05-23 1:06:09 44 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'master-bin.000001' position 57733; GTID position '0-1-36,11-1-133,12-1-126'
|
|
|
MDEV-31120 contains the repeatable test case that produces:
Error during XID COMMIT: failed to update GTID state in mysql.gtid_slave_pos: 1062: Duplicate entry '0-1212750' for key 'PRIMARY'
|
|
|
From monty in MDEV-31120, mtr changing mysql.gtid_slave_pos to Aria is exposing the problem. As InnoDB, the default user experience, it doesn't generate errors like on the attached test.
|
|
The test rpl.rpl_start_alter_6 is occasionally failing like this:
|
10.8 600751e7693dcf6236d3e6b64fa24d19
|
rpl.rpl_start_alter_6 'innodb,stmt' w2 [ fail ]
|
Test ended at 2022-06-02 06:08:49
|
|
CURRENT_TEST: rpl.rpl_start_alter_6
|
mysqltest: In included file "./include/sync_with_master_gtid.inc":
|
included from /home/buildbot/buildbot/build/mariadb-10.8.4/mysql-test/suite/rpl/include/start_alter_concurrent.inc at line 193:
|
included from /home/buildbot/buildbot/build/mariadb-10.8.4/mysql-test/suite/rpl/include/start_alter_include.inc at line 55:
|
included from /home/buildbot/buildbot/build/mariadb-10.8.4/mysql-test/suite/rpl/t/rpl_start_alter_6.test at line 40:
|
At line 48: Failed to sync with master
|
|
The result from queries just before the failure was:
|
< snip >
|
include/sync_with_master_gtid.inc
|
# Concurrent DML
|
# Sync slave
|
include/save_master_gtid.inc
|
include/sync_with_master_gtid.inc
|
# Rollback tests
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
# Sync slave
|
include/save_master_gtid.inc
|
include/sync_with_master_gtid.inc
|
Timeout in master_gtid_wait('0-1-36,11-1-135,12-1-137', 120), current slave GTID position is: 0-1-36,11-1-127,12-1-123.
|
|
|
Here is a more recent failure, a mandatory staging build of 10.9:
|
10.9 8760fe11b89450d56269ed062b36b4238d45707f
|
rpl.rpl_start_alter_6 'innodb,mix' w21 [ fail ]
|
Test ended at 2023-08-02 09:17:22
|
CURRENT_TEST: rpl.rpl_start_alter_6
|
mysqltest: In included file "./include/sync_with_master_gtid.inc":
|
included from /home/buildbot/amd64-debian-11-msan/build/mysql-test/suite/rpl/include/start_alter_concurrent.inc at line 193:
|
included from /home/buildbot/amd64-debian-11-msan/build/mysql-test/suite/rpl/include/start_alter_include.inc at line 55:
|
included from /home/buildbot/amd64-debian-11-msan/build/mysql-test/suite/rpl/t/rpl_start_alter_6.test at line 40:
|
At line 48: Failed to sync with master
|
The result from queries just before the failure was:
|
< snip >
|
include/sync_with_master_gtid.inc
|
# Concurrent DML
|
# Sync slave
|
include/save_master_gtid.inc
|
include/sync_with_master_gtid.inc
|
# Rollback tests
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
ERROR 23000: Duplicate entry '2' for key 'b'
|
# Sync slave
|
include/save_master_gtid.inc
|
include/sync_with_master_gtid.inc
|
Timeout in master_gtid_wait('0-1-36,11-1-135,12-1-137', 120), current slave GTID position is: 0-1-36,11-1-130,12-1-124.
|
More results from queries before failure can be found in /home/buildbot/amd64-debian-11-msan/build/mysql-test/var/21/log/rpl_start_alter_6.log
|
mysqld.2.err.21 looks like the replica server was killed while the test execution was in progress. There is also a var.tar.gz available for download.
|
|
Agree with Daniel, the direct cause of these failures seem to be that mysql.gtid_slave_pos is using engine=Aria, not InnoDB.
Optimistic parallel replication requires using InnoDB so that updates of the mysql.gtid_slave_pos can be rolled back in case a retry is needed after a deadlock.
I'm not sure exactly what is being retried here. It looks like it must be the START or ROLLBACK ALTER that's being deadlock killed after it has reached the wait_for_prior_commit.
But I think ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB will fix these failures.
|
|
knielsen, thank you. Do you know why mtr is altering the table from InnoDB to Aria in the first place? The SQL bootstrap script would initially create the table in InnoDB.
If there are any non-InnoDB replication tests, maybe only those tests should use the table in non-InnoDB format?
|
|
marko, no I don't know. I believe the database installation script creates the table in InnoDB if that engine is available and MyISAM (or these days maybe Aria) if not. Maybe the mtr behavior is historical, or maybe there was some reason for it.
Before blindly pushing the change to the mysql.gtid_slave_pos table for these test cases though, I want to understand why these DDL statements are being rolled back for retry, and if it's correct/safe to do so, so that I do not hide a bug in the START ALTER code.
|
|
elenst, maybe you can cast light upon
> Do you know why mtr is altering the table from InnoDB to Aria in the first place?
Thanks anyway!
|
|
My git blame is as good as everyone's:
commit e22c3810f059e4f6e3ec52f09d35486e0ff80fb6
|
Author: Sergei Golubchik
|
Date: Thu Jun 5 09:04:43 2014 +0200
|
|
MDEV-6243 mysql_install_db or mysql_upgrade fails when default_engine=archive
|
|
don't use the default storage engine for mysql.gtid_slave_pos, prefer innodb.
|
but alter it to myisam in mtr, because many tests run without innodb.
|
And naturally MyISAM was further switched to Aria along with the system table engine change, by some Monty's commit.
|
|
I guess mtr uses the same mysql_install_db template for all tests, regardless of restart options (I didn't check)? Then we cannot have innodb tables in there. Maybe we could change it in include/have_innodb.inc.
Anyway, I debugged the rpl_start_alter_6 failure. It turns out the replication deadlock kill+retry is caused by persistent statistics inside InnoDB, dict_stats_save(). This function creates an internal transaction that gets assigned the user-query's THD. The function then goes on to take table(?) locks on the dict tables. When these locks conflict, they cause parallel replication to deadlock kill the later transaction because it has the replication THD assigned. I think this is a redundant/false-alarm retry, since the dict system transaction is internal and should not be able to cause a deadlock.
I'm wondering if it's correct to assign the query/replication THD to the separate InnoDB-internal trx that updates the dict tables. In this case, it causes unnecessary and unexpected transaction rollback and retry in parallel replication.
On the other hand, maybe having a NULL trx->mysql_thd would cause problems in other places. It's not a fatal problem to get spurious rollback+retry in parallel replication (it's handled on the upper layer); but would still be preferable to avoid if it is possible. The user might carefully have arranged for no conflicts to be possible between replicated transactions, and then be surprised / experience problems when such conflicts occur from persistent stats updates.
|
|
Starting with MDEV-16678, InnoDB pretty much requires a valid trx_t::mysql_thd. The assignment of THD is outside the control of InnoDB, other than that there is some preallocation of purge-related THD objects related to MDEV-16264 and MDEV-11024. But those cannot be used by dict_stats_save().
|
|
Thanks, Marko.I'll change the tests to use InnoDB for the mysql.gtid_slave_pos table.
For now, I don't think we need to do anything else. Some transaction rollback+retry is expected in in-order parallel replication, and presumably conflicts due to dict stats update will be rare. Let's just keep in mind that this can be a source of otherwise unexpected conflicts in parallel replication, in case we see it in other tests or user reports.
|
|
Fix pushed to 10.11. Bunch of rpl_start_alter_*.test now change the mysql.gtid_slave_pos table to use InnoDB.
|