[MDEV-27849] rpl.rpl_start_alter_7 (and 8, mysqbinlog_2) fail in buildbot, [ERROR] Slave SQL: Error during XID COMMIT: failed to update GTID state in mysql.gtid_slave_pos - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.8(EOL), 10.9(EOL), 10.10(EOL), 10.11, 11.0(EOL)
Fix Version/s: 10.11.7, 11.0.5, 11.1.4, 11.2.3, 11.3.2
Component/s: Replication, Tests
Labels:
None

Description

https://buildbot.mariadb.org/#/builders/195/builds/4566/steps/7/logs/stdio
https://buildbot.mariadb.org/#/builders/195/builds/4419/steps/7/logs/stdio

Occurs for rpl.rpl_start_alter_7, rpl.rpl_start_alter_8, rpl.rpl_start_alter_mysqlbinlog_2, rpl.rpl_start_alter_4, rpl.rpl_start_alter_3, rpl.rpl_start_alter_6, rpl.rpl_start_alter_5

rpl.rpl_start_alter_7 'innodb'           w1 [ fail ]

        Test ended at 2022-02-15 02:25:31

CURRENT_TEST: rpl.rpl_start_alter_7

mysqltest: In included file "./include/sync_with_master_gtid.inc":

included from /buildbot/amd64-ubuntu-1804-msan/build/mysql-test/suite/rpl/t/rpl_start_alter_7.test at line 83:

At line 48: Failed to sync with master

The result from queries just before the failure was:

< snip >

ERROR 23000: Duplicate entry '2' for key 'b'

ERROR 23000: Duplicate entry '2' for key 'b'

ERROR 23000: Duplicate entry '2' for key 'b'

connection server_2;

drop database s2;

select @@gtid_binlog_pos;

@@gtid_binlog_pos

12-2-412

connection server_3;

start all slaves;

Warnings:

Note	1937	SLAVE 'm2' started

Note	1937	SLAVE 'm1' started

set default_master_connection = 'm1';

include/wait_for_slave_to_start.inc

set default_master_connection = 'm2';

include/wait_for_slave_to_start.inc

set default_master_connection = 'm1';

include/sync_with_master_gtid.inc

Timeout in master_gtid_wait('11-1-412', 120), current slave GTID position is: 11-1-291,12-2-412.

Not checked with the original failures, but the replica error log in the multi-master setup shows:

2022-11-24  4:29:00 31 [Note] Master 'm1': Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './mysqld-relay-bin-m1.000001' position: 4; GTID position '12-2-2'

2022-11-24  4:29:02 24 [ERROR] Slave SQL: Error during XID COMMIT: failed to update GTID state in mysql.gtid_slave_pos: 1062: Duplicate entry '11-605' for key 'PRIMARY', Gtid 11-1-281, Internal MariaDB error code: 1942

2022-11-24  4:29:02 24 [ERROR] Slave (additional info): Duplicate entry '11-605' for key 'PRIMARY' Error_code: 1062

2022-11-24  4:29:02 24 [Warning] Slave: Duplicate entry '11-605' for key 'PRIMARY' Error_code: 1062

2022-11-24  4:29:02 24 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'master-bin.000001' position 49198; GTID position '11-1-280,12-2-334'

2022-11-24  4:29:02 31 [Note] Master 'm1': Slave SQL thread exiting, replication stopped in log 'master-bin.000001' at position 49198; GTID position '11-1-280,12-2-334', master: 127.0.0.1:16040

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

logs.tar.gz
2022-11-25 19:22
1.13 MB
Angelique Sklavounos
MDEV-27849.cnf
2023-07-14 08:30
0.7 kB
Daniel Black
MDEV-27849.test
2023-07-14 08:30
2 kB
Daniel Black

Issue Links

relates to

MDEV-31120 Duplicate entry allowed into a UNIQUE column

Closed

Activity

Ascending order - Click to sort in descending order

View 14 older comments

Elena Stepanova added a comment - 2023-11-18 11:06 - edited

My git blame is as good as everyone's:

commit e22c3810f059e4f6e3ec52f09d35486e0ff80fb6

Author: Sergei Golubchik

Date:   Thu Jun 5 09:04:43 2014 +0200

    MDEV-6243 mysql_install_db or mysql_upgrade fails when default_engine=archive

    don't use the default storage engine for mysql.gtid_slave_pos, prefer innodb.

    but alter it to myisam in mtr, because many tests run without innodb.

And naturally MyISAM was further switched to Aria along with the system table engine change, by some Monty's commit.

Elena Stepanova added a comment - 2023-11-18 11:06 - edited My git blame is as good as everyone's: commit e22c3810f059e4f6e3ec52f09d35486e0ff80fb6 Author: Sergei Golubchik Date: Thu Jun 5 09:04:43 2014 +0200 MDEV-6243 mysql_install_db or mysql_upgrade fails when default_engine=archive don't use the default storage engine for mysql.gtid_slave_pos, prefer innodb. but alter it to myisam in mtr, because many tests run without innodb. And naturally MyISAM was further switched to Aria along with the system table engine change, by some Monty's commit.

Kristian Nielsen added a comment - 2023-11-20 10:56 - edited

I guess mtr uses the same mysql_install_db template for all tests, regardless of restart options (I didn't check)? Then we cannot have innodb tables in there. Maybe we could change it in include/have_innodb.inc.

Anyway, I debugged the rpl_start_alter_6 failure. It turns out the replication deadlock kill+retry is caused by persistent statistics inside InnoDB, dict_stats_save(). This function creates an internal transaction that gets assigned the user-query's THD. The function then goes on to take table(?) locks on the dict tables. When these locks conflict, they cause parallel replication to deadlock kill the later transaction because it has the replication THD assigned. I think this is a redundant/false-alarm retry, since the dict system transaction is internal and should not be able to cause a deadlock.

I'm wondering if it's correct to assign the query/replication THD to the separate InnoDB-internal trx that updates the dict tables. In this case, it causes unnecessary and unexpected transaction rollback and retry in parallel replication.

On the other hand, maybe having a NULL trx->mysql_thd would cause problems in other places. It's not a fatal problem to get spurious rollback+retry in parallel replication (it's handled on the upper layer); but would still be preferable to avoid if it is possible. The user might carefully have arranged for no conflicts to be possible between replicated transactions, and then be surprised / experience problems when such conflicts occur from persistent stats updates.

Kristian Nielsen added a comment - 2023-11-20 10:56 - edited I guess mtr uses the same mysql_install_db template for all tests, regardless of restart options (I didn't check)? Then we cannot have innodb tables in there. Maybe we could change it in include/have_innodb.inc. Anyway, I debugged the rpl_start_alter_6 failure. It turns out the replication deadlock kill+retry is caused by persistent statistics inside InnoDB, dict_stats_save(). This function creates an internal transaction that gets assigned the user-query's THD. The function then goes on to take table(?) locks on the dict tables. When these locks conflict, they cause parallel replication to deadlock kill the later transaction because it has the replication THD assigned. I think this is a redundant/false-alarm retry, since the dict system transaction is internal and should not be able to cause a deadlock. I'm wondering if it's correct to assign the query/replication THD to the separate InnoDB-internal trx that updates the dict tables. In this case, it causes unnecessary and unexpected transaction rollback and retry in parallel replication. On the other hand, maybe having a NULL trx->mysql_thd would cause problems in other places. It's not a fatal problem to get spurious rollback+retry in parallel replication (it's handled on the upper layer); but would still be preferable to avoid if it is possible. The user might carefully have arranged for no conflicts to be possible between replicated transactions, and then be surprised / experience problems when such conflicts occur from persistent stats updates.

Marko Mäkelä added a comment - 2023-11-22 11:48

Starting with ~~MDEV-16678~~, InnoDB pretty much requires a valid trx_t::mysql_thd. The assignment of THD is outside the control of InnoDB, other than that there is some preallocation of purge-related THD objects related to ~~MDEV-16264~~ and ~~MDEV-11024~~. But those cannot be used by dict_stats_save().

Marko Mäkelä added a comment - 2023-11-22 11:48 Starting with MDEV-16678 , InnoDB pretty much requires a valid trx_t::mysql_thd . The assignment of THD is outside the control of InnoDB, other than that there is some preallocation of purge-related THD objects related to MDEV-16264 and MDEV-11024 . But those cannot be used by dict_stats_save() .

Kristian Nielsen added a comment - 2023-11-22 13:18

Thanks, Marko.I'll change the tests to use InnoDB for the mysql.gtid_slave_pos table.

For now, I don't think we need to do anything else. Some transaction rollback+retry is expected in in-order parallel replication, and presumably conflicts due to dict stats update will be rare. Let's just keep in mind that this can be a source of otherwise unexpected conflicts in parallel replication, in case we see it in other tests or user reports.

Kristian Nielsen added a comment - 2023-11-22 13:18 Thanks, Marko.I'll change the tests to use InnoDB for the mysql.gtid_slave_pos table. For now, I don't think we need to do anything else. Some transaction rollback+retry is expected in in-order parallel replication, and presumably conflicts due to dict stats update will be rare. Let's just keep in mind that this can be a source of otherwise unexpected conflicts in parallel replication, in case we see it in other tests or user reports.

Kristian Nielsen added a comment - 2023-12-11 14:29

Fix pushed to 10.11. Bunch of rpl_start_alter_*.test now change the mysql.gtid_slave_pos table to use InnoDB.

Kristian Nielsen added a comment - 2023-12-11 14:29 Fix pushed to 10.11. Bunch of rpl_start_alter_*.test now change the mysql.gtid_slave_pos table to use InnoDB.

MariaDB Server

rpl.rpl_start_alter_7 (and 8, mysqbinlog_2) fail in buildbot, [ERROR] Slave SQL: Error during XID COMMIT: failed to update GTID state in mysql.gtid_slave_pos

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration