[MDEV-30232] rpl.rpl_gtid_crash fails sporadically in BB with Timeout wait for SQL thread to catch up with IO thread - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.4(EOL), 10.5, 10.6, 10.8(EOL), 10.9(EOL), 10.10(EOL), 10.11, 11.0(EOL)
Fix Version/s: 11.4.1, 10.5.25, 10.11.8, 10.6.18, 11.0.6, 11.1.5, 11.2.4, 11.4.2, 11.5.1
Component/s: Replication, Tests
Labels:
None

Description

https://buildbot.askmonty.org/buildbot/builders/kvm-fulltest2/builds/34174

10.8 e8a2a70cf
rpl.rpl_gtid_crash 'innodb,row' w4 [ fail ]
Test ended at 2022-03-08 06:14:34

CURRENT_TEST: rpl.rpl_gtid_crash
mysqltest: At line 367: Timeout wait for SQL thread to catch up with IO thread

The result from queries just before the failure was:
< snip >
call mtr.add_suppression("Unexpected change of master binlog file name in the middle of GTID");
set sql_log_bin= 1;
connection server_1;
SET GLOBAL debug_dbug="+d,inject_error_writing_xid";
BEGIN;
INSERT INTO t1 VALUES (11);
COMMIT;
ERROR HY000: Error writing file 'master-bin' (errno: 28 "No space left on device")
SET GLOBAL debug_dbug="+d,crash_dispatch_command_before";
COMMIT;
Got one of the listed errors
SELECT @@GLOBAL.server_id;
@@GLOBAL.server_id
3
SELECT * from t1 WHERE a > 10 ORDER BY a;
a
gtid_check
Binlog pos ok
# Wait 30 seconds for SQL thread to catch up with IO thread
connection server_2;

Also seen to have failed at the second wait for the SQL thread:

10.6 c4ce012e4
rpl.rpl_gtid_crash 'innodb,row' w4 [ fail ]
Test ended at 2022-11-07 19:05:14

CURRENT_TEST: rpl.rpl_gtid_crash
mysqltest: At line 464: Timeout wait for SQL thread to catch up with IO thread

The result from queries just before the failure was:
< snip >
BEGIN;
INSERT INTO t1 VALUES (21);
COMMIT;
ERROR HY000: Error writing file 'master-bin' (errno: 28 "No space left on device")
SET GLOBAL debug_dbug="+d,crash_dispatch_command_before";
COMMIT;
Got one of the listed errors
SELECT @@GLOBAL.server_id;
@@GLOBAL.server_id
1
SELECT * from t1 WHERE a > 10 ORDER BY a;
a
13
14
gtid_check
Binlog pos ok
gtid_check
Current pos ok
# Wait 30 seconds for SQL thread to catch up with IO thread
connection server_2;

Attachments

Issue Links

is duplicated by

MDEV-30231 rpl.rpl_gtid_crash fails sporadically in BB with Timeout wait for IO thread to connect and SQL thread to catch up with IO thread

Closed

relates to

MDEV-36311 Sporadic 120 seconds connect timeouts, test suite failures

Open

Activity

Ascending order - Click to sort in descending order

Andrei Elkin added a comment - 2023-11-22 15:48

A warning needs suppression:

https://buildbot.mariadb.org/#/builders/572/builds/4473/steps/8/logs/stdio

rpl.rpl_gtid_crash 'innodb,row' w14 [ fail ] Found warnings/errors in server log file!
Test ended at 2023-11-22 14:23:59
line
2023-11-22 14:23:50 5 [Warning] Slave I/O: SET @master_heartbeat_period to master failed with error: Lost connection to server during query, Internal MariaDB error code: 2013
^ Found warnings in /home/buildbot/amd64-debian-11-msan/build/mysql-test/var/14/log/mysqld.2.err

According to the slave side error log it may happened.

Andrei Elkin added a comment - 2023-11-22 15:48 A warning needs suppression: https://buildbot.mariadb.org/#/builders/572/builds/4473/steps/8/logs/stdio rpl.rpl_gtid_crash 'innodb,row' w14 [ fail ] Found warnings/errors in server log file! Test ended at 2023-11-22 14:23:59 line 2023-11-22 14:23:50 5 [Warning] Slave I/O: SET @master_heartbeat_period to master failed with error: Lost connection to server during query, Internal MariaDB error code: 2013 ^ Found warnings in /home/buildbot/amd64-debian-11-msan/build/mysql-test/var/14/log/mysqld.2.err According to the slave side error log it may happened.

Kristian Nielsen added a comment - 2024-03-20 15:10

The root cause of this appears to be a problem in the linux kernel networking code:

https://lore.kernel.org/netdev/87sf0ldk41.fsf@urd.knielsen-hq.org/T/#u

When the race is hit, in the test case rpl.rpl_gtid_crash, the slave gets a successful socket reconnection to the (old, crashed) master, but then times out in poll() during the connection setup because of the bug/race in the kernel. Since the connect timeout happens to be longer than the 60 seconds timeout coded in the test case, this ends up as a test case timeout failure.

Given that the root cause is a problem in the OS, a work-around could be implemented in the test to simply reduce the connection timeout so that the test case will complete even when it gets a stale reconnect.

Kristian Nielsen added a comment - 2024-03-20 15:10 The root cause of this appears to be a problem in the linux kernel networking code: https://lore.kernel.org/netdev/87sf0ldk41.fsf@urd.knielsen-hq.org/T/#u When the race is hit, in the test case rpl.rpl_gtid_crash, the slave gets a successful socket reconnection to the (old, crashed) master, but then times out in poll() during the connection setup because of the bug/race in the kernel. Since the connect timeout happens to be longer than the 60 seconds timeout coded in the test case, this ends up as a test case timeout failure. Given that the root cause is a problem in the OS, a work-around could be implemented in the test to simply reduce the connection timeout so that the test case will complete even when it gets a stale reconnect.

Kristian Nielsen added a comment - 2024-04-15 17:47

I think it is --slave-net-timeout that gets triggered by the Linux kernel bug and needs to be reduced (from the default 120 currently).

The failure is quite tricky to reproduce, but does occur relatively frequently in buildbot.

Need to decide if the value should be reduced only in this rpl.rpl_gtid_crash test case, or should be reduced generally for all tests, in case others are affected by this as well.

Kristian Nielsen added a comment - 2024-04-15 17:47 I think it is --slave-net-timeout that gets triggered by the Linux kernel bug and needs to be reduced (from the default 120 currently). The failure is quite tricky to reproduce, but does occur relatively frequently in buildbot. Need to decide if the value should be reduced only in this rpl.rpl_gtid_crash test case, or should be reduced generally for all tests, in case others are affected by this as well.

Kristian Nielsen added a comment - 2024-04-16 11:29

Fix for this in st-10.5-knielsen, but I'm prevented from pushing by broken branch protection.
Feel free to pull if desired.

Kristian Nielsen added a comment - 2024-04-16 11:29 Fix for this in st-10.5-knielsen, but I'm prevented from pushing by broken branch protection. Feel free to pull if desired.

Kristian Nielsen added a comment - 2024-04-20 12:05

Pushed to 10.5.

Kristian Nielsen added a comment - 2024-04-20 12:05 Pushed to 10.5.

People

Assignee:: Kristian Nielsen

Reporter:: Angelique Sklavounos (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 2022-12-14 19:44

Updated:: 2025-03-17 18:21

Resolved:: 2024-04-20 12:05

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server