[MDEV-30214] rpl.rpl_manual_change_index_file occasionally fails in BB with Result length mismatch - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.10(EOL), 10.11, 11.0(EOL)
Fix Version/s: 10.4.31, 10.5.22, 10.6.15, 10.9.8, 10.10.6, 10.11.5, 11.0.3, 11.1.2
Component/s: Replication, Tests
Labels:
None

Description

https://buildbot.mariadb.org/#/builders/208/builds/9538

10.11 12786f0e7
rpl.rpl_manual_change_index_file 'row' w30 [ fail ]
Test ended at 2022-12-12 10:53:42

CURRENT_TEST: rpl.rpl_manual_change_index_file
--- /home/buildbot/amd64-ubuntu-1804-bigtest/build/mysql-test/suite/rpl/r/rpl_manual_change_index_file.result 2022-12-12 07:52:58.000000000 +0000
+++ /home/buildbot/amd64-ubuntu-1804-bigtest/build/mysql-test/suite/rpl/r/rpl_manual_change_index_file.reject 2022-12-12 10:53:42.390734033 +0000
@@ -13,7 +13,7 @@
call mtr.add_suppression('Got fatal error 1236 from master when reading data from binary log: .*could not find next log');
connection slave;
include/wait_for_slave_io_error.inc [errno=1236]
-Last_IO_Error = 'Got fatal error 1236 from master when reading data from binary log: 'could not find next log; the first event 'master-bin.000001' at XXX, the last event read from 'master-bin.000002' at XXX, the last byte read from 'master-bin.000002' at XXX.''
+Last_IO_Error = 'Got fatal error 1236 from master when reading data from binary log: 'could not find next log; the first event '.' at XXX, the last event read from 'master-bin.000002' at XXX, the last byte read from 'master-bin.000002' at XXX.''
connection master;
CREATE TABLE t2(c1 INT);
FLUSH LOGS;

mysqltest: Result length mismatch

Error logs and var dir attached (mysqld.x.err.30, var_30214.tar.gz).
Error logs for passing equivalent test in mysqld.x.15. (https://buildbot.mariadb.org/#/builders/208/builds/12052/)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

mysqld.1.err.15
445 kB
2023-05-02 19:15
mysqld.1.err.30
458 kB
2022-12-12 17:41
mysqld.2.err.15
558 kB
2023-05-02 19:15
mysqld.2.err.30
650 kB
2022-12-12 17:41
var_30214.tar.gz
1.66 MB
2022-12-12 17:41

Activity

Ascending order - Click to sort in descending order

Angelique Sklavounos (Inactive) added a comment - 2023-05-02 19:36

The test appears to always be retry-pass. The result file expects the error message to contain ‘master-bin.000001’, but instead the error log shows the IO thread with an empty string or 'FIRST' for the log file:

2022-12-12 10:53:42 241 [Note] Slave I/O thread: Start asynchronous replication to master 'root@127.0.0.1:16540' in log '' at position 4

2022-12-12 10:53:42 242 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 0, relay log './slave-relay-bin.000001' position: 4

2022-12-12 10:53:42 241 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:16540',replication started in log 'FIRST' at position 4

2022-12-12 10:53:42 241 [ERROR] Error reading packet from server: could not find next log; the first event '.' at 4, the last event read from 'master-bin.000002' at 514, the last byte read from 'master-bin.000002' at 562. (server_errno=1236)

This failure was first seen with: https://buildbot.mariadb.org/#/builders/121/builds/6504 and is only seen on 10.10 and higher.

The error log for the failure shows (attached mysqld.2.err.30) that the CHANGE MASTER statement at the start of the test does not contain the filename for the New state:

2022-12-12 10:53:42 239 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='127.0.0.1', master_port='16540', master_log_file='master-bin.000001', master_log_pos='329'. New state master_host='127.0.0.1', master_port='16540', master_log_file='', master_log_pos='4'.

2022-12-12 10:53:42 239 [Note] Previous Using_Gtid=Slave_Pos. New Using_Gtid=No

before this, when the slave is stopped so CM can be changed to USING_GTID=NO, the IO thread knows of the master log file but not the SQL thread (let’s call this Case A):

2022-12-12 10:53:42 235 [Note] Previous Using_Gtid=Slave_Pos. New Using_Gtid=Slave_Pos

2022-12-12 10:53:42 237 [Note] Slave I/O thread: Start asynchronous replication to master 'root@127.0.0.1:16540' in log '' at position 4

2022-12-12 10:53:42 237 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:16540',replication starts at GTID position ''

2022-12-12 10:53:42 238 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './slave-relay-bin.000001' position: 4; GTID position ''

2022-12-12 10:53:42 238 [Note] Slave SQL thread exiting, replication stopped in log 'FIRST' at position 4; GTID position '', master: 127.0.0.1:16540

2022-12-12 10:53:42 237 [Note] Slave I/O thread exiting, read up to log 'master-bin.000001', position 329; GTID position , master 127.0.0.1:16540

while for the passing test in another build (https://buildbot.mariadb.org/#/builders/208) the error log shows that the CHANGE MASTER statement contains the master_log_file filename:

2023-04-14 16:13:00 239 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='127.0.0.1', master_port='16240', master_log_file='master-bin.000001', master_log_pos='329'. New state master_host='127.0.0.1', master_port='16240', master_log_file='master-bin.000001', master_log_pos='329'.

2023-04-14 16:13:00 239 [Note] Previous Using_Gtid=Slave_Pos. New Using_Gtid=No

and both IO and SQL threads know of the log file:

2023-04-14 16:13:00 235 [Note] Previous Using_Gtid=Slave_Pos. New Using_Gtid=Slave_Pos

2023-04-14 16:13:00 237 [Note] Slave I/O thread: Start asynchronous replication to master 'root@127.0.0.1:16240' in log '' at position 4

2023-04-14 16:13:00 238 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './slave-relay-bin.000001' position: 4; GTID position ''

2023-04-14 16:13:00 237 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:16240',replication starts at GTID position ''

2023-04-14 16:13:00 238 [Note] Error reading relay log event: slave SQL thread was killed

2023-04-14 16:13:00 238 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000001' at position 329; GTID position '', master: 127.0.0.1:16240

2023-04-14 16:13:00 237 [Note] Slave I/O thread exiting, read up to log 'master-bin.000001', position 329; GTID position , master 127.0.0.1:16240

Notably, the error log for the failure does NOT show the message “Error reading relay log event: slave SQL thread was killed”

Elkin I am unsure if there is a timing bug with Case A to investigate. Otherwise, the expected error message can simply be made more general so that it would accept either ‘.’ or ‘master-bin.000001’, or the "USING_GTID=NO" could be removed as the 1236 error is still achievable when the index file is manually changed.

Angelique Sklavounos (Inactive) added a comment - 2023-05-02 19:36 The test appears to always be retry-pass. The result file expects the error message to contain ‘master-bin.000001’, but instead the error log shows the IO thread with an empty string or 'FIRST' for the log file: 2022-12-12 10:53:42 241 [Note] Slave I/O thread: Start asynchronous replication to master 'root@127.0.0.1:16540' in log '' at position 4 2022-12-12 10:53:42 242 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 0, relay log './slave-relay-bin.000001' position: 4 2022-12-12 10:53:42 241 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:16540',replication started in log 'FIRST' at position 4 2022-12-12 10:53:42 241 [ERROR] Error reading packet from server: could not find next log; the first event '.' at 4, the last event read from 'master-bin.000002' at 514, the last byte read from 'master-bin.000002' at 562. (server_errno=1236) This failure was first seen with: https://buildbot.mariadb.org/#/builders/121/builds/6504 and is only seen on 10.10 and higher. The error log for the failure shows (attached mysqld.2.err.30) that the CHANGE MASTER statement at the start of the test does not contain the filename for the New state: 2022-12-12 10:53:42 239 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='127.0.0.1', master_port='16540', master_log_file='master-bin.000001', master_log_pos='329'. New state master_host='127.0.0.1', master_port='16540', master_log_file='', master_log_pos='4'. 2022-12-12 10:53:42 239 [Note] Previous Using_Gtid=Slave_Pos. New Using_Gtid=No before this, when the slave is stopped so CM can be changed to USING_GTID=NO, the IO thread knows of the master log file but not the SQL thread (let’s call this Case A): 2022-12-12 10:53:42 235 [Note] Previous Using_Gtid=Slave_Pos. New Using_Gtid=Slave_Pos 2022-12-12 10:53:42 237 [Note] Slave I/O thread: Start asynchronous replication to master 'root@127.0.0.1:16540' in log '' at position 4 2022-12-12 10:53:42 237 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:16540',replication starts at GTID position '' 2022-12-12 10:53:42 238 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './slave-relay-bin.000001' position: 4; GTID position '' 2022-12-12 10:53:42 238 [Note] Slave SQL thread exiting, replication stopped in log 'FIRST' at position 4; GTID position '', master: 127.0.0.1:16540 2022-12-12 10:53:42 237 [Note] Slave I/O thread exiting, read up to log 'master-bin.000001', position 329; GTID position , master 127.0.0.1:16540 while for the passing test in another build ( https://buildbot.mariadb.org/#/builders/208 ) the error log shows that the CHANGE MASTER statement contains the master_log_file filename: 2023-04-14 16:13:00 239 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='127.0.0.1', master_port='16240', master_log_file='master-bin.000001', master_log_pos='329'. New state master_host='127.0.0.1', master_port='16240', master_log_file='master-bin.000001', master_log_pos='329'. 2023-04-14 16:13:00 239 [Note] Previous Using_Gtid=Slave_Pos. New Using_Gtid=No and both IO and SQL threads know of the log file: 2023-04-14 16:13:00 235 [Note] Previous Using_Gtid=Slave_Pos. New Using_Gtid=Slave_Pos 2023-04-14 16:13:00 237 [Note] Slave I/O thread: Start asynchronous replication to master 'root@127.0.0.1:16240' in log '' at position 4 2023-04-14 16:13:00 238 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './slave-relay-bin.000001' position: 4; GTID position '' 2023-04-14 16:13:00 237 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:16240',replication starts at GTID position '' 2023-04-14 16:13:00 238 [Note] Error reading relay log event: slave SQL thread was killed 2023-04-14 16:13:00 238 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000001' at position 329; GTID position '', master: 127.0.0.1:16240 2023-04-14 16:13:00 237 [Note] Slave I/O thread exiting, read up to log 'master-bin.000001', position 329; GTID position , master 127.0.0.1:16240 Notably, the error log for the failure does NOT show the message “Error reading relay log event: slave SQL thread was killed” Elkin I am unsure if there is a timing bug with Case A to investigate. Otherwise, the expected error message can simply be made more general so that it would accept either ‘.’ or ‘master-bin.000001’, or the "USING_GTID=NO" could be removed as the 1236 error is still achievable when the index file is manually changed.

Andrei Elkin added a comment - 2023-05-03 08:47 - edited

angelique.sklavounos, I agree with making 'the expected error message ... more general'. After all the test verifies what belongs to
a class of unspecified behavior (thankfully a way was found to avoid that endless looping).

If you have it reproducible with rr we could gain some better understanding. But no need to persevere.

Andrei Elkin added a comment - 2023-05-03 08:47 - edited angelique.sklavounos , I agree with making 'the expected error message ... more general'. After all the test verifies what belongs to a class of unspecified behavior (thankfully a way was found to avoid that endless looping). If you have it reproducible with rr we could gain some better understanding. But no need to persevere.

MariaDB Server

rpl.rpl_manual_change_index_file occasionally fails in BB with Result length mismatch

Details

Description

Attachments

Attachments

Activity

People

Dates

Git Integration