[MDEV-29816] rpl.rpl_parallel_29322 occasionally fails in BB with [ERROR] I/O error reading the header from the binary log, errno=175, io cache code=0 Created: 2022-10-18  Updated: 2023-12-11  Resolved: 2023-12-11

Status: Closed
Project: MariaDB Server
Component/s: Replication, Tests
Affects Version/s: 10.5, 10.6, 10.7, 10.8, 10.9, 10.10, 10.11, 11.0
Fix Version/s: 10.5.24, 10.6.17, 10.11.7, 11.0.5, 11.1.4, 11.2.3, 11.3.2

Type: Bug Priority: Major
Reporter: Angelique Sklavounos (Inactive) Assignee: Kristian Nielsen
Resolution: Fixed Votes: 0
Labels: None


 Description   

10.6 44fd2c4b2

rpl.rpl_parallel_29322 'mix'             w3 [ fail ]  Found warnings/errors in server log file!
        Test ended at 2022-09-20 17:13:53
line
2022-09-20 17:13:52 12 [ERROR] I/O error reading the header from the binary log, errno=175, io cache code=0
^ Found warnings in /dev/shm/var/3/log/mysqld.1.err
ok



 Comments   
Comment by Kristian Nielsen [ 2023-11-23 ]

The root cause appears to be as follows:

  • The dump thread very rarely survives on the master some time after STOP SLAVE on the slave.
  • The test case removes the old master-bin.000002, then copies in a new one.
  • If the old dump thread reads the master-bin.000002 just at the point where it is created but still of size 0, we get this error in the log
  • The test is otherwise unaffected, because the slave connection to the old dump thread is already closed at this point.

I was not able to easily reproduce the condition where the dump thread survives for longer. But it seems clear that this can happen. The dump thread terminates when it tries to send an event to the slave on a TCP connection that is closed. But the close on the TCP socket (TCP RESET packet) could be seen with some delay, which can then delay stop of the dump thread.

So I think the solution is to ensure the dump thread is gone before manipulating binlog files. Or alternatively just suppress this error in the log with a suitable comment.

Comment by Kristian Nielsen [ 2023-12-11 ]

Pushed to 10.5.

Comment by Andrei Elkin [ 2023-12-11 ]

Thanks, knielsen for working on this one!

As some future enhancement in the area of handling the state "zombie" dump thread, an idea arose at time of MDEV-32551 analysis to engage the semi-sync ack thread. As it accepts the slave-stop message its handling just needs extending to translate the message into actions, like to kill a respective dump thread.
bnestere ^

Generated at Thu Feb 08 10:11:30 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.