[MDEV-33331] IO Thread Relay Log Inconsistent Statistics After MDEV-32551 - Jira

Details

Type: Bug
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 10.6, 10.11, 11.0(EOL), 11.1(EOL), 11.2(EOL), 11.3(EOL)
Fix Version/s: 10.6, 10.11
Component/s: Replication, Tests
Labels:
None

Description

After ~~MDEV-32551~~, in a master/slave setup, if the replica's IO thread quickly reconnects (i.e quickly running STOP SLAVE IO_THREAD followed by START SLAVE IO_THREAD), the relay rotation behavior changes. Additionally, there is a small gap of time in which the SHOW SLAVE STATUS variable Slave_IO_Running can be YES, but the Master_Log_File is empty, which was not observed pre-~~MDEV-32551~~.

These issues make for unstable MTR tests that either 1) rely on consistent relay logging behavior, e.g. rpl_mariadb_slave_capability (in 10.11+), or 2) rely on binlog coordinates after `start_slave.inc` on replica with an empty state, e.g. after `RESET SLAVE` in rpl_using_gtid_default.

Due to the primary server side changes of kill zombie dump threads:

kill_zombie_dump_threads() now does killing of dump threads properly.

It can now kill several threads (should be impossible but could
happen if IO slaves reconnects very fast).

We now wait until the dump thread is done before starting the
dump.

That is because kill_zombie_dump_threads() now kills threads properly, and binlog dump threads will now kill themselves if they see another connection with the same server_id, to concretely outline when we get inconsistent relay logs:

Slave: START SLAVE IO_THREAD; --source include/wait_for_slave_io_to_start.inc : Start the IO thread as normal, and wait for Slave_IO_Running==YES. This just waits for the initial handshake to complete. We do not yet receive anything from the master binlog dump thread (in particular, the fake rotate event, which initializes the name of the binary log to read)
Master: The master's binlog dump thread tries to send the fake rotate event to initialize the binlog name on the slave.
Slave: STOP SLAVE IO_THREAD. The issue is that, we can stop the slave potentially before it can receive the initial fake rotate log event, because that is outside of start_slave.inc 's check.

I wonder if the initial fake rotate log event should be a part of the "handshake", such that the Slave_IO_Running is not changed to "Yes" until the replica receives it from the primary.

Attachments

Issue Links

is caused by

MDEV-32551 "Read semi-sync reply magic number error" warnings on master

Closed

Activity

There are no comments yet on this issue.

People

Assignee:: Brandon Nesterenko

Reporter:: Brandon Nesterenko

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2024-01-30 15:24

Updated:: 2024-11-22 17:12

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server