[MDEV-34778] Inconsistent Seconds_Behind_Master after slave stop+start (rpl.rpl_old_master sporadic failure) - Jira

Details

Type: Bug
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 11.4
Fix Version/s: 11.4
Component/s: Replication
Labels:
None

Description

This is seen as sporadic failure of test case rpl.rpl_old_master, for
example in commit 5dc2fe4815ddfb19c406f225576040882fe62b02:

rpl.rpl_old_master 'mix'                 [ fail ]

        Test ended at 2024-08-19 19:37:08

CURRENT_TEST: rpl.rpl_old_master

mysqltest: In included file "./include/wait_for_slave_param.inc":

included from /kvm/src/my/sn-mdev30255/mysql-test/suite/rpl/t/rpl_old_master.test at line 38:

At line 111: Timeout in include/wait_for_slave_param.inc

The root cause is a real bug or at least inconsistency in the handling of
Seconds_Behind_Master around slave restart. The issue seems to be related to
changes done in this commit:

commit d69e835787f9ce9848cb6c2a5343887dd0eec2ce

Author: Brandon Nesterenko <brandon.nesterenko@mariadb.com>

Date:   Thu Nov 3 14:56:50 2022 -0600

    MDEV-29639: Seconds_Behind_Master is incorrect for Delayed, Parallel Replicas

I am not sure it is worth to try and fix the problem in the code.
Seconds_Behind_Master has historically had many problems with inconsistent
behaviour, and ~~MDEV-33856~~ provides a better solution. But I need to silence
the failure in the test case, so I wanted to document the underlying root
cause.

The failure can be triggered reliably with the following patch, on commit
5dc2fe4815ddfb19c406f225576040882fe62b02 for example:

diff --git a/mysql-test/suite/rpl/t/rpl_old_master.test b/mysql-test/suite/rpl/t/rpl_old_master.test

index 6faa8212d66..37594444472 100644

--- a/mysql-test/suite/rpl/t/rpl_old_master.test

+++ b/mysql-test/suite/rpl/t/rpl_old_master.test

@@ -6,6 +6,7 @@

 --source include/master-slave.inc

 --connection slave

+--sleep 15

 --source include/stop_slave.inc

 --connection master

diff --git a/sql/slave.cc b/sql/slave.cc

index d768d3e6442..22928196990 100644

--- a/sql/slave.cc

+++ b/sql/slave.cc

@@ -4670,6 +4670,7 @@ static int exec_relay_log_event(THD* thd, Relay_log_info* rli,

});

 #endif

+if (typ == GTID_LIST_EVENT) my_sleep(10000000);

     DBUG_RETURN(exec_res);

   mysql_mutex_unlock(&rli->data_lock);

@@ -6117,6 +6118,7 @@ static int queue_event(Master_info* mi, const uchar *buf, ulong event_len)

               checksum_alg == BINLOG_CHECKSUM_ALG_CRC32);

   DBUG_ENTER("queue_event");

+  if (buf[EVENT_TYPE_OFFSET] == GTID_LIST_EVENT) my_sleep(10000000);

/*

     FD_queue checksum alg description does not apply in a case of

     FD itself. The one carries both parts of the checksum data.

What happens is the following:

1. The slave starts at the start of the testcase (include/master-slave.inc).

2. The I/O thread writes the event prior to the Gtid_List event (for
example) to the relay log, then goes to sleep waiting for more events from
the master.

3. The SQL thread in next_event() reaches the current end of the relay log,
it sets sql_thread_caught_up=true and goes to sleep waiting for more events
to be written:

    if (!cur_log->error) /* EOF */ {

      if (hot_log) {

        rli->sql_thread_caught_up= true;

4. The I/O thread wakes up, writes the next event (Gtid_List in this case)
to the relay log.

5. The SQL thread wakes up, reads Gtid_List, returns it from next_event().

6. The sql thread in exec_relay_log_event() does not set
sql_thread_caught_up back to false, because the Gtid_List event has
event_can_update_last_master_timestamp()=false; this code from the
~~MDEV-29639~~ patch:

    if ((!rli->mi->using_parallel()) &&

        event_can_update_last_master_timestamp(ev)) {

      rli->last_master_timestamp= ev->when + (time_t) ev->exec_time;

      rli->sql_thread_caught_up= false;

7. The SQL thread gets killed just after exec_relay_log_event() has
applied the Gtid_List event. This causes the SQL thread to exit from a
different code path than this one in next_event from ~~MDEV-29639~~:

/*

    Set that we are not caught up so if there is a hang/problem on restart,

    Seconds_Behind_Master will still grow.

*/

  rli->sql_thread_caught_up= false;

  DBUG_RETURN(0);

This is the inconsistency (or bug) seemingly left from the ~~MDEV-29639~~ patch.
The sql_thread_caught_up will be set false or not at random at slave stop,
depending on exactly where the kill signal is detected in the SQL thread.

8. The slave stops with sql_thread_caught_up left at the value true. When
the slave starts again, it will initially report Seconds_Behind_Master as 0
until sql_thread_caught_up gets set back to false, due to this code in
send_show_master_info_data():

        idle= mi->rli.sql_thread_caught_up;

...

      if (idle)

        time_diff= 0;

9. In the test case rpl.rpl_old_master, the slave is started while FLUSH
TABLES WITH READ LOCK is in effect. After ~~MDEV-12179~~, this blocks the slave
thread from doing anything (due to find_gtid_slave_pos_tables()) until
UNLOCK TABLES. Since the test case waits for Seconds_Behind_Master to become
non-zero before it does UNLOCK TABLES, this causes the test case to hang and
eventually time out.

So long and complex story.
A fix could be to ensure that sql_thread_caught_up gets set consistently to
false when the SQL thread exits due to being killed, no matter where in the
code the kill is detected. But as written above, I do not think it is worth
it to try to fix this (and risk introducing other inconsistencies or bugs in
the very complex behaviour of Seconds_Behind_Master).

For the test failure, the test is anyway no longer testing what it was
supposed to, after ~~MDEV-12179~~. So I will try to fix the test or just remove
that bit from the test. It is testing ~~MDEV-14528~~, that Seconds_Behind_Master
will be updated when doing parallel replication from a pre-10.0 master,
which does not seem a critical thing to test for now that 10.5 is earliest
supported release.

Attachments

Issue Links

relates to

MDEV-12179 Per-engine mysql.gtid_slave_pos tables

Closed

MDEV-14528 Track master timestamp in case rolling back to serial replication

Closed

MDEV-29639 Seconds_Behind_Master is incorrect for Delayed, Parallel Replicas

Closed

Activity

People

Assignee:: Brandon Nesterenko

Reporter:: Kristian Nielsen

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 2024-08-19 18:13

Updated:: 2024-09-23 20:07

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server