Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-17516

Replication lag issue using parallel replication

Details

    Description

      Using parallel replication second behind master is wrongly reporting 0 when SQL thread is stopped and restarted long time after.

      This happen by design
      https://lists.launchpad.net/maria-developers/msg08958.html

      but is really a show stopper for most proxy that send traffic to such slave thinking it's in sync with master.

      My understanding is that slave_behind_master is computed after first commit so in this case the master is 2 days in advance and on a fresh restarted slave we get this

      |   30 | system user  |                      | tsce_unedic | Connect | 2211 | altering table                                                                 | OPTIMIZE TABLE `requetes` |    0.000 |
      

      And we can see wrong second behind master

              Seconds_Behind_Master: 0
                         Using_Gtid: Slave_Pos
                        Gtid_IO_Pos: 0-21-28557589
                      Parallel_Mode: conservative 
      

      but on his master

      gtid_current_pos       | 0-21-28570301 
      

      A possible solution would be to update Seconds_Behind_Master by injecting a fake event in start slave with the max timestamp of all events read by the leader thread and send to to the worker threads .

      To reproduce :

      --source include/have_innodb.inc
      --source include/have_binlog_format_mixed.inc
      --let $rpl_topology=1->2
      --source include/rpl_init.inc
       
      # Test various aspects of parallel replication.
       
      --connection server_1
      ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB;
       
      CREATE TABLE t1 (a INT PRIMARY KEY, b INT) ENGINE=InnoDB;
      --save_master_pos
       
      --connection server_2
      --sync_with_master
      --source include/stop_slave_sql_thread.inc
      SET GLOBAL slave_parallel_threads=4;
       
      --connection server_2
      --sync_with_master
      --source include/stop_slave.inc
      SET GLOBAL slave_parallel_threads=1;
       
      --connection server_1
      --disable_warnings
      INSERT INTO t1 VALUES (1, SLEEP(100));
      --wait 100s
      INSERT INTO t1 VALUES (1, SLEEP(1));
       
      --connection server_2
      --source include/start_slave.inc
      --let $status_items= Seconds_Behind_Master
      --source include/show_slave_status.inc
      --sync_with_master
      --let $status_items= Seconds_Behind_Master
      --source include/show_slave_status.inc
      

      Attachments

        Issue Links

          Activity

            "memorize the last gained SBM", non valid for stop slave with no delay but the next event is a long query

            To preserve the definition of SBM = time difference of last event in queue and oldest event in the queue being COMMITTED,
            i'm still curious why not using a heartbeat from the leader enrich with timestamp of last binary log event would be more accurate, a slave would not start fetching event before first heartbeat ? And SBM definition become time difference last event in the leader and oldest event COMMITTED in the queue

            stephane@skysql.com VAROQUI Stephane added a comment - "memorize the last gained SBM", non valid for stop slave with no delay but the next event is a long query To preserve the definition of SBM = time difference of last event in queue and oldest event in the queue being COMMITTED, i'm still curious why not using a heartbeat from the leader enrich with timestamp of last binary log event would be more accurate, a slave would not start fetching event before first heartbeat ? And SBM definition become time difference last event in the leader and oldest event COMMITTED in the queue

            Andrei i have check the state Seconds_Behind_Master: NULL Slave_IO_Running: No already exist so will not break any tool , So init SBM NULL and transition of Slave_IO_Running to yes after first event fetch is correct

            stephane@skysql.com VAROQUI Stephane added a comment - Andrei i have check the state Seconds_Behind_Master: NULL Slave_IO_Running: No already exist so will not break any tool , So init SBM NULL and transition of Slave_IO_Running to yes after first event fetch is correct
            Elkin Andrei Elkin added a comment - - edited

            stephane@skysql.com, to the HB exploitation , your direction is great. Just not HB, but when necessary it's feasible to add up to the master-slave connection handshake something like you propose.
            E.g that the slave service is started for the 1st time on the (restarted) slave server. In the recommended CM...master_use_gtid = slave_pos in the handshake time slave would receive back the end-of-transaction timestamp corresponding to its last GTID executed (without this ts piece slave would be aware only of the GTID details of its last executed trx).
            This measure refines 'a maximum SBM'.

            Elkin Andrei Elkin added a comment - - edited stephane@skysql.com , to the HB exploitation , your direction is great. Just not HB, but when necessary it's feasible to add up to the master-slave connection handshake something like you propose. E.g that the slave service is started for the 1st time on the (restarted) slave server. In the recommended CM...master_use_gtid = slave_pos in the handshake time slave would receive back the end-of-transaction timestamp corresponding to its last GTID executed (without this ts piece slave would be aware only of the GTID details of its last executed trx). This measure refines 'a maximum SBM'.

            handshake to both of you so

            stephane@skysql.com VAROQUI Stephane added a comment - handshake to both of you so

            At the same time Hancheck + Heartbeat would refine SBM by accounting network time, long waiting request

            stephane@skysql.com VAROQUI Stephane added a comment - At the same time Hancheck + Heartbeat would refine SBM by accounting network time, long waiting request

            People

              bnestere Brandon Nesterenko
              stephane@skysql.com VAROQUI Stephane
              Votes:
              1 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.