Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-17516

Replication lag issue using parallel replication

Details

    Description

      Using parallel replication second behind master is wrongly reporting 0 when SQL thread is stopped and restarted long time after.

      This happen by design
      https://lists.launchpad.net/maria-developers/msg08958.html

      but is really a show stopper for most proxy that send traffic to such slave thinking it's in sync with master.

      My understanding is that slave_behind_master is computed after first commit so in this case the master is 2 days in advance and on a fresh restarted slave we get this

      |   30 | system user  |                      | tsce_unedic | Connect | 2211 | altering table                                                                 | OPTIMIZE TABLE `requetes` |    0.000 |
      

      And we can see wrong second behind master

              Seconds_Behind_Master: 0
                         Using_Gtid: Slave_Pos
                        Gtid_IO_Pos: 0-21-28557589
                      Parallel_Mode: conservative 
      

      but on his master

      gtid_current_pos       | 0-21-28570301 
      

      A possible solution would be to update Seconds_Behind_Master by injecting a fake event in start slave with the max timestamp of all events read by the leader thread and send to to the worker threads .

      To reproduce :

      --source include/have_innodb.inc
      --source include/have_binlog_format_mixed.inc
      --let $rpl_topology=1->2
      --source include/rpl_init.inc
       
      # Test various aspects of parallel replication.
       
      --connection server_1
      ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB;
       
      CREATE TABLE t1 (a INT PRIMARY KEY, b INT) ENGINE=InnoDB;
      --save_master_pos
       
      --connection server_2
      --sync_with_master
      --source include/stop_slave_sql_thread.inc
      SET GLOBAL slave_parallel_threads=4;
       
      --connection server_2
      --sync_with_master
      --source include/stop_slave.inc
      SET GLOBAL slave_parallel_threads=1;
       
      --connection server_1
      --disable_warnings
      INSERT INTO t1 VALUES (1, SLEEP(100));
      --wait 100s
      INSERT INTO t1 VALUES (1, SLEEP(1));
       
      --connection server_2
      --source include/start_slave.inc
      --let $status_items= Seconds_Behind_Master
      --source include/show_slave_status.inc
      --sync_with_master
      --let $status_items= Seconds_Behind_Master
      --source include/show_slave_status.inc
      

      Attachments

        Issue Links

          Activity

            stephane@skysql.com VAROQUI Stephane created issue -
            stephane@skysql.com VAROQUI Stephane made changes -
            Field Original Value New Value
            Description Using parallel replication second behind master is wrongly reporting 0 when SQL thread is stopped and restarted long time after.


            This happen by design
            https://lists.launchpad.net/maria-developers/msg08958.html

            but is really a show stopper for most proxy that send traffic to such slave thinking it's in sync with master.

            My understanding is that slave_behind_master is computed after first commit so in this case the master is 2 days in advance and on a fresh restarted slave we get this

            {noformat}
            | 30 | system user | | tsce_unedic | Connect | 2211 | altering table | OPTIMIZE TABLE `requetes` | 0.000 |
            {noformat}
            And we can see wrong second behind master
            {noformat}
                    Seconds_Behind_Master: 0
                               Using_Gtid: Slave_Pos
                              Gtid_IO_Pos: 0-21-28557589
                            Parallel_Mode: conservative
            {noformat}
            but on his master
            {noformat}
            gtid_current_pos | 0-21-28570301
            {noformat}
            A possible solution would be to update Seconds_Behind_Master by injecting a fake event ion start slave with the timestamp of the first event read by the leader thread

            To reproduce :
            {noformat}
            --source include/have_innodb.inc
            --source include/have_binlog_format_mixed.inc
            --let $rpl_topology=1->2
            --source include/rpl_init.inc

            # Test various aspects of parallel replication.

            --connection server_1
            ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB;

            CREATE TABLE t1 (a INT PRIMARY KEY, b INT) ENGINE=InnoDB;
            --save_master_pos

            --connection server_2
            --sync_with_master
            --source include/stop_slave_sql_thread.inc
            SET GLOBAL slave_parallel_threads=4;

            --connection server_2
            --sync_with_master
            --source include/stop_slave.inc
            SET GLOBAL slave_parallel_threads=1;

            --connection server_1
            --disable_warnings
            INSERT INTO t1 VALUES (1, SLEEP(3600));

            --connection server_2
            --source include/start_slave.inc
            --let $status_items= Seconds_Behind_Master
            --source include/show_slave_status.inc
            --sync_with_master
            --let $status_items= Seconds_Behind_Master
            --source include/show_slave_status.inc
            {noformat}
            Using parallel replication second behind master is wrongly reporting 0 when SQL thread is stopped and restarted long time after.


            This happen by design
            https://lists.launchpad.net/maria-developers/msg08958.html

            but is really a show stopper for most proxy that send traffic to such slave thinking it's in sync with master.

            My understanding is that slave_behind_master is computed after first commit so in this case the master is 2 days in advance and on a fresh restarted slave we get this

            {noformat}
            | 30 | system user | | tsce_unedic | Connect | 2211 | altering table | OPTIMIZE TABLE `requetes` | 0.000 |
            {noformat}
            And we can see wrong second behind master
            {noformat}
                    Seconds_Behind_Master: 0
                               Using_Gtid: Slave_Pos
                              Gtid_IO_Pos: 0-21-28557589
                            Parallel_Mode: conservative
            {noformat}
            but on his master
            {noformat}
            gtid_current_pos | 0-21-28570301
            {noformat}

            A possible solution would be to update Seconds_Behind_Master by injecting a fake event in start slave with the max timestamp of all events read by the leader thread and send to to the worker threads .

            To reproduce :
            {noformat}
            --source include/have_innodb.inc
            --source include/have_binlog_format_mixed.inc
            --let $rpl_topology=1->2
            --source include/rpl_init.inc

            # Test various aspects of parallel replication.

            --connection server_1
            ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB;

            CREATE TABLE t1 (a INT PRIMARY KEY, b INT) ENGINE=InnoDB;
            --save_master_pos

            --connection server_2
            --sync_with_master
            --source include/stop_slave_sql_thread.inc
            SET GLOBAL slave_parallel_threads=4;

            --connection server_2
            --sync_with_master
            --source include/stop_slave.inc
            SET GLOBAL slave_parallel_threads=1;

            --connection server_1
            --disable_warnings
            INSERT INTO t1 VALUES (1, SLEEP(100));
            --wait 100s
            INSERT INTO t1 VALUES (1, SLEEP(1));

            --connection server_2
            --source include/start_slave.inc
            --let $status_items= Seconds_Behind_Master
            --source include/show_slave_status.inc
            --sync_with_master
            --let $status_items= Seconds_Behind_Master
            --source include/show_slave_status.inc
            {noformat}
            stephane@skysql.com VAROQUI Stephane made changes -
            elenst Elena Stepanova made changes -
            Assignee Andrei Elkin [ elkin ]
            Elkin Andrei Elkin made changes -
            Labels seconds-behind-master
            elenst Elena Stepanova made changes -
            Fix Version/s 10.1 [ 16100 ]
            Elkin Andrei Elkin made changes -
            Assignee Andrei Elkin [ elkin ] Sujatha Sivakumar [ sujatha.sivakumar ]
            julien.fritsch Julien Fritsch made changes -
            Fix Version/s 10..4 [ 24902 ]
            Fix Version/s 10.2 [ 14601 ]
            Fix Version/s 10.3 [ 22126 ]
            Fix Version/s 10.5 [ 23123 ]
            serg Sergei Golubchik made changes -
            Fix Version/s 10.4 [ 22408 ]
            Fix Version/s 10..4 [ 24902 ]
            julien.fritsch Julien Fritsch made changes -
            Fix Version/s 10.1 [ 16100 ]
            julien.fritsch Julien Fritsch made changes -
            Assignee Sujatha Sivakumar [ sujatha.sivakumar ] Andrei Elkin [ elkin ]
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 90206 ] MariaDB v4 [ 140987 ]
            Elkin Andrei Elkin made changes -
            Assignee Andrei Elkin [ elkin ] Brandon Nesterenko [ JIRAUSER48702 ]
            ralf.gebhardt Ralf Gebhardt made changes -
            Fix Version/s 10.2 [ 14601 ]
            bnestere Brandon Nesterenko made changes -
            Status Open [ 1 ] Confirmed [ 10101 ]
            bnestere Brandon Nesterenko made changes -

            Note that updating strategy of Seconds_Behind_Master also needs to consider the behavior of delayed slaves. See MDEV-29639 for details.

            bnestere Brandon Nesterenko added a comment - Note that updating strategy of Seconds_Behind_Master also needs to consider the behavior of delayed slaves. See MDEV-29639 for details.
            julien.fritsch Julien Fritsch made changes -
            Status Confirmed [ 10101 ] In Progress [ 3 ]

            Hi Andrei!

            This is ready for review. PR-2296

            bnestere Brandon Nesterenko added a comment - Hi Andrei! This is ready for review. PR-2296
            bnestere Brandon Nesterenko made changes -
            Assignee Brandon Nesterenko [ JIRAUSER48702 ] Andrei Elkin [ elkin ]
            Status In Progress [ 3 ] In Review [ 10002 ]
            bnestere Brandon Nesterenko made changes -
            bnestere Brandon Nesterenko made changes -
            Assignee Andrei Elkin [ elkin ] Brandon Nesterenko [ JIRAUSER48702 ]
            bnestere Brandon Nesterenko made changes -
            Status In Review [ 10002 ] Stalled [ 10000 ]

            The cause of Seconds_Behind_Master (SBM) reporting as 0 after slave restart is that the relay log is empty until the IO thread is able to re-queue the events. That is, after STOP SLAVE has been issued, SBM is invalidated, and presented as NULL by SHOW SLAVE STATUS. Currently, this state is maintained until the slave threads are running again. However, SBM will report as 0 for the time it takes the IO thread to receive an event from the master, queue it to the relay log, and until the SQL thread reads it. This is misleading, as SBM is not actually 0, but should still be invalid because it is not yet known when the last timestamp from the master is.

            The proposed fix (from Elkin) is to extend the duration of the NULL/invalid state of SBM until the first event is read by the SQL thread.

            bnestere Brandon Nesterenko added a comment - The cause of Seconds_Behind_Master (SBM) reporting as 0 after slave restart is that the relay log is empty until the IO thread is able to re-queue the events. That is, after STOP SLAVE has been issued, SBM is invalidated, and presented as NULL by SHOW SLAVE STATUS. Currently, this state is maintained until the slave threads are running again. However, SBM will report as 0 for the time it takes the IO thread to receive an event from the master, queue it to the relay log, and until the SQL thread reads it. This is misleading, as SBM is not actually 0, but should still be invalid because it is not yet known when the last timestamp from the master is. The proposed fix (from Elkin ) is to extend the duration of the NULL/invalid state of SBM until the first event is read by the SQL thread.
            stephane@skysql.com VAROQUI Stephane added a comment - - edited

            NULL looks like a bad idea what about 666 ? None of the tools around monitoring will have to be changed with a valid numeric value .

            Unless io_thread reports "No" when second behind_masster is reporting NULL

            Best would be to refetch the first event from the relay log to extract last event time ever played from the rejoining node and compute the delay from current_time - first_relay_time. and later from connection set it to 0 if the first event receive from leader is same GTID as this one , i don't think this could happen as if exists multiple entry in relay log it's because of a late situation

            For my knowledge what reason trigger the relay log to be deleted, if they get replayed the time can be read from sql_thread ?

            stephane@skysql.com VAROQUI Stephane added a comment - - edited NULL looks like a bad idea what about 666 ? None of the tools around monitoring will have to be changed with a valid numeric value . Unless io_thread reports "No" when second behind_masster is reporting NULL Best would be to refetch the first event from the relay log to extract last event time ever played from the rejoining node and compute the delay from current_time - first_relay_time. and later from connection set it to 0 if the first event receive from leader is same GTID as this one , i don't think this could happen as if exists multiple entry in relay log it's because of a late situation For my knowledge what reason trigger the relay log to be deleted, if they get replayed the time can be read from sql_thread ?
            Elkin Andrei Elkin added a comment - - edited

            stephane@skysql.com, you have a good point and a nice recommendation. It's better not to endanger monitoring tools.
            The relay log can be deleted at the slave start time due to the gtid slave setup via Change-Master...master_use_gtid=slave_pos (as well as by the legacy --relay-log-recovery).
            It's not a big deal to memorize the last gained SBM to use it to compute an estimate while the first slave-restart event is still coming.
            In case the last time SBM is unknown (the slave server restart), a maximum SBM would be displayed (rather than zero as of current).
            bnestere, what you'd say as well?

            Elkin Andrei Elkin added a comment - - edited stephane@skysql.com , you have a good point and a nice recommendation. It's better not to endanger monitoring tools. The relay log can be deleted at the slave start time due to the gtid slave setup via Change-Master...master_use_gtid=slave_pos (as well as by the legacy --relay-log-recovery ). It's not a big deal to memorize the last gained SBM to use it to compute an estimate while the first slave-restart event is still coming. In case the last time SBM is unknown (the slave server restart), a maximum SBM would be displayed (rather than zero as of current). bnestere , what you'd say as well?

            "memorize the last gained SBM", non valid for stop slave with no delay but the next event is a long query

            To preserve the definition of SBM = time difference of last event in queue and oldest event in the queue being COMMITTED,
            i'm still curious why not using a heartbeat from the leader enrich with timestamp of last binary log event would be more accurate, a slave would not start fetching event before first heartbeat ? And SBM definition become time difference last event in the leader and oldest event COMMITTED in the queue

            stephane@skysql.com VAROQUI Stephane added a comment - "memorize the last gained SBM", non valid for stop slave with no delay but the next event is a long query To preserve the definition of SBM = time difference of last event in queue and oldest event in the queue being COMMITTED, i'm still curious why not using a heartbeat from the leader enrich with timestamp of last binary log event would be more accurate, a slave would not start fetching event before first heartbeat ? And SBM definition become time difference last event in the leader and oldest event COMMITTED in the queue

            Andrei i have check the state Seconds_Behind_Master: NULL Slave_IO_Running: No already exist so will not break any tool , So init SBM NULL and transition of Slave_IO_Running to yes after first event fetch is correct

            stephane@skysql.com VAROQUI Stephane added a comment - Andrei i have check the state Seconds_Behind_Master: NULL Slave_IO_Running: No already exist so will not break any tool , So init SBM NULL and transition of Slave_IO_Running to yes after first event fetch is correct
            Elkin Andrei Elkin added a comment - - edited

            stephane@skysql.com, to the HB exploitation , your direction is great. Just not HB, but when necessary it's feasible to add up to the master-slave connection handshake something like you propose.
            E.g that the slave service is started for the 1st time on the (restarted) slave server. In the recommended CM...master_use_gtid = slave_pos in the handshake time slave would receive back the end-of-transaction timestamp corresponding to its last GTID executed (without this ts piece slave would be aware only of the GTID details of its last executed trx).
            This measure refines 'a maximum SBM'.

            Elkin Andrei Elkin added a comment - - edited stephane@skysql.com , to the HB exploitation , your direction is great. Just not HB, but when necessary it's feasible to add up to the master-slave connection handshake something like you propose. E.g that the slave service is started for the 1st time on the (restarted) slave server. In the recommended CM...master_use_gtid = slave_pos in the handshake time slave would receive back the end-of-transaction timestamp corresponding to its last GTID executed (without this ts piece slave would be aware only of the GTID details of its last executed trx). This measure refines 'a maximum SBM'.

            handshake to both of you so

            stephane@skysql.com VAROQUI Stephane added a comment - handshake to both of you so

            At the same time Hancheck + Heartbeat would refine SBM by accounting network time, long waiting request

            stephane@skysql.com VAROQUI Stephane added a comment - At the same time Hancheck + Heartbeat would refine SBM by accounting network time, long waiting request
            Elkin Andrei Elkin made changes -
            Priority Critical [ 2 ] Major [ 3 ]
            julien.fritsch Julien Fritsch made changes -
            Fix Version/s 10.3 [ 22126 ]
            bnestere Brandon Nesterenko made changes -
            bnestere Brandon Nesterenko made changes -
            julien.fritsch Julien Fritsch made changes -
            Fix Version/s 10.4 [ 22408 ]
            bnestere Brandon Nesterenko made changes -

            People

              bnestere Brandon Nesterenko
              stephane@skysql.com VAROQUI Stephane
              Votes:
              1 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.