[MDEV-20122] Deprecate MASTER_USE_GTID=Current_Pos to favor new MASTER_DEMOTE_TO_SLAVE option Created: 2019-07-22  Updated: 2023-12-08  Resolved: 2022-07-30

Status: Closed
Project: MariaDB Server
Component/s: Replication
Fix Version/s: 10.10.0

Type: Task Priority: Critical
Reporter: Geoff Montee (Inactive) Assignee: Brandon Nesterenko
Resolution: Fixed Votes: 3
Labels: Preview_10.10, gtid_current_pos

Issue Links:
Issue split
split to MDEV-32976 Un-deprecate MASTER_USE_GTID=Current_Pos Open
Relates
relates to MDEV-10279 gtid_current_pos is not updated with ... Open
relates to MDEV-16834 GTID current_pos easily breaks replic... Closed
relates to MDEV-17156 Local transactions on a Slave don't u... Closed
relates to MDEV-28839 Convert master_use_gtid = current_pos... Closed
relates to MDEV-30647 Remove Current_Pos from MASTER_USE_GT... Closed

 Description   

======================================
Description update after problem discussion:
======================================

This work deprecates Current_Pos as an option to CHANGE MASTER TO MASTER_USE_GTID while also adding a safe replacement option MASTER_DEMOTE_TO_SLAVE=<bool>. Specifically, the use case of Current_Pos is to transition a master to become a slave; however, this can break replication state due to actively updating gtid_current_pos with gtid_binlog_pos and gtid_slave_pos.

MASTER_DEMOTE_TO_SLAVE changes this use case by forcing users to set Using_Gtid=Slave_Pos and merging gtid_binlog_pos into gtid_slave_pos once at CHANGE MASTER TO time. Note that if gtid_slave_pos is more recent than gtid_binlog_pos (as in the case of chain replication), the replication state should be preserved.

Then, MASTER_USE_GTID=Current_Pos is deprecated in favor of using Slave_Pos in combination with MASTER_DEMOTE_TO_SLAVE=1.

==========================
Original Description:
==========================

When a slave is configured to replicate with "MASTER_USE_GTID=current_pos", the slave uses its value of gtid_current_pos to replicate from the master.

https://mariadb.com/kb/en/library/change-master-to/#master_use_gtid

https://mariadb.com/kb/en/library/gtid/#gtid_current_pos

The value of gtid_current_pos includes GTIDs from both gtid_slave_pos and gtid_binlog_pos:

https://mariadb.com/kb/en/library/gtid/#gtid_slave_pos

https://mariadb.com/kb/en/library/gtid/#gtid_binlog_pos

Since both gtid_slave_pos and gtid_binlog_pos are used, this means that the position takes into account both local transactions and replicated transactions. This can be somewhat problematic, since it means that executing a single local transaction on the slave can end up breaking replication, due to the fact that the local transaction would cause the slave's GTID position to become inconsistent with the master's GTID position. However, in my opinion, this makes sense, given the design of the GTID functionality. To prevent this specific issue, if a slave is using "MASTER_USE_GTID=current_pos", then it should have read_only=ON set.

However, the more problematic issue is that MariaDB will not alert users to the inconsistent GTID position until the slave threads are restarted. If the slave is running smoothly, then the slave threads may not be restarted for weeks or months.

The root cause of this appears to be that the slave's I/O thread only initializes its local value of gtid_current_pos when the thread is first started in start_slave_threads():

https://github.com/MariaDB/server/blob/mariadb-10.4.6/sql/slave.cc#L1400

This means that if a local transaction is executed on the slave, then the slave won't notice that its GTID position is inconsistent with the master until the slave threads are restarted.

For example, let's say that I have a master and a slave.

The master's GTID position:

MariaDB [(none)]> SHOW GLOBAL VARIABLES LIKE '%gtid%';
+------------------------+--------------------+
| Variable_name          | Value              |
+------------------------+--------------------+
| gtid_binlog_pos        | 1-1-95,3-1-1,4-2-1 |
| gtid_binlog_state      | 1-1-95,3-1-1,4-2-1 |
| gtid_current_pos       | 1-1-95,3-1-1,4-2-1 |
| gtid_domain_id         | 3                  |
| gtid_ignore_duplicates | OFF                |
| gtid_pos_auto_engines  |                    |
| gtid_slave_pos         | 1-1-95,3-1-1,4-2-1 |
| gtid_strict_mode       | OFF                |
| wsrep_gtid_domain_id   | 0                  |
| wsrep_gtid_mode        | OFF                |
+------------------------+--------------------+
10 rows in set (0.001 sec)

The slave's GTID position:

MariaDB [(none)]> SHOW GLOBAL VARIABLES LIKE '%gtid%';
+------------------------+--------------------+
| Variable_name          | Value              |
+------------------------+--------------------+
| gtid_binlog_pos        | 1-1-95,3-1-1,4-2-1 |
| gtid_binlog_state      | 1-1-95,3-1-1,4-2-1 |
| gtid_current_pos       | 1-1-95,3-1-1,4-2-1 |
| gtid_domain_id         | 4                  |
| gtid_ignore_duplicates | OFF                |
| gtid_pos_auto_engines  |                    |
| gtid_slave_pos         | 1-1-95,3-1-1       |
| gtid_strict_mode       | OFF                |
| wsrep_gtid_domain_id   | 0                  |
| wsrep_gtid_mode        | OFF                |
+------------------------+--------------------+
10 rows in set (0.001 sec)

And let's say that the slave is configured to use "MASTER_USE_GTID=current_pos":

MariaDB [(none)]> CHANGE MASTER TO MASTER_HOST='172.30.0.105', MASTER_USER='maxscale', MASTER_PASSWORD='password', MASTER_USE_GTID=current_pos;
Query OK, 0 rows affected (0.009 sec)
 
MariaDB [(none)]> START SLAVE;
Query OK, 0 rows affected (0.045 sec)

And the slave is initially replicating normally:

MariaDB [(none)]> SHOW SLAVE STATUS\G
*************************** 1. row ***************************
                Slave_IO_State: Waiting for master to send event
                   Master_Host: 172.30.0.105
                   Master_User: maxscale
                   Master_Port: 3306
                 Connect_Retry: 60
               Master_Log_File: mariadb-bin.000001
           Read_Master_Log_Pos: 376
                Relay_Log_File: ip-172-30-0-96-relay-bin.000002
                 Relay_Log_Pos: 717
         Relay_Master_Log_File: mariadb-bin.000001
              Slave_IO_Running: Yes
             Slave_SQL_Running: Yes
               Replicate_Do_DB:
           Replicate_Ignore_DB:
            Replicate_Do_Table:
        Replicate_Ignore_Table:
       Replicate_Wild_Do_Table:
   Replicate_Wild_Ignore_Table:
                    Last_Errno: 0
                    Last_Error:
                  Skip_Counter: 0
           Exec_Master_Log_Pos: 376
               Relay_Log_Space: 1035
               Until_Condition: None
                Until_Log_File:
                 Until_Log_Pos: 0
            Master_SSL_Allowed: No
            Master_SSL_CA_File:
            Master_SSL_CA_Path:
               Master_SSL_Cert:
             Master_SSL_Cipher:
                Master_SSL_Key:
         Seconds_Behind_Master: 0
 Master_SSL_Verify_Server_Cert: No
                 Last_IO_Errno: 0
                 Last_IO_Error:
                Last_SQL_Errno: 0
                Last_SQL_Error:
   Replicate_Ignore_Server_Ids:
              Master_Server_Id: 1
                Master_SSL_Crl:
            Master_SSL_Crlpath:
                    Using_Gtid: Current_Pos
                   Gtid_IO_Pos: 1-1-95,4-2-1,3-1-1
       Replicate_Do_Domain_Ids:
   Replicate_Ignore_Domain_Ids:
                 Parallel_Mode: conservative
                     SQL_Delay: 0
           SQL_Remaining_Delay: NULL
       Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
              Slave_DDL_Groups: 0
Slave_Non_Transactional_Groups: 0
    Slave_Transactional_Groups: 0
1 row in set (0.000 sec)

But then let's say that we execute a local transaction on the slave. We can see that the slave's gtid_binlog_pos changes:

MariaDB [(none)]> CREATE DATABASE slave_db;
Query OK, 1 row affected (0.000 sec)
 
MariaDB [(none)]> SHOW GLOBAL VARIABLES LIKE '%gtid%';
+------------------------+--------------------+
| Variable_name          | Value              |
+------------------------+--------------------+
| gtid_binlog_pos        | 1-1-95,3-1-1,4-2-2 |
| gtid_binlog_state      | 1-1-95,3-1-1,4-2-2 |
| gtid_current_pos       | 1-1-95,3-1-1,4-2-2 |
| gtid_domain_id         | 4                  |
| gtid_ignore_duplicates | OFF                |
| gtid_pos_auto_engines  |                    |
| gtid_slave_pos         | 1-1-95,3-1-1       |
| gtid_strict_mode       | OFF                |
| wsrep_gtid_domain_id   | 0                  |
| wsrep_gtid_mode        | OFF                |
+------------------------+--------------------+
10 rows in set (0.001 sec)

But at first, the slave doesn't actually notice that its position is inconsistent with the master:

MariaDB [(none)]> SHOW SLAVE STATUS\G
*************************** 1. row ***************************
                Slave_IO_State: Waiting for master to send event
                   Master_Host: 172.30.0.105
                   Master_User: maxscale
                   Master_Port: 3306
                 Connect_Retry: 60
               Master_Log_File: mariadb-bin.000001
           Read_Master_Log_Pos: 376
                Relay_Log_File: ip-172-30-0-96-relay-bin.000002
                 Relay_Log_Pos: 717
         Relay_Master_Log_File: mariadb-bin.000001
              Slave_IO_Running: Yes
             Slave_SQL_Running: Yes
               Replicate_Do_DB:
           Replicate_Ignore_DB:
            Replicate_Do_Table:
        Replicate_Ignore_Table:
       Replicate_Wild_Do_Table:
   Replicate_Wild_Ignore_Table:
                    Last_Errno: 0
                    Last_Error:
                  Skip_Counter: 0
           Exec_Master_Log_Pos: 376
               Relay_Log_Space: 1035
               Until_Condition: None
                Until_Log_File:
                 Until_Log_Pos: 0
            Master_SSL_Allowed: No
            Master_SSL_CA_File:
            Master_SSL_CA_Path:
               Master_SSL_Cert:
             Master_SSL_Cipher:
                Master_SSL_Key:
         Seconds_Behind_Master: 0
 Master_SSL_Verify_Server_Cert: No
                 Last_IO_Errno: 0
                 Last_IO_Error:
                Last_SQL_Errno: 0
                Last_SQL_Error:
   Replicate_Ignore_Server_Ids:
              Master_Server_Id: 1
                Master_SSL_Crl:
            Master_SSL_Crlpath:
                    Using_Gtid: Current_Pos
                   Gtid_IO_Pos: 1-1-95,4-2-1,3-1-1
       Replicate_Do_Domain_Ids:
   Replicate_Ignore_Domain_Ids:
                 Parallel_Mode: conservative
                     SQL_Delay: 0
           SQL_Remaining_Delay: NULL
       Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
              Slave_DDL_Groups: 0
Slave_Non_Transactional_Groups: 0
    Slave_Transactional_Groups: 0
1 row in set (0.000 sec)

The slave only notices when the slave threads are restarted:

MariaDB [(none)]> STOP SLAVE;
Query OK, 0 rows affected (0.002 sec)
 
MariaDB [(none)]> START SLAVE;
Query OK, 0 rows affected (0.005 sec)
 
MariaDB [(none)]> SHOW SLAVE STATUS\G
*************************** 1. row ***************************
                Slave_IO_State:
                   Master_Host: 172.30.0.105
                   Master_User: maxscale
                   Master_Port: 3306
                 Connect_Retry: 60
               Master_Log_File: mariadb-bin.000001
           Read_Master_Log_Pos: 376
                Relay_Log_File: ip-172-30-0-96-relay-bin.000001
                 Relay_Log_Pos: 4
         Relay_Master_Log_File: mariadb-bin.000001
              Slave_IO_Running: No
             Slave_SQL_Running: Yes
               Replicate_Do_DB:
           Replicate_Ignore_DB:
            Replicate_Do_Table:
        Replicate_Ignore_Table:
       Replicate_Wild_Do_Table:
   Replicate_Wild_Ignore_Table:
                    Last_Errno: 0
                    Last_Error:
                  Skip_Counter: 0
           Exec_Master_Log_Pos: 376
               Relay_Log_Space: 296
               Until_Condition: None
                Until_Log_File:
                 Until_Log_Pos: 0
            Master_SSL_Allowed: No
            Master_SSL_CA_File:
            Master_SSL_CA_Path:
               Master_SSL_Cert:
             Master_SSL_Cipher:
                Master_SSL_Key:
         Seconds_Behind_Master: NULL
 Master_SSL_Verify_Server_Cert: No
                 Last_IO_Errno: 1236
                 Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 4-2-2, which is not in the master's binlog'
                Last_SQL_Errno: 0
                Last_SQL_Error:
   Replicate_Ignore_Server_Ids:
              Master_Server_Id: 1
                Master_SSL_Crl:
            Master_SSL_Crlpath:
                    Using_Gtid: Current_Pos
                   Gtid_IO_Pos: 1-1-95,4-2-2,3-1-1
       Replicate_Do_Domain_Ids:
   Replicate_Ignore_Domain_Ids:
                 Parallel_Mode: conservative
                     SQL_Delay: 0
           SQL_Remaining_Delay: NULL
       Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
              Slave_DDL_Groups: 0
Slave_Non_Transactional_Groups: 0
    Slave_Transactional_Groups: 0
1 row in set (0.000 sec)

I think the slave should warn the user about this, so that users can be aware of inconsistent positions, even when the slave threads are not restarted.

For example, here's one potential fix:

If a slave has "MASTER_USE_GTID=current_pos" set, then the slave's I/O thread could periodically compare the thread's local value of gtid_current_pos (i.e. mi->gtid_current_pos) to the slave's global value of gtid_binlog_pos. If the global value of gtid_binlog_pos contains GTIDs that are greater than the GTIDs in the thread's local value of gtid_current_pos (i.e. mi->gtid_current_pos), then the slave could write a warning to the error log. If gtid_strict_mode were enabled, then maybe the warning could be changed to an error.



 Comments   
Comment by Andrei Elkin [ 2019-07-23 ]

GeoffMontee, thanks for the report and analysis done. We might consider your proposals. However, let me first copy-paste a mail pertaining to MDEV-18404 discussion with knielsen about 'current_pos', its goal, semantics and, defacto, a recommendation not to use it.

Could we work this case around with switching to slave_pos instead of elaborating
current_pos? Could you please consider that first. [While I am on vacation, I can read mails anyway. Feel free to escalate the issue if really necessary so my colleagues could start working on it earlier than when I am back. ]

Quote, unquote:
_The @@gtid_current_pos exists for one sole purpose. This is to let the user
promote a slave as the new master and attach the old master as a slave to
the new master.
By using master_use_gtid=current_pos, the exact same command can be used to
attach a slave to the new master, regardless of whether that slave was
previously a slave or a master:

CHANGE MASTER TO master_host=new_promoted_master

If not using gtid_current_pos (ie master_use_gtid=slave_pos), then to let
the old master become a slave of the new master, the old master's position
must explicitly be set:

SET GLOBAL gtid_slave_pos=@@gtid_binlog_pos

This is because for efficiency reasons, the master doesn't update the
mysql.gtid_slave_pos in each commit.

So now we can see why only GTIDs with the servers own server_id should
contribute to @@gtid_current_pos. If a GTID was replicated from another
server, that GTID will appear in the @@gtid_slave_pos. If the GTID
originated on this server, it will appear in @@gtid_binlog_pos. The
@@gtid_current_pos is the @@gtid_slave_pos extended with GTIDs originating
on this server, hence only GTIDs with our own server id.

Normally, every GTID in the binlog with a different server id than our own
will already be in the @@gtid_slave_pos as well - since it originated on
another server and was replicated to this server.

Thus, in the normal case, where user did not play tricks with the binlog and
slave state, extending the @@gtid_current_pos as suggested in MDEV-18404 has
no effect - the GTIDs are already in the @@gtid_slave_pos, so
@@gtid_current_pos is unaffected.

And in case the user deliberately modified the state, it should be up to the
user to decide what goes into @@gtid_slave_state and @@gtid_binlog_state.
For example, the MDEV-18404 change would make it impossible on a server to
remove a replicated GTID from the @@gtid_current_pos if --log-slave-updates
(without the drastic RESET MASTER).

Another problem is that the server cannot reliably compare GTIDs with
distinct server ids to decide which one is the most recent. There is no
guarantee that sequence numbers are monotonic across different server ids.
Thus the MDEV-18404 method could create completely invalid positions in
some setups where @@gtid_strict_mode=0 and replication domains are not
strictly maintained.

I don't see the value in MDEV-18404. If the user is updating
@@gtid_binlog_state (itself a very drastic operation), and wants a specific
GTID to go into the slave position - just update the @@gtid_slave_pos with
the desired GTID, don't leave the server with an inconsistent replication
state.
_

*And finally, let me reiterate: I consider the @@gtid_current_pos a design
mistake. Better to just transfer the @@gtid_binlog_pos to the
@@gtid_slave_pos only at the point where an old master is turned into a
slave. This can be done manually already, and it would be simple to
implement automatic support for this with an extra option for CHANGE MASTER.
*_

Comment by Geoff Montee (Inactive) [ 2019-07-23 ]

Hi Elkin,

While I am on vacation, I can read mails anyway. Feel free to escalate the issue if really necessary so my colleagues could start working on it earlier than when I am back.

Thanks for the response! This issue isn't really urgent. I hope you enjoy your vacation!

However, let me first copy-paste a mail pertaining to MDEV-18404 discussion with Kristian Nielsen about 'current_pos', its goal, semantics and, defacto, a recommendation not to use it.

I'm not quite sure how MDEV-18404 is relevant to this specific issue, but I appreciate you sharing Kristian's comments on that issue. My reasoning for submitting MDEV-18404 was only related to my interest in finding a way to back up and restore GTID state using Mariabackup. I agree that it would not generally be a good idea to manually try out the steps in MDEV-18404. However, I was trying to determine if it would be feasible for Mariabackup to back up and restore a server's gtid_binlog_pos value without backing up all of the binary logs. Mariabackup already backs up and restores a server's gtid_slave_pos, since it is stored in an InnoDB table. The full details are in MDEV-18405. But anyway, after finding out from Kristian that gtid_current_pos intentionally excludes transactions from gtid_binlog_pos that don't have the server's own server_id component, I mentioned in MDEV-18405 that if we want to back up and restore gtid_binlog_pos, then it would probably make more sense to restore its value to gtid_slave_pos. I see Kristian's perspective, and I don't have any issue with it.

Regardless, I understand the purpose of current_pos, and I understand the semantics. I also understand that it is very easy for users to accidentally break a slave using current_pos. Currently, if a slave is using current_pos, then the slave doesn't do anything to try to detect if the user has done any unsafe operations that may cause the slave to break.

If we want to continue to support current_pos, then I am just suggesting that the slave should try to detect if the user has done any unsafe operations that may cause the slave to break. Maybe it could write a warning to the error log. Maybe the warning should suggest that the user may want to switch to slave_pos instead.

Could we work this case around with switching to slave_pos instead of elaborating
current_pos? Could you please consider that first.

Yes, I always recommend to use slave_pos, rather than current_pos.

Our Mariabackup documentation on how to build a slave also recommends to use slave_pos.

https://mariadb.com/kb/en/library/setting-up-a-replication-slave-with-mariabackup/#gtids

However, a lot of users are already using current_pos for whatever reason.

And finally, let me reiterate: I consider the @@gtid_current_pos a design
mistake. Better to just transfer the @@gtid_binlog_pos to the
@@gtid_slave_pos only at the point where an old master is turned into a
slave. This can be done manually already, and it would be simple to
implement automatic support for this with an extra option for CHANGE MASTER.

Do we have plans to remove current_pos or change the way it works?

Comment by Andrei Elkin [ 2020-10-16 ]

GeoffMontee, howdy.

Let's first settle our opinions on the Semantics of MASTER_USE_GTID=current_pos.

Like slave_pos it's a form of connection mode that presents a
slave's gtid state to Master. That is, on the slave server it only
affects the slave IO thread. The current_pos mode is made
IO to regard gtid_current_pos as the slave gtid state. More
specifically IO acquires a snapshot of gtid_current_pos at
connecting time to present it to Master. Master is to validate the
slave's state.

When later, after the successful validation is done,
gtid_current_pos is locally updated it must be fair to claim that
the local update may not affect the current slave connection. Even
when it comes to the inconsistency matter, then it will have been
caught when (though not necessarily instantly) Slave executes
events and gtid_strict_mode is set.

Notice too, that the preferred slave_pos mode is also vulnerable
to the current issue in the multi-source scenario. The second source
playing a role of local connection desynchronizes slave_pos state.

Personally I prefer this interpretation of a "dumb" simple IO that
is not concerned with what gtids it carries in.

Secondly, to learn by Slave about potential inconsistency might be useful though.
A watching mechanism should error log online changes to gtid_current_pos or gtid_slave_pos done
slave locally or through second source in the domains of concern.
E.g when a replication source is defined as
CHANGE MASTER ... do_domain_ids = (d1,d2) that would be domains d1 or d2.

I would limit this watcher to gtid_strict_mode = ON.

We're considering its technical implementation as IO:s would do
the marking, local transaction handlers and slave appliers would do
the checking and warning. This method apparently addresses a natural
interest of when the slave state gets exactly screwed.

GeoffMontee, feel free to remove the SI association if it's no longer relevant to the customer.

Cheers,
Andrei

Comment by Sujatha Sivakumar (Inactive) [ 2020-10-20 ]

Hello GeoffMontee

Current issue is observed in case of "GTID_STRICT_MODE=off".
I tried to reproduce MDEV-20122 in case "GTID_STRICT_MODE=on"

Enable circular replication between master-slave.
Do 'CREATE TABLE t' on master and 'INSERT INTO t' on slave.
Following state is achieved.

Master:
========

MariaDB [test]> show global variables like '%gtid%';
+------------------------+-------------+
| Variable_name          | Value       |
+------------------------+-------------+
| gtid_binlog_pos        | 0-2-2       |
| gtid_binlog_state      | 0-1-1,0-2-2 |
| gtid_current_pos       | 0-2-2       |
| gtid_domain_id         | 0           |
| gtid_ignore_duplicates | OFF         |
| gtid_slave_pos         | 0-2-2       |
| gtid_strict_mode       | ON          |
| wsrep_gtid_domain_id   | 0           |
| wsrep_gtid_mode        | OFF         |
+------------------------+-------------+
9 rows in set (0.01 sec)

Slave:
======

MariaDB [test]> show global variables like '%gtid%';
+------------------------+-------------+
| Variable_name          | Value       |
+------------------------+-------------+
| gtid_binlog_pos        | 0-2-2       |
| gtid_binlog_state      | 0-1-1,0-2-2 |
| gtid_current_pos       | 0-2-2       |
| gtid_domain_id         | 0           |
| gtid_ignore_duplicates | OFF         |
| gtid_slave_pos         | 0-2-2       |
| gtid_strict_mode       | ON          |
| wsrep_gtid_domain_id   | 0           |
| wsrep_gtid_mode        | OFF         |
+------------------------+-------------+
9 rows in set (0.01 sec)

Now do 'STOP SLAVE' on 'Server_2'. Execute 'CHANGE MASTER TO' with 'MASTER_USE_GTID=current_pos'

Case 1:
======
With circular replication in effect, MDEV-20122 will never occur. Replication will be smooth with both 'current_pos' and 'slave_pos'.
As both servers are in sync.

Case 2: [No circular replication between master and slave. i.e slave becomes new 'master' and its 'slave' is using 'current_pos'
=======

MariaDB [test]> start slave;
Query OK, 0 rows affected (0.01 sec)
MariaDB [test]> insert into t values (30);
Query OK, 1 row affected (0.00 sec)
Please note: "gtid_binlog_pos" got updated.
MariaDB [test]> show global variables like '%gtid%';
+------------------------+-------------+
| Variable_name          | Value       |
+------------------------+-------------+
| gtid_binlog_pos        | 0-2-3       |
| gtid_binlog_state      | 0-1-1,0-2-3 |
| gtid_current_pos       | 0-2-3       |
| gtid_domain_id         | 0           |
| gtid_ignore_duplicates | OFF         |
| gtid_slave_pos         | 0-2-2       |
| gtid_strict_mode       | ON          |
| wsrep_gtid_domain_id   | 0           |
| wsrep_gtid_mode        | OFF         |
+------------------------+-------------+
9 rows in set (0.01 sec)

As long as Master is muted/slient, Slave works fine.
Now do a DML on master, observe that Slave stops.

MariaDB [test]> show slave status\G;
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: localhost
                  Master_User: root
                  Master_Port: 16000
                Connect_Retry: 60
              Master_Log_File: master-bin.000001
          Read_Master_Log_Pos: 842
               Relay_Log_File: slave-relay-bin.000002
                Relay_Log_Pos: 658
        Relay_Master_Log_File: master-bin.000001
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table: test.t_ignored1
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 1950
                   Last_Error: An attempt was made to binlog GTID 0-1-3 which would create an out-of-order sequence number with existing GTID 0-2-3, and gtid strict mode is enabled. (edited) 

Slave stops with an error, upon processing the first GTID received from master, it doesn't have to reconnect to observe the discrepancy.
Hence there is no bug in case where 'GTID_STRICT_MODE=ON'.

Please let us know your thoughts.

Comment by Andrei Elkin [ 2020-10-20 ]

GeoffMontee, to add up to latest update from Sujatha on gtid_strict_mode, in your bug description the slave applier may not run, as the master is muted. In such scenario the strict mode error won't show up, so the slave reconnect would see the description error instead.
I'd rate this as a sort of inconvenience to me rather than a critical issue.

As to the non-strict mode I bet you would also never rate that as critical.

Comment by Geoff Montee (Inactive) [ 2020-10-20 ]

Hi Elkin,

Personally I prefer this interpretation of a "dumb" simple IO that is not concerned with what gtids it carries in.

You know more than me about the GTID implementation, but I personally disagree. The IO thread currently seems a bit too dumb regarding GTIDs.

The IO thread doesn't seem quite so "dumb" in other areas. As far as I know, the IO thread filters out events that contain the slave's server_id. I think the IO thread also handles filtering for IGNORE_SERVER_IDS, DO_DOMAIN_IDS, and IGNORE_DOMAIN_IDS. If the IO thread already reads the server_id and gtid_domain_id from each event, it does not seem like it would be unreasonable to also read the GTID from the event, and then to compare that GTID to the local values.

Secondly, to learn by Slave about potential inconsistency might be useful though.
A watching mechanism should error log online changes to gtid_current_pos or gtid_slave_pos done
slave locally or through second source in the domains of concern.
E.g when a replication source is defined as
CHANGE MASTER ... do_domain_ids = (d1,d2) that would be domains d1 or d2.

I would limit this watcher to gtid_strict_mode = ON.

That sounds like it could be a useful way to solve problems like this.

feel free to remove the SI association if it's no longer relevant to the customer.

No comment on that. You'll have to ask nicklamb or ccalender.

Comment by Geoff Montee (Inactive) [ 2020-10-20 ]

Hi sujatha.sivakumar,

Your test case with gtid_strict_mode=ON proves that the slave raises an error in the case where an "out-of-order sequence number" is written to the binary log. However, this test case does not prove that setting gtid_strict_mode=ON can prevent the slave's GTIDs from getting out of sync with the master's GTIDs, because the slave's GTIDs can become inconsistent without raising an "out-of-order sequence number".

For example, if you had set gtid_domain_id=1 on the slave, then the slave's local transaction would have been written to the binary log with GTID 1-2-1. This would not raise an "out-of-order sequence number" error, so gtid_strict_mode would not notice the inconsistency. In this case, the slave would only notice the inconsistent GTID position after the IO thread is stopped and restarted.

Comment by Andrei Elkin [ 2020-12-01 ]

GeoffMontee: to a correct mentioning by you
> IO thread filters out events

Notice that while doing so the IO thread is not concerned with out-of-order which
is left by the design to the applier thread. It's fair to say of what the IO thread does that it maintains integrity of replicated gtid domains configuration. (The consistency - to which the replication initial gtid [this bug's immediate worry] - imo - therefore is the applier's burden.)

By all possible I suggest we don't refine anything that relate to gtid_current_pos.

Comment by Andrei Elkin [ 2021-05-27 ]

julien.fritsch, GeoffMontee, (esa.korhonen) I suggest (have suggested in this comment)) to start deprecating CM..master_use_gtid=current_pos (and then the related gtid_current_pos) in 10.6 and that's what we'll do in this ticket.

Another task for 10.7 should be reported (myself) to complete deprecation which means to replace gtid_current_pos in all features that use it.

Comment by Andrei Elkin [ 2021-06-10 ]

GeoffMontee, esa.korhonen,knielsen: While deprecating the current behaviour of dynamic (START SLAVE time) computation of the effective slave's gtid state by CM..master_use_gtid=current_pos option we could salvage the syntax part.
What would think of turning master_use_gtid=current_pos to compute the new value to gtid_slave_pos
at the time of executing CHANGE MASTER?
That is the CM's option would imply a SET GLOBAL gtid_slave_pos = value, where value is computed according to the current specification as a "constrained" union of the slave and binlog gtid states.

I'd be great to decide on this step in order to formulate a meaningful deprecation message.

Also START SLAVE would regard gtid_slave_pos as the single source of the slave gtid state.

My use case is obviously an ex-master that is demoted to the slave role.
As you can see with this change we also cover this issue's complaint.
Once CM..master_use_gtid=current_pos is done so the server settles its slave's gtid state, the server is free to create local gtids which won't mess with the slave gtid state anymore.

Comment by Geoff Montee (Inactive) [ 2021-06-10 ]

Hi Elkin,

What would think of turning master_use_gtid=current_pos to compute the new value to gtid_slave_pos at the time of executing CHANGE MASTER? That is the CM's option would imply a SET GLOBAL gtid_slave_pos = value, where value is computed according to the current specification as a "constrained" union of the slave and binlog gtid states.

That sounds good to me. It simplifies how the slave threads handle GTID tracking, but it still maintains the advantages of the master_use_gtid=current_pos syntax.

Comment by Kristian Nielsen [ 2021-06-11 ]

Hi Andrei,

The idea has a lot of appeal, it feels like a much nicer semantics for master_use_gtid=current_pos. That it means that as a host changes role from master to slave, it will use its master position (with local changes) as the starting point for replicating as a slave. That's a much better semantics of what current_pos was intended to do when I originally implemented it.

I see a problem with the proposal as stated (if I understood it correctly). The problem is that "host changes role from master to slave" is not always what a CHANGE MASTER command means.
CHANGE MASTER is used to switch a master to become a slave, but it is also used in many other situations - to change a slave (that was never a master) to another master, to change the credentials on the master, to configure ssl, etc. etc.

If any CHANGE MASTER command was to magically change the current gtid_position with local transactions, we are back to the problems that START SLAVE had in this respect.

I'm not sure there currently is a well-defined way - from the point of view of the server - to know that the user is switching a master to become a slave.

One possibility is to add an explicit option to CHANGE MASTER that says "this is a master becoming a slave". CHANGE MASTER TO master_demote_to_slave=1 or something (can't think of a better name at the top of my head). This could then imply the master_use_gtid=current_pos semantics you suggested, and possibly imply other unrelated semantics that is useful for the "master becomes a slave" case.

I think that's one way to keep the much better semantics of your proposal and avoid magic gtid_pos changes on unrelated CHANGE MASTER command. Though it's not as clean as the server just doing the right thing (ie. if user forgets the option to CHANGE MASTER, then the slave just starts from the wrong position).

  • Kristian.
Comment by Andrei Elkin [ 2021-06-15 ]

knielsen, howdy! Yours is a nice refinement.
Indeed, a new option that states the user's intent explicitly has
a clear advantage. I'll proceed from here to see through all major use cases of the role transition. Thank you!

As this task is concerned the agreement is reached then.
master_use_gtid = current_pos is to be deprecated.
Its purpose to facilitate failover will be captured by a new master_demote_to_slave = <bool> option.

Comment by Sujatha Sivakumar (Inactive) [ 2021-09-15 ]

Hello julien.fritsch

The deprecation warning is implemented. Will request for review.

Comment by Sujatha Sivakumar (Inactive) [ 2021-09-15 ]

Hello Andrei,

Please review the following changes.

https://github.com/MariaDB/server/commit/47476b09638f6c3a57ee40d318be7a98cda9c83d

http://buildbot.askmonty.org/buildbot/grid?category=main&branch=bb-10.6-sujatha

Thank you.

Comment by Andrei Elkin [ 2021-09-27 ]

The patch looks good though the warning should be made starting in 10.7.
I am pushing the commit after double-checking about 10.7 with serg

Comment by Andrei Elkin [ 2021-09-27 ]

ralf.gebhardt@mariadb.com, according to Serg no

Regarding to deprecation policies, is the upcoming 10.7.1 good enough for us to deprecate CHANGE MASTER TO ... master_use_gtid = an-enum-value, that is we're to deprecate current_pos?

serg 2:40 PM
no, there was no preview release with this deprecation, so it cannot be in 10.7.1 anymore

Comment by Brandon Nesterenko [ 2022-06-06 ]

Howdy Andrei!

I have updated Sujatha's patch which deprecates master_use_gtid=current_pos for 10.10 and it is ready for review:
Patch 57a7c5c
BB bb-10.10-MDEV-20122-deprecate-current-pos

Comment by Andrei Elkin [ 2022-06-08 ]

The deprecation part of a two part work is requested.

Comment by Andrei Elkin [ 2022-06-08 ]

Review is done as a commit to the feature branch:
57a7c5c4ee6..172508da770 HEAD > bb-10.10MDEV-20122-deprecate-current-pos

(The review commits may become Irrelevant to the feature after the eventual approval, so to be discarded)

Comment by Brandon Nesterenko [ 2022-06-14 ]

Hi Andrei! The latest commits in PR-2155 are ready for review.

Comment by Andrei Elkin [ 2022-06-14 ]

Approved, as the latest patch implements the requirements.

Comment by Brandon Nesterenko [ 2022-06-15 ]

Hi angelique.sklavounos!

I am also re-assigning this ticket to you for testing. The preview branch is preview-10.10-gtid.

Comment by Angelique Sklavounos (Inactive) [ 2022-07-21 ]

OK to push

Comment by Brandon Nesterenko [ 2022-07-26 ]

Howdy Andrei!

This is ready for a final round of review before pushing into 10.10

https://github.com/MariaDB/server/pull/2199

Comment by Andrei Elkin [ 2022-07-26 ]

Approved on GH.

Comment by Brandon Nesterenko [ 2022-07-30 ]

pushed into 10.10 as 90c3b28

Comment by Andrey Khizhnyakov [ 2023-06-13 ]

Good afternoon Please tell me, is this bug present in the version of mariadb 10.4.12?

Comment by Alice Sherepa [ 2023-06-13 ]

andreitech this was added in 10.10.0, so this feature is not present on all earlier versions, 10.3+,10.4+,etc (so also 10.4.12)

Comment by Kfir Itzhak [ 2023-12-05 ]

Hi,

Please do not deprecate master_use_gtid=current_pos. I use it for Active<->Active replication and i believe many others as well, so please do not remove that feature.

Comment by Brandon Nesterenko [ 2023-12-08 ]

Hi mastertheknife!

Thanks for your input here. We've discussed it and filed MDEV-32976 to remove the deprecation status of the option.

Generated at Thu Feb 08 08:56:59 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.