[MDEV-13431] wsrep_gtid_mode uses wrong GTID for transaction committed by slave thread - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.1.25
Fix Version/s: 10.1.31
Component/s: Galera, Replication, Storage Engine - InnoDB
Labels:
- galera
- gtid
- replication
- wsrep

Description

This is similar to ~~MDEV-8458~~, but this issue also effects MariaDB 10.1 with wsrep_gtid_mode=ON.

When wsrep_gtid_mode is enabled, transactions that are replicated within a cluster by Galera receive a GTID where the domain_id is specified by wsrep_gtid_domain_id, the server_id is specified by server_id, and the seq_no is incremented for each transaction that is committed in the domain.

It does not seem to work this way for transactions that are replicated by an asynchronous slave thread within a Galera cluster.

For example, let's say that we have two clusters and one cluster replicates from the other using GTID replication.

On cluster1, we see the following:

MariaDB [(none)]> show global variables like '%gtid%';

+------------------------+-------+

| Variable_name          | Value |

+------------------------+-------+

| gtid_binlog_pos        | 1-1-1 |

| gtid_binlog_state      | 1-1-1 |

| gtid_current_pos       | 1-1-1 |

| gtid_domain_id         | 1     |

| gtid_ignore_duplicates | OFF   |

| gtid_slave_pos         |       |

| gtid_strict_mode       | OFF   |

| wsrep_gtid_domain_id   | 1     |

| wsrep_gtid_mode        | ON    |

+------------------------+-------+

9 rows in set (0.00 sec)

On cluster2, we see the following:

MariaDB [(none)]> show global variables like '%gtid%';

+------------------------+-------------+

| Variable_name          | Value       |

+------------------------+-------------+

| gtid_binlog_pos        | 2-2-2       |

| gtid_binlog_state      | 2-2-2       |

| gtid_current_pos       | 1-1-1,2-2-2 |

| gtid_domain_id         | 2           |

| gtid_ignore_duplicates | OFF         |

| gtid_slave_pos         | 1-1-1       |

| gtid_strict_mode       | OFF         |

| wsrep_gtid_domain_id   | 2           |

| wsrep_gtid_mode        | ON          |

+------------------------+-------------+

9 rows in set (0.00 sec)

One node in cluster2 is a slave of one node in cluster1:

MariaDB [(none)]> show slave status\G

*************************** 1. row ***************************

               Slave_IO_State: Waiting for master to send event

                  Master_Host: 172.30.0.32

                  Master_User: repl

                  Master_Port: 3306

                Connect_Retry: 60

              Master_Log_File: mariadb-bin.000002

          Read_Master_Log_Pos: 428

               Relay_Log_File: ip-172-30-0-46-relay-bin.000002

                Relay_Log_Pos: 644

        Relay_Master_Log_File: mariadb-bin.000002

             Slave_IO_Running: Yes

            Slave_SQL_Running: Yes

              Replicate_Do_DB:

          Replicate_Ignore_DB:

           Replicate_Do_Table:

       Replicate_Ignore_Table:

      Replicate_Wild_Do_Table:

  Replicate_Wild_Ignore_Table:

                   Last_Errno: 0

                   Last_Error:

                 Skip_Counter: 0

          Exec_Master_Log_Pos: 428

              Relay_Log_Space: 951

              Until_Condition: None

               Until_Log_File:

                Until_Log_Pos: 0

           Master_SSL_Allowed: No

           Master_SSL_CA_File:

           Master_SSL_CA_Path:

              Master_SSL_Cert:

            Master_SSL_Cipher:

               Master_SSL_Key:

        Seconds_Behind_Master: 0

Master_SSL_Verify_Server_Cert: No

                Last_IO_Errno: 0

                Last_IO_Error:

               Last_SQL_Errno: 0

               Last_SQL_Error:

  Replicate_Ignore_Server_Ids:

             Master_Server_Id: 1

               Master_SSL_Crl:

           Master_SSL_Crlpath:

                   Using_Gtid: Slave_Pos

                  Gtid_IO_Pos: 1-1-1

      Replicate_Do_Domain_Ids:

  Replicate_Ignore_Domain_Ids:

                Parallel_Mode: conservative

1 row in set (0.00 sec)

If we commit a transaction on cluster1, we would expect it to have the GTID 1-1-2 on cluster1, and either 2-1-3 or 2-2-3 on cluster2, depending on whether it uses the server_id of the originating cluster or replaces it with its own. Does that actually happen?:

Let's say that we execute the following on cluster1:

MariaDB [(none)]> insert into db1.tab values (1, 'str1');

Query OK, 1 row affected (0.00 sec)

What GTID does this transaction have on each cluster?

Here's the binlog event on the node in cluster1 where the transaction originated:

# at 428

#170802 16:51:02 server id 1  end_log_pos 466   GTID 1-1-2 trans

/*!100001 SET @@session.gtid_seq_no=2*//*!*/;

BEGIN

/*!*/;

# at 466

# at 523

#170802 16:51:02 server id 1  end_log_pos 523   Annotate_rows:

#Q> insert into db1.tab values (1, 'str1')

#170802 16:51:02 server id 1  end_log_pos 567   Table_map: `db1`.`tab` mapped to number 18

# at 567

#170802 16:51:02 server id 1  end_log_pos 606   Write_rows: table id 18 flags: STMT_END_F

BINLOG '

NjuCWRMBAAAALAAAADcCAAAAABIAAAAAAAEAA2RiMQADdGFiAAIDDwKWAAI=

NjuCWRcBAAAAJwAAAF4CAAAAABIAAAAAAAEAAv/8AQAAAARzdHIx

'/*!*/;

### INSERT INTO `db1`.`tab`

### SET

###   @1=1

###   @2='str1'

# at 606

#170802 16:51:02 server id 1  end_log_pos 633   Xid = 10

COMMIT/*!*/;

And here's the binlog event on the node in cluster2 that is acting as a slave to cluster1:

# at 617

#170802 16:51:02 server id 1  end_log_pos 655   GTID 2-1-2 trans

/*!100001 SET @@session.server_id=1*//*!*/;

/*!100001 SET @@session.gtid_seq_no=2*//*!*/;

BEGIN

/*!*/;

# at 655

#170802 16:51:02 server id 1  end_log_pos 699   Table_map: `db1`.`tab` mapped to number 21

# at 699

#170802 16:51:02 server id 1  end_log_pos 738   Write_rows: table id 21 flags: STMT_END_F

BINLOG '

NjuCWRMBAAAALAAAALsCAAAAABUAAAAAAAEAA2RiMQADdGFiAAIDDwKWAAI=

NjuCWRcBAAAAJwAAAOICAAAAABUAAAAAAAEAAgP8AQAAAARzdHIx

'/*!*/;

### INSERT INTO `db1`.`tab`

### SET

###   @1=1

###   @2='str1'

# at 738

#170802 16:51:02 server id 1  end_log_pos 765   Xid = 3

COMMIT/*!*/;

And here's the binlog event on another node in cluster2:

# at 576

#170802 16:51:02 server id 1  end_log_pos 614   GTID 2-1-3 trans

/*!100001 SET @@session.server_id=1*//*!*/;

/*!100001 SET @@session.gtid_seq_no=3*//*!*/;

BEGIN

/*!*/;

# at 614

#170802 16:51:02 server id 1  end_log_pos 658   Table_map: `db1`.`tab` mapped to number 20

# at 658

#170802 16:51:02 server id 1  end_log_pos 697   Write_rows: table id 20 flags: STMT_END_F

BINLOG '

NjuCWRMBAAAALAAAAJICAAAAABQAAAAAAAEAA2RiMQADdGFiAAIDDwKWAAI=

NjuCWRcBAAAAJwAAALkCAAAAABQAAAAAAAEAAgP8AQAAAARzdHIx

'/*!*/;

### INSERT INTO `db1`.`tab`

### SET

###   @1=1

###   @2='str1'

# at 697

#170802 16:51:02 server id 1  end_log_pos 724   Xid = 3

COMMIT/*!*/;

So the transaction has the expected GTID in cluster1, and it has the expected GTID for the non-slave nodes in cluster2, but it has an unexpected GTID for the slave node in cluster2.

Attachments

Issue Links

causes

MDEV-10227 MariaDB Galera cluster gtid's falling out of sync inspite of setting wsrep_gtid_mode=ON

Closed

is caused by

MDEV-20720 Galera: Replicate MariaDB GTID to other nodes in the cluster

Closed

relates to

MDEV-8458 Galera Cluster replication stream doesn't pass along MariaDB's GTID

Closed

Activity

Ascending order - Click to sort in descending order

Andrii Nikitin (Inactive) added a comment - 2017-08-08 20:19 - edited

I confirm the problem with script below, which sets up two local clusters on (3307,3308) and (3310,3311)
3310 replicates with gtid from 3307
Below are aliases of the nodes :
m1 - 3307
m2 - 3308
m4 - 3310
m5 - 3311

So topology is like below:

(m1,m2)  - cluster "c1"

 ^ gtid master

 V gtid slave

(m4,m5) - cluster "c2"

The script will also download and unpack tar 10.1.25 into _depot/m-tar/10.1.25

set -e

M7VER=10.1.25

# just use current directory if called from framework

if [ ! -f common.sh ] ; then

  [ -d mariadb-environs ] || git clone http://github.com/AndriiNikitin/mariadb-environs

  cd mariadb-environs

  ./get_plugin.sh galera

fi

echo CLEANING UP ...

c1/cleanup.sh || :

c2/cleanup.sh || :

rm -rf c1/*

rm -rf c2/*

echo GENERATE TEMPLATES ...

_template/plant_cluster.sh c1

_template/plant_cluster.sh c2

echo m1 > c1/nodes.lst

echo m2 >> c1/nodes.lst

echo m4 > c2/nodes.lst

echo m5 >> c2/nodes.lst

c1/replant.sh $M7VER

c2/replant.sh $M7VER

echo DOWNLOAD IF NEEDED ...

./build_or_download.sh m1

echo INSTALL DATA DIRECTORY ...

c1/gen_cnf.sh

c2/gen_cnf.sh

c1/install_db.sh

c2/install_db.sh

echo SET UP ACCOUNTS ...

c1/galera_setup_acl.sh

c2/galera_setup_acl.sh

echo ENABLE GTID ...

c1/gen_cnf.sh log-bin=qq log_slave_updates=1

c2/gen_cnf.sh log-bin=qq log_slave_updates=1 wsrep_gtid_domain_id=5 gtid_domain_id=5 wsrep_gtid_mode=1

echo START CLUSTERS ...

c1/galera_start_new.sh

c2/galera_start_new.sh

echo SETUP REPLICATION

m4*/replicate_gtid.sh m1

echo PERFORM TRANSACTION

m1*/sql.sh 'create table test.t1(i int)'

m1*/sql.sh 'insert into test.t1 select 1'

m2*/sql.sh 'insert into test.t1 select 2'

sleep 1

m5*/sql.sh 'create table test.t2(a int)'

m1*/sql.sh 'insert into test.t1 select 3'

sleep 1

echo CHECK DATA

c1/sql.sh 'select * from t1'

c2/sql.sh 'select * from t1'

set +e

echo GTID in m1:

_depot/m-tar/$M7VER/bin/mysqlbinlog -vvv m1*/dt/qq*.00* | grep -E 'GTID|INSERT|create'

echo GTID in m2:

_depot/m-tar/$M7VER/bin/mysqlbinlog -vvv m2*/dt/qq*.00* | grep -E 'GTID|INSERT|create'

echo GTID in m4:

_depot/m-tar/$M7VER/bin/mysqlbinlog -vvv m4*/dt/qq*.00* | grep -E 'GTID|INSERT|create'

echo GTID in m5:

_depot/m-tar/$M7VER/bin/mysqlbinlog -vvv m5*/dt/qq*.00* | grep -E 'GTID|INSERT|create'

c1/sql.sh 'show variables like "gtid_%_pos"'

c2/sql.sh 'show variables like "gtid_%_pos"'

Please note that initially gtid is correct in cluster 2 for two first inserts into both nodes of cluster1:

GTID in m1:

#170808 21:43:45 server id 1  end_log_pos 650   GTID 0-1-3 ddl

create table test.t1(i int)

#170808 21:43:45 server id 1  end_log_pos 778   GTID 0-1-4 trans

### INSERT INTO `test`.`t1`

#170808 21:43:45 server id 2  end_log_pos 918   GTID 0-2-5 trans

### INSERT INTO `test`.`t1`

Direct Slave:

GTID in m4:

#170808 22:06:30 server id 1  end_log_pos 650   GTID 5-1-3 ddl

create table test.t1(i int)

#170808 22:06:30 server id 1  end_log_pos 778   GTID 5-1-4 trans

### INSERT INTO `test`.`t1`

#170808 22:06:30 server id 2  end_log_pos 918   GTID 5-2-5 trans

### INSERT INTO `test`.`t1`

And the other node in cluster2:

GTID in m5:

#170808 22:06:30 server id 1  end_log_pos 634   GTID 5-1-3 ddl

create table test.t1(i int)

#170808 22:06:30 server id 1  end_log_pos 762   GTID 5-1-4 trans

### INSERT INTO `test`.`t1`

#170808 22:06:30 server id 2  end_log_pos 902   GTID 5-2-5 trans

So far so good and gtid works as expected. (5-1-4 and 5-2-5 on both nodes in cluster2). But once we executed some transaction directly in cluster2 - the other node in cluster2 shows different gtid (below 5-1-6 vs 5-1-7):

Master from cluster 1 (m1):

#170808 22:06:31 server id 1  end_log_pos 1058  GTID 0-1-6 trans

### INSERT INTO `test`.`t1`

Direct Slave from cluster2 (m4):

#170808 22:06:31 server id 5  end_log_pos 1058  GTID 5-5-6 ddl

create table test.t2(a int)

#170808 22:06:31 server id 1  end_log_pos 1186  GTID 5-1-6 trans

### INSERT INTO `test`.`t1`

The other node from cluster2 (m5):

#170808 22:06:31 server id 5  end_log_pos 1042  GTID 5-5-6 ddl

create table test.t2(a int)

#170808 22:06:31 server id 1  end_log_pos 1170  GTID 5-1-7 trans

### INSERT INTO `test`.`t1`

And content of gtid pos variables at the end:

m1 :gtid_binlog_pos 0-1-6 gtid_current_pos 0-1-6 gtid_slave_pos

m2 :gtid_binlog_pos 0-1-6 gtid_current_pos gtid_slave_pos

m4 :gtid_binlog_pos 5-1-6 gtid_current_pos 0-1-6 gtid_slave_pos 0-1-6

m5 :gtid_binlog_pos 5-1-7 gtid_current_pos gtid_slave_pos

Andrii Nikitin (Inactive) added a comment - 2017-08-08 20:19 - edited I confirm the problem with script below, which sets up two local clusters on (3307,3308) and (3310,3311) 3310 replicates with gtid from 3307 Below are aliases of the nodes : m1 - 3307 m2 - 3308 m4 - 3310 m5 - 3311 So topology is like below: (m1,m2) - cluster "c1" ^ gtid master | V gtid slave (m4,m5) - cluster "c2" The script will also download and unpack tar 10.1.25 into _depot/m-tar/10.1.25 set -e M7VER=10.1.25 # just use current directory if called from framework if [ ! -f common.sh ] ; then [ -d mariadb-environs ] || git clone http: //github .com /AndriiNikitin/mariadb-environs cd mariadb-environs . /get_plugin .sh galera fi echo CLEANING UP ... c1 /cleanup .sh || : c2 /cleanup .sh || : rm -rf c1/* rm -rf c2/* echo GENERATE TEMPLATES ... _template /plant_cluster .sh c1 _template /plant_cluster .sh c2 echo m1 > c1 /nodes .lst echo m2 >> c1 /nodes .lst echo m4 > c2 /nodes .lst echo m5 >> c2 /nodes .lst c1 /replant .sh $M7VER c2 /replant .sh $M7VER echo DOWNLOAD IF NEEDED ... . /build_or_download .sh m1 echo INSTALL DATA DIRECTORY ... c1 /gen_cnf .sh c2 /gen_cnf .sh c1 /install_db .sh c2 /install_db .sh echo SET UP ACCOUNTS ... c1 /galera_setup_acl .sh c2 /galera_setup_acl .sh echo ENABLE GTID ... c1 /gen_cnf .sh log-bin=qq log_slave_updates=1 c2 /gen_cnf .sh log-bin=qq log_slave_updates=1 wsrep_gtid_domain_id=5 gtid_domain_id=5 wsrep_gtid_mode=1 echo START CLUSTERS ... c1 /galera_start_new .sh c2 /galera_start_new .sh echo SETUP REPLICATION m4* /replicate_gtid .sh m1 echo PERFORM TRANSACTION m1* /sql .sh 'create table test.t1(i int)' m1* /sql .sh 'insert into test.t1 select 1' m2* /sql .sh 'insert into test.t1 select 2' sleep 1 m5* /sql .sh 'create table test.t2(a int)' m1* /sql .sh 'insert into test.t1 select 3' sleep 1 echo CHECK DATA c1 /sql .sh 'select * from t1' c2 /sql .sh 'select * from t1' set +e echo GTID in m1: _depot /m-tar/ $M7VER /bin/mysqlbinlog -vvv m1* /dt/qq *.00* | grep -E 'GTID|INSERT|create' echo GTID in m2: _depot /m-tar/ $M7VER /bin/mysqlbinlog -vvv m2* /dt/qq *.00* | grep -E 'GTID|INSERT|create' echo GTID in m4: _depot /m-tar/ $M7VER /bin/mysqlbinlog -vvv m4* /dt/qq *.00* | grep -E 'GTID|INSERT|create' echo GTID in m5: _depot /m-tar/ $M7VER /bin/mysqlbinlog -vvv m5* /dt/qq *.00* | grep -E 'GTID|INSERT|create' c1 /sql .sh 'show variables like "gtid_%_pos"' c2 /sql .sh 'show variables like "gtid_%_pos"' Please note that initially gtid is correct in cluster 2 for two first inserts into both nodes of cluster1: GTID in m1: #170808 21:43:45 server id 1 end_log_pos 650 GTID 0-1-3 ddl create table test.t1(i int) #170808 21:43:45 server id 1 end_log_pos 778 GTID 0-1-4 trans ### INSERT INTO `test`.`t1` #170808 21:43:45 server id 2 end_log_pos 918 GTID 0-2-5 trans ### INSERT INTO `test`.`t1` Direct Slave: GTID in m4: #170808 22:06:30 server id 1 end_log_pos 650 GTID 5-1-3 ddl create table test.t1(i int) #170808 22:06:30 server id 1 end_log_pos 778 GTID 5-1-4 trans ### INSERT INTO `test`.`t1` #170808 22:06:30 server id 2 end_log_pos 918 GTID 5-2-5 trans ### INSERT INTO `test`.`t1` And the other node in cluster2: GTID in m5: #170808 22:06:30 server id 1 end_log_pos 634 GTID 5-1-3 ddl create table test.t1(i int) #170808 22:06:30 server id 1 end_log_pos 762 GTID 5-1-4 trans ### INSERT INTO `test`.`t1` #170808 22:06:30 server id 2 end_log_pos 902 GTID 5-2-5 trans So far so good and gtid works as expected. (5-1-4 and 5-2-5 on both nodes in cluster2). But once we executed some transaction directly in cluster2 - the other node in cluster2 shows different gtid (below 5-1-6 vs 5-1-7): Master from cluster 1 (m1): #170808 22:06:31 server id 1 end_log_pos 1058 GTID 0-1-6 trans ### INSERT INTO `test`.`t1` Direct Slave from cluster2 (m4): #170808 22:06:31 server id 5 end_log_pos 1058 GTID 5-5-6 ddl create table test.t2(a int) #170808 22:06:31 server id 1 end_log_pos 1186 GTID 5-1-6 trans ### INSERT INTO `test`.`t1` The other node from cluster2 (m5): #170808 22:06:31 server id 5 end_log_pos 1042 GTID 5-5-6 ddl create table test.t2(a int) #170808 22:06:31 server id 1 end_log_pos 1170 GTID 5-1-7 trans ### INSERT INTO `test`.`t1` And content of gtid pos variables at the end: m1 :gtid_binlog_pos 0-1-6 gtid_current_pos 0-1-6 gtid_slave_pos m2 :gtid_binlog_pos 0-1-6 gtid_current_pos gtid_slave_pos m4 :gtid_binlog_pos 5-1-6 gtid_current_pos 0-1-6 gtid_slave_pos 0-1-6 m5 :gtid_binlog_pos 5-1-7 gtid_current_pos gtid_slave_pos

Przemek added a comment - 2017-08-10 13:33

I can confirm this problem with MariaDB Server 10.1.25, also on simpler case - 1 async master and 2 node Galera cluster, where one of the nodes is async slave.
I was using same gtid_domain_id and wsrep_gtid_domain_id on all nodes, and tried wsrep_gtid_mode=ON and OFF, but problem happens in both.

Initial state - async replication up and running, based on GTID, from MariaDB node to node0. Node0 and node2 are part of the same Galera cluster:

MariaDB [(none)]> show global variables like 'gtid_bi%';

+-------------------+-------+

| Variable_name     | Value |

+-------------------+-------+

| gtid_binlog_pos   |       |

| gtid_binlog_state |       |

+-------------------+-------+

2 rows in set (0.00 sec)

node0>  show global variables like 'gtid_b%';

+-------------------+-------+

| Variable_name     | Value |

+-------------------+-------+

| gtid_binlog_pos   |       |

| gtid_binlog_state |       |

+-------------------+-------+

2 rows in set (0.00 sec)

node2>  show global variables like '%gtid_b%';

+-------------------+-------+

| Variable_name     | Value |

+-------------------+-------+

| gtid_binlog_pos   |       |

| gtid_binlog_state |       |

+-------------------+-------+

2 rows in set (0.01 sec)

First insert on async master:

MariaDB [(none)]> insert into test.t1 values (null);

Query OK, 1 row affected (0.00 sec)

MariaDB [(none)]> show global variables like 'gtid_bi%';

+-------------------+---------+

| Variable_name     | Value   |

+-------------------+---------+

| gtid_binlog_pos   | 0-100-1 |

| gtid_binlog_state | 0-100-1 |

+-------------------+---------+

2 rows in set (0.00 sec)

node0>  show global variables like 'gtid_b%';

+-------------------+---------+

| Variable_name     | Value   |

+-------------------+---------+

| gtid_binlog_pos   | 0-100-1 |

| gtid_binlog_state | 0-100-1 |

+-------------------+---------+

2 rows in set (0.00 sec)

node2> show variables like 'gtid_b%';

+-------------------+---------+

| Variable_name     | Value   |

+-------------------+---------+

| gtid_binlog_pos   | 0-100-1 |

| gtid_binlog_state | 0-100-1 |

+-------------------+---------+

2 rows in set (0.00 sec)

So far, so good - positions in sync.
Then, insert on second cluster node:

node2> insert into test.t1 values (null);

Query OK, 1 row affected (0.01 sec)

node2>  show global variables like '%gtid_b%';

+-------------------+----------------+

| Variable_name     | Value          |

+-------------------+----------------+

| gtid_binlog_pos   | 0-12-2         |

| gtid_binlog_state | 0-100-1,0-12-2 |

+-------------------+----------------+

2 rows in set (0.00 sec)

node0>  show global variables like 'gtid_b%';

+-------------------+----------------+

| Variable_name     | Value          |

+-------------------+----------------+

| gtid_binlog_pos   | 0-12-2         |

| gtid_binlog_state | 0-100-1,0-12-2 |

+-------------------+----------------+

2 rows in set (0.00 sec)

Again so far "looks" good, GTID is consistent with the first node.
But let's do another insert on async master:

MariaDB [(none)]> insert into test.t1 values (null);

Query OK, 1 row affected (0.00 sec)

MariaDB [(none)]> show global variables like 'gtid_bi%';

+-------------------+---------+

| Variable_name     | Value   |

+-------------------+---------+

| gtid_binlog_pos   | 0-100-2 |

| gtid_binlog_state | 0-100-2 |

+-------------------+---------+

2 rows in set (0.00 sec)

node0>  show global variables like 'gtid_b%';

+-------------------+----------------+

| Variable_name     | Value          |

+-------------------+----------------+

| gtid_binlog_pos   | 0-100-2        |

| gtid_binlog_state | 0-12-2,0-100-2 |

+-------------------+----------------+

2 rows in set (0.00 sec)

node2>  show global variables like '%gtid_b%';

+-------------------+----------------+

| Variable_name     | Value          |

+-------------------+----------------+

| gtid_binlog_pos   | 0-100-3        |

| gtid_binlog_state | 0-12-2,0-100-3 |

+-------------------+----------------+

2 rows in set (0.00 sec)

And now - for some reason, position on node2 related to async master notation, increased by two! While data is same on both nodes:

node0> select * from test.t1;

+----+

| id |

+----+

|  1 |

|  2 |

|  7 |

+----+

3 rows in set (0.00 sec)

node2> select * from test.t1;

+----+

| id |

+----+

|  1 |

|  2 |

|  7 |

+----+

3 rows in set (0.00 sec)

And the more updates we do inside the cluster, the more inconsistencies are introduced in GTID sequences here.
For example, after few more inserts:

node0>  show global variables like 'gtid_b%';

+-------------------+------------------------------+

| Variable_name     | Value                        |

+-------------------+------------------------------+

| gtid_binlog_pos   | 0-10-9                       |

| gtid_binlog_state | 0-100-2,0-12-2,0-11-7,0-10-9 |

+-------------------+------------------------------+

2 rows in set (0.00 sec)

node2>  show global variables like '%gtid_b%';

+-------------------+-------------------------------+

| Variable_name     | Value                         |

+-------------------+-------------------------------+

| gtid_binlog_pos   | 0-10-10                       |

| gtid_binlog_state | 0-100-3,0-12-2,0-11-8,0-10-10 |

+-------------------+-------------------------------+

2 rows in set (0.00 sec)

So it seems that when async channel is involved to replicate inside a Galera cluster, even for a brief moment, GTID sequences inside the cluster are getting quite messy.

Przemek added a comment - 2017-08-10 13:33 I can confirm this problem with MariaDB Server 10.1.25, also on simpler case - 1 async master and 2 node Galera cluster, where one of the nodes is async slave. I was using same gtid_domain_id and wsrep_gtid_domain_id on all nodes, and tried wsrep_gtid_mode=ON and OFF, but problem happens in both. Initial state - async replication up and running, based on GTID, from MariaDB node to node0. Node0 and node2 are part of the same Galera cluster: MariaDB [(none)]> show global variables like 'gtid_bi%' ; +-------------------+-------+ | Variable_name | Value | +-------------------+-------+ | gtid_binlog_pos | | | gtid_binlog_state | | +-------------------+-------+ 2 rows in set ( 0.00 sec) node0> show global variables like 'gtid_b%' ; +-------------------+-------+ | Variable_name | Value | +-------------------+-------+ | gtid_binlog_pos | | | gtid_binlog_state | | +-------------------+-------+ 2 rows in set ( 0.00 sec) node2> show global variables like '%gtid_b%' ; +-------------------+-------+ | Variable_name | Value | +-------------------+-------+ | gtid_binlog_pos | | | gtid_binlog_state | | +-------------------+-------+ 2 rows in set ( 0.01 sec) First insert on async master: MariaDB [(none)]> insert into test.t1 values ( null ); Query OK, 1 row affected ( 0.00 sec) MariaDB [(none)]> show global variables like 'gtid_bi%' ; +-------------------+---------+ | Variable_name | Value | +-------------------+---------+ | gtid_binlog_pos | 0 - 100 - 1 | | gtid_binlog_state | 0 - 100 - 1 | +-------------------+---------+ 2 rows in set ( 0.00 sec) node0> show global variables like 'gtid_b%' ; +-------------------+---------+ | Variable_name | Value | +-------------------+---------+ | gtid_binlog_pos | 0 - 100 - 1 | | gtid_binlog_state | 0 - 100 - 1 | +-------------------+---------+ 2 rows in set ( 0.00 sec) node2> show variables like 'gtid_b%' ; +-------------------+---------+ | Variable_name | Value | +-------------------+---------+ | gtid_binlog_pos | 0 - 100 - 1 | | gtid_binlog_state | 0 - 100 - 1 | +-------------------+---------+ 2 rows in set ( 0.00 sec) So far, so good - positions in sync. Then, insert on second cluster node: node2> insert into test.t1 values ( null ); Query OK, 1 row affected ( 0.01 sec) node2> show global variables like '%gtid_b%' ; +-------------------+----------------+ | Variable_name | Value | +-------------------+----------------+ | gtid_binlog_pos | 0 - 12 - 2 | | gtid_binlog_state | 0 - 100 - 1 , 0 - 12 - 2 | +-------------------+----------------+ 2 rows in set ( 0.00 sec) node0> show global variables like 'gtid_b%' ; +-------------------+----------------+ | Variable_name | Value | +-------------------+----------------+ | gtid_binlog_pos | 0 - 12 - 2 | | gtid_binlog_state | 0 - 100 - 1 , 0 - 12 - 2 | +-------------------+----------------+ 2 rows in set ( 0.00 sec) Again so far "looks" good, GTID is consistent with the first node. But let's do another insert on async master: MariaDB [(none)]> insert into test.t1 values ( null ); Query OK, 1 row affected ( 0.00 sec) MariaDB [(none)]> show global variables like 'gtid_bi%' ; +-------------------+---------+ | Variable_name | Value | +-------------------+---------+ | gtid_binlog_pos | 0 - 100 - 2 | | gtid_binlog_state | 0 - 100 - 2 | +-------------------+---------+ 2 rows in set ( 0.00 sec) node0> show global variables like 'gtid_b%' ; +-------------------+----------------+ | Variable_name | Value | +-------------------+----------------+ | gtid_binlog_pos | 0 - 100 - 2 | | gtid_binlog_state | 0 - 12 - 2 , 0 - 100 - 2 | +-------------------+----------------+ 2 rows in set ( 0.00 sec) node2> show global variables like '%gtid_b%' ; +-------------------+----------------+ | Variable_name | Value | +-------------------+----------------+ | gtid_binlog_pos | 0 - 100 - 3 | | gtid_binlog_state | 0 - 12 - 2 , 0 - 100 - 3 | +-------------------+----------------+ 2 rows in set ( 0.00 sec) And now - for some reason, position on node2 related to async master notation, increased by two! While data is same on both nodes: node0> select * from test.t1; +----+ | id | +----+ | 1 | | 2 | | 7 | +----+ 3 rows in set ( 0.00 sec) node2> select * from test.t1; +----+ | id | +----+ | 1 | | 2 | | 7 | +----+ 3 rows in set ( 0.00 sec) And the more updates we do inside the cluster, the more inconsistencies are introduced in GTID sequences here. For example, after few more inserts: node0> show global variables like 'gtid_b%' ; +-------------------+------------------------------+ | Variable_name | Value | +-------------------+------------------------------+ | gtid_binlog_pos | 0 - 10 - 9 | | gtid_binlog_state | 0 - 100 - 2 , 0 - 12 - 2 , 0 - 11 - 7 , 0 - 10 - 9 | +-------------------+------------------------------+ 2 rows in set ( 0.00 sec) node2> show global variables like '%gtid_b%' ; +-------------------+-------------------------------+ | Variable_name | Value | +-------------------+-------------------------------+ | gtid_binlog_pos | 0 - 10 - 10 | | gtid_binlog_state | 0 - 100 - 3 , 0 - 12 - 2 , 0 - 11 - 8 , 0 - 10 - 10 | +-------------------+-------------------------------+ 2 rows in set ( 0.00 sec) So it seems that when async channel is involved to replicate inside a Galera cluster, even for a brief moment, GTID sequences inside the cluster are getting quite messy.

Sachin Setiya (Inactive) added a comment - 2017-12-11 09:06

Hi przemek@mysqlmaniac.com!

Actually this is also wrong gtid 0-12-2,0-100-2 , But patch for 10715 will fix this issue.

Sachin Setiya (Inactive) added a comment - 2017-12-11 09:06 Hi przemek@mysqlmaniac.com ! Actually this is also wrong gtid 0-12-2,0-100-2 , But patch for 10715 will fix this issue.

People

Assignee:: Sachin Setiya (Inactive)

Reporter:: Geoff Montee (Inactive)

Votes:: 2 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 2017-08-02 20:59

Updated:: 2024-07-08 00:40

Resolved:: 2017-12-25 15:45

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.