[MDEV-25709]  Replication IO_Thread Connecting Primary server handshake call "SELECT GTID_BINLOG_POS" even when MASTER_USE_GITD=NO and GTID is disabled Created: 2021-05-18  Updated: 2023-12-15

Status: In Review
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.2, 10.3, 10.4, 10.5
Fix Version/s: 10.4, 10.5

Type: Bug Priority: Critical
Reporter: Pramod Mahto Assignee: Andrei Elkin
Resolution: Unresolved Votes: 1
Labels: None

Issue Links:
Problem/Incident
is caused by MDEV-25764 SELECT binlog_gtid_pos takes a very l... Confirmed

 Description   

Replication IO_Thread call SELECT GTID_BINLOG_POS even when MASTER_USE_GTID=NO and GTID is disabled.

Id: 49525
User: replica
Host: XXXXX.XXXX.XXXX.XXXXX
db: NULL
Command: Query
Time: 61
State: executing
Info: SELECT binlog_gtid_pos('mysql-binlog.XXXXX',XXXXXXXX)
Progress: 0.000

..
...
......
Using_Gtid: No
Gtid_IO_Pos:
Replicate_Do_Domain_Ids:
Replicate_Ignore_Domain_Ids:
Parallel_Mode: conservative
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it

This unnecessary query execution can hang IO thread for significant time, Causing slave lag (MDEV-25764)



 Comments   
Comment by Sachin Setiya (Inactive) [ 2021-05-27 ]

In slave.cc this reason is given for asking binlog_gtid_pos from master

    /*
      If we are not using GTID to connect this time, then instead request
      the corresponding GTID position from the master, so that the user
      can reconnect the next time using MASTER_GTID_POS=AUTO.
    */
    char quote_buf[2*sizeof(mi->master_log_name)+1];
    char str_buf[28+2*sizeof(mi->master_log_name)+10];
    String query(str_buf, sizeof(str_buf), system_charset_info);
    query.length(0);
 
    query.append("SELECT binlog_gtid_pos('");
    escape_quotes_for_mysql(&my_charset_bin, quote_buf, sizeof(quote_buf),
                            mi->master_log_name, strlen(mi->master_log_name));
    query.append(quote_buf);
    query.append("',");
    query.append_ulonglong(mi->master_log_pos);
    query.append(")");
 

Comment by Sachin Setiya (Inactive) [ 2021-06-15 ]

Hi Elkin,
So I analyzed and side effects of the patch. So the crux of the patch is , it removes

rpl_global_gtid_slave_state->load(mi->io_thd, master_row[0],
-                                        strlen(master_row[0]), false, false);

these lines , which will make gtid_slave_pos to not update on the slave start.
So this will cause 2 error (please note these will happen in the case of MASTER_USE_GTID=NO only).
1. If gtid_slave_pos have one row per domain id (means only one transaction was executed per domain id), on the start slave rpl_global_gtid_slave_state->load will cause one more row per domain id to be added to gtid slave pos table. For example
gtid_slave_pos table entries

SELECT * FROM mysql.gtid_slave_pos order by domain_id,seq_no;
domain_id	sub_id	server_id	seq_no
0	1	1	1
1	2	1	1

After slave restart

SELECT * FROM mysql.gtid_slave_pos order by domain_id,seq_no;
domain_id	sub_id	server_id	seq_no
0	1	1	1
0	4	1	1
1	2	1	1
1	3	1	1

and with this patch these row addition will not happen

SELECT * FROM mysql.gtid_slave_pos order by domain_id,seq_no;
domain_id	sub_id	server_id	seq_no
0	1	1	1
1	2	1	1

To me it does not seem like a big issue, But this does cause rpl_gtid_stop_start test to fail.

2. Loss of update in gtid_slave_pos table on slave.
This happens in very specific case , So lets assume gtid_slave_pos is corrupted (in test case it is on the storage engine which is disabled), So we wont be able to update gtid_slave_pos with current transactions gtids. And now we stop the slave fix the gtid_slave_pos table , since there will be no call for rpl_global_gtid_slave_state->load gtid_slave_pos will not be updated , but as soon as slave receive the next transaction, gtid_slave_pos will be updated. This will not happen in unpatched mariadb. Relevant test case

SET sql_log_bin= 0;
ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB;
SET sql_log_bin= 1;
 
--write_file $MYSQLTEST_VARDIR/tmp/mysqld.2.expect
wait
EOF
--shutdown_server
--source include/wait_until_disconnected.inc
 
--append_file $MYSQLTEST_VARDIR/tmp/mysqld.2.expect
restart: --skip-slave-start=1 --skip-innodb
EOF
--enable_reconnect
--source include/wait_until_connected_again.inc
 
--error ER_UNKNOWN_STORAGE_ENGINE
SELECT * FROM mysql.gtid_slave_pos;
SET sql_log_bin=0;
call mtr.add_suppression("Failed to load slave replication state from table");
call mtr.add_suppression("Unable to load replication GTID slave state");
SET sql_log_bin=1;
 
--source include/start_slave.inc
--connection server_1
INSERT INTO t1 VALUES (9);
--save_master_pos
 
--connection server_2
--sync_with_master
FLUSH NO_WRITE_TO_BINLOG TABLES;
SELECT * FROM t1 ORDER BY a;
 
# Put things back as they were.
--write_file $MYSQLTEST_VARDIR/tmp/mysqld.2.expect
wait
EOF
--shutdown_server
--source include/wait_until_disconnected.inc
 
--append_file $MYSQLTEST_VARDIR/tmp/mysqld.2.expect
restart:  
EOF
--enable_reconnect
--source include/wait_until_connected_again.inc
SET sql_log_bin= 0;
ALTER TABLE mysql.gtid_slave_pos ENGINE=MyISAM;
SET sql_log_bin= 1;
--source include/start_slave.inc
select * from mysql.gtid_slave_pos order by domain_id;

I am not sure if 2 is a big issue or not , it depends what user expects from gtid_slave_pos table when MASTER_USE_GTID=NO. According to me if user is not using gtid they should not be concerned with gtid_slave_pos status.

Comment by Sachin Setiya (Inactive) [ 2021-06-15 ]

Patch branch bb-10.2-25709

Comment by Kristian Nielsen [ 2023-08-02 ]

Just to make it clear, it is not a bug that non-GTID slave obtains the GTID position when connecting, it's a important feature of MariaDB GTID. For example, this is what enables to take a non-locking xtrabackup of InnoDB (which recovers the binlog-file/offset replication position in a transactional way), use that to provision a slave and connect it in non-GTID mode, and then immediately switch it to GTID mode.

The real bug is that obtaining the GTID position has poor performance. The correct fix for that is MDEV-4991, GTID binlog indexing.

Still, it's unfortunate that this performance problem has been around for so long. It can be valid to provide a way to avoid the performance overhead until the "real" solution of MDEV-4991. We should just keep in mind that this is a performance bug, not a "wrong behavior" bug, and plan so that the correct solution can be implemented without impacting upgrades and backwards compatibility.

Also, it's good to mention the work-around that is currently available, which is to reduce the size of the binlog files (--max-binlog-size).

Generated at Thu Feb 08 09:39:45 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.