[MXS-2643] Maxscale causal_reads and Mariadb percona cluster timeout Created: 2019-08-20  Updated: 2019-12-03  Resolved: 2019-12-02

Status: Closed
Project: MariaDB MaxScale
Component/s: galeramon, readwritesplit
Affects Version/s: 2.3.10, 2.3.11
Fix Version/s: N/A

Type: Bug Priority: Blocker
Reporter: Leon Ruumpol Assignee: markus makela
Resolution: Not a Bug Votes: 1
Labels: None

Issue Links:
Problem/Incident
is caused by MDEV-18672 MASTER_GTID_WAIT timeout on the master Open

 Description   

In a Mariadb cluster (4 node) with gtid mode on we have problems with causal_reads. After an insert, the next query waits until the causal_reads_timeout reaches its maximum value. We are not sure if this problem comes from Maxscale or Percona.

Steps to reproduce:

CREATE TABLE `test_lrl`.`test` (
`id` INT NOT NULL AUTO_INCREMENT,
`text` VARCHAR(45) NULL,
PRIMARY KEY (`id`));

SELECT id,text,@@HOSTNAME FROM test_lrl.test;
12:39:43 SELECT id,text,@@HOSTNAME FROM test_lrl.test LIMIT 0, 1000 3 row(s) returned *0,0094 *sec / 0,000017 sec

INSERT INTO `test_lrl`.`test` (`text`) VALUES ('Test');

SHOW SESSION VARIABLES WHERE Variable_name = 'last_gtid';
maxscalen1:> 'last_gtid', '1-1-480601'

Mariadb nodes: SHOW GLOBAL VARIABLES LIKE 'gtid_current_pos';
n1:> 'last_gtid', '1-1-480639'
n2:> 'last_gtid', '1-1-480639'
n3:> 'last_gtid', '1-1-480639'
n4:> 'last_gtid', '1-1-480639'

SELECT id,text,@@HOSTNAME FROM test_lrl.test;
12:41:10 SELECT id,text,@@HOSTNAME FROM test_lrl.test LIMIT 0, 1000 4 row(s) returned *20,014 *sec / 0,000015 sec
'13', 'Test', 'db-c01-n3'

Mariadb nodes:
n1:>
n2:> Waiting in MASTER_GTID_WAIT() (primary waiter) | SET @maxscale_secret_variable=(SELECT CASE WHEN MASTER_GTID_WAIT('1-1-480601', 20) = 0 THEN 1 ELSE (SELECT 1 FROM INFORMATION_SCHEMA.ENGINES) END)
n3:>
n4:>

Config Maxscale:

--------------------------------------------------------------------
Server | Address | Port | Connections | Status
--------------------------------------------------------------------
db-c01-n1 | 10.1.1.162 | 3306 | 2 | Slave, Synced, Running
db-c01-n2 | 10.1.1.163 | 3306 | 1 | Slave, Synced, Running
db-c01-n3 | 10.1.1.164 | 3306 | 4 | Master, Synced, Running
db-c01-n4 | 10.1.1.165 | 3306 | 1 | Slave, Synced, Running
--------------------------------------------------------------------

[galera-service]
type=service
router=readwritesplit
servers=db-c01-n1,db-c01-n2,db-c01-n3,db-c01-n4,
user=maxscale
password=******
localhost_match_wildcard_host=1
use_sql_variables_in=master
max_sescmd_history=0
disable_sescmd_history=true
filters=NamedServerFilter
max_slave_connections=1
slave_selection_criteria=LEAST_GLOBAL_CONNECTIONS
causal_reads=1
causal_reads_timeout=20

[NamedServerFilter]
type=filter
module=namedserverfilter
match01=.temp\..
target01=->master
match02=.`temp`\..
target02=->master

Galera cluster:
[mysqld]
binlog_format=ROW
innodb_autoinc_lock_mode=2
bind-address=0.0.0.0

wsrep_on=ON
wsrep_provider=/usr/lib64/galera/libgalera_smm.so

wsrep_cluster_name="mariadb-c01-test"
wsrep_cluster_address="gcomm://db-c01-n1,db-c01-n2,db-c01-n3,db-c01-n4"

wsrep_sst_method=mariabackup
wsrep_sst_auth=replication:******

wsrep_node_address="10.1.1.162"
wsrep_node_name="db-c01-n1"

  1. Settings for gtid cluster ():
    wsrep_slave_threads=8
    wsrep_gtid_mode=ON
    wsrep_gtid_domain_id=1
    log_slave_updates=ON
    log_bin=db-c01
    expire_logs_days=1
    session_track_system_variables=autocommit,character_set_client,character_set_connection,character_set_results,time_zone,last_gtid


 Comments   
Comment by markus makela [ 2019-10-16 ]

It's very likely that this is caused by the GTIDs in Galera. To my knowledge they are not guaranteed to be the same across all clusters and as such the causal_reads feature might not work.

Comment by markus makela [ 2019-12-02 ]

Turns out this is reproducible with a direct connection to a MariaDB master:

CREATE TABLE `test`.`t1` (`text` VARCHAR(45));
INSERT INTO `test`.`t1` (`text`) VALUES ('Test');
SET @a = @@last_gtid;
SELECT MASTER_GTID_WAIT(@a, 20);

The MASTER_GTID_WAIT seems to work correctly only when replication is configured. This would suggest that it's not a MaxScale problem and that the problem is in the server.

Upon further inspection of the MASTER_GTID_WAIT documentation, the reason why it doesn't work is obvious.

It waits until the value of gtid_slave_pos has the same or higher seq_no within all replication domains specified in the gtid-list; in other words, it waits until the slave has reached the specified GTID position.

The master MariaDB server and, by extension, Galera, do not update gtid_slave_pos when events are generated on that particular node.

Comment by markus makela [ 2019-12-02 ]

This will be fixedby MDEV-18672.

Generated at Thu Feb 08 04:15:38 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.