Details
Description
Summary:
When simulating a network connectivity loss between replicating servers, if the replication channel uses SSL then MariaDB 10.x does not detect the loss of network connectivity, although 5.5 does. This is regardless of the value of the "slave_net_timeout" variable.
Reproduced using the 10.0.14 GLIBC214 build and 10.0.0 builds, using both GTID (10.0.14) and binlog-based replication (10.0.14/10.0.0). Works as expected on 5.5.40 with binlog-based replication. I am testing with the binary .tar.gz MariaDB builds downloaded from the Mariadb servers (archive.mariadb.org).
Steps to reproduce (functional case):
- Set up MariaDB with two 5.5.40 servers in master-slave configuration and ensure replication is working and SSL-encrypted.
- Start generating traffic on the master. Watch the slave status to see the traffic is being replicated successfully.
- Simulate a network failure, e.g. "iptables -I INPUT -s <master_ip> -j DROP" on the slave. This drops all network packets from the master host.
- Wait for slave_net_timeout seconds to pass. The slave will restart as documented, and the slave status will now state that it is attempting to reconnect to the master.
Steps to reproduce (broken case):
- Set up MariaDB with two 10.0.14 servers in master-slave configuration and ensure replication is working and SSL-encrypted.
- Start generating traffic on the master. Watch the slave status to see the traffic is being replicated successfully.
- Simulate a network failure, e.g. "iptables -I INPUT -s <master_ip> -j DROP" on the slave. This drops all network packets from the master host.
- Wait for slave_net_timeout seconds to pass. The slave status will continue to state "waiting for master to send event", even though the log position counters are not advancing. The slave will remain in this state until the slave is stopped and restarted – it will not restart on its own, contrary to documentation. This is a change in behavior from Mariadb 5.5 and appears to be incorrect behavior.
I also tested this with 10.0.14 as the master, and 5.5.40 as the slave. This works as expected. 10.0.14 as the slave does not work, which seems to indicate that the slave code path is what changed. I also tested with 10.0.0 as the slave, and it also does not work.
Here is the my.cnf on the slave:
[client]
port = 3306
socket = /var/run/mysqld/mysqld.sock
ssl-ca=/etc/mysql/ssl/ca-cert.pem
[mysqld_safe]
socket = /var/run/mysqld/mysqld.sock
nice = 0
[mysqld]
user = mysql
pid-file = /var/run/mysqld/mysqld.pid
socket = /var/run/mysqld/mysqld.sock
port = 3306
basedir = /usr/local/mysql
datadir = /usr/local/mysql/data
tmpdir = /tmp
skip-external-locking
key_buffer_size = 8M
myisam_sort_buffer_size = 2M
aria_pagecache_buffer_size = 64M
aria_sort_buffer_size = 32M
max_allowed_packet = 16M ## Default: 1M
max_connections = 3500 ## Default: 100
max_connect_errors = 100000 ## Default: 10 Range: 1-4294967295
table_cache = 200 ## Default: 32
thread_stack = 256K
thread_cache_size = 8
query_cache_limit = 4M ## Default: 1M
query_cache_size = 128M ## Default: 16M
log_error = /var/log/mysql/error.log
log_warnings = 0
slow_query_log = 1
slow_query_log_file = /var/log/mysql/mysql-slow.log
long_query_time = 10
server-id = 1 ## Should usually match gtid-domain-id.
log_bin = mysql-logs/bin-log
log-bin-index = mysql-logs/bin-log.index
master-info-file = mysql-logs/master.info
log-slave-updates
expire_logs_days = 14
max_binlog_size = 100M
auto_increment_increment = 2
auto_increment_offset = 1
slave-net-timeout = 6
slave_compressed_protocol = 1
relay-log = mysql-logs/relay-log
relay-log-index = mysql-logs/relay-log.index
relay-log-info-file = mysql-logs/relay-log.info
replicate-ignore-db = mysql
default-storage-engine = InnoDB
innodb_file_format = barracuda
innodb_file_per_table
innodb_log_group_home_dir = mysql-logs/log/
innodb_data_file_path = mysql-logs/data/ibdata1:100M:autoextend
innodb_flush_method = O_DIRECT
large_pages
innodb_buffer_pool_size = 7700M
innodb_log_file_size = 125M
innodb_log_files_in_group = 2
innodb_log_buffer_size = 8M
innodb_lock_wait_timeout = 50
innodb_flush_log_at_trx_commit = 2
innodb_thread_concurrency = 0
innodb_io_capacity = 3700
innodb_write_io_threads = 8
innodb_read_io_threads = 8
innodb_purge_threads = 1
innodb_stats_method = nulls_ignored
innodb_stats_sample_pages = 128
ssl-ca=/etc/mysql/ssl/ca-cert.pem
ssl-cert=/etc/mysql/ssl/server-cert.pem
ssl-key=/etc/mysql/ssl/server-key.pem
[mysqldump]
quick
quote-names
max_allowed_packet = 16M
[mysql]
no-auto-rehash
[isamchk]
key_buffer = 32M
sort_buffer_size = 32M
read_buffer = 4M
write_buffer = 4M
Here is the replication setup command. Note that we are using SSL ("require ssl" is set for the replicator user on the master). I'm not sure if that affect this behavior or not.
CHANGE MASTER TO master_host=<master>, master_port=3306, master_ssl=1, master_ssl_ca='/etc/mysql/ssl/ca-cert.pem', master_ssl_cert='/etc/mysql/ssl/server-cert.pem', master_ssl_key='/etc/mysql/ssl/server-key.pem', master_user='replicator', master_password=<hidden>;