[MDEV-9861] Network outage can break replication Created: 2016-04-01  Updated: 2020-10-20

Status: Open
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.1.11
Fix Version/s: 10.1

Type: Bug Priority: Minor
Reporter: Igor Pashev Assignee: Andrei Elkin
Resolution: Unresolved Votes: 0
Labels: None


 Description   

We use parallel row-based replication for a few channels (20 ¯_(ツ)_/¯).

Almost every time we reboot a master server (MySQL 5.6 @ RDS) or have a network outage, we get some HA_ERR_FOUND_DUPP_KEY errors right after replication resumes (or "a foreign key constraint fails" in statement-based replication). Skipping those errors produces more errors, and idempotent mode lead to another errors.

Since network outage is a kind of expected event, I'd consider this behaviour a bug.

Our configuration:

[mysqld]
basedir = /nix/store/ja84bcvggh2aribmvjqj0rr89d066dkw-mariadb-10.1.11
init_file = /nix/store/h7qfaw1m15w5sr69xqvqm0468ag7dvd6-init
pid_file = /run/mysqld/mysqld.pid
plugin_load = unix_socket=auth_socket.so
datadir = /mariadb/db
event_scheduler = ON
group_concat_max_len = 8388608
ignore_db_dirs = lost+found
innodb_buffer_pool_dump_at_shutdown = ON
innodb_buffer_pool_instances = 64
innodb_buffer_pool_load_at_startup = ON
innodb_buffer_pool_size = 19327352832
innodb_file_format = barracuda
innodb_file_per_table = ON
innodb_flush_log_at_trx_commit = 2
innodb_flush_method = O_DIRECT
innodb_lock_wait_timeout = 1800
innodb_log_file_size = 314572800
join_buffer_size = 1048576
log_slave_updates = OFF
max_allowed_packet = 268435456
max_connections = 1000
net_read_timeout = 1000
net_write_timeout = 1000
port = 3306
query_cache_limit = 2097152
query_cache_size = 33554432
query_cache_strip_comments = ON
query_cache_type = 1
relay_log = /mariadb/relay/cat-bin
server_id = 123456
skip_log_bin
slave_compressed_protocol = ON
slave_domain_parallel_threads = 10
slave_net_timeout = 600
slave_parallel_max_queued = 8388608
slave_parallel_threads = 40
sort_buffer_size = 4194304
ssl_cert = /nix/store/rmlq71y8fnyx1k7sp3qx55fqwg45pl40-cats-cert.pem
ssl_key = /run/keys/cats-key.pem
table_open_cache = 30000
# replication options:
!include /nix/store/z3pldfvs7wpd8bhz6pbnr73zqagnxmx1-mysqld-repl.cnf



 Comments   
Comment by Elena Stepanova [ 2016-05-02 ]

Could you please attach the slave error log which covers the time interval at least from the server startup till the replication failure (which means it will also naturally include the moment of the network outage or master restart)?

By "a few channels" do you mean multi-master replication? Do you ensure that the replication channels are independent and do not conflict with each other?

Thanks.

Comment by Igor Pashev [ 2016-05-02 ]

> By "a few channels" do you mean multi-master replication?
Yes, multi-master replication.

> Do you ensure that the replication channels are independent and do not conflict with each other?
Yes, we use a "white list", e. i. zzz-xx.replicate_wild_do_table = zzz_live_xx.%, thus I'm quite sure there is no collision.

I haven't seen this issue for a while (with 10.1.13), probably because there were no reboots or network outages, or the problem is gone.

Comment by Elena Stepanova [ 2016-05-02 ]

The white list might be tricky if you are getting errors specifically on foreign key constraints, since foreign key cascades are not really replicated, but executed separately on the slave.

Anyway, even although you haven't had the problem for a while, you probably still have the error log from previous occasions? If so, please attach them.

Comment by Igor Pashev [ 2016-05-02 ]

No, the old logs are gone ¯_(ツ)_/¯

> foreign key cascades are not really replicated, but executed separately on the slave
Interesting. Maybe it's related to parallel replication.

Anyway, the problem is gone since at least 12th of April.

Comment by Igor Pashev [ 2016-06-10 ]

It finally happened:

Jun 10 23:04:22 petcat mysqld[749]: 2016-06-10 23:04:22 139620269476608 [Note] Master 'bob-au': Slave: received end packet from server, apparent master shutdown:
Jun 10 23:04:22 petcat mysqld[749]: 2016-06-10 23:04:22 139620269476608 [Note] Master 'bob-au': Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mysql-bin-changelog.1
72110' at position 1806660
Jun 10 23:04:22 petcat mysqld[749]: 2016-06-10 23:04:22 139620269476608 [ERROR] Master 'bob-au': Slave I/O: error reconnecting to master 'cat@mysqlrds.int.live.aws.theiconic.com.au:3306' - 
retry-time: 60  retries: 86400  message: Can't connect to MySQL server on 'mysqlrds.int.live.aws.theiconic.com.au' (111 "Connection refused"), Internal MariaDB error code: 2003
Jun 10 23:05:09 petcat mysqld[749]: 2016-06-10 23:05:09 139620269173504 [Note] Master 'bob-au': Error reading relay log event: slave SQL thread was killed
Jun 10 23:05:09 petcat mysqld[749]: 2016-06-10 23:05:09 139620269476608 [Note] Master 'bob-au': Slave I/O thread killed during or after a reconnect done to recover from failed read
Jun 10 23:05:09 petcat mysqld[749]: 2016-06-10 23:05:09 139620269476608 [Note] Master 'bob-au': Slave I/O thread exiting, read up to log 'mysql-bin-changelog.172110', position 1806660
Jun 10 23:05:09 petcat mysqld[749]: 2016-06-10 23:05:09 139615283055360 [Note] Master 'bob-au': Slave SQL thread initialized, starting replication in log 'FIRST' at position 0, relay log '/
mariadb/relay/cat-bin-bob@002dau.000001' position: 4
Jun 10 23:07:16 petcat mysqld[749]: 2016-06-10 23:07:16 139620269173504 [ERROR] Master 'bob-au': Slave I/O: error connecting to master 'cat@mysqlrds.int.live.aws.theiconic.com.au:3306' - re
try-time: 60  retries: 86400  message: Can't connect to MySQL server on 'mysqlrds.int.live.aws.theiconic.com.au' (110 "Connection timed out"), Internal MariaDB error code: 2003
Jun 10 23:08:17 petcat mysqld[749]: 2016-06-10 23:08:17 139620269173504 [Note] Master 'bob-au': Slave I/O thread: connected to master 'cat@mysqlrds.int.live.aws.theiconic.com.au:3306',repli
cation started in log 'FIRST' at position 4
Jun 10 23:08:19 petcat mysqld[749]: 2016-06-10 23:08:19 139615283055360 [ERROR] Master 'bob-au': Slave SQL: Error 'Duplicate entry '1427614' for key 'PRIMARY'' on query. Default database: '
bob_live_au'. Query: 'INSERT INTO `customer_wishlist` (`fk_customer`, `status`, `is_default`, `name`, `created_at`) VALUES ('1918296', 'private', '1', 'Default wishlist', '2016-06-08 00:55:
00')', Internal MariaDB error code: 1062
Jun 10 23:08:19 petcat mysqld[749]: 2016-06-10 23:08:19 139615283055360 [Warning] Master 'bob-au': Slave: Field 'access_hash' doesn't have a default value Error_code: 1364
Jun 10 23:08:19 petcat mysqld[749]: 2016-06-10 23:08:19 139615283055360 [Warning] Master 'bob-au': Slave: Duplicate entry '1427614' for key 'PRIMARY' Error_code: 1062
Jun 10 23:08:19 petcat mysqld[749]: 2016-06-10 23:08:19 139615283055360 [ERROR] Master 'bob-au': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL th
read with "SLAVE START". We stopped at log 'mysql-bin-changelog.171242' position 416
Jun 10 23:15:10 petcat mysqld[749]: 2016-06-10 23:15:10 139613408201472 [Note] Master 'bob-au': Slave SQL thread initialized, starting replication in log 'mysql-bin-changelog.171242' at pos

Comment by Igor Pashev [ 2016-09-15 ]

I wonder if this is caused by Amazon RDS switching between availability zones.

Comment by Igor Pashev [ 2016-09-21 ]

With

{slave_exec_mode = IDEMPOTENT}

, the issue happens for 10.1.17, but does not happen for 10.0.25

Comment by Igor Pashev [ 2016-09-24 ]

I just saw it with MariaDB -> MariaDB replication (both 10.1.17). Rebooted master -> slaves failed with Error 'Duplicate entry '12345' for key 'PRIMARY''.

Generated at Thu Feb 08 07:37:52 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.