[MDEV-23028] master info corruption IO thread OK but no events are copied Created: 2020-06-26  Updated: 2020-06-26

Status: Open
Project: MariaDB Server
Component/s: Backup, Replication
Affects Version/s: 10.3.23
Fix Version/s: None

Type: Bug Priority: Major
Reporter: VAROQUI Stephane Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Environment:

1 - Master 1-Slave /GTID conservative/semisync



 Description   

Ok the all story start by restoring a mysqldump that failed
complaining about event scheduler was off on the master :

/usr/bin/mysqldump --opt --hex-blob --events --disable-keys --master-data=1 --apply-slave-statements --gtid --single-transaction --all-databases --host=db-fr-1.mixr-dev.svc.cloud18 --port=3306 --user=root --password=XXXX --verbose

mysqldump Ver 10.17 Distrib 10.4.13-MariaDB, for Linux (x86_64)
The output is directly pipe to mysql client

Couldn't execute 'show events': Cannot proceed, because event scheduler is disabled (1577)

Here i suppose that if event scheduler is off, backup should just ignore and not failed with roolback?

On the master :

2020-06-26 16:23:00 131 [Note] Start binlog_dump to slave_server(53814828), pos(log-bin.000011, 356)
2020-06-26 16:24:00 161 [Note] Start binlog_dump to slave_server(53814828), pos(log-bin.000011, 356)
2020-06-26 16:24:00 131 [Warning] Aborted connection 131 to db: 'unconnected' user: 'root' host: '10.48.96.64' (A slave with the same server_uuid/server_id as this slave has...)
2020-06-26 16:25:05 225 [Note] Start binlog_dump to slave_server(53814828), pos(log-bin.000011, 356)
2020-06-26 16:25:05 161 [Warning] Aborted connection 161 to db: 'unconnected' user: 'root' host: '10.48.96.64' (A slave with the same server_uuid/server_id as this slave has...)
2020-06-26 16:26:10 257 [Note] Start binlog_dump to slave_server(53814828), pos(log-bin.000011, 356)
2020-06-26 16:26:11 225 [Warning] Aborted connection 225 to db: 'unconnected' user: 'root' host: '10.48.96.64' (A slave with the same server_uuid/server_id as this slave has...)
2020-06-26 16:27:16 316 [Note] Start binlog_dump to slave_server(53814828), pos(log-bin.000011, 356)
2020-06-26 16:27:16 257 [Warning] Aborted connection 257 to db: 'unconnected' user: 'root' host: '10.48.96.64' (A slave with the same server_uuid/server_id as this slave has...)

on the slave

2020-06-26 16:36:32 544 [Note] Slave I/O thread: connected to master 'root@db-fr-1.mixr-dev.svc.cloud18:3306',replication starts at GTID position '0-12599180-2851700'
2020-06-26 16:37:32 544 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'log-bin.000011' at position 356; GTID position '0-12599180-2851700'
2020-06-26 16:38:37 544 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'log-bin.000011' at position 356; GTID position '0-12599180-2851700'
2020-06-26 16:39:42 544 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'log-bin.000011' at position 356; GTID position '0-12599180-2851700'
2020-06-26 16:40:47 544 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'log-bin.000011' at position 356; GTID position '0-12599180-2851700'
2020-06-26 16:41:52 544 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'log-bin.000011' at position 356; GTID position '0-12599180-2851700'
2020-06-26 16:42:57 544 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'log-bin.000011' at position 356; GTID position '0-12599180-2851700'
2020-06-26 16:44:02 544 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'log-bin.000011' at position 356; GTID position '0-12599180-2851700'

on the slave both io thread and sql threads are running witch is already an issue as looks like the io thread can't fetch any events.

I ve tried to

  • raised CHANGE MASTER TO MASTER_HEARTBEAT_PERIOD=30;
  • change slave server id
  • disable semisync
  • move from conservative to optimistic
  • restart the slave

Restart the master

None so far curve the replication that stay stuck on gtid '0-12599180-2851700'

master.info looks strange
33
log-bin.000011
356
db-fr-1.mixr-dev.svc.cloud18
root
mariadb
3306
5
0

0
30.000

0

0

using_gtid=2
do_domain_ids=0
ignore_domain_ids=0
END_MARKER
RKER

That RKER at the end is not a mistake of copy paste so the file get corrupted somehow
indeed stop slave; reset slaves all and restore dump fixed the issue

I was stuck 2 times in such state on that release never on 10.2.18 in production can one imagine a scenario that would so badly corrupt the master.info file at the end .

May be concurrent start slave or change master to from dump reload and replication-manager?


Generated at Thu Feb 08 09:19:18 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.