[MDEV-23028] master info corruption IO thread OK but no events are copied - Jira

XML

Word

Printable

Details

Type: Bug
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 10.3.23
Fix Version/s: None
Component/s: Backup, Replication
Labels:
None
Environment:
1 - Master 1-Slave /GTID conservative/semisync

Description

Ok the all story start by restoring a mysqldump that failed
complaining about event scheduler was off on the master :

/usr/bin/mysqldump --opt --hex-blob --events --disable-keys --master-data=1 --apply-slave-statements --gtid --single-transaction --all-databases --host=db-fr-1.mixr-dev.svc.cloud18 --port=3306 --user=root --password=XXXX --verbose

mysqldump Ver 10.17 Distrib 10.4.13-MariaDB, for Linux (x86_64)
The output is directly pipe to mysql client

Couldn't execute 'show events': Cannot proceed, because event scheduler is disabled (1577)

Here i suppose that if event scheduler is off, backup should just ignore and not failed with roolback?

On the master :

2020-06-26 16:23:00 131 [Note] Start binlog_dump to slave_server(53814828), pos(log-bin.000011, 356)
2020-06-26 16:24:00 161 [Note] Start binlog_dump to slave_server(53814828), pos(log-bin.000011, 356)
2020-06-26 16:24:00 131 [Warning] Aborted connection 131 to db: 'unconnected' user: 'root' host: '10.48.96.64' (A slave with the same server_uuid/server_id as this slave has...)
2020-06-26 16:25:05 225 [Note] Start binlog_dump to slave_server(53814828), pos(log-bin.000011, 356)
2020-06-26 16:25:05 161 [Warning] Aborted connection 161 to db: 'unconnected' user: 'root' host: '10.48.96.64' (A slave with the same server_uuid/server_id as this slave has...)
2020-06-26 16:26:10 257 [Note] Start binlog_dump to slave_server(53814828), pos(log-bin.000011, 356)
2020-06-26 16:26:11 225 [Warning] Aborted connection 225 to db: 'unconnected' user: 'root' host: '10.48.96.64' (A slave with the same server_uuid/server_id as this slave has...)
2020-06-26 16:27:16 316 [Note] Start binlog_dump to slave_server(53814828), pos(log-bin.000011, 356)
2020-06-26 16:27:16 257 [Warning] Aborted connection 257 to db: 'unconnected' user: 'root' host: '10.48.96.64' (A slave with the same server_uuid/server_id as this slave has...)

on the slave

2020-06-26 16:36:32 544 [Note] Slave I/O thread: connected to master 'root@db-fr-1.mixr-dev.svc.cloud18:3306',replication starts at GTID position '0-12599180-2851700'
2020-06-26 16:37:32 544 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'log-bin.000011' at position 356; GTID position '0-12599180-2851700'
2020-06-26 16:38:37 544 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'log-bin.000011' at position 356; GTID position '0-12599180-2851700'
2020-06-26 16:39:42 544 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'log-bin.000011' at position 356; GTID position '0-12599180-2851700'
2020-06-26 16:40:47 544 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'log-bin.000011' at position 356; GTID position '0-12599180-2851700'
2020-06-26 16:41:52 544 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'log-bin.000011' at position 356; GTID position '0-12599180-2851700'
2020-06-26 16:42:57 544 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'log-bin.000011' at position 356; GTID position '0-12599180-2851700'
2020-06-26 16:44:02 544 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'log-bin.000011' at position 356; GTID position '0-12599180-2851700'

on the slave both io thread and sql threads are running witch is already an issue as looks like the io thread can't fetch any events.

I ve tried to

raised CHANGE MASTER TO MASTER_HEARTBEAT_PERIOD=30;
change slave server id
disable semisync
move from conservative to optimistic
restart the slave

Restart the master

None so far curve the replication that stay stuck on gtid '0-12599180-2851700'

master.info looks strange
33
log-bin.000011
356
db-fr-1.mixr-dev.svc.cloud18
root
mariadb
3306
5
0

0
30.000

using_gtid=2
do_domain_ids=0
ignore_domain_ids=0
END_MARKER
RKER

That RKER at the end is not a mistake of copy paste so the file get corrupted somehow
indeed stop slave; reset slaves all and restore dump fixed the issue

I was stuck 2 times in such state on that release never on 10.2.18 in production can one imagine a scenario that would so badly corrupt the master.info file at the end .

May be concurrent start slave or change master to from dump reload and replication-manager?

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: VAROQUI Stephane

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 2020-06-26 17:43

Updated:: 2020-06-26 17:52

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.