Details
-
Bug
-
Status: Open (View Workflow)
-
Major
-
Resolution: Unresolved
-
10.3.23
-
None
-
None
-
1 - Master 1-Slave /GTID conservative/semisync
Description
Ok the all story start by restoring a mysqldump that failed
complaining about event scheduler was off on the master :
/usr/bin/mysqldump --opt --hex-blob --events --disable-keys --master-data=1 --apply-slave-statements --gtid --single-transaction --all-databases --host=db-fr-1.mixr-dev.svc.cloud18 --port=3306 --user=root --password=XXXX --verbose
mysqldump Ver 10.17 Distrib 10.4.13-MariaDB, for Linux (x86_64)
The output is directly pipe to mysql client
Couldn't execute 'show events': Cannot proceed, because event scheduler is disabled (1577)
Here i suppose that if event scheduler is off, backup should just ignore and not failed with roolback?
On the master :
2020-06-26 16:23:00 131 [Note] Start binlog_dump to slave_server(53814828), pos(log-bin.000011, 356)
2020-06-26 16:24:00 161 [Note] Start binlog_dump to slave_server(53814828), pos(log-bin.000011, 356)
2020-06-26 16:24:00 131 [Warning] Aborted connection 131 to db: 'unconnected' user: 'root' host: '10.48.96.64' (A slave with the same server_uuid/server_id as this slave has...)
2020-06-26 16:25:05 225 [Note] Start binlog_dump to slave_server(53814828), pos(log-bin.000011, 356)
2020-06-26 16:25:05 161 [Warning] Aborted connection 161 to db: 'unconnected' user: 'root' host: '10.48.96.64' (A slave with the same server_uuid/server_id as this slave has...)
2020-06-26 16:26:10 257 [Note] Start binlog_dump to slave_server(53814828), pos(log-bin.000011, 356)
2020-06-26 16:26:11 225 [Warning] Aborted connection 225 to db: 'unconnected' user: 'root' host: '10.48.96.64' (A slave with the same server_uuid/server_id as this slave has...)
2020-06-26 16:27:16 316 [Note] Start binlog_dump to slave_server(53814828), pos(log-bin.000011, 356)
2020-06-26 16:27:16 257 [Warning] Aborted connection 257 to db: 'unconnected' user: 'root' host: '10.48.96.64' (A slave with the same server_uuid/server_id as this slave has...)
on the slave
2020-06-26 16:36:32 544 [Note] Slave I/O thread: connected to master 'root@db-fr-1.mixr-dev.svc.cloud18:3306',replication starts at GTID position '0-12599180-2851700'
2020-06-26 16:37:32 544 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'log-bin.000011' at position 356; GTID position '0-12599180-2851700'
2020-06-26 16:38:37 544 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'log-bin.000011' at position 356; GTID position '0-12599180-2851700'
2020-06-26 16:39:42 544 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'log-bin.000011' at position 356; GTID position '0-12599180-2851700'
2020-06-26 16:40:47 544 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'log-bin.000011' at position 356; GTID position '0-12599180-2851700'
2020-06-26 16:41:52 544 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'log-bin.000011' at position 356; GTID position '0-12599180-2851700'
2020-06-26 16:42:57 544 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'log-bin.000011' at position 356; GTID position '0-12599180-2851700'
2020-06-26 16:44:02 544 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'log-bin.000011' at position 356; GTID position '0-12599180-2851700'
on the slave both io thread and sql threads are running witch is already an issue as looks like the io thread can't fetch any events.
I ve tried to
- raised CHANGE MASTER TO MASTER_HEARTBEAT_PERIOD=30;
- change slave server id
- disable semisync
- move from conservative to optimistic
- restart the slave
Restart the master
None so far curve the replication that stay stuck on gtid '0-12599180-2851700'
master.info looks strange
33
log-bin.000011
356
db-fr-1.mixr-dev.svc.cloud18
root
mariadb
3306
5
0
0
30.000
0
0
using_gtid=2
do_domain_ids=0
ignore_domain_ids=0
END_MARKER
RKER
That RKER at the end is not a mistake of copy paste so the file get corrupted somehow
indeed stop slave; reset slaves all and restore dump fixed the issue
I was stuck 2 times in such state on that release never on 10.2.18 in production can one imagine a scenario that would so badly corrupt the master.info file at the end .
May be concurrent start slave or change master to from dump reload and replication-manager?