Devang, thanks for more verbose data!
The first error recurrence is actually at
2018-09-13 0:54:47 139720503559936 [ERROR] Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'binlog truncated in the middle of event; consider out of disk space on master; the first event 'XXXXX14-bin.000761' at 180213418, the last event read from 'XXXXX14-bin.000761' at 566564012, the last byte read from 'XXXXX14-bin.000761' at 566564031.', Internal MariaDB error code: 1236
2018-09-13 0:54:47 139720503559936 [Note] Slave I/O thread exiting, read up to log 'XXXXX14-bin.000761', position 566579200
2018-09-13 0:55:01 139720503256832 [Note] Error reading relay log event: slave SQL thread was killed
2018-09-13 0:55:01 139720503256832 [Note] Slave SQL thread exiting, replication stopped in log 'XXXXX14-bin.000761' at position 566579200
2018-09-13 0:55:01 139720503559936 [Note] Slave I/O thread: connected to master 'repluser@10.20.20.13:3306',replication started in log 'XXXXX14-bin.000761' at position 566579200
2018-09-13 0:55:01 139720502953728 [Note] Slave SQL thread initialized, starting replication in log 'XXXXX14-bin.000761' at position 566579200, relay log './XXXXX10-relay-bin.000136' position: 386366338
We don't know yet what caused incorrect read on the master when it was attempting to send the last event read from 'XXXXX14-bin.000761' at 566564012. This remains to be our main concern.
And it apparently stands separately from another puzzling part which is about consistency of your data.It all looks like on the slave
your jump over the failed-to-be-sent event. The next time the IO thread of slave resumes not from where it had to
which is 566564012, but from an offset left by the error message. I don't have any other explanation but CHANGE-MASTER is executed
with master_log_pos := 566579200.
So please find out what commands were run on slave to make it to connect not from what is expected.
Also notice that
@@global.gtid_strict_mode = OFF
|
in your settings which does not to catch this skip esp if it was not planned by your. I recommend to always set it ON if you're concerned with
data consistency.
To the main part of what is causing the "truncate" error, could you gzip and upload to Jira
'XXXXX14-bin.000761' log? It's enough for me to have its head part ending at least at 566564012 + 1024 byte (the last event start + 1 KB).
If you can't do that then I will ask you to run 3 commands:
1. mysqlbinlog --start-position=start_pos --stop-position=566564012
where start_pos value you will have to find yourself as the offset (position) of a 100th event
in the file before the failed one. Run mysqlbinlog | less, search for ^at 566564012 to locate the event and then search for some 100th '^at any-number' upwards which will set you at the start-position event (the best if it would be GTID event).
2. mysqlbinlog -start-position=566564012
This command most probably will fail the last event could be corrupted, and if that's the case then
3. dd skip=566564012 bs=1 count=1024 < XXXXX14-bin.000761 > 566564012.dump
I am waiting for your results,
Andrei
Dear Andrei Elkin [ elkin ]
One thing i noticed which may help you further , i m not sure how near i m but i observed this behavior when MASTER does not have anything left to send to SLAVE.
I might be wrong but it surely help you.
thanks for looking into.
Devang