Details
-
New Feature
-
Status: Open (View Workflow)
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
Randomly once in a month on a cheap cloud server with limited disk IO GTID replication freeze for ever. It's trigger by a short network glitch. The IO thread start re connection to leader that failed in infinite loop .
During investigation the leader error log messages are very confusing and does not help to found the cause of the issue
2021-04-12 12:26:55 1951557 [Note] Start binlog_dump to slave_server(2), pos(, 4), using_gtid(1), gtid('0-12599180-3481944')
2021-04-12 12:26:55 1951495 [Warning] Aborted connection 1951495 to db: 'unconnected' user: 'root' host: '10.48.96.84' (A slave with the same server_uuid/server_id as this slave has co)
2021-04-12 12:27:55 1951587 [Note] Start binlog_dump to slave_server(2), pos(, 4), using_gtid(1), gtid('0-12599180-3481944')
2021-04-12 12:27:55 1951557 [Warning] Aborted connection 1951557 to db: 'unconnected' user: 'root' host: '10.48.96.84' (A slave with the same server_uuid/server_id as this slave has co)
2021-04-12 12:29:01 1951615 [Note] Start binlog_dump to slave_server(2), pos(, 4), using_gtid(1), gtid('0-12599180-3481944')
2021-04-12 12:29:01 1951587 [Warning] Aborted connection 1951587 to db: 'unconnected' user: 'root' host: '10.48.96.84' (A slave with the same server_uuid/server_id as this slave has co)
The issue is because of slow disque and that sending the 0-12599180-3481944 position takes more than slave_net_timeout , the IO thread so cancel the event reception and retry
set global slave_net_timeout=1200;
is fixing the issue
In such infinite loop scenario cause by lack of binlog indexing on GTID the io_thread has always been reporting yes making monitoring proxies to send traffic to some super delayed slaves. Introducing a connecting state should be more appropriate.
One can also point the lack of an existing GTID function that with GTID parameter return binlogs file and position
Attachments
Issue Links
- duplicates
-
MDEV-18142 master binlog read slow causes slave connect fail.
- Closed
- is part of
-
MDEV-4991 GTID binlog indexing
- Closed