Randomly once in a month on a cheap cloud server with limited disk IO GTID replication freeze for ever. It's trigger by a short network glitch. The IO thread start re connection to leader that failed in infinite loop .
During investigation the leader error log messages are very confusing and does not help to found the cause of the issue
2021-04-12 12:26:55 1951557 [Note] Start binlog_dump to slave_server(2), pos(, 4), using_gtid(1), gtid('0-12599180-3481944')
2021-04-12 12:26:55 1951495 [Warning] Aborted connection 1951495 to db: 'unconnected' user: 'root' host: '10.48.96.84' (A slave with the same server_uuid/server_id as this slave has co)
2021-04-12 12:27:55 1951587 [Note] Start binlog_dump to slave_server(2), pos(, 4), using_gtid(1), gtid('0-12599180-3481944')
2021-04-12 12:27:55 1951557 [Warning] Aborted connection 1951557 to db: 'unconnected' user: 'root' host: '10.48.96.84' (A slave with the same server_uuid/server_id as this slave has co)
2021-04-12 12:29:01 1951615 [Note] Start binlog_dump to slave_server(2), pos(, 4), using_gtid(1), gtid('0-12599180-3481944')
2021-04-12 12:29:01 1951587 [Warning] Aborted connection 1951587 to db: 'unconnected' user: 'root' host: '10.48.96.84' (A slave with the same server_uuid/server_id as this slave has co)
The issue is because of slow disque and that sending the 0-12599180-3481944 position takes more than slave_net_timeout , the IO thread so cancel the event reception and retry
set global slave_net_timeout=1200;
is fixing the issue
In such infinite loop scenario cause by lack of binlog indexing on GTID the io_thread has always been reporting yes making monitoring proxies to send traffic to some super delayed slaves. Introducing a connecting state should be more appropriate.
One can also point the lack of an existing GTID function that with GTID parameter return binlogs file and position
Firstly, I agree master should be faster to respond, and there seems to be nothing but the indexing to help out.
I'm following with some more thought on that ticket.
To the new in-CONNECTing state also makes sense.
Could you please report the server version for us to make sure we don't have that already in higher versions.
Andrei Elkin
added a comment - stephane@skysql.com , thanks for the report!
Firstly, I agree master should be faster to respond, and there seems to be nothing but the indexing to help out.
I'm following with some more thought on that ticket.
To the new in-CONNECTing state also makes sense.
Could you please report the server version for us to make sure we don't have that already in higher versions.
GTID indexes are now finally done, in 11.4 (MDEV-4991). This should fix the problem with slow connect to master with slow disks.
(The other points mentioned in the description remain valid, of course).
Kristian Nielsen
added a comment - Nice analysis Stephane!
GTID indexes are now finally done, in 11.4 ( MDEV-4991 ). This should fix the problem with slow connect to master with slow disks.
(The other points mentioned in the description remain valid, of course).
stephane@skysql.com, thanks for the report!
Firstly, I agree master should be faster to respond, and there seems to be nothing but the indexing to help out.
I'm following with some more thought on that ticket.
To the new in-CONNECTing state also makes sense.
Could you please report the server version for us to make sure we don't have that already in higher versions.