[MDEV-25392] IO thread reporting yes despite failing to fetch GTID Created: 2021-04-12  Updated: 2024-01-30

Status: Open
Project: MariaDB Server
Component/s: Replication
Fix Version/s: None

Type: New Feature Priority: Major
Reporter: VAROQUI Stephane Assignee: Andrei Elkin
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
PartOf
is part of MDEV-4991 GTID binlog indexing In Testing

 Description   

Randomly once in a month on a cheap cloud server with limited disk IO GTID replication freeze for ever. It's trigger by a short network glitch. The IO thread start re connection to leader that failed in infinite loop .

During investigation the leader error log messages are very confusing and does not help to found the cause of the issue

2021-04-12 12:26:55 1951557 [Note] Start binlog_dump to slave_server(2), pos(, 4), using_gtid(1), gtid('0-12599180-3481944')
2021-04-12 12:26:55 1951495 [Warning] Aborted connection 1951495 to db: 'unconnected' user: 'root' host: '10.48.96.84' (A slave with the same server_uuid/server_id as this slave has co)
2021-04-12 12:27:55 1951587 [Note] Start binlog_dump to slave_server(2), pos(, 4), using_gtid(1), gtid('0-12599180-3481944')
2021-04-12 12:27:55 1951557 [Warning] Aborted connection 1951557 to db: 'unconnected' user: 'root' host: '10.48.96.84' (A slave with the same server_uuid/server_id as this slave has co)
2021-04-12 12:29:01 1951615 [Note] Start binlog_dump to slave_server(2), pos(, 4), using_gtid(1), gtid('0-12599180-3481944')
2021-04-12 12:29:01 1951587 [Warning] Aborted connection 1951587 to db: 'unconnected' user: 'root' host: '10.48.96.84' (A slave with the same server_uuid/server_id as this slave has co)

The issue is because of slow disque and that sending the 0-12599180-3481944 position takes more than slave_net_timeout , the IO thread so cancel the event reception and retry

set global slave_net_timeout=1200;
is fixing the issue

In such infinite loop scenario cause by lack of binlog indexing on GTID the io_thread has always been reporting yes making monitoring proxies to send traffic to some super delayed slaves. Introducing a connecting state should be more appropriate.

One can also point the lack of an existing GTID function that with GTID parameter return binlogs file and position



 Comments   
Comment by Andrei Elkin [ 2021-04-17 ]

stephane@skysql.com, thanks for the report!

Firstly, I agree master should be faster to respond, and there seems to be nothing but the indexing to help out.
I'm following with some more thought on that ticket.

To the new in-CONNECTing state also makes sense.
Could you please report the server version for us to make sure we don't have that already in higher versions.

Comment by Kristian Nielsen [ 2024-01-30 ]

Nice analysis Stephane!

GTID indexes are now finally done, in 11.4 (MDEV-4991). This should fix the problem with slow connect to master with slow disks.

(The other points mentioned in the description remain valid, of course).

Comment by VAROQUI Stephane [ 2024-01-30 ]

Let's celebrate MDEV-4991 in FOSDEM

Generated at Thu Feb 08 09:37:20 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.