[MDEV-25392] IO thread reporting yes despite failing to fetch GTID - Jira

Details

Type: New Feature
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Fix Version/s: None
Component/s: Replication
Labels:
None

Description

Randomly once in a month on a cheap cloud server with limited disk IO GTID replication freeze for ever. It's trigger by a short network glitch. The IO thread start re connection to leader that failed in infinite loop .

During investigation the leader error log messages are very confusing and does not help to found the cause of the issue

2021-04-12 12:26:55 1951557 [Note] Start binlog_dump to slave_server(2), pos(, 4), using_gtid(1), gtid('0-12599180-3481944')
2021-04-12 12:26:55 1951495 [Warning] Aborted connection 1951495 to db: 'unconnected' user: 'root' host: '10.48.96.84' (A slave with the same server_uuid/server_id as this slave has co)
2021-04-12 12:27:55 1951587 [Note] Start binlog_dump to slave_server(2), pos(, 4), using_gtid(1), gtid('0-12599180-3481944')
2021-04-12 12:27:55 1951557 [Warning] Aborted connection 1951557 to db: 'unconnected' user: 'root' host: '10.48.96.84' (A slave with the same server_uuid/server_id as this slave has co)
2021-04-12 12:29:01 1951615 [Note] Start binlog_dump to slave_server(2), pos(, 4), using_gtid(1), gtid('0-12599180-3481944')
2021-04-12 12:29:01 1951587 [Warning] Aborted connection 1951587 to db: 'unconnected' user: 'root' host: '10.48.96.84' (A slave with the same server_uuid/server_id as this slave has co)

The issue is because of slow disque and that sending the 0-12599180-3481944 position takes more than slave_net_timeout , the IO thread so cancel the event reception and retry

set global slave_net_timeout=1200;
is fixing the issue

In such infinite loop scenario cause by lack of binlog indexing on GTID the io_thread has always been reporting yes making monitoring proxies to send traffic to some super delayed slaves. Introducing a connecting state should be more appropriate.

One can also point the lack of an existing GTID function that with GTID parameter return binlogs file and position

Attachments

Issue Links

duplicates

MDEV-18142 master binlog read slow causes slave connect fail.

Closed

is part of

MDEV-4991 GTID binlog indexing

Closed

Activity

Ascending order - Click to sort in descending order

Andrei Elkin added a comment - 2021-04-17 09:30

stephane@skysql.com, thanks for the report!

Firstly, I agree master should be faster to respond, and there seems to be nothing but the indexing to help out.
I'm following with some more thought on that ticket.

To the new in-CONNECTing state also makes sense.
Could you please report the server version for us to make sure we don't have that already in higher versions.

Andrei Elkin added a comment - 2021-04-17 09:30 stephane@skysql.com , thanks for the report! Firstly, I agree master should be faster to respond, and there seems to be nothing but the indexing to help out. I'm following with some more thought on that ticket. To the new in-CONNECTing state also makes sense. Could you please report the server version for us to make sure we don't have that already in higher versions.

Kristian Nielsen added a comment - 2024-01-30 10:05

Nice analysis Stephane!

GTID indexes are now finally done, in 11.4 (~~MDEV-4991~~). This should fix the problem with slow connect to master with slow disks.

(The other points mentioned in the description remain valid, of course).

Kristian Nielsen added a comment - 2024-01-30 10:05 Nice analysis Stephane! GTID indexes are now finally done, in 11.4 ( MDEV-4991 ). This should fix the problem with slow connect to master with slow disks. (The other points mentioned in the description remain valid, of course).

VAROQUI Stephane added a comment - 2024-01-30 10:31

Let's celebrate ~~MDEV-4991~~ in FOSDEM

VAROQUI Stephane added a comment - 2024-01-30 10:31 Let's celebrate MDEV-4991 in FOSDEM

People

Assignee:: Andrei Elkin

Reporter:: VAROQUI Stephane

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 2021-04-12 11:55

Updated:: 2024-09-06 10:00

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server