Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-25392

IO thread reporting yes despite failing to fetch GTID

    XMLWordPrintable

Details

    • New Feature
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • None
    • Replication
    • None

    Description

      Randomly once in a month on a cheap cloud server with limited disk IO GTID replication freeze for ever. It's trigger by a short network glitch. The IO thread start re connection to leader that failed in infinite loop .

      During investigation the leader error log messages are very confusing and does not help to found the cause of the issue

      2021-04-12 12:26:55 1951557 [Note] Start binlog_dump to slave_server(2), pos(, 4), using_gtid(1), gtid('0-12599180-3481944')
      2021-04-12 12:26:55 1951495 [Warning] Aborted connection 1951495 to db: 'unconnected' user: 'root' host: '10.48.96.84' (A slave with the same server_uuid/server_id as this slave has co)
      2021-04-12 12:27:55 1951587 [Note] Start binlog_dump to slave_server(2), pos(, 4), using_gtid(1), gtid('0-12599180-3481944')
      2021-04-12 12:27:55 1951557 [Warning] Aborted connection 1951557 to db: 'unconnected' user: 'root' host: '10.48.96.84' (A slave with the same server_uuid/server_id as this slave has co)
      2021-04-12 12:29:01 1951615 [Note] Start binlog_dump to slave_server(2), pos(, 4), using_gtid(1), gtid('0-12599180-3481944')
      2021-04-12 12:29:01 1951587 [Warning] Aborted connection 1951587 to db: 'unconnected' user: 'root' host: '10.48.96.84' (A slave with the same server_uuid/server_id as this slave has co)

      The issue is because of slow disque and that sending the 0-12599180-3481944 position takes more than slave_net_timeout , the IO thread so cancel the event reception and retry

      set global slave_net_timeout=1200;
      is fixing the issue

      In such infinite loop scenario cause by lack of binlog indexing on GTID the io_thread has always been reporting yes making monitoring proxies to send traffic to some super delayed slaves. Introducing a connecting state should be more appropriate.

      One can also point the lack of an existing GTID function that with GTID parameter return binlogs file and position

      Attachments

        Issue Links

          Activity

            People

              Elkin Andrei Elkin
              stephane@skysql.com VAROQUI Stephane
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.