Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-25392

IO thread reporting yes despite failing to fetch GTID

Details

    • New Feature
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • None
    • Replication
    • None

    Description

      Randomly once in a month on a cheap cloud server with limited disk IO GTID replication freeze for ever. It's trigger by a short network glitch. The IO thread start re connection to leader that failed in infinite loop .

      During investigation the leader error log messages are very confusing and does not help to found the cause of the issue

      2021-04-12 12:26:55 1951557 [Note] Start binlog_dump to slave_server(2), pos(, 4), using_gtid(1), gtid('0-12599180-3481944')
      2021-04-12 12:26:55 1951495 [Warning] Aborted connection 1951495 to db: 'unconnected' user: 'root' host: '10.48.96.84' (A slave with the same server_uuid/server_id as this slave has co)
      2021-04-12 12:27:55 1951587 [Note] Start binlog_dump to slave_server(2), pos(, 4), using_gtid(1), gtid('0-12599180-3481944')
      2021-04-12 12:27:55 1951557 [Warning] Aborted connection 1951557 to db: 'unconnected' user: 'root' host: '10.48.96.84' (A slave with the same server_uuid/server_id as this slave has co)
      2021-04-12 12:29:01 1951615 [Note] Start binlog_dump to slave_server(2), pos(, 4), using_gtid(1), gtid('0-12599180-3481944')
      2021-04-12 12:29:01 1951587 [Warning] Aborted connection 1951587 to db: 'unconnected' user: 'root' host: '10.48.96.84' (A slave with the same server_uuid/server_id as this slave has co)

      The issue is because of slow disque and that sending the 0-12599180-3481944 position takes more than slave_net_timeout , the IO thread so cancel the event reception and retry

      set global slave_net_timeout=1200;
      is fixing the issue

      In such infinite loop scenario cause by lack of binlog indexing on GTID the io_thread has always been reporting yes making monitoring proxies to send traffic to some super delayed slaves. Introducing a connecting state should be more appropriate.

      One can also point the lack of an existing GTID function that with GTID parameter return binlogs file and position

      Attachments

        Issue Links

          Activity

            Elkin Andrei Elkin added a comment -

            stephane@skysql.com, thanks for the report!

            Firstly, I agree master should be faster to respond, and there seems to be nothing but the indexing to help out.
            I'm following with some more thought on that ticket.

            To the new in-CONNECTing state also makes sense.
            Could you please report the server version for us to make sure we don't have that already in higher versions.

            Elkin Andrei Elkin added a comment - stephane@skysql.com , thanks for the report! Firstly, I agree master should be faster to respond, and there seems to be nothing but the indexing to help out. I'm following with some more thought on that ticket. To the new in-CONNECTing state also makes sense. Could you please report the server version for us to make sure we don't have that already in higher versions.

            Nice analysis Stephane!

            GTID indexes are now finally done, in 11.4 (MDEV-4991). This should fix the problem with slow connect to master with slow disks.

            (The other points mentioned in the description remain valid, of course).

            knielsen Kristian Nielsen added a comment - Nice analysis Stephane! GTID indexes are now finally done, in 11.4 ( MDEV-4991 ). This should fix the problem with slow connect to master with slow disks. (The other points mentioned in the description remain valid, of course).

            Let's celebrate MDEV-4991 in FOSDEM

            stephane@skysql.com VAROQUI Stephane added a comment - Let's celebrate MDEV-4991 in FOSDEM

            People

              Elkin Andrei Elkin
              stephane@skysql.com VAROQUI Stephane
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.