Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-23028

master info corruption IO thread OK but no events are copied

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 10.3.23
    • Fix Version/s: None
    • Component/s: Backup, Replication
    • Labels:
      None
    • Environment:
      1 - Master 1-Slave /GTID conservative/semisync

      Description

      Ok the all story start by restoring a mysqldump that failed
      complaining about event scheduler was off on the master :

      /usr/bin/mysqldump --opt --hex-blob --events --disable-keys --master-data=1 --apply-slave-statements --gtid --single-transaction --all-databases --host=db-fr-1.mixr-dev.svc.cloud18 --port=3306 --user=root --password=XXXX --verbose

      mysqldump Ver 10.17 Distrib 10.4.13-MariaDB, for Linux (x86_64)
      The output is directly pipe to mysql client

      Couldn't execute 'show events': Cannot proceed, because event scheduler is disabled (1577)

      Here i suppose that if event scheduler is off, backup should just ignore and not failed with roolback?

      On the master :

      2020-06-26 16:23:00 131 [Note] Start binlog_dump to slave_server(53814828), pos(log-bin.000011, 356)
      2020-06-26 16:24:00 161 [Note] Start binlog_dump to slave_server(53814828), pos(log-bin.000011, 356)
      2020-06-26 16:24:00 131 [Warning] Aborted connection 131 to db: 'unconnected' user: 'root' host: '10.48.96.64' (A slave with the same server_uuid/server_id as this slave has...)
      2020-06-26 16:25:05 225 [Note] Start binlog_dump to slave_server(53814828), pos(log-bin.000011, 356)
      2020-06-26 16:25:05 161 [Warning] Aborted connection 161 to db: 'unconnected' user: 'root' host: '10.48.96.64' (A slave with the same server_uuid/server_id as this slave has...)
      2020-06-26 16:26:10 257 [Note] Start binlog_dump to slave_server(53814828), pos(log-bin.000011, 356)
      2020-06-26 16:26:11 225 [Warning] Aborted connection 225 to db: 'unconnected' user: 'root' host: '10.48.96.64' (A slave with the same server_uuid/server_id as this slave has...)
      2020-06-26 16:27:16 316 [Note] Start binlog_dump to slave_server(53814828), pos(log-bin.000011, 356)
      2020-06-26 16:27:16 257 [Warning] Aborted connection 257 to db: 'unconnected' user: 'root' host: '10.48.96.64' (A slave with the same server_uuid/server_id as this slave has...)

      on the slave

      2020-06-26 16:36:32 544 [Note] Slave I/O thread: connected to master 'root@db-fr-1.mixr-dev.svc.cloud18:3306',replication starts at GTID position '0-12599180-2851700'
      2020-06-26 16:37:32 544 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'log-bin.000011' at position 356; GTID position '0-12599180-2851700'
      2020-06-26 16:38:37 544 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'log-bin.000011' at position 356; GTID position '0-12599180-2851700'
      2020-06-26 16:39:42 544 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'log-bin.000011' at position 356; GTID position '0-12599180-2851700'
      2020-06-26 16:40:47 544 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'log-bin.000011' at position 356; GTID position '0-12599180-2851700'
      2020-06-26 16:41:52 544 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'log-bin.000011' at position 356; GTID position '0-12599180-2851700'
      2020-06-26 16:42:57 544 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'log-bin.000011' at position 356; GTID position '0-12599180-2851700'
      2020-06-26 16:44:02 544 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'log-bin.000011' at position 356; GTID position '0-12599180-2851700'

      on the slave both io thread and sql threads are running witch is already an issue as looks like the io thread can't fetch any events.

      I ve tried to

      • raised CHANGE MASTER TO MASTER_HEARTBEAT_PERIOD=30;
      • change slave server id
      • disable semisync
      • move from conservative to optimistic
      • restart the slave

      Restart the master

      None so far curve the replication that stay stuck on gtid '0-12599180-2851700'

      master.info looks strange
      33
      log-bin.000011
      356
      db-fr-1.mixr-dev.svc.cloud18
      root
      mariadb
      3306
      5
      0

      0
      30.000

      0

      0

      using_gtid=2
      do_domain_ids=0
      ignore_domain_ids=0
      END_MARKER
      RKER

      That RKER at the end is not a mistake of copy paste so the file get corrupted somehow
      indeed stop slave; reset slaves all and restore dump fixed the issue

      I was stuck 2 times in such state on that release never on 10.2.18 in production can one imagine a scenario that would so badly corrupt the master.info file at the end .

      May be concurrent start slave or change master to from dump reload and replication-manager?

        Attachments

          Activity

            People

            Assignee:
            Unassigned
            Reporter:
            stephane@skysql.com VAROQUI Stephane
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Dates

              Created:
              Updated: