Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-15972

Slave Data Mismatch

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 10.2.12
    • Fix Version/s: 10.2
    • Component/s: Replication
    • Labels:
    • Environment:
      Ubuntu - Xenial , 4CPU & 8GB RAM, NVME Disk100G.

      Description

      Below is the setup:

      (MUMBAI-N1->MUMBAI-N2) < = > (BLORE-N1->BLORE-N2)

      MUMBAI-N2 <=> BLORE-N2 is Master-Master to each other.

      OneCluster-Domainid-4100
      MUMBAI N1 - prd-aearo002 (E2)
      MUMBAI N2 - prd-aearo003 (E3)

      OneCluster-Domainid-4101
      BLORE N1 - prd-xaero002 (X2)
      BLORE N3 - prd-xaero001 (X1)

      We are able to achieve this by mentioning below variables:

      gtid_domain_id :
      > MUM-N1=4100
      > MUM-N2=4100
      > BLORE-N1=4101
      > BLORE-N2=4101

      set global gtid_strict_mode=0;
      set global wsrep_restart_slave = 1;
      slave-skip-errors=1062
       
      CHANGE MASTER to {ignore_server_id = 'of its Cluster group server_ids'}
      

      N1

      set global auto_increment_offset = 1;
      set global auto_increment_increment = 4;
      

      N2

      set global auto_increment_offset = 2;
      set global auto_increment_increment = 4;
      

      N3

      set global auto_increment_offset = 3;
      set global auto_increment_increment = 4;
      

      N4

      set global auto_increment_offset = 4;
      set global auto_increment_increment = 4;
      

      ON E3:

       set GLOBAL gtid_slave_pos='0-1-2,4101-855583900-13';  #--> On X2 (show variables like 'gtid_binlog_pos';)
       CHANGE MASTER 'x-02' TO  MASTER_HOST='prd-xaero002.phonepe.nm1',MASTER_USER='repl_user', MASTER_PASSWORD='slave123', master_use_gtid=slave_pos, IGNORE_SERVER_IDS = (195981401,751586059);
       
       set GLOBAL gtid_slave_pos='0-1-2,4101-855583900-13'; # On X1 (show variables like 'gtid_binlog_pos';)
       CHANGE MASTER 'x-03' TO  MASTER_HOST='prd-xaero001.phonepe.nm1',MASTER_USER='repl_user', MASTER_PASSWORD='slave123', master_use_gtid=slave_pos,IGNORE_SERVER_IDS = (195981401,751586059);
      

      start slave 'x-02'; start slave 'x-03' ;

      ON X1:

       set GLOBAL gtid_slave_pos='0-195981401-6'; # On E2 (show variables like 'gtid_binlog_pos';)
       CHANGE MASTER 'e-02' TO  MASTER_HOST='prd-eaero002.phonepe.nm1',MASTER_USER='repl_user', MASTER_PASSWORD='slave123', master_use_gtid=slave_pos,IGNORE_SERVER_IDS = (855583900,855583901);
       
       set GLOBAL gtid_slave_pos='0-195981401-6'; # On E3 (show variables like 'gtid_binlog_pos';)
       CHANGE MASTER 'e-03' TO  MASTER_HOST='prd-eaero003.phonepe.nm1',MASTER_USER='repl_user', MASTER_PASSWORD='slave123', master_use_gtid=slave_pos, IGNORE_SERVER_IDS = (855583900,855583901);
      

      start slave 'e-03'; start slave 'e-02';

      Slaves looks good:

      -hprd-xaero003.phonepe.nm1
                   Slave_IO_Running: Yes
                  Slave_SQL_Running: Yes
                   Slave_IO_Running: Yes
                  Slave_SQL_Running: Yes
       
      -hprd-eaero003.phonepe.nm1
                   Slave_IO_Running: Yes
                  Slave_SQL_Running: Yes
                   Slave_IO_Running: Yes
                  Slave_SQL_Running: Yes
      

      #Proceeding to the next toughness:
      #Now writing on Blore-N1 * 8TPS, while Mumbai-N1,N2 * 8TPS is writing. For 1Day.
      Issue is sometimes this goes for a GhostSQLDrop:

            2018-04-20 17:32:56 140708296464128 [Warning] Master 'e-03': Slave SQL: Could not execute Write_rows_v1 event on table MUMBAI.mumbai; Duplicate entry '21849' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mariadb-bin.000006, end_log_pos 2247180, Gtid 4100-195981401-4916, Internal MariaDB error code: 1062
            2018-04-20 17:32:56 140708296464128 [Note] Master 'e-03': Slave SQL thread exiting, replication stopped in log 'mariadb-bin.000006' at position 2246959; GTID position '4100-195981401-4914,4101-855583900-2939'
      

      We just need to start it: MariaDB [(none)]> start slave 'e-03';

      {Came up with an automated script works good for E3 and X1}

      Next Day: Stopped all writes.

      On Mum-N1:

            root@prd-eaero002:~# mysql -e "select count(*) from MUMBAI.mumbai"
      +----------+
      | count(*) |
      +----------+
      |  1011536 |
      +----------+
      

      On Blore-N1:

           root@prd-xaero003:~# mysql -e "select count(*) from MUMBAI.mumbai"
      +----------+
      | count(*) |
      +----------+
      |  1011534 |
      +----------+
      

      While checking the data:

      root@prd-eaero002:~# mysql -e "select * from MUMBAI.mumbai where uid='592457';"
      +------+------+---------------------+--------+
      | id   | node | dated               | uid    |
      +------+------+---------------------+--------+
      |  101 |    1 | 2018-04-21 18:26:42 | 592457 |
      +------+------+---------------------+--------+
      root@prd-eaero002:~# mysql -e "select * from MUMBAI.mumbai where uid='592461';"
      +------+------+---------------------+--------+
      | id   | node | dated               | uid    |
      +------+------+---------------------+--------+
      |  101 |    1 | 2018-04-21 18:26:42 | 592461 |
      +------+------+---------------------+--------+
      root@prd-eaero002:~#
      

      Only uid 592457,592461 is missing in BLORE Cluster.

      root@prd-xaero001:~#  mysql -e "select * from MUMBAI.mumbai where uid='592457';"
      root@prd-xaero001:~#  mysql -e "select * from MUMBAI.mumbai where uid='592461';"
      

      In Diff out of 10L records. Only above IDs are missing.

      root@prd-eaero002:~# diff /tmp/uid /tmp/prd-xaero002.phonepe.nm1_uid
      161937,161938d161936
      < 592457
      < 592461
      root@prd-eaero002:~# wc -l /tmp/prd-xaero002.phonepe.nm1_uid
      1011535 /tmp/prd-xaero002.phonepe.nm1_uid
      root@prd-eaero002:~# wc -l /tmp/uid
      1011537 /tmp/uid
      root@prd-eaero002:~#
      

      Looks like on BLORE-N2(Slave & Master of MUMBAI-N2) - (prd-xaero001)

      2018-04-21 18:26:42 0x7f04801e6700  InnoDB: Assertion failure in file /home/buildbot/buildbot/build/mariadb-10.2.12/storage/innobase/trx/trx0trx.cc line 2117
      InnoDB: Failing assertion: trx->lock.n_active_thrs == 1
      InnoDB: We intentionally generate a memory trap.
      InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
      InnoDB: If you get repeated assertion failures or crashes, even
      InnoDB: immediately after the mysqld startup, there may be
      InnoDB: corruption in the InnoDB tablespace. Please refer to
      InnoDB: http://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
      InnoDB: about forcing recovery.
      180421 18:26:42 [ERROR] mysqld got signal 11 ;
      This could be because you hit a bug. It is also possible that this binary
      or one of the libraries it was linked against is corrupt, improperly built,
      or misconfigured. This error can also be caused by malfunctioning hardware.
       
      To report this bug, see https://mariadb.com/kb/en/reporting-bugs
       
      We will try our best to scrape up some info that will hopefully help
      diagnose the problem, but since we have already crashed,
      something is definitely wrong and this may fail.
      

      And BLORE-N2 went for restart and received IST.

      Now it picked the slave from where it left:

      2018-04-21 18:27:46 139643427370752 [Note] Master 'e-02': Slave I/O thread: connected to master 'repl_user@prd-eaero002.phonepe.nm1:3306',replication starts at GTID position '4101-855583900-103689,0-195981401-6,4100-195981401-58264'
      2018-04-21 18:27:41 139643218073344 [Note] Master 'e-03': Slave I/O thread: connected to master 'repl_user@prd-eaero003.phonepe.nm1:3306',replication starts at GTID position '4101-855583900-103689,0-195981401-6,4100-195981401-58264'
      

      The uid in table corresponds to 4100-195981401-58264 is from the point BLORE-Node2 crashed.

          1. @4=592449 (4100-195981401-58264) exists in all
          2. @4=592453 (4100-195981401-58265) exists in all
            Server Blore Started giving its data back to Mumbai.
          3. @4=592451 (4101-855583900-103690) exists in all
            Next ones are missing ones.
          4. @4=592457 (4100-195981401-58266) doesn't exist in Blore
          5. @4=592461 (4100-195981401-58267) doesn't exist in Blore

      Very strange: Hmmmm....

      Now I see in BLORE-N2
      It exist in Relay log but not in Binary log:

      root@prd-xaero001:/home/mannoj.kumar# mysqlbinlog /var/lib/mysql/mysqld-relay-bin-e@002d03.000002 --base64-output=DECODE-ROWS --verbose | grep -i "4=" | head -20
      ###   @4=592453
      ###   @4=592457  - Exist in Relay Log
      ###   @4=592461  - Exist in Relay Log
      ###   @4=592465
      ###   @4=592469
      ###   @4=592473
      ###   @4=592481
      ###   @4=592477
      

      Binglog Entries : Missing in Binlog

      ###   @4=592453
      ###   @4=592451
      ###   @4=592518
      ###   @4=592522
      ###   @4=592524
      ###   @4=592526
      ###   @4=592528
      ###   @4=592530
      ###   @4=592532
      ###   @4=592534
      

        Attachments

          Activity

            People

            Assignee:
            Elkin Andrei Elkin
            Reporter:
            mannoj-maria Mannoj Kumar Saravanan
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Dates

              Created:
              Updated: