[MDEV-15972] Slave Data Mismatch - Jira

XML

Word

Printable

Details

Type: Bug
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 10.2.12
Fix Version/s: 10.2(EOL)
Component/s: Replication
Labels:
Environment:
Ubuntu - Xenial , 4CPU & 8GB RAM, NVME Disk100G.

Bug Category:
Can result in data loss

Description

Below is the setup:

(MUMBAI-N1->MUMBAI-N2) < = > (BLORE-N1->BLORE-N2)

MUMBAI-N2 <=> BLORE-N2 is Master-Master to each other.

OneCluster-Domainid-4100
MUMBAI N1 - prd-aearo002 (E2)
MUMBAI N2 - prd-aearo003 (E3)

OneCluster-Domainid-4101
BLORE N1 - prd-xaero002 (X2)
BLORE N3 - prd-xaero001 (X1)

We are able to achieve this by mentioning below variables:

gtid_domain_id :
> MUM-N1=4100
> MUM-N2=4100
> BLORE-N1=4101
> BLORE-N2=4101

set global gtid_strict_mode=0;

set global wsrep_restart_slave = 1;

slave-skip-errors=1062

CHANGE MASTER to {ignore_server_id = 'of its Cluster group server_ids'}

N1
set global auto_increment_offset = 1;
set global auto_increment_increment = 4;

N2
set global auto_increment_offset = 2;
set global auto_increment_increment = 4;

N3
set global auto_increment_offset = 3;
set global auto_increment_increment = 4;

N4
set global auto_increment_offset = 4;
set global auto_increment_increment = 4;

ON E3:

 set GLOBAL gtid_slave_pos='0-1-2,4101-855583900-13';  #--> On X2 (show variables like 'gtid_binlog_pos';)

 CHANGE MASTER 'x-02' TO  MASTER_HOST='prd-xaero002.phonepe.nm1',MASTER_USER='repl_user', MASTER_PASSWORD='slave123', master_use_gtid=slave_pos, IGNORE_SERVER_IDS = (195981401,751586059);

 set GLOBAL gtid_slave_pos='0-1-2,4101-855583900-13'; # On X1 (show variables like 'gtid_binlog_pos';)

 CHANGE MASTER 'x-03' TO  MASTER_HOST='prd-xaero001.phonepe.nm1',MASTER_USER='repl_user', MASTER_PASSWORD='slave123', master_use_gtid=slave_pos,IGNORE_SERVER_IDS = (195981401,751586059);

start slave 'x-02'; start slave 'x-03' ;

ON X1:

 set GLOBAL gtid_slave_pos='0-195981401-6'; # On E2 (show variables like 'gtid_binlog_pos';)

 CHANGE MASTER 'e-02' TO  MASTER_HOST='prd-eaero002.phonepe.nm1',MASTER_USER='repl_user', MASTER_PASSWORD='slave123', master_use_gtid=slave_pos,IGNORE_SERVER_IDS = (855583900,855583901);

 set GLOBAL gtid_slave_pos='0-195981401-6'; # On E3 (show variables like 'gtid_binlog_pos';)

 CHANGE MASTER 'e-03' TO  MASTER_HOST='prd-eaero003.phonepe.nm1',MASTER_USER='repl_user', MASTER_PASSWORD='slave123', master_use_gtid=slave_pos, IGNORE_SERVER_IDS = (855583900,855583901);

start slave 'e-03'; start slave 'e-02';

Slaves looks good:

-hprd-xaero003.phonepe.nm1

             Slave_IO_Running: Yes

            Slave_SQL_Running: Yes

             Slave_IO_Running: Yes

            Slave_SQL_Running: Yes

-hprd-eaero003.phonepe.nm1

             Slave_IO_Running: Yes

            Slave_SQL_Running: Yes

             Slave_IO_Running: Yes

            Slave_SQL_Running: Yes

#Proceeding to the next toughness:
#Now writing on Blore-N1 * 8TPS, while Mumbai-N1,N2 * 8TPS is writing. For 1Day.
Issue is sometimes this goes for a GhostSQLDrop:

      2018-04-20 17:32:56 140708296464128 [Warning] Master 'e-03': Slave SQL: Could not execute Write_rows_v1 event on table MUMBAI.mumbai; Duplicate entry '21849' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mariadb-bin.000006, end_log_pos 2247180, Gtid 4100-195981401-4916, Internal MariaDB error code: 1062

      2018-04-20 17:32:56 140708296464128 [Note] Master 'e-03': Slave SQL thread exiting, replication stopped in log 'mariadb-bin.000006' at position 2246959; GTID position '4100-195981401-4914,4101-855583900-2939'

We just need to start it: MariaDB [(none)]> start slave 'e-03';

{Came up with an automated script works good for E3 and X1}

Next Day: Stopped all writes.

On Mum-N1:

      root@prd-eaero002:~# mysql -e "select count(*) from MUMBAI.mumbai"

+----------+

| count(*) |

+----------+

|  1011536 |

+----------+

On Blore-N1:

     root@prd-xaero003:~# mysql -e "select count(*) from MUMBAI.mumbai"

+----------+

| count(*) |

+----------+

|  1011534 |

+----------+

While checking the data:

root@prd-eaero002:~# mysql -e "select * from MUMBAI.mumbai where uid='592457';"

+------+------+---------------------+--------+

| id   | node | dated               | uid    |

+------+------+---------------------+--------+

|  101 |    1 | 2018-04-21 18:26:42 | 592457 |

+------+------+---------------------+--------+

root@prd-eaero002:~# mysql -e "select * from MUMBAI.mumbai where uid='592461';"

+------+------+---------------------+--------+

| id   | node | dated               | uid    |

+------+------+---------------------+--------+

|  101 |    1 | 2018-04-21 18:26:42 | 592461 |

+------+------+---------------------+--------+

root@prd-eaero002:~#

Only uid 592457,592461 is missing in BLORE Cluster.

root@prd-xaero001:~#  mysql -e "select * from MUMBAI.mumbai where uid='592457';"

root@prd-xaero001:~#  mysql -e "select * from MUMBAI.mumbai where uid='592461';"

In Diff out of 10L records. Only above IDs are missing.

root@prd-eaero002:~# diff /tmp/uid /tmp/prd-xaero002.phonepe.nm1_uid

161937,161938d161936

< 592457

< 592461

root@prd-eaero002:~# wc -l /tmp/prd-xaero002.phonepe.nm1_uid

1011535 /tmp/prd-xaero002.phonepe.nm1_uid

root@prd-eaero002:~# wc -l /tmp/uid

1011537 /tmp/uid

root@prd-eaero002:~#

Looks like on BLORE-N2(Slave & Master of MUMBAI-N2) - (prd-xaero001)

2018-04-21 18:26:42 0x7f04801e6700  InnoDB: Assertion failure in file /home/buildbot/buildbot/build/mariadb-10.2.12/storage/innobase/trx/trx0trx.cc line 2117

InnoDB: Failing assertion: trx->lock.n_active_thrs == 1

InnoDB: We intentionally generate a memory trap.

InnoDB: Submit a detailed bug report to https://jira.mariadb.org/

InnoDB: If you get repeated assertion failures or crashes, even

InnoDB: immediately after the mysqld startup, there may be

InnoDB: corruption in the InnoDB tablespace. Please refer to

InnoDB: http://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html

InnoDB: about forcing recovery.

180421 18:26:42 [ERROR] mysqld got signal 11 ;

This could be because you hit a bug. It is also possible that this binary

or one of the libraries it was linked against is corrupt, improperly built,

or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help

diagnose the problem, but since we have already crashed,

something is definitely wrong and this may fail.

And BLORE-N2 went for restart and received IST.

Now it picked the slave from where it left:

2018-04-21 18:27:46 139643427370752 [Note] Master 'e-02': Slave I/O thread: connected to master 'repl_user@prd-eaero002.phonepe.nm1:3306',replication starts at GTID position '4101-855583900-103689,0-195981401-6,4100-195981401-58264'

2018-04-21 18:27:41 139643218073344 [Note] Master 'e-03': Slave I/O thread: connected to master 'repl_user@prd-eaero003.phonepe.nm1:3306',replication starts at GTID position '4101-855583900-103689,0-195981401-6,4100-195981401-58264'

The uid in table corresponds to 4100-195981401-58264 is from the point BLORE-Node2 crashed.

1. 1. @4=592449 (4100-195981401-58264) exists in all
  2. @4=592453 (4100-195981401-58265) exists in all
    Server Blore Started giving its data back to Mumbai.
  3. @4=592451 (4101-855583900-103690) exists in all
    Next ones are missing ones.
  4. @4=592457 (4100-195981401-58266) doesn't exist in Blore
  5. @4=592461 (4100-195981401-58267) doesn't exist in Blore

Very strange: Hmmmm....

Now I see in BLORE-N2
It exist in Relay log but not in Binary log:

root@prd-xaero001:/home/mannoj.kumar# mysqlbinlog /var/lib/mysql/mysqld-relay-bin-e@002d03.000002 --base64-output=DECODE-ROWS --verbose | grep -i "4=" | head -20

###   @4=592453

###   @4=592457  - Exist in Relay Log

###   @4=592461  - Exist in Relay Log

###   @4=592465

###   @4=592469

###   @4=592473

###   @4=592481

###   @4=592477

Binglog Entries : Missing in Binlog

###   @4=592453

###   @4=592451

###   @4=592518

###   @4=592522

###   @4=592524

###   @4=592526

###   @4=592528

###   @4=592530

###   @4=592532

###   @4=592534

Attachments

Issue Links

relates to

MDEV-8870 Crash - null pointer exception

Closed

Activity

People

Assignee:: Andrei Elkin

Reporter:: Mannoj Kumar Saravanan

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2018-04-22 10:42

Updated:: 2025-10-28 20:22

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.