[MDEV-15972] Slave Data Mismatch Created: 2018-04-22  Updated: 2020-10-20

Status: Open
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.2.12
Fix Version/s: 10.2

Type: Bug Priority: Major
Reporter: Mannoj Kumar Saravanan Assignee: Andrei Elkin
Resolution: Unresolved Votes: 0
Labels: replication
Environment:

Ubuntu - Xenial , 4CPU & 8GB RAM, NVME Disk100G.


Issue Links:
Relates
relates to MDEV-8870 Crash - null pointer exception Closed

 Description   

Below is the setup:

(MUMBAI-N1->MUMBAI-N2) < = > (BLORE-N1->BLORE-N2)

MUMBAI-N2 <=> BLORE-N2 is Master-Master to each other.

OneCluster-Domainid-4100
MUMBAI N1 - prd-aearo002 (E2)
MUMBAI N2 - prd-aearo003 (E3)

OneCluster-Domainid-4101
BLORE N1 - prd-xaero002 (X2)
BLORE N3 - prd-xaero001 (X1)

We are able to achieve this by mentioning below variables:

gtid_domain_id :
> MUM-N1=4100
> MUM-N2=4100
> BLORE-N1=4101
> BLORE-N2=4101

set global gtid_strict_mode=0;
set global wsrep_restart_slave = 1;
slave-skip-errors=1062
 
CHANGE MASTER to {ignore_server_id = 'of its Cluster group server_ids'}

N1

set global auto_increment_offset = 1;
set global auto_increment_increment = 4;

N2

set global auto_increment_offset = 2;
set global auto_increment_increment = 4;

N3

set global auto_increment_offset = 3;
set global auto_increment_increment = 4;

N4

set global auto_increment_offset = 4;
set global auto_increment_increment = 4;

ON E3:

 set GLOBAL gtid_slave_pos='0-1-2,4101-855583900-13';  #--> On X2 (show variables like 'gtid_binlog_pos';)
 CHANGE MASTER 'x-02' TO  MASTER_HOST='prd-xaero002.phonepe.nm1',MASTER_USER='repl_user', MASTER_PASSWORD='slave123', master_use_gtid=slave_pos, IGNORE_SERVER_IDS = (195981401,751586059);
 
 set GLOBAL gtid_slave_pos='0-1-2,4101-855583900-13'; # On X1 (show variables like 'gtid_binlog_pos';)
 CHANGE MASTER 'x-03' TO  MASTER_HOST='prd-xaero001.phonepe.nm1',MASTER_USER='repl_user', MASTER_PASSWORD='slave123', master_use_gtid=slave_pos,IGNORE_SERVER_IDS = (195981401,751586059);

start slave 'x-02'; start slave 'x-03' ;

ON X1:

 set GLOBAL gtid_slave_pos='0-195981401-6'; # On E2 (show variables like 'gtid_binlog_pos';)
 CHANGE MASTER 'e-02' TO  MASTER_HOST='prd-eaero002.phonepe.nm1',MASTER_USER='repl_user', MASTER_PASSWORD='slave123', master_use_gtid=slave_pos,IGNORE_SERVER_IDS = (855583900,855583901);
 
 set GLOBAL gtid_slave_pos='0-195981401-6'; # On E3 (show variables like 'gtid_binlog_pos';)
 CHANGE MASTER 'e-03' TO  MASTER_HOST='prd-eaero003.phonepe.nm1',MASTER_USER='repl_user', MASTER_PASSWORD='slave123', master_use_gtid=slave_pos, IGNORE_SERVER_IDS = (855583900,855583901);

start slave 'e-03'; start slave 'e-02';

Slaves looks good:

-hprd-xaero003.phonepe.nm1
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
 
-hprd-eaero003.phonepe.nm1
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes

#Proceeding to the next toughness:
#Now writing on Blore-N1 * 8TPS, while Mumbai-N1,N2 * 8TPS is writing. For 1Day.
Issue is sometimes this goes for a GhostSQLDrop:

      2018-04-20 17:32:56 140708296464128 [Warning] Master 'e-03': Slave SQL: Could not execute Write_rows_v1 event on table MUMBAI.mumbai; Duplicate entry '21849' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mariadb-bin.000006, end_log_pos 2247180, Gtid 4100-195981401-4916, Internal MariaDB error code: 1062
      2018-04-20 17:32:56 140708296464128 [Note] Master 'e-03': Slave SQL thread exiting, replication stopped in log 'mariadb-bin.000006' at position 2246959; GTID position '4100-195981401-4914,4101-855583900-2939'

We just need to start it: MariaDB [(none)]> start slave 'e-03';

{Came up with an automated script works good for E3 and X1}

Next Day: Stopped all writes.

On Mum-N1:

      root@prd-eaero002:~# mysql -e "select count(*) from MUMBAI.mumbai"
+----------+
| count(*) |
+----------+
|  1011536 |
+----------+

On Blore-N1:

     root@prd-xaero003:~# mysql -e "select count(*) from MUMBAI.mumbai"
+----------+
| count(*) |
+----------+
|  1011534 |
+----------+

While checking the data:

root@prd-eaero002:~# mysql -e "select * from MUMBAI.mumbai where uid='592457';"
+------+------+---------------------+--------+
| id   | node | dated               | uid    |
+------+------+---------------------+--------+
|  101 |    1 | 2018-04-21 18:26:42 | 592457 |
+------+------+---------------------+--------+
root@prd-eaero002:~# mysql -e "select * from MUMBAI.mumbai where uid='592461';"
+------+------+---------------------+--------+
| id   | node | dated               | uid    |
+------+------+---------------------+--------+
|  101 |    1 | 2018-04-21 18:26:42 | 592461 |
+------+------+---------------------+--------+
root@prd-eaero002:~#

Only uid 592457,592461 is missing in BLORE Cluster.

root@prd-xaero001:~#  mysql -e "select * from MUMBAI.mumbai where uid='592457';"
root@prd-xaero001:~#  mysql -e "select * from MUMBAI.mumbai where uid='592461';"

In Diff out of 10L records. Only above IDs are missing.

root@prd-eaero002:~# diff /tmp/uid /tmp/prd-xaero002.phonepe.nm1_uid
161937,161938d161936
< 592457
< 592461
root@prd-eaero002:~# wc -l /tmp/prd-xaero002.phonepe.nm1_uid
1011535 /tmp/prd-xaero002.phonepe.nm1_uid
root@prd-eaero002:~# wc -l /tmp/uid
1011537 /tmp/uid
root@prd-eaero002:~#

Looks like on BLORE-N2(Slave & Master of MUMBAI-N2) - (prd-xaero001)

2018-04-21 18:26:42 0x7f04801e6700  InnoDB: Assertion failure in file /home/buildbot/buildbot/build/mariadb-10.2.12/storage/innobase/trx/trx0trx.cc line 2117
InnoDB: Failing assertion: trx->lock.n_active_thrs == 1
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
180421 18:26:42 [ERROR] mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

And BLORE-N2 went for restart and received IST.

Now it picked the slave from where it left:

2018-04-21 18:27:46 139643427370752 [Note] Master 'e-02': Slave I/O thread: connected to master 'repl_user@prd-eaero002.phonepe.nm1:3306',replication starts at GTID position '4101-855583900-103689,0-195981401-6,4100-195981401-58264'
2018-04-21 18:27:41 139643218073344 [Note] Master 'e-03': Slave I/O thread: connected to master 'repl_user@prd-eaero003.phonepe.nm1:3306',replication starts at GTID position '4101-855583900-103689,0-195981401-6,4100-195981401-58264'

The uid in table corresponds to 4100-195981401-58264 is from the point BLORE-Node2 crashed.

      1. @4=592449 (4100-195981401-58264) exists in all
      2. @4=592453 (4100-195981401-58265) exists in all
        Server Blore Started giving its data back to Mumbai.
      3. @4=592451 (4101-855583900-103690) exists in all
        Next ones are missing ones.
      4. @4=592457 (4100-195981401-58266) doesn't exist in Blore
      5. @4=592461 (4100-195981401-58267) doesn't exist in Blore

Very strange: Hmmmm....

Now I see in BLORE-N2
It exist in Relay log but not in Binary log:

root@prd-xaero001:/home/mannoj.kumar# mysqlbinlog /var/lib/mysql/mysqld-relay-bin-e@002d03.000002 --base64-output=DECODE-ROWS --verbose | grep -i "4=" | head -20
###   @4=592453
###   @4=592457  - Exist in Relay Log
###   @4=592461  - Exist in Relay Log
###   @4=592465
###   @4=592469
###   @4=592473
###   @4=592481
###   @4=592477

Binglog Entries : Missing in Binlog

###   @4=592453
###   @4=592451
###   @4=592518
###   @4=592522
###   @4=592524
###   @4=592526
###   @4=592528
###   @4=592530
###   @4=592532
###   @4=592534


Generated at Thu Feb 08 08:25:24 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.