Details
-
Bug
-
Status: Open (View Workflow)
-
Major
-
Resolution: Unresolved
-
10.2.12
-
Ubuntu - Xenial , 4CPU & 8GB RAM, NVME Disk100G.
Description
Below is the setup:
(MUMBAI-N1->MUMBAI-N2) < = > (BLORE-N1->BLORE-N2)
MUMBAI-N2 <=> BLORE-N2 is Master-Master to each other.
OneCluster-Domainid-4100
MUMBAI N1 - prd-aearo002 (E2)
MUMBAI N2 - prd-aearo003 (E3)
OneCluster-Domainid-4101
BLORE N1 - prd-xaero002 (X2)
BLORE N3 - prd-xaero001 (X1)
We are able to achieve this by mentioning below variables:
gtid_domain_id :
> MUM-N1=4100
> MUM-N2=4100
> BLORE-N1=4101
> BLORE-N2=4101
set global gtid_strict_mode=0;
|
set global wsrep_restart_slave = 1;
|
slave-skip-errors=1062
|
|
CHANGE MASTER to {ignore_server_id = 'of its Cluster group server_ids'}
|
N1 |
set global auto_increment_offset = 1;
|
set global auto_increment_increment = 4;
|
N2 |
set global auto_increment_offset = 2;
|
set global auto_increment_increment = 4;
|
N3 |
set global auto_increment_offset = 3;
|
set global auto_increment_increment = 4;
|
N4 |
set global auto_increment_offset = 4;
|
set global auto_increment_increment = 4;
|
ON E3:
set GLOBAL gtid_slave_pos='0-1-2,4101-855583900-13'; #--> On X2 (show variables like 'gtid_binlog_pos';)
|
CHANGE MASTER 'x-02' TO MASTER_HOST='prd-xaero002.phonepe.nm1',MASTER_USER='repl_user', MASTER_PASSWORD='slave123', master_use_gtid=slave_pos, IGNORE_SERVER_IDS = (195981401,751586059);
|
|
set GLOBAL gtid_slave_pos='0-1-2,4101-855583900-13'; # On X1 (show variables like 'gtid_binlog_pos';)
|
CHANGE MASTER 'x-03' TO MASTER_HOST='prd-xaero001.phonepe.nm1',MASTER_USER='repl_user', MASTER_PASSWORD='slave123', master_use_gtid=slave_pos,IGNORE_SERVER_IDS = (195981401,751586059);
|
start slave 'x-02'; start slave 'x-03' ;
ON X1:
set GLOBAL gtid_slave_pos='0-195981401-6'; # On E2 (show variables like 'gtid_binlog_pos';)
|
CHANGE MASTER 'e-02' TO MASTER_HOST='prd-eaero002.phonepe.nm1',MASTER_USER='repl_user', MASTER_PASSWORD='slave123', master_use_gtid=slave_pos,IGNORE_SERVER_IDS = (855583900,855583901);
|
|
set GLOBAL gtid_slave_pos='0-195981401-6'; # On E3 (show variables like 'gtid_binlog_pos';)
|
CHANGE MASTER 'e-03' TO MASTER_HOST='prd-eaero003.phonepe.nm1',MASTER_USER='repl_user', MASTER_PASSWORD='slave123', master_use_gtid=slave_pos, IGNORE_SERVER_IDS = (855583900,855583901);
|
start slave 'e-03'; start slave 'e-02';
Slaves looks good:
-hprd-xaero003.phonepe.nm1
|
Slave_IO_Running: Yes
|
Slave_SQL_Running: Yes
|
Slave_IO_Running: Yes
|
Slave_SQL_Running: Yes
|
|
-hprd-eaero003.phonepe.nm1
|
Slave_IO_Running: Yes
|
Slave_SQL_Running: Yes
|
Slave_IO_Running: Yes
|
Slave_SQL_Running: Yes
|
#Proceeding to the next toughness:
#Now writing on Blore-N1 * 8TPS, while Mumbai-N1,N2 * 8TPS is writing. For 1Day.
Issue is sometimes this goes for a GhostSQLDrop:
2018-04-20 17:32:56 140708296464128 [Warning] Master 'e-03': Slave SQL: Could not execute Write_rows_v1 event on table MUMBAI.mumbai; Duplicate entry '21849' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mariadb-bin.000006, end_log_pos 2247180, Gtid 4100-195981401-4916, Internal MariaDB error code: 1062
|
2018-04-20 17:32:56 140708296464128 [Note] Master 'e-03': Slave SQL thread exiting, replication stopped in log 'mariadb-bin.000006' at position 2246959; GTID position '4100-195981401-4914,4101-855583900-2939'
|
We just need to start it: MariaDB [(none)]> start slave 'e-03';
{Came up with an automated script works good for E3 and X1}Next Day: Stopped all writes.
On Mum-N1:
root@prd-eaero002:~# mysql -e "select count(*) from MUMBAI.mumbai"
|
+----------+
|
| count(*) |
|
+----------+
|
| 1011536 |
|
+----------+
|
On Blore-N1:
root@prd-xaero003:~# mysql -e "select count(*) from MUMBAI.mumbai"
|
+----------+
|
| count(*) |
|
+----------+
|
| 1011534 |
|
+----------+
|
While checking the data:
root@prd-eaero002:~# mysql -e "select * from MUMBAI.mumbai where uid='592457';"
|
+------+------+---------------------+--------+
|
| id | node | dated | uid |
|
+------+------+---------------------+--------+
|
| 101 | 1 | 2018-04-21 18:26:42 | 592457 |
|
+------+------+---------------------+--------+
|
root@prd-eaero002:~# mysql -e "select * from MUMBAI.mumbai where uid='592461';"
|
+------+------+---------------------+--------+
|
| id | node | dated | uid |
|
+------+------+---------------------+--------+
|
| 101 | 1 | 2018-04-21 18:26:42 | 592461 |
|
+------+------+---------------------+--------+
|
root@prd-eaero002:~#
|
Only uid 592457,592461 is missing in BLORE Cluster.
root@prd-xaero001:~# mysql -e "select * from MUMBAI.mumbai where uid='592457';"
|
root@prd-xaero001:~# mysql -e "select * from MUMBAI.mumbai where uid='592461';"
|
In Diff out of 10L records. Only above IDs are missing.
root@prd-eaero002:~# diff /tmp/uid /tmp/prd-xaero002.phonepe.nm1_uid
|
161937,161938d161936
|
< 592457
|
< 592461
|
root@prd-eaero002:~# wc -l /tmp/prd-xaero002.phonepe.nm1_uid
|
1011535 /tmp/prd-xaero002.phonepe.nm1_uid
|
root@prd-eaero002:~# wc -l /tmp/uid
|
1011537 /tmp/uid
|
root@prd-eaero002:~#
|
Looks like on BLORE-N2(Slave & Master of MUMBAI-N2) - (prd-xaero001)
2018-04-21 18:26:42 0x7f04801e6700 InnoDB: Assertion failure in file /home/buildbot/buildbot/build/mariadb-10.2.12/storage/innobase/trx/trx0trx.cc line 2117
|
InnoDB: Failing assertion: trx->lock.n_active_thrs == 1
|
InnoDB: We intentionally generate a memory trap.
|
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
|
InnoDB: If you get repeated assertion failures or crashes, even
|
InnoDB: immediately after the mysqld startup, there may be
|
InnoDB: corruption in the InnoDB tablespace. Please refer to
|
InnoDB: http://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
|
InnoDB: about forcing recovery.
|
180421 18:26:42 [ERROR] mysqld got signal 11 ;
|
This could be because you hit a bug. It is also possible that this binary
|
or one of the libraries it was linked against is corrupt, improperly built,
|
or misconfigured. This error can also be caused by malfunctioning hardware.
|
|
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
|
|
We will try our best to scrape up some info that will hopefully help
|
diagnose the problem, but since we have already crashed,
|
something is definitely wrong and this may fail.
|
And BLORE-N2 went for restart and received IST.
Now it picked the slave from where it left:
2018-04-21 18:27:46 139643427370752 [Note] Master 'e-02': Slave I/O thread: connected to master 'repl_user@prd-eaero002.phonepe.nm1:3306',replication starts at GTID position '4101-855583900-103689,0-195981401-6,4100-195981401-58264'
|
2018-04-21 18:27:41 139643218073344 [Note] Master 'e-03': Slave I/O thread: connected to master 'repl_user@prd-eaero003.phonepe.nm1:3306',replication starts at GTID position '4101-855583900-103689,0-195981401-6,4100-195981401-58264'
|
The uid in table corresponds to 4100-195981401-58264 is from the point BLORE-Node2 crashed.
-
-
- @4=592449 (4100-195981401-58264) exists in all
- @4=592453 (4100-195981401-58265) exists in all
Server Blore Started giving its data back to Mumbai. - @4=592451 (4101-855583900-103690) exists in all
Next ones are missing ones. - @4=592457 (4100-195981401-58266) doesn't exist in Blore
- @4=592461 (4100-195981401-58267) doesn't exist in Blore
-
Very strange: Hmmmm....
Now I see in BLORE-N2
It exist in Relay log but not in Binary log:
root@prd-xaero001:/home/mannoj.kumar# mysqlbinlog /var/lib/mysql/mysqld-relay-bin-e@002d03.000002 --base64-output=DECODE-ROWS --verbose | grep -i "4=" | head -20
|
### @4=592453
|
### @4=592457 - Exist in Relay Log
|
### @4=592461 - Exist in Relay Log
|
### @4=592465
|
### @4=592469
|
### @4=592473
|
### @4=592481
|
### @4=592477
|
Binglog Entries : Missing in Binlog
### @4=592453
|
### @4=592451
|
### @4=592518
|
### @4=592522
|
### @4=592524
|
### @4=592526
|
### @4=592528
|
### @4=592530
|
### @4=592532
|
### @4=592534
|
Attachments
Issue Links
- relates to
-
MDEV-8870 Crash - null pointer exception
- Closed