[MDEV-28770] replication duplicate key with the slave_pos Created: 2022-06-08  Updated: 2022-06-16

Status: Open
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.4.18
Fix Version/s: None

Type: Bug Priority: Major
Reporter: yongjian.wu Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: replication
Environment:

master slave


Attachments: HTML File binlog     File my.cnf.1     HTML File scripts    

 Description   

Hi guys
when use the slave_pos to recover an crashed master slave,like below.
CHANGE MASTER TO MASTER_HOST=\'172.137.254.21\', MASTER_PORT=3306, MASTER_USER=\'replication\', MASTER_PASSWORD=*****, MASTER_USE_GTID=slave_pos'

found that the replication will start from a wrong position,although the position in the log is right,but actually it will start from a wrong pos and try apply the transaction that already exists in the database.
/////////////////////////
2022-06-08 10:58:45 1105456 [Note] Slave I/O thread: connected to master 'replication@172.137.254.21:3306',replication starts at GTID position '111-11-5933'
2022-06-08 10:58:45 1105457 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table tt.table1; Duplicate entry '5575-2021-12-30 00:00:00-m1' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log BINLOG.000001, end_log_pos 1880831, Internal MariaDB error code: 1062
2022-06-08 10:58:45 1105457 [Warning] Slave: Duplicate entry '5575-2021-12-30 00:00:00-m1' for key 'PRIMARY' Error_code: 1062
//////////////////////////////

is this a bug?could you help on this?



 Comments   
Comment by Angelique Sklavounos (Inactive) [ 2022-06-08 ]

Hi yongjian.wu thank you for your report.

If possible, could you please provide the output of show slave status\G, along with the binlog and error log (if it exists)?

Also, to help me try to reproduce...
1) Was the slave originally replicating with MASTER_USE_GTID=slave_pos before the crash?
2) Did both the master and slave crash and were restarted? Or just one of them crashed?

Comment by Angelique Sklavounos (Inactive) [ 2022-06-08 ]

yongjian.wu could you also please describe the replication configuration? Specifically, by "when use the slave_pos to recover an crashed master slave", was this CHANGE MASTER statement applied to the slave again, or did the master crash and was changed to become the new slave using this CHANGE MASTER statement? In which case, how did you set the value of @@global.gitd_slave_pos?

Comment by yongjian.wu [ 2022-06-10 ]

Hi Angelique Sklavounos
sorry for late reply ,the env disappered and i reproduce a new one waste some time.......
1) Was the slave originally replicating with MASTER_USE_GTID=slave_pos before the crash?---yes
2) Did both the master and slave crash and were restarted? Or just one of them crashed? --just change a master
i upload a file please kindly take a look scripts
if anything you need please let me know

thank you so much

Comment by yongjian.wu [ 2022-06-10 ]

you can found the file "binlog" the pos 877 gtid is 111-11-614 this an old gtid that already applied on the current node

Comment by yongjian.wu [ 2022-06-10 ]

binlog

Comment by yongjian.wu [ 2022-06-10 ]

the error node's alert log db3
2022-06-10 18:58:51 1251 [Note] Slave I/O thread exiting, read up to log 'BINLOG.000002', position 10795; GTID position 111-11-623
2022-06-10 18:58:51 1251 [Note] master was 172.137.254.21:3306
2022-06-10 18:58:52 1250 [Note] Master connection name: '' Master_info_file: 'master.info' Relay_info_file: 'relay-log.info'
2022-06-10 18:58:52 1250 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='172.137.254.21', master_port='3306', master_log_file='BINLOG.000002', master_log_pos='10795'. New state master_host='172.137.254.22', master_port='3306', master_log_file='', master_log_pos='4'.
2022-06-10 18:58:52 1250 [Note] Previous Using_Gtid=Slave_Pos. New Using_Gtid=Slave_Pos
2022-06-10 18:58:55 1253 [Note] Slave I/O thread: Start semi-sync replication to master 'replication@172.137.254.22:3306' in log '' at position 4
2022-06-10 18:58:55 1254 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log '/opt/maria10.1/binlog/RELAY.000001' position: 4; GTID position '111-11-623'
2022-06-10 18:58:55 1253 [Note] Slave I/O thread: connected to master 'replication@172.137.254.22:3306',replication starts at GTID position '111-11-623'
2022-06-10 18:58:55 1254 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table tt.table1; Duplicate entry '22-2021-12-30 00:00:00-m1' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log BINLOG.000003, end_log_pos 877, Internal MariaDB error code: 1062
2022-06-10 18:58:55 1254 [Warning] Slave: Duplicate entry '22-2021-12-30 00:00:00-m1' for key 'PRIMARY' Error_code: 1062
2022-06-10 18:58:55 1254 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'BINLOG.000003' position 371; GTID position '111-11-623'
2022-06-10 18:58:55 1254 [Note] Slave SQL thread exiting, replication stopped in log 'BINLOG.000003' at position 371; GTID position '111-11-623'
2022-06-10 18:58:55 1254 [Note] master was 172.137.254.22:3306

Comment by Angelique Sklavounos (Inactive) [ 2022-06-10 ]

Great, thanks yongjian.wu, I will take a look now with the information you gave me.

Comment by Angelique Sklavounos (Inactive) [ 2022-06-10 ]

Hi yongjian.wu, I noticed wsrep_gtid_mode ON in the scripts file - is Galera being used? Could you please send the .cnf files? Thank you.

Comment by yongjian.wu [ 2022-06-13 ]

the Galera is not there and attach the my.cnf
my.cnf.1

thank you

Comment by Kuijun Cui [ 2022-06-13 ]

I'm sorry, I also met this issue, three nodes, one primary, two replica nodes, the same gtid_domain_id, deffer server_id, semi-sync enabled.

When switching to new primary node, the duplicate key error occurs. All nodes do not produce new binlog, but the replica SQL thread will start from position '4'.

When I use "change master to master_log_pos=correctPosition", then the replica came back to normal.

When I use "change master to master_log_pos=correctPositio, master_use_gtid=slave_pos", the SQL thread still start from the postion 4.

And the log output is:

2022-06-13 14:56:36 788843 [Note] Error reading relay log event: slave SQL thread was killed
2022-06-13 14:56:36 788843 [Note] Slave SQL thread exiting, replication stopped in log 'BINLOG.000002' at position 17624549
2022-06-13 14:56:36 788843 [Note] master was 172.36.250.14:3306
2022-06-13 14:56:36 788842 [Note] Slave I/O thread exiting, read up to log 'BINLOG.000002', position 17624549
2022-06-13 14:56:36 788842 [Note] master was 172.36.250.14:3306
2022-06-13 14:56:50 788823 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='172.36.250.14', master_port='3306', master_log_file='BINLOG.000002', master_log_pos='17624549'. New state master_host='172.36.250.14', master_port='3306', master_log_file='BINLOG.000002', master_log_pos='17624549'.
2022-06-13 14:56:50 788823 [Note] Previous Using_Gtid=No. New Using_Gtid=Slave_Pos
2022-06-13 14:56:53 788954 [Note] Slave I/O thread: Start semi-sync replication to master 'replication@172.36.250.14:3306' in log 'BINLOG.000002' at position 17624549
2022-06-13 14:56:53 788955 [Note] Slave SQL thread initialized, starting replication in log *'BINLOG.000002' at position 17624549*, relay log '/opt/maria10.1/binlog/RELAY.000001' position: 4; GTID position '1301-303-17388'
2022-06-13 14:56:53 788954 [Note] Slave I/O thread: connected to master 'replication@172.36.250.14:3306',replication starts at GTID position '1301-303-17388'
2022-06-13 14:56:54 788955 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table sbtest.sbtest9; Duplicate entry '49807' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's *master log BINLOG.000002, end_log_pos 2021*, Internal MariaDB error code: 1062
2022-06-13 14:56:54 788955 [Warning] Slave: Duplicate entry '49807' for key 'PRIMARY' Error_code: 1062
2022-06-13 14:56:54 788955 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'BINLOG.000002' position 1046; GTID position '1301-303-17388'
2022-06-13 14:56:54 788955 [Note] Slave SQL thread exiting, replication stopped in log 'BINLOG.000002' at position 1046; GTID position '1301-303-17388'
2022-06-13 14:56:54 788955 [Note] master was 172.36.250.14:3306

The position SQL thread started, sames is not the really postion.

Comment by Angelique Sklavounos (Inactive) [ 2022-06-13 ]

Thanks yongjian, following your .cnf and 'scripts', I am able to reproduce this.

However, if I separate the insertions into two transactions, so

mysql -ureplication -preplication -h172.137.254.21 -P3306 tt -e " set autocommit=0;begin; insert into table1(date,info,test_info,INF) VALUES ('2021-12-30','m$i',$i,111);insert into table1(date,info,test_info,INF) VALUES ('2021-12-30','m$i',$i,111);commit;"

would become

mysql -ureplication -preplication -h172.137.254.21 -P3306 tt -e " set autocommit=0;begin; insert into table1(date,info,test_info,INF) VALUES ('2021-12-30','m$i',$i,111);commit;"
mysql -ureplication -preplication -h172.137.254.21 -P3306 tt -e " set autocommit=0;begin; insert into table1(date,info,test_info,INF) VALUES ('2021-12-30','m$i',$i,111);commit;"   

then I do not see the Duplicate error.

yongjian.wu, I have 2 more questions:
1) Do you also get the Duplicate key error when not doing two INSERTs in a single transaction?
2) One thing I also noticed was my error stated Duplicate entry '42-2021-12-30 00:00:00-m1' for key 'PRIMARY' - 42 instead of 22. yongjian, could I please verify that the table definition of AUTO_INCREMENT=41 was used? If I change it to 21, then I do indeed get Duplicate entry '22-2021-12-30 00:00:00-m1'.

cui3093 Thank you for your comment. I am confused by the log output, specifically:

2022-06-13 14:56:50 788823 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='172.36.250.14', master_port='3306', master_log_file='BINLOG.000002', master_log_pos='17624549'. New state master_host='172.36.250.14', master_port='3306', master_log_file='BINLOG.000002', master_log_pos='17624549'.
2022-06-13 14:56:50 788823 [Note] Previous Using_Gtid=No. New Using_Gtid=Slave_Pos

To me, it does not appear that a switch was made to the new primary node, since the values for master_host are the same when comparing "Previous state" and "New state".
I get the Duplicate key error when switching to the new primary that was a replica and master_use_gtid=slave_pos (but not when using the same primary), but I get a different error when not using it and using master_log_pos, so could you please attach the .cnfs and also share more info on the table and inserts? Thank you.

Comment by yongjian.wu [ 2022-06-14 ]

Hi Angelique Sklavounos
1) Do you also get the Duplicate key error when not doing two INSERTs in a single transaction? -->yes i have tried with the "single" insert it is no problem
2) One thing I also noticed was my error stated Duplicate..... --> yes ,you can also check the log that i upload before the AUTO_INCREMENT is 41, and i also do a test like you i make the AUTO_INCREMENT = 10 then i got the error " Duplicate entry '11-2021-12-30 00:00:00" today

FYI
(root@localhost) [tt] 10:45:37> show variables like '%increment%';
-----------------------------------+

Variable_name Value

-----------------------------------+

auto_increment_increment 1
auto_increment_offset 1
div_precision_increment 4
innodb_autoextend_increment 64
wsrep_auto_increment_control ON

-----------------------------------+
5 rows in set (0.00 sec)

is there any bug during the PK auto_incrment replication under the slave_pos?

Comment by Angelique Sklavounos (Inactive) [ 2022-06-14 ]

1) Do you also get the Duplicate key error when not doing two INSERTs in a single transaction? -->yes i have tried with the "single" insert it is no problem

Thanks, yongjian, I will look into this some more.

is there any bug during the PK auto_incrment replication under the slave_pos?

Not that I'm aware of... I would like to be able to reproduce your output with AUTO_INCREMENT is 41 and the Duplicate entry 22-2021-12-30 00:00:00-m1.

Comment by yongjian.wu [ 2022-06-14 ]

sorry Angelique
maybe there is some misunderstanding,the AUTO_INCREMENT=41 is not the initial SQL i used to create the table, I get it should be during the test by show create table.
so today i do a re-test for the auto_increment when creating the table I set it to "10" and when the error occur it said " Duplicate entry '11-2021-12-30 00:00:00"
so maybe you needn't to reproduce the AUTO_INCREMENT=41 and Duplicate entry 22, it should be a wrong case

Comment by Angelique Sklavounos (Inactive) [ 2022-06-14 ]

Great, thanks for clarifying yongjian! I will continue to look into the Duplicate entry error.

Comment by Angelique Sklavounos (Inactive) [ 2022-06-15 ]

Hi yongjian.wu, could you please try commenting out/deleting wsrep_gtid_mode=ON in the .cnf file to see if it helps? This appears to fix the issue when I tested it.

Comment by yongjian.wu [ 2022-06-15 ]

Hi Angelique
thanks a lot for your quick response I will do some test under your suggestion.

Comment by yongjian.wu [ 2022-06-15 ]

Hi Angelique

seems the error gone away,,could i know why this parameter cause this trouble

if the wsrep_gtid_mode on will cause the replication to do some additional check?

so even the wsrep_on is OFF still can not make the wsrep_gtid_mode disabled is this expected?

Comment by Kuijun Cui [ 2022-06-16 ]

Hi Angelique
Thank you for your reply.
I also enabled wsrep_gtid_mode with wsrep_on=0, after disabling this option, issue disappeared. I'm also confused by when wsrep_on=0, why some wsrep_* option still keep take effect, is there any bug or flaw exists?Could you share with me?

Generated at Thu Feb 08 10:03:20 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.