[MDEV-18576] replication issue, disconnection every minute with error : "A slave with the same server_uuid/server_id as this slave has co', with different server-id. Created: 2019-02-14 Updated: 2019-12-19 Resolved: 2019-04-24 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Replication |
| Affects Version/s: | 10.3.12 |
| Fix Version/s: | 10.2.4 |
| Type: | Bug | Priority: | Major |
| Reporter: | Christophe Vigny | Assignee: | Andrei Elkin |
| Resolution: | Not a Bug | Votes: | 3 |
| Labels: | None | ||
| Environment: |
debian stretch |
||
| Attachments: |
|
| Description |
|
After upgrading mariadb master/slave cluster from 10.0.33 to 10.3.12, I have a replication issue. on each cluster that I have upgraded, every minute, all slaves connections are closed with the message : I have checked the server-id of all the stack, non are the same. The replication was working fine with 10.0.33. All the slaves have been created via cloning the full data set via rsync, and then restarting changing the server-id. I have read |
| Comments |
| Comment by Alex/AT [ 2019-02-27 ] | ||||
|
Having the same. 10.3.11, 10.3.12, 10.3.13 | ||||
| Comment by Johannes Jordan [ 2019-02-27 ] | ||||
|
We can confirm this problem with 10.3.13. Upon investigation, we found that when our replicating slave disconnects, the master server keeps the socket open in CLOSE_WAIT state for a very long time (it is in this state for over 15 minutes already). On the slave host, the socket is properly closed. Checked both with lsof -a -i4 -i6 -itcp | 154 | replicator | …:60554 | NULL | Binlog Dump | 1834 | Master has sent all binlog to slave; waiting for binlog to be updated | NULL | 0.000 | | ||||
| Comment by Alex/AT [ 2019-02-27 ] | ||||
|
To me, it seems like unexpected socket death on slave... Slave records the following errors: | ||||
| Comment by Christophe Vigny [ 2019-04-05 ] | ||||
|
hi, I have just upgrade to the 10.3.14, and the issue is always here. On all 10.3.14 I have a deconnection every minute. | ||||
| Comment by Andrei Elkin [ 2019-04-08 ] | ||||
|
tophe, the description time slave error log is full of "duplicate" records:
We need to find out the reason of the failed read which happened on Master. Could you please check out
where start-pos would be few replication event offset before the reported 594190462, e.g as the binlog offset of Gtid_log_event of 0-31-(461106486 - 1=461106485). That is let us verify that replication events near and at mysql-bin.000237:594190462 are really readable. | ||||
| Comment by Christophe Vigny [ 2019-04-08 ] | ||||
|
Hi Andrei, I have extracted mysq-bin -log arround the position. mysql_bin_log.sql | ||||
| Comment by Andrei Elkin [ 2019-04-08 ] | ||||
|
tophe The binlog entries are readable, could've been readable by Master at that time. | ||||
| Comment by Andrei Elkin [ 2019-04-08 ] | ||||
|
tophe, sorry to ask this late, but I could not grasp this part of the description: I have started and run myql_upgrade, in 10.1,10.2,10.3 with the last current stable. No upgrade errors. What is the last current stable mean? | ||||
| Comment by Christophe Vigny [ 2019-04-08 ] | ||||
|
the master is now in 10.3.13, the slave in 10.3.14, I can upgrade master if you wish. on the master, I use mysqlbin-lot in 10.3.13, the last log is :
there are only write during the night (almost) | ||||
| Comment by Christophe Vigny [ 2019-04-08 ] | ||||
|
I have stop the master to flush the write and then: | ||||
| Comment by Andrei Elkin [ 2019-04-08 ] | ||||
|
tophe So end_log_pos 511405928 of the last event in 'mysql-bin.000275' binlog seems to cause I can't say anything specific now, I need a little time to analyzed. A to the upgrading master to 10.3.14, it should not be a bad idea, as least we'll be sure whether the 14th has fixed this issue. | ||||
| Comment by Christophe Vigny [ 2019-04-08 ] | ||||
|
ok, I have upgraded the master, and restarted master and slave. slave: | ||||
| Comment by Andrei Elkin [ 2019-04-08 ] | ||||
|
tophe What is your uname -a? | ||||
| Comment by Christophe Vigny [ 2019-04-08 ] | ||||
|
there are 5 slaves, the master have no load at that time. 2 in 10.0.33 : no problem. | ||||
| Comment by Andrei Elkin [ 2019-04-08 ] | ||||
|
tophe And your master server platform properties, could you please paste uname -a output? | ||||
| Comment by Andrei Elkin [ 2019-04-08 ] | ||||
|
tophe It would be of great help to us if you could try out mariadb-10.3.12 as master. This will clear out few bugfixes suspects. | ||||
| Comment by Christophe Vigny [ 2019-04-09 ] | ||||
|
Andrei, the problem has came with the upgrade in 10.3.12, I am sure that the issue is present in the 10.3.12 version. some precision with our system, the server run on debian jessie, and the mysql is the official compiled version. | ||||
| Comment by Alex/AT [ 2019-04-09 ] | ||||
|
Same here. Seems to be binlog-agnostic, restarting replication with cleaning binlogs does not help. Started with 10.3.11 (vanilla MariaDB RPMs for CentOS 7) for us, can't tell if with slave or master upgrades - were done the same day, also don't know if earlier 10.3 versions are fine. The bug is multi-master agnostic, behaves the same in single master and multi master configurations. | ||||
| Comment by Matt Harrison [ 2019-04-10 ] | ||||
|
I was having the same issue after upgrading from mariadb 10.1.37 to 10.3.13. I don't have exact setup to test anymore as I ended up switching it out for a 4 server Galera cluster which I had been wanting to do for awhile. I've had no issues with Galera syncing so far on the same databases. | ||||
| Comment by Andrei Elkin [ 2019-04-16 ] | ||||
|
tophe, I hope you could help me to reproduce it. In these slave errorlog lines
I read this slave did not receive a single event group because GTID remains after each reconnect that is a next one is not received (sent by master). So probably your master binlog files and their content ideally suits to reproduce the issue, and quickly. Also could you please clarify was master under any load when your were connecting these 5 slaves at the time I hope on your help. With your or anybody assistance we should resolve the case soon. Andrei | ||||
| Comment by Christophe Vigny [ 2019-04-17 ] | ||||
|
Hi Andrei, I am on holiday now. Monday, I can help you, I can remove the slaves and make a tcpdump with only one slave and send you the capture, or make any task you need. that master work only at night for archiving purpose, but the problem remain all the time. thanks. | ||||
| Comment by Christophe Vigny [ 2019-04-23 ] | ||||
|
hi Andrei, I have stoped all slave except the one in 10.3.14, and the restarted, slave and master. same problem. | ||||
| Comment by Andrei Elkin [ 2019-04-23 ] | ||||
|
tophe, hello. So we've learned that just 1 slave suffices. Also I see it happens at the very beginning of the slave session:
Could you share your binlog with us so I could try out myself? Secondly, if I won't be able to, which I let you know shortly, I suggest you would run tcpdump on the master and slave sides to capture all network packets between them. Without such dump I still can't figure out what causes the failed reading on the slave side. | ||||
| Comment by Christophe Vigny [ 2019-04-23 ] | ||||
|
I have just restart the master, binlog is empty. see it in mysql-bin.000285 I am working on tcpdump, I have limited destination and source running: | ||||
| Comment by Andrei Elkin [ 2019-04-23 ] | ||||
|
Many thanks! I am analyzing the data now. | ||||
| Comment by Andrei Elkin [ 2019-04-23 ] | ||||
|
I think I finally understood what is going on on your system, tophe. MASTER_HEARTBEAT_PERIOD of CHANGE MASTER must be set to 0 and then the slave reconnects because of slave_net_timeout = 60 secs. That's how the error logs line essentially read. At disconnecting phase the master's old dump thread manages to stay alive which is fine as it's going to be killed at the reconnecting phase by a new instance of the dump thread. Overall, your slaves should be doing fine only experiencing reconnect because of the chosen value of MASTER_HEARTBEAT_PERIOD. I am looking into right now whether the old dump thread should have been closed by disconnect, whether it stopped working at some version. | ||||
| Comment by Christophe Vigny [ 2019-04-23 ] | ||||
|
I cannot read the value of the MASTER_HEARTBEAT_PERIOD, with show all slaves status. | ||||
| Comment by Andrei Elkin [ 2019-04-23 ] | ||||
|
show global status like 'Slave_heartbeat_period' | ||||
| Comment by Andrei Elkin [ 2019-04-23 ] | ||||
|
No, the default has been stably slave_net_timeout/2.0. | ||||
| Comment by Christophe Vigny [ 2019-04-24 ] | ||||
|
Hi, I check the value on the upgraded slave : Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. MariaDB [(none)]> show global status like 'Slave_heartbeat_period';
-----------------------
----------------------- on all slave I have 1800. It is not 0 nor slave_net_timeout/2.0. | ||||
| Comment by Andrei Elkin [ 2019-04-24 ] | ||||
|
Howdy! > It is not 0 30 is a sensible to avoid 60 sec periodical reconnect when master does not have anything to send. | ||||
| Comment by Christophe Vigny [ 2019-04-24 ] | ||||
|
Andrei, stop slave; and then the issue resolve, no more disconnection. it seem that ervery host, I updated from 10.0 to 10.1 10.2 10.3 have the same problem: thank you a lot for your help. | ||||
| Comment by Andrei Elkin [ 2019-04-24 ] | ||||
|
tophe, to confirm now that the suspicious error-logging was caused by slave_net_timeout default change starting from 10.2.4. Yes, you need to reset the old heartbeat period with explicit CHANGE-MASTER-hb-period-new-value. Many thanks to you for so generous helping! Cheers! | ||||
| Comment by Andrei Elkin [ 2019-04-24 ] | ||||
|
slave_net_timeout default was changed in 10.2.4. | ||||
| Comment by Christophe Vigny [ 2019-04-24 ] | ||||
|
thank's Andrei, for that help. I can confirm that after changing Slave_heartbeat_period to slave_net_timeout / 2, I have no more logging problem . |