[MDEV-18576] replication issue, disconnection every minute with error : "A slave with the same server_uuid/server_id as this slave has co', with different server-id. Created: 2019-02-14  Updated: 2019-12-19  Resolved: 2019-04-24

Status: Closed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.3.12
Fix Version/s: 10.2.4

Type: Bug Priority: Major
Reporter: Christophe Vigny Assignee: Andrei Elkin
Resolution: Not a Bug Votes: 3
Labels: None
Environment:

debian stretch


Attachments: File log.511405928     File master.err     File master.my.cnf     File master.tcpdump     File master.tcpdump     File mysql-bin.000285     File mysql_bin_log.sql     File only_one_master.err     File only_one_slave.err     Text File show_master_variables.txt     Text File show_slave_variables.txt     File slave.err     File slave.tcpdump     File slave.tcpdump    

 Description   

After upgrading mariadb master/slave cluster from 10.0.33 to 10.3.12, I have a replication issue. on each cluster that I have upgraded, every minute, all slaves connections are closed with the message :
2019-02-12 19:41:20 77017 [Warning] Aborted connection 77017 to db: 'unconnected' user: 'sa' host: 'logarchiverpl2.artprice.bil' (A slave with the same server_uuid/server_id as this slave has co)

I have checked the server-id of all the stack, non are the same.
show slave hosts;
Server_id Host Port Master_id
29 3306 31
30 3306 31
27 3306 31
22 3311 31
25 3311 31

The replication was working fine with 10.0.33. All the slaves have been created via cloning the full data set via rsync, and then restarting changing the server-id.
For the upgrade process, I have started and run myql_upgrade, in 10.1,10.2,10.3 with the last current stable. No upgrade errors.

I have read MDEV-15660, mentioning the same error message, and trying to raise table_open_cache, but it doesn't help.



 Comments   
Comment by Alex/AT [ 2019-02-27 ]

Having the same. 10.3.11, 10.3.12, 10.3.13

Comment by Johannes Jordan [ 2019-02-27 ]

We can confirm this problem with 10.3.13.

Upon investigation, we found that when our replicating slave disconnects, the master server keeps the socket open in CLOSE_WAIT state for a very long time (it is in this state for over 15 minutes already). On the slave host, the socket is properly closed. Checked both with lsof -a -i4 -i6 -itcp
At the same time, the replication process is still listed in the master servers PROCESSLIST:

| 154 | replicator | …:60554 | NULL | Binlog Dump | 1834 | Master has sent all binlog to slave; waiting for binlog to be updated | NULL | 0.000 |

Comment by Alex/AT [ 2019-02-27 ]

To me, it seems like unexpected socket death on slave... Slave records the following errors:
2019-02-27 18:25:45 1237 [Note] Master 'LXmysqldev01': Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mysql_binary_log.001143' at position 1927424; GTID position '0-1-156551062,1060-1060-9867482,2060-2060-9848207,40-1-435817267'

Comment by Christophe Vigny [ 2019-04-05 ]

hi, I have just upgrade to the 10.3.14, and the issue is always here. On all 10.3.14 I have a deconnection every minute.

Comment by Andrei Elkin [ 2019-04-08 ]

tophe, the description time slave error log is full of "duplicate" records:

2019-02-14  8:00:23 9 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mysql-bin.000237' at position 594190462; GTID position '0-31-461106486'

We need to find out the reason of the failed read which happened on Master. Could you please check out
what is in the binlog really -

mysqlbinlog -v mysql-bin.000237 --start-position=start-pos

where start-pos would be few replication event offset before the reported 594190462, e.g as the binlog offset of Gtid_log_event of 0-31-(461106486 - 1=461106485). That is let us verify that replication events near and at mysql-bin.000237:594190462 are really readable.

Comment by Christophe Vigny [ 2019-04-08 ]

Hi Andrei,

I have extracted mysq-bin -log arround the position. mysql_bin_log.sql

Comment by Andrei Elkin [ 2019-04-08 ]

tophe The binlog entries are readable, could've been readable by Master at that time.
Now I need more error log lines, particularly those that were right before the first of Slave I/O thread: Failed reading log event, and contemporary ones from the master. It would be easier perhaps if you try to reproduce the failure while I am going to track it down closely. Thank you!

Comment by Andrei Elkin [ 2019-04-08 ]

tophe, sorry to ask this late, but I could not grasp this part of the description:

I have started and run myql_upgrade, in 10.1,10.2,10.3 with the last current stable. No upgrade errors.

What is the last current stable mean?
And also did the resulted upgrade make 10.3.12 version master with binlog files generated by 10.0.x? What if the troublesome binlog entry (quoted above in the duplicate lines) was indeed created at 10.0.x time? If it was, what version of mysqlbinlog have you extracted the recent binlog snippet? I suggest (now) that it should be also 10.3.12.

Comment by Christophe Vigny [ 2019-04-08 ]

the master is now in 10.3.13, the slave in 10.3.14, I can upgrade master if you wish.
on, the slave I have retry every minute:
2019-04-08 16:20:20 9 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mysql-bin.000275' at position 511405928; GTID position '0-31-510881608'
2019-04-08 16:21:21 9 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mysql-bin.000275' at position 511405928; GTID position '0-31-510881608'
2019-04-08 16:22:21 9 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mysql-bin.000275' at position 511405928; GTID position '0-31-510881608'
2019-04-08 16:23:22 9 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mysql-bin.000275' at position 511405928; GTID position '0-31-510881608'

on the master, I use mysqlbin-lot in 10.3.13, the last log is :

  1. at 511405672
  2. at 511405714
  3. at 511405897
    so there is no 511405928 writed, and the las bin log are
    /!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1/;
    /!40019 SET @@session.max_insert_delayed_threads=0/;
    /!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0/;
    DELIMITER /!/;
  4. at 4
    #190407 2:05:30 server id 31 end_log_pos 256 CRC32 0xbf9ea543 Start: binlog v 4, server v 10.3.13-MariaDB-log created 190407 2:05:30
  5. Warning: this binlog is either in use or was not closed properly.
    BINLOG '
    yj6pXA8fAAAA/AAAAAABAAABAAQAMTAuMy4xMy1NYXJpYURCLWxvZwAAAAAAAAAAAAAAAAAAAAAA
    AAAAAAAAAAAAAAAAAAAAAAAAEzgNAAgAEgAEBAQEEgAA5AAEGggAAAAICAgCAAAACgoKAAAAAAAA
    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
    AAAAAAAAAAAEEwQADQgICAoKCgFDpZ6/
    '/!/;
  6. at 511405672
    #190408 13:02:00 server id 31 end_log_pos 511405714 CRC32 0x2a8a7823 GTID 0-31-510881608 trans
    /Unable to render embedded object: File (100101 SET @@session.skip_parallel_replication=0*//*) not found./;
    /Unable to render embedded object: File (100001 SET @@session.gtid_domain_id=0*//*) not found./;
    /Unable to render embedded object: File (100001 SET @@session.server_id=31*//*) not found./;
    /Unable to render embedded object: File (100001 SET @@session.gtid_seq_no=510881608*//*) not found./;
    BEGIN
    /!/;
  7. at 511405714
    #190408 13:02:00 server id 31 end_log_pos 511405897 CRC32 0xb3793f0d Query thread_id=187339 exec_time=0 error_code=0
    use `walletarchive`/!/;
    SET TIMESTAMP=1554721320/!/;
    SET @@session.pseudo_thread_id=187339/!/;
    SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1, @@session.check_constraint_checks=1/!/;
    SET @@session.sql_mode=1411383296/!/;
    SET @@session.auto_increment_increment=1, @@session.auto_increment_offset=1/!/;
    /Unable to render embedded object: File (\C latin1 *//*) not found./;
    SET @@session.character_set_client=8,@@session.collation_connection=8,@@session.collation_server=8/!/;
    SET @@session.lc_time_names=0/!/;
    SET @@session.collation_database=DEFAULT/!/;
    replace into logartist ( idlogs ,src ,idartist ,idartistvariant) values ( '1307681265' ,'1' ,'231782' ,'1')
    /!/;
  8. at 511405897
    #190408 13:02:00 server id 31 end_log_pos 511405928 CRC32 0x5b71b976 Xid = 49193413
    COMMIT/!/;
    DELIMITER ;
  9. End of log file
    ROLLBACK /* added by mysqlbinlog */;
    /!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE/;
    /!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0/;

there are only write during the night (almost)

Comment by Christophe Vigny [ 2019-04-08 ]

I have stop the master to flush the write and then:

log.511405928

Comment by Andrei Elkin [ 2019-04-08 ]

tophe So end_log_pos 511405928 of the last event in 'mysql-bin.000275' binlog seems to cause
Failed reading log event.
And 'mysql-bin.000275' must be the latest log.

I can't say anything specific now, I need a little time to analyzed. A to the upgrading master to 10.3.14, it should not be a bad idea, as least we'll be sure whether the 14th has fixed this issue.

Comment by Christophe Vigny [ 2019-04-08 ]

ok, I have upgraded the master, and restarted master and slave.
after some time:
master
Version: '10.3.14-MariaDB-log' socket: '/tmp/mysql.sock' port: 3306 MariaDB Server
2019-04-08 16:58:41 0 [Note] InnoDB: Buffer pool(s) load completed at 190408 16:58:41
2019-04-08 16:59:34 24 [Note] Start binlog_dump to slave_server(25), pos(mysql-bin.000276, 342)
2019-04-08 16:59:34 23 [Note] Start binlog_dump to slave_server(22), pos(mysql-bin.000276, 342)
2019-04-08 16:59:34 25 [Note] Start binlog_dump to slave_server(27), pos(mysql-bin.000276, 342)
2019-04-08 16:59:34 22 [Note] Start binlog_dump to slave_server(30), pos(mysql-bin.000276, 342)
2019-04-08 16:59:36 26 [Note] Start binlog_dump to slave_server(29), pos(mysql-bin.000276, 342)
2019-04-08 17:00:19 29 [Note] Start binlog_dump to slave_server(25), pos(mysql-bin.000277, 342)
2019-04-08 17:00:21 30 [Note] Start binlog_dump to slave_server(22), pos(mysql-bin.000277, 342)
2019-04-08 17:00:34 31 [Note] Start binlog_dump to slave_server(30), pos(mysql-bin.000277, 342)
2019-04-08 17:00:34 22 [Warning] Aborted connection 22 to db: 'unconnected' user: 'sa' host: 'logarchiverpl1.dc4.bil' (A slave with the same server_uuid/server_id as this slave has co)
2019-04-08 17:00:34 32 [Note] Start binlog_dump to slave_server(27), pos(mysql-bin.000277, 342)
2019-04-08 17:00:34 25 [Warning] Aborted connection 25 to db: 'unconnected' user: 'sa' host: 'logarchiverpl1.artprice.bil' (A slave with the same server_uuid/server_id as this slave has co)
2019-04-08 17:00:36 33 [Note] Start binlog_dump to slave_server(29), pos(mysql-bin.000277, 342)
2019-04-08 17:00:36 26 [Warning] Aborted connection 26 to db: 'unconnected' user: 'sa' host: 'logarchiverpl2.artprice.bil' (A slave with the same server_uuid/server_id as this slave has co)
2019-04-08 17:01:34 34 [Note] Start binlog_dump to slave_server(30), pos(mysql-bin.000277, 342)
2019-04-08 17:01:34 31 [Warning] Aborted connection 31 to db: 'unconnected' user: 'sa' host: 'logarchiverpl1.dc4.bil' (A slave with the same server_uuid/server_id as this slave has co)
2019-04-08 17:01:34 35 [Note] Start binlog_dump to slave_server(27), pos(mysql-bin.000277, 342)
2019-04-08 17:01:34 32 [Warning] Aborted connection 32 to db: 'unconnected' user: 'sa' host: 'logarchiverpl1.artprice.bil' (A slave with the same server_uuid/server_id as this slave has co)
2019-04-08 17:01:36 36 [Note] Start binlog_dump to slave_server(29), pos(mysql-bin.000277, 342)
2019-04-08 17:01:36 33 [Warning] Aborted connection 33 to db: 'unconnected' user: 'sa' host: 'logarchiverpl2.artprice.bil' (A slave with the same server_uuid/server_id as this slave has co)
2019-04-08 17:02:35 37 [Note] Start binlog_dump to slave_server(27), pos(mysql-bin.000277, 342)
2019-04-08 17:02:35 35 [Warning] Aborted connection 35 to db: 'unconnected' user: 'sa' host: 'logarchiverpl1.artprice.bil' (A slave with the same server_uuid/server_id as this slave has co)
2019-04-08 17:02:35 38 [Note] Start binlog_dump to slave_server(30), pos(mysql-bin.000277, 342)
2019-04-08 17:02:35 34 [Warning] Aborted connection 34 to db: 'unconnected' user: 'sa' host: 'logarchiverpl1.dc4.bil' (A slave with the same server_uuid/server_id as this slave has co)
2019-04-08 17:02:36 39 [Note] Start binlog_dump to slave_server(29), pos(mysql-bin.000277, 342)
2019-04-08 17:02:36 36 [Warning] Aborted connection 36 to db: 'unconnected' user: 'sa' host: 'logarchiverpl2.artprice.bil' (A slave with the same server_uuid/server_id as this slave has co)

slave:
2019-04-08 16:59:36 10 [Note] Slave SQL thread initialized, starting replication in log 'mysql-bin.000276' at position 342, relay log './mysql-relay-bin.000001' position: 4; GTID position '0-31-510881608'
2019-04-08 16:59:36 0 [Note] /opt/mariadb/server-10.3/bin/mysqld: ready for connections.
Version: '10.3.14-MariaDB' socket: '/tmp/mysql.sock' port: 3306 MariaDB Server
2019-04-08 16:59:36 9 [Note] Slave I/O thread: connected to master 'sa@artdbdata.artprice.bil:3306',replication starts at GTID position '0-31-510881608'
2019-04-08 16:59:40 0 [Note] InnoDB: Buffer pool(s) load completed at 190408 16:59:40
2019-04-08 17:00:36 9 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mysql-bin.000277' at position 342; GTID position '0-31-510881608'
2019-04-08 17:01:36 9 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mysql-bin.000277' at position 342; GTID position '0-31-510881608'
2019-04-08 17:02:36 9 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mysql-bin.000277' at position 342; GTID position '0-31-510881608'

Comment by Andrei Elkin [ 2019-04-08 ]

tophe What is your uname -a?
There are 3 slaves that were reconnecting in the last master errorlog. All those 3 had Failed reading?
Did you have any load on Master at the last testing time? (Looks like no load).

Comment by Christophe Vigny [ 2019-04-08 ]

there are 5 slaves, the master have no load at that time. 2 in 10.0.33 : no problem.
three in 10.3.14, 10.3.13, all have the problem.

Comment by Andrei Elkin [ 2019-04-08 ]

tophe And your master server platform properties, could you please paste uname -a output?

Comment by Andrei Elkin [ 2019-04-08 ]

tophe It would be of great help to us if you could try out mariadb-10.3.12 as master. This will clear out few bugfixes suspects.

Comment by Christophe Vigny [ 2019-04-09 ]

Andrei, the problem has came with the upgrade in 10.3.12, I am sure that the issue is present in the 10.3.12 version.

some precision with our system, the server run on debian jessie, and the mysql is the official compiled version.
I have other clusters with the issue, but they are multi master config and it is the same platform.

Comment by Alex/AT [ 2019-04-09 ]

Same here. Seems to be binlog-agnostic, restarting replication with cleaning binlogs does not help. Started with 10.3.11 (vanilla MariaDB RPMs for CentOS 7) for us, can't tell if with slave or master upgrades - were done the same day, also don't know if earlier 10.3 versions are fine. The bug is multi-master agnostic, behaves the same in single master and multi master configurations.

Comment by Matt Harrison [ 2019-04-10 ]

I was having the same issue after upgrading from mariadb 10.1.37 to 10.3.13.
This was with 1 master and 3 slaves. I had tried completely wiping all the slaves and rebuilding them as fresh servers from a fresh copy from master but the problem still happened.

I don't have exact setup to test anymore as I ended up switching it out for a 4 server Galera cluster which I had been wanting to do for awhile. I've had no issues with Galera syncing so far on the same databases.

Comment by Andrei Elkin [ 2019-04-16 ]

tophe, I hope you could help me to reproduce it. In these slave errorlog lines

2019-04-08 16:59:36 9 [Note] Slave I/O thread: connected to master 'sa@artdbdata.artprice.bil:3306',replication starts at GTID position '0-31-510881608'
2019-04-08 17:00:36 9 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mysql-bin.000277' at position 342; GTID position '0-31-510881608'

I read this slave did not receive a single event group because GTID remains after each reconnect that is a next one is not received (sent by master). So probably your master binlog files and their content ideally suits to reproduce the issue, and quickly.
Could you please tell me if I can rely on such favor to run it again and when YES, I would provide you with a further details (at the moment I am considering to run tcpdump to see into what's coming from the master to make the slave to stumble at receiving and go to reconnect).

Also could you please clarify was master under any load when your were connecting these 5 slaves at the time
of the timestamps? Have you tried to run a single slave server? It would be good to try this out.

I hope on your help. With your or anybody assistance we should resolve the case soon.
Thanks.

Andrei

Comment by Christophe Vigny [ 2019-04-17 ]

Hi Andrei,

I am on holiday now. Monday, I can help you, I can remove the slaves and make a tcpdump with only one slave and send you the capture, or make any task you need.

that master work only at night for archiving purpose, but the problem remain all the time.
I have other configurations with multi master, having the same error, I got the probleme on each host that I upgrade to 10.3 serie.

thanks.

Comment by Christophe Vigny [ 2019-04-23 ]

hi Andrei,

I have stoped all slave except the one in 10.3.14, and the restarted, slave and master.
the log are in only_one file.

same problem.

Comment by Andrei Elkin [ 2019-04-23 ]

tophe, hello.

So we've learned that just 1 slave suffices. Also I see it happens at the very beginning of the slave session:

2019-04-23 15:02:45 9 [Note] Slave I/O thread: connected to master 'sa@artdbdata.artprice.bil:3306',replication starts at GTID position '0-31-523872748'
...
2019-04-23 15:03:45 9 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mysql-bin.000285' at position 342; GTID position '0-31-523872748'

Could you share your binlog with us so I could try out myself?

Secondly, if I won't be able to, which I let you know shortly, I suggest you would run tcpdump on the master and slave sides to capture all network packets between them. Without such dump I still can't figure out what causes the failed reading on the slave side.

Comment by Christophe Vigny [ 2019-04-23 ]

I have just restart the master, binlog is empty. see it in mysql-bin.000285

I am working on tcpdump, I have limited destination and source running:
on slave
tcpdump dst nas2 -w slave.tcpdump
on master
tcpdump dst logarchiverpl2 -w master.tcpdump

Comment by Andrei Elkin [ 2019-04-23 ]

Many thanks! I am analyzing the data now.
Now the first question to it, could you also print out the last event group from
the preceding log ? That is
mysqlbinlog -v --start-position=`the-offset-of-the-last-GTID-event` mysql-bin.000284.

Comment by Andrei Elkin [ 2019-04-23 ]

I think I finally understood what is going on on your system, tophe. MASTER_HEARTBEAT_PERIOD of CHANGE MASTER must be set to 0 and then the slave reconnects because of slave_net_timeout = 60 secs. That's how the error logs line essentially read. At disconnecting phase the master's old dump thread manages to stay alive which is fine as it's going to be killed at the reconnecting phase by a new instance of the dump thread.
The fact of the killing is present in the synopsis.

Overall, your slaves should be doing fine only experiencing reconnect because of the chosen value of MASTER_HEARTBEAT_PERIOD.

I am looking into right now whether the old dump thread should have been closed by disconnect, whether it stopped working at some version.

Comment by Christophe Vigny [ 2019-04-23 ]

I cannot read the value of the MASTER_HEARTBEAT_PERIOD, with show all slaves status.
I doesn't set it and the server is an upgrade off a 10.0 serie, perhaps the default value in case of update is set to 0.

Comment by Andrei Elkin [ 2019-04-23 ]

show global status like 'Slave_heartbeat_period'

Comment by Andrei Elkin [ 2019-04-23 ]

No, the default has been stably slave_net_timeout/2.0.

Comment by Christophe Vigny [ 2019-04-24 ]

Hi, I check the value on the upgraded slave :
logarchiverpl2.artprice.bil:~$ mysql -S /tmp/mysql.sock
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 484
Server version: 10.3.14-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show global status like 'Slave_heartbeat_period';
--------------------------------+

Variable_name Value

--------------------------------+

Slave_heartbeat_period 1800.000

--------------------------------+
1 row in set (0.001 sec)

on all slave I have 1800. It is not 0 nor slave_net_timeout/2.0.
does I have to update to 30 ?

Comment by Andrei Elkin [ 2019-04-24 ]

Howdy!

> It is not 0
When it's greater than slave_net_timeout it is effectively 0 : the heartbeat event won't be generated and
sent.

30 is a sensible to avoid 60 sec periodical reconnect when master does not have anything to send.

Comment by Christophe Vigny [ 2019-04-24 ]

Andrei,
I have to

stop slave;
CHANGE MASTER TO
MASTER_HEARTBEAT_PERIOD=30;
start slave;

and then the issue resolve, no more disconnection.
I am updating the others cluster know.

it seem that ervery host, I updated from 10.0 to 10.1 10.2 10.3 have the same problem:
SHOW VARIABLES like 'slave_net_timeout' : 60
and
Slave_heartbeat_period: 1800

thank you a lot for your help.

Comment by Andrei Elkin [ 2019-04-24 ]

tophe, to confirm now that the suspicious error-logging was caused by slave_net_timeout default change starting from 10.2.4. Yes, you need to reset the old heartbeat period with explicit CHANGE-MASTER-hb-period-new-value.

Many thanks to you for so generous helping!

Cheers!

Comment by Andrei Elkin [ 2019-04-24 ]

slave_net_timeout default was changed in 10.2.4.

Comment by Christophe Vigny [ 2019-04-24 ]

thank's Andrei, for that help.

I can confirm that after changing Slave_heartbeat_period to slave_net_timeout / 2, I have no more logging problem .

Generated at Thu Feb 08 08:45:09 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.