[MDEV-6188] master_retry_count (ignored if disconnect happens on SET master_heartbeat_period) Created: 2014-04-30  Updated: 2014-10-11  Resolved: 2014-06-17

Status: Closed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 5.5.38, 10.0.10
Fix Version/s: 5.5.39, 10.0.13

Type: Bug Priority: Critical
Reporter: BELUGABEHR Assignee: Kristian Nielsen
Resolution: Fixed Votes: 0
Labels: replication, slave


 Description   

Hello,

The documentation for the "master_retry_count" variable marks it as deprecated:

https://mariadb.com/kb/en/replication-and-binary-log-server-system-variables/#master_retry_count

However, there is no indication on how to go about enabling this feature. It appears that there may be a larger issue here:

http://serverfault.com/questions/522207/mariadb-replication-not-auto-reconnecting

I'm using 10.0.10

Thanks!



 Comments   
Comment by Elena Stepanova [ 2014-04-30 ]

Hi.

Thanks for noticing. I've updated the page: in MariaDB as of 10.0.10, the option is not deprecated.
In MySQL 5.6 it is, and there is new syntax in CHANGE MASTER to set the value, but it is not in MariaDB 10.0 (yet).

Regarding the question at serverfault.com (unrelated to the documentation issue):

The problem here is that master_retry_count works most of the time (errors upon reading packets, connecting to the master, etc.). But in some special cases, for example while trying to execute a query on master, such as SET master_heartbeat_period, if the query fails on whatever reason, the slave considers it a fatal error and gives up.

From all I see, it's a bug present in MariaDB, including current 10.0, and MySQL up to 5.5. It was fixed in MySQL 5.6: the error code returned upon query execution is checked against the list of "network failures", and if it's one of those, the connection retry happens as usual.
I think we need to merge the fix into 10.0, since it was committed as a part of a seemingly unrelated task, and thus can easily be missed.

Meanwhile, if you are concerned about this issue, configuring MASTER_HEARTBEAT_PERIOD = interval should make it go away, although of course disabling heartbeats can have other consequences if the slave has idle periods longer than slave_net_timeout – it will increase the number of reconnects. It's probably not a big deal when the connection is that poor, but should be taken into account anyway.

Comment by Elena Stepanova [ 2014-04-30 ]

Here is the patch in MySQL 5.6 tree where (I think) the problem was fixed:

                    revno: 2661.723.1
                    revision-id: aelkin@mysql.com-20100528094719-b7n5o9oei91y89uu
                    parent: li-bing.song@sun.com-20100402070652-3sc1o9xna50htmfv
                    committer: Andrei Elkin <aelkin@mysql.com>
                    branch nick: rep2-wl2540-checksum
                    timestamp: Fri 2010-05-28 12:47:19 +0300
                    message:
                      wl#2540 replication checksum
                      
                      intermediate changeset implements the task w/o relying on FD (to be refined by the following patch) as well as with per-event (A) (should be removed from all but FD). The 3rd todo will be correct affected tests because of FD is going to be extended by (A) size of 1 bytte. Finally, merging with fixes for bug#49741 shall complete the show

It sounds unrelated, but it contains, among other things, this diff (it's not the full hunk, just the part that seems most relevant):

--- sql/slave.cc        2010-02-12 23:30:44 +0000
+++ sql/slave.cc        2010-05-28 09:47:19 +0000
@@ -1456,19 +1456,59 @@
     llstr((ulonglong) (mi->heartbeat_period*1000000000UL), llbuf);
     my_sprintf(query, (query, query_format, llbuf));
 
-    if (mysql_real_query(mysql, query, strlen(query))
-        && !check_io_slave_killed(mi->io_thd, mi, NULL))
+    if (mysql_real_query(mysql, query, strlen(query)))
     {
-      errmsg= "The slave I/O thread stops because SET @master_heartbeat_period "
-        "on master failed.";
-      err_code= ER_SLAVE_FATAL_ERROR;
-      sprintf(err_buff, "%s Error: %s", errmsg, mysql_error(mysql));
+      if (is_network_error(mysql_errno(mysql)))
+      {
+        mi->report(WARNING_LEVEL, mysql_errno(mysql),
+                   "SET @master_heartbeat_period to master failed with error: %s",
+                   mysql_error(mysql));
+        mysql_free_result(mysql_store_result(mysql));
+        goto network_err;
+      }
+      else
+      {
+        /* Fatal error */
+        errmsg= "The slave I/O thread stops because a fatal error is encountered "
+          " when it tries to SET @master_heartbeat_period on master.";
+        err_code= ER_SLAVE_FATAL_ERROR;
+        sprintf(err_buff, "%s Error: %s", errmsg, mysql_error(mysql));
+        mysql_free_result(mysql_store_result(mysql));
+        goto err;
+      }
+    }
+    mysql_free_result(mysql_store_result(mysql));

Comment by BELUGABEHR [ 2014-05-01 ]

Thanks for the great insight.

How do I interrogate this variable on my system?

It is not listed in "SHOW SLAVE STATUS".

https://mariadb.com/kb/en/show-slave-status/

Comment by Elena Stepanova [ 2014-05-01 ]

If you mean master-retry-count, currently you can't. There was an upstream bugreport about it, http://bugs.mysql.com/bug.php?id=44486 . Since this bugfix was a separate commit, hopefully it will make it to 10.0 tree along with other 5.6 bugfixes as a part of MDEV-5242 activity. I'll add a note to the latter task, just in case (although, it won't make it to the 10.0.11 release which is due any day; it will happen later).

Comment by BELUGABEHR [ 2014-05-01 ]

Many thanks!

Not sure how you want to handle this ticket, but I will monitor for the changes on MDEV-5242.

Comment by Kristian Nielsen [ 2014-06-17 ]

Pushed into 5.5.39

Comment by Anton Avramov [ 2014-10-02 ]

This bug states it was fixed in 10.0.13, however I keep experience it in
Server version: 10.0.13-MariaDB-1~squeeze-log mariadb.org binary distribution
I see in the logs:
Oct 2 13:33:26 taisho mysqld: 141002 13:33:26 [ERROR] Master 'geo-milev.taisho': Slave I/O: Setting @mariadb_slave_capability failed with error: Lost connection to MySQL server during query, Internal MariaDB error code: 2013
Oct 2 13:33:26 taisho mysqld: 141002 13:33:26 [Note] Master 'geo-milev.taisho': Slave I/O thread exiting, read up to log 'mysql-bin.000562', position 140594

It seams that this is one of those network issues that is not in the list.

Generated at Thu Feb 08 07:10:01 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.