[MDEV-6188] master_retry_count (ignored if disconnect happens on SET master_heartbeat_period) - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 5.5.38, 10.0.10
Fix Version/s: 5.5.39, 10.0.13
Component/s: Replication
Labels:
- replication
- slave

Description

Hello,

The documentation for the "master_retry_count" variable marks it as deprecated:

https://mariadb.com/kb/en/replication-and-binary-log-server-system-variables/#master_retry_count

However, there is no indication on how to go about enabling this feature. It appears that there may be a larger issue here:

http://serverfault.com/questions/522207/mariadb-replication-not-auto-reconnecting

I'm using 10.0.10

Thanks!

Attachments

Issue Links

relates to

MDEV-25674 No SQL variable for master_retry_count setting

Closed

Activity

Ascending order - Click to sort in descending order

Elena Stepanova added a comment - 2014-04-30 19:31

Hi.

Thanks for noticing. I've updated the page: in MariaDB as of 10.0.10, the option is not deprecated.
In MySQL 5.6 it is, and there is new syntax in CHANGE MASTER to set the value, but it is not in MariaDB 10.0 (yet).

Regarding the question at serverfault.com (unrelated to the documentation issue):

The problem here is that master_retry_count works most of the time (errors upon reading packets, connecting to the master, etc.). But in some special cases, for example while trying to execute a query on master, such as SET master_heartbeat_period, if the query fails on whatever reason, the slave considers it a fatal error and gives up.

From all I see, it's a bug present in MariaDB, including current 10.0, and MySQL up to 5.5. It was fixed in MySQL 5.6: the error code returned upon query execution is checked against the list of "network failures", and if it's one of those, the connection retry happens as usual.
I think we need to merge the fix into 10.0, since it was committed as a part of a seemingly unrelated task, and thus can easily be missed.

Meanwhile, if you are concerned about this issue, configuring MASTER_HEARTBEAT_PERIOD = interval should make it go away, although of course disabling heartbeats can have other consequences if the slave has idle periods longer than slave_net_timeout – it will increase the number of reconnects. It's probably not a big deal when the connection is that poor, but should be taken into account anyway.

Elena Stepanova added a comment - 2014-04-30 19:31 Hi. Thanks for noticing. I've updated the page: in MariaDB as of 10.0.10, the option is not deprecated. In MySQL 5.6 it is, and there is new syntax in CHANGE MASTER to set the value, but it is not in MariaDB 10.0 (yet). Regarding the question at serverfault.com (unrelated to the documentation issue): The problem here is that master_retry_count works most of the time (errors upon reading packets, connecting to the master, etc.). But in some special cases, for example while trying to execute a query on master, such as SET master_heartbeat_period, if the query fails on whatever reason , the slave considers it a fatal error and gives up. From all I see, it's a bug present in MariaDB, including current 10.0, and MySQL up to 5.5. It was fixed in MySQL 5.6: the error code returned upon query execution is checked against the list of "network failures", and if it's one of those, the connection retry happens as usual. I think we need to merge the fix into 10.0, since it was committed as a part of a seemingly unrelated task, and thus can easily be missed. Meanwhile, if you are concerned about this issue, configuring MASTER_HEARTBEAT_PERIOD = interval should make it go away, although of course disabling heartbeats can have other consequences if the slave has idle periods longer than slave_net_timeout – it will increase the number of reconnects. It's probably not a big deal when the connection is that poor, but should be taken into account anyway.

Elena Stepanova added a comment - 2014-04-30 19:39

Here is the patch in MySQL 5.6 tree where (I think) the problem was fixed:

                    revno: 2661.723.1

                    revision-id: aelkin@mysql.com-20100528094719-b7n5o9oei91y89uu

                    parent: li-bing.song@sun.com-20100402070652-3sc1o9xna50htmfv

                    committer: Andrei Elkin <aelkin@mysql.com>

                    branch nick: rep2-wl2540-checksum

                    timestamp: Fri 2010-05-28 12:47:19 +0300

                    message:

                      wl#2540 replication checksum

                      intermediate changeset implements the task w/o relying on FD (to be refined by the following patch) as well as with per-event (A) (should be removed from all but FD). The 3rd todo will be correct affected tests because of FD is going to be extended by (A) size of 1 bytte. Finally, merging with fixes for bug#49741 shall complete the show

It sounds unrelated, but it contains, among other things, this diff (it's not the full hunk, just the part that seems most relevant):

--- sql/slave.cc        2010-02-12 23:30:44 +0000

+++ sql/slave.cc        2010-05-28 09:47:19 +0000

@@ -1456,19 +1456,59 @@

     llstr((ulonglong) (mi->heartbeat_period*1000000000UL), llbuf);

     my_sprintf(query, (query, query_format, llbuf));

-    if (mysql_real_query(mysql, query, strlen(query))

-        && !check_io_slave_killed(mi->io_thd, mi, NULL))

+    if (mysql_real_query(mysql, query, strlen(query)))

-      errmsg= "The slave I/O thread stops because SET @master_heartbeat_period "

-        "on master failed.";

-      err_code= ER_SLAVE_FATAL_ERROR;

-      sprintf(err_buff, "%s Error: %s", errmsg, mysql_error(mysql));

+      if (is_network_error(mysql_errno(mysql)))

+      {

+        mi->report(WARNING_LEVEL, mysql_errno(mysql),

+                   "SET @master_heartbeat_period to master failed with error: %s",

+                   mysql_error(mysql));

+        mysql_free_result(mysql_store_result(mysql));

+        goto network_err;

+      }

+      else

+      {

+        /* Fatal error */

+        errmsg= "The slave I/O thread stops because a fatal error is encountered "

+          " when it tries to SET @master_heartbeat_period on master.";

+        err_code= ER_SLAVE_FATAL_ERROR;

+        sprintf(err_buff, "%s Error: %s", errmsg, mysql_error(mysql));

+        mysql_free_result(mysql_store_result(mysql));

+        goto err;

+      }

+    }

+    mysql_free_result(mysql_store_result(mysql));

Elena Stepanova added a comment - 2014-04-30 19:39 Here is the patch in MySQL 5.6 tree where (I think) the problem was fixed: revno: 2661.723.1 revision-id: aelkin@mysql.com-20100528094719-b7n5o9oei91y89uu parent: li-bing.song@sun.com-20100402070652-3sc1o9xna50htmfv committer: Andrei Elkin <aelkin@mysql.com> branch nick: rep2-wl2540-checksum timestamp: Fri 2010-05-28 12:47:19 +0300 message: wl#2540 replication checksum intermediate changeset implements the task w/o relying on FD (to be refined by the following patch) as well as with per-event (A) (should be removed from all but FD). The 3rd todo will be correct affected tests because of FD is going to be extended by (A) size of 1 bytte. Finally, merging with fixes for bug#49741 shall complete the show It sounds unrelated, but it contains, among other things, this diff (it's not the full hunk, just the part that seems most relevant): --- sql/slave.cc 2010-02-12 23:30:44 +0000 +++ sql/slave.cc 2010-05-28 09:47:19 +0000 @@ -1456,19 +1456,59 @@ llstr((ulonglong) (mi->heartbeat_period*1000000000UL), llbuf); my_sprintf(query, (query, query_format, llbuf)); - if (mysql_real_query(mysql, query, strlen(query)) - && !check_io_slave_killed(mi->io_thd, mi, NULL)) + if (mysql_real_query(mysql, query, strlen(query))) { - errmsg= "The slave I/O thread stops because SET @master_heartbeat_period " - "on master failed."; - err_code= ER_SLAVE_FATAL_ERROR; - sprintf(err_buff, "%s Error: %s", errmsg, mysql_error(mysql)); + if (is_network_error(mysql_errno(mysql))) + { + mi->report(WARNING_LEVEL, mysql_errno(mysql), + "SET @master_heartbeat_period to master failed with error: %s", + mysql_error(mysql)); + mysql_free_result(mysql_store_result(mysql)); + goto network_err; + } + else + { + /* Fatal error */ + errmsg= "The slave I/O thread stops because a fatal error is encountered " + " when it tries to SET @master_heartbeat_period on master."; + err_code= ER_SLAVE_FATAL_ERROR; + sprintf(err_buff, "%s Error: %s", errmsg, mysql_error(mysql)); + mysql_free_result(mysql_store_result(mysql)); + goto err; + } + } + mysql_free_result(mysql_store_result(mysql));

BELUGABEHR added a comment - 2014-05-01 02:22

Thanks for the great insight.

How do I interrogate this variable on my system?

It is not listed in "SHOW SLAVE STATUS".

https://mariadb.com/kb/en/show-slave-status/

BELUGABEHR added a comment - 2014-05-01 02:22 Thanks for the great insight. How do I interrogate this variable on my system? It is not listed in "SHOW SLAVE STATUS". https://mariadb.com/kb/en/show-slave-status/

Elena Stepanova added a comment - 2014-05-01 02:51

If you mean master-retry-count, currently you can't. There was an upstream bugreport about it, http://bugs.mysql.com/bug.php?id=44486 . Since this bugfix was a separate commit, hopefully it will make it to 10.0 tree along with other 5.6 bugfixes as a part of ~~MDEV-5242~~ activity. I'll add a note to the latter task, just in case (although, it won't make it to the 10.0.11 release which is due any day; it will happen later).

Elena Stepanova added a comment - 2014-05-01 02:51 If you mean master-retry-count, currently you can't. There was an upstream bugreport about it, http://bugs.mysql.com/bug.php?id=44486 . Since this bugfix was a separate commit, hopefully it will make it to 10.0 tree along with other 5.6 bugfixes as a part of MDEV-5242 activity. I'll add a note to the latter task, just in case (although, it won't make it to the 10.0.11 release which is due any day; it will happen later).

BELUGABEHR added a comment - 2014-05-01 02:57

Many thanks!

Not sure how you want to handle this ticket, but I will monitor for the changes on ~~MDEV-5242~~.

BELUGABEHR added a comment - 2014-05-01 02:57 Many thanks! Not sure how you want to handle this ticket, but I will monitor for the changes on MDEV-5242 .

Kristian Nielsen added a comment - 2014-06-17 16:05

Pushed into 5.5.39

Kristian Nielsen added a comment - 2014-06-17 16:05 Pushed into 5.5.39

Anton Avramov added a comment - 2014-10-02 15:38

This bug states it was fixed in 10.0.13, however I keep experience it in
Server version: 10.0.13-MariaDB-1~squeeze-log mariadb.org binary distribution
I see in the logs:
Oct 2 13:33:26 taisho mysqld: 141002 13:33:26 [ERROR] Master 'geo-milev.taisho': Slave I/O: Setting @mariadb_slave_capability failed with error: Lost connection to MySQL server during query, Internal MariaDB error code: 2013
Oct 2 13:33:26 taisho mysqld: 141002 13:33:26 [Note] Master 'geo-milev.taisho': Slave I/O thread exiting, read up to log 'mysql-bin.000562', position 140594

It seams that this is one of those network issues that is not in the list.

Anton Avramov added a comment - 2014-10-02 15:38 This bug states it was fixed in 10.0.13, however I keep experience it in Server version: 10.0.13-MariaDB-1~squeeze-log mariadb.org binary distribution I see in the logs: Oct 2 13:33:26 taisho mysqld: 141002 13:33:26 [ERROR] Master 'geo-milev.taisho': Slave I/O: Setting @mariadb_slave_capability failed with error: Lost connection to MySQL server during query, Internal MariaDB error code: 2013 Oct 2 13:33:26 taisho mysqld: 141002 13:33:26 [Note] Master 'geo-milev.taisho': Slave I/O thread exiting, read up to log 'mysql-bin.000562', position 140594 It seams that this is one of those network issues that is not in the list.

People

Assignee:: Kristian Nielsen

Reporter:: BELUGABEHR

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 2014-04-30 01:02

Updated:: 2024-03-07 09:47

Resolved:: 2014-06-17 16:05

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server