Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-6188

master_retry_count (ignored if disconnect happens on SET master_heartbeat_period)

Details

    Description

      Hello,

      The documentation for the "master_retry_count" variable marks it as deprecated:

      https://mariadb.com/kb/en/replication-and-binary-log-server-system-variables/#master_retry_count

      However, there is no indication on how to go about enabling this feature. It appears that there may be a larger issue here:

      http://serverfault.com/questions/522207/mariadb-replication-not-auto-reconnecting

      I'm using 10.0.10

      Thanks!

      Attachments

        Issue Links

          Activity

            Hi.

            Thanks for noticing. I've updated the page: in MariaDB as of 10.0.10, the option is not deprecated.
            In MySQL 5.6 it is, and there is new syntax in CHANGE MASTER to set the value, but it is not in MariaDB 10.0 (yet).

            Regarding the question at serverfault.com (unrelated to the documentation issue):

            The problem here is that master_retry_count works most of the time (errors upon reading packets, connecting to the master, etc.). But in some special cases, for example while trying to execute a query on master, such as SET master_heartbeat_period, if the query fails on whatever reason, the slave considers it a fatal error and gives up.

            From all I see, it's a bug present in MariaDB, including current 10.0, and MySQL up to 5.5. It was fixed in MySQL 5.6: the error code returned upon query execution is checked against the list of "network failures", and if it's one of those, the connection retry happens as usual.
            I think we need to merge the fix into 10.0, since it was committed as a part of a seemingly unrelated task, and thus can easily be missed.

            Meanwhile, if you are concerned about this issue, configuring MASTER_HEARTBEAT_PERIOD = interval should make it go away, although of course disabling heartbeats can have other consequences if the slave has idle periods longer than slave_net_timeout – it will increase the number of reconnects. It's probably not a big deal when the connection is that poor, but should be taken into account anyway.

            elenst Elena Stepanova added a comment - Hi. Thanks for noticing. I've updated the page: in MariaDB as of 10.0.10, the option is not deprecated. In MySQL 5.6 it is, and there is new syntax in CHANGE MASTER to set the value, but it is not in MariaDB 10.0 (yet). Regarding the question at serverfault.com (unrelated to the documentation issue): The problem here is that master_retry_count works most of the time (errors upon reading packets, connecting to the master, etc.). But in some special cases, for example while trying to execute a query on master, such as SET master_heartbeat_period, if the query fails on whatever reason , the slave considers it a fatal error and gives up. From all I see, it's a bug present in MariaDB, including current 10.0, and MySQL up to 5.5. It was fixed in MySQL 5.6: the error code returned upon query execution is checked against the list of "network failures", and if it's one of those, the connection retry happens as usual. I think we need to merge the fix into 10.0, since it was committed as a part of a seemingly unrelated task, and thus can easily be missed. Meanwhile, if you are concerned about this issue, configuring MASTER_HEARTBEAT_PERIOD = interval should make it go away, although of course disabling heartbeats can have other consequences if the slave has idle periods longer than slave_net_timeout – it will increase the number of reconnects. It's probably not a big deal when the connection is that poor, but should be taken into account anyway.

            Here is the patch in MySQL 5.6 tree where (I think) the problem was fixed:

                                revno: 2661.723.1
                                revision-id: aelkin@mysql.com-20100528094719-b7n5o9oei91y89uu
                                parent: li-bing.song@sun.com-20100402070652-3sc1o9xna50htmfv
                                committer: Andrei Elkin <aelkin@mysql.com>
                                branch nick: rep2-wl2540-checksum
                                timestamp: Fri 2010-05-28 12:47:19 +0300
                                message:
                                  wl#2540 replication checksum
                                  
                                  intermediate changeset implements the task w/o relying on FD (to be refined by the following patch) as well as with per-event (A) (should be removed from all but FD). The 3rd todo will be correct affected tests because of FD is going to be extended by (A) size of 1 bytte. Finally, merging with fixes for bug#49741 shall complete the show

            It sounds unrelated, but it contains, among other things, this diff (it's not the full hunk, just the part that seems most relevant):

            --- sql/slave.cc        2010-02-12 23:30:44 +0000
            +++ sql/slave.cc        2010-05-28 09:47:19 +0000
            @@ -1456,19 +1456,59 @@
                 llstr((ulonglong) (mi->heartbeat_period*1000000000UL), llbuf);
                 my_sprintf(query, (query, query_format, llbuf));
             
            -    if (mysql_real_query(mysql, query, strlen(query))
            -        && !check_io_slave_killed(mi->io_thd, mi, NULL))
            +    if (mysql_real_query(mysql, query, strlen(query)))
                 {
            -      errmsg= "The slave I/O thread stops because SET @master_heartbeat_period "
            -        "on master failed.";
            -      err_code= ER_SLAVE_FATAL_ERROR;
            -      sprintf(err_buff, "%s Error: %s", errmsg, mysql_error(mysql));
            +      if (is_network_error(mysql_errno(mysql)))
            +      {
            +        mi->report(WARNING_LEVEL, mysql_errno(mysql),
            +                   "SET @master_heartbeat_period to master failed with error: %s",
            +                   mysql_error(mysql));
            +        mysql_free_result(mysql_store_result(mysql));
            +        goto network_err;
            +      }
            +      else
            +      {
            +        /* Fatal error */
            +        errmsg= "The slave I/O thread stops because a fatal error is encountered "
            +          " when it tries to SET @master_heartbeat_period on master.";
            +        err_code= ER_SLAVE_FATAL_ERROR;
            +        sprintf(err_buff, "%s Error: %s", errmsg, mysql_error(mysql));
            +        mysql_free_result(mysql_store_result(mysql));
            +        goto err;
            +      }
            +    }
            +    mysql_free_result(mysql_store_result(mysql));

            elenst Elena Stepanova added a comment - Here is the patch in MySQL 5.6 tree where (I think) the problem was fixed: revno: 2661.723.1 revision-id: aelkin@mysql.com-20100528094719-b7n5o9oei91y89uu parent: li-bing.song@sun.com-20100402070652-3sc1o9xna50htmfv committer: Andrei Elkin <aelkin@mysql.com> branch nick: rep2-wl2540-checksum timestamp: Fri 2010-05-28 12:47:19 +0300 message: wl#2540 replication checksum intermediate changeset implements the task w/o relying on FD (to be refined by the following patch) as well as with per-event (A) (should be removed from all but FD). The 3rd todo will be correct affected tests because of FD is going to be extended by (A) size of 1 bytte. Finally, merging with fixes for bug#49741 shall complete the show It sounds unrelated, but it contains, among other things, this diff (it's not the full hunk, just the part that seems most relevant): --- sql/slave.cc 2010-02-12 23:30:44 +0000 +++ sql/slave.cc 2010-05-28 09:47:19 +0000 @@ -1456,19 +1456,59 @@ llstr((ulonglong) (mi->heartbeat_period*1000000000UL), llbuf); my_sprintf(query, (query, query_format, llbuf)); - if (mysql_real_query(mysql, query, strlen(query)) - && !check_io_slave_killed(mi->io_thd, mi, NULL)) + if (mysql_real_query(mysql, query, strlen(query))) { - errmsg= "The slave I/O thread stops because SET @master_heartbeat_period " - "on master failed."; - err_code= ER_SLAVE_FATAL_ERROR; - sprintf(err_buff, "%s Error: %s", errmsg, mysql_error(mysql)); + if (is_network_error(mysql_errno(mysql))) + { + mi->report(WARNING_LEVEL, mysql_errno(mysql), + "SET @master_heartbeat_period to master failed with error: %s", + mysql_error(mysql)); + mysql_free_result(mysql_store_result(mysql)); + goto network_err; + } + else + { + /* Fatal error */ + errmsg= "The slave I/O thread stops because a fatal error is encountered " + " when it tries to SET @master_heartbeat_period on master."; + err_code= ER_SLAVE_FATAL_ERROR; + sprintf(err_buff, "%s Error: %s", errmsg, mysql_error(mysql)); + mysql_free_result(mysql_store_result(mysql)); + goto err; + } + } + mysql_free_result(mysql_store_result(mysql));
            belugabehr BELUGABEHR added a comment -

            Thanks for the great insight.

            How do I interrogate this variable on my system?

            It is not listed in "SHOW SLAVE STATUS".

            https://mariadb.com/kb/en/show-slave-status/

            belugabehr BELUGABEHR added a comment - Thanks for the great insight. How do I interrogate this variable on my system? It is not listed in "SHOW SLAVE STATUS". https://mariadb.com/kb/en/show-slave-status/

            If you mean master-retry-count, currently you can't. There was an upstream bugreport about it, http://bugs.mysql.com/bug.php?id=44486 . Since this bugfix was a separate commit, hopefully it will make it to 10.0 tree along with other 5.6 bugfixes as a part of MDEV-5242 activity. I'll add a note to the latter task, just in case (although, it won't make it to the 10.0.11 release which is due any day; it will happen later).

            elenst Elena Stepanova added a comment - If you mean master-retry-count, currently you can't. There was an upstream bugreport about it, http://bugs.mysql.com/bug.php?id=44486 . Since this bugfix was a separate commit, hopefully it will make it to 10.0 tree along with other 5.6 bugfixes as a part of MDEV-5242 activity. I'll add a note to the latter task, just in case (although, it won't make it to the 10.0.11 release which is due any day; it will happen later).
            belugabehr BELUGABEHR added a comment -

            Many thanks!

            Not sure how you want to handle this ticket, but I will monitor for the changes on MDEV-5242.

            belugabehr BELUGABEHR added a comment - Many thanks! Not sure how you want to handle this ticket, but I will monitor for the changes on MDEV-5242 .

            Pushed into 5.5.39

            knielsen Kristian Nielsen added a comment - Pushed into 5.5.39
            lukav Anton Avramov added a comment -

            This bug states it was fixed in 10.0.13, however I keep experience it in
            Server version: 10.0.13-MariaDB-1~squeeze-log mariadb.org binary distribution
            I see in the logs:
            Oct 2 13:33:26 taisho mysqld: 141002 13:33:26 [ERROR] Master 'geo-milev.taisho': Slave I/O: Setting @mariadb_slave_capability failed with error: Lost connection to MySQL server during query, Internal MariaDB error code: 2013
            Oct 2 13:33:26 taisho mysqld: 141002 13:33:26 [Note] Master 'geo-milev.taisho': Slave I/O thread exiting, read up to log 'mysql-bin.000562', position 140594

            It seams that this is one of those network issues that is not in the list.

            lukav Anton Avramov added a comment - This bug states it was fixed in 10.0.13, however I keep experience it in Server version: 10.0.13-MariaDB-1~squeeze-log mariadb.org binary distribution I see in the logs: Oct 2 13:33:26 taisho mysqld: 141002 13:33:26 [ERROR] Master 'geo-milev.taisho': Slave I/O: Setting @mariadb_slave_capability failed with error: Lost connection to MySQL server during query, Internal MariaDB error code: 2013 Oct 2 13:33:26 taisho mysqld: 141002 13:33:26 [Note] Master 'geo-milev.taisho': Slave I/O thread exiting, read up to log 'mysql-bin.000562', position 140594 It seams that this is one of those network issues that is not in the list.

            People

              knielsen Kristian Nielsen
              belugabehr BELUGABEHR
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.