Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-30232

rpl.rpl_gtid_crash fails sporadically in BB with Timeout wait for SQL thread to catch up with IO thread

Details

    Description

      https://buildbot.askmonty.org/buildbot/builders/kvm-fulltest2/builds/34174

      10.8 e8a2a70cf

      rpl.rpl_gtid_crash 'innodb,row'          w4 [ fail ]
              Test ended at 2022-03-08 06:14:34
       
      CURRENT_TEST: rpl.rpl_gtid_crash
      mysqltest: At line 367: Timeout wait for SQL thread to catch up with IO thread
       
      The result from queries just before the failure was:
      < snip >
      call mtr.add_suppression("Unexpected change of master binlog file name in the middle of GTID");
      set sql_log_bin= 1;
      connection server_1;
      SET GLOBAL debug_dbug="+d,inject_error_writing_xid";
      BEGIN;
      INSERT INTO t1 VALUES (11);
      COMMIT;
      ERROR HY000: Error writing file 'master-bin' (errno: 28 "No space left on device")
      SET GLOBAL debug_dbug="+d,crash_dispatch_command_before";
      COMMIT;
      Got one of the listed errors
      SELECT @@GLOBAL.server_id;
      @@GLOBAL.server_id
      3
      SELECT * from t1 WHERE a > 10 ORDER BY a;
      a
      gtid_check
      Binlog pos ok
      # Wait 30 seconds for SQL thread to catch up with IO thread
      connection server_2;
      

      Also seen to have failed at the second wait for the SQL thread:

      10.6 c4ce012e4

      rpl.rpl_gtid_crash 'innodb,row'          w4 [ fail ]
              Test ended at 2022-11-07 19:05:14
       
      CURRENT_TEST: rpl.rpl_gtid_crash
      mysqltest: At line 464: Timeout wait for SQL thread to catch up with IO thread
       
      The result from queries just before the failure was:
      < snip >
      BEGIN;
      INSERT INTO t1 VALUES (21);
      COMMIT;
      ERROR HY000: Error writing file 'master-bin' (errno: 28 "No space left on device")
      SET GLOBAL debug_dbug="+d,crash_dispatch_command_before";
      COMMIT;
      Got one of the listed errors
      SELECT @@GLOBAL.server_id;
      @@GLOBAL.server_id
      1
      SELECT * from t1 WHERE a > 10 ORDER BY a;
      a
      13
      14
      gtid_check
      Binlog pos ok
      gtid_check
      Current pos ok
      # Wait 30 seconds for SQL thread to catch up with IO thread
      connection server_2;
      

      Attachments

        Issue Links

          Activity

            Elkin Andrei Elkin added a comment -

            A warning needs suppression:

            https://buildbot.mariadb.org/#/builders/572/builds/4473/steps/8/logs/stdio

            rpl.rpl_gtid_crash 'innodb,row' w14 [ fail ] Found warnings/errors in server log file!
            Test ended at 2023-11-22 14:23:59
            line
            2023-11-22 14:23:50 5 [Warning] Slave I/O: SET @master_heartbeat_period to master failed with error: Lost connection to server during query, Internal MariaDB error code: 2013
            ^ Found warnings in /home/buildbot/amd64-debian-11-msan/build/mysql-test/var/14/log/mysqld.2.err

            According to the slave side error log it may happened.

            Elkin Andrei Elkin added a comment - A warning needs suppression: https://buildbot.mariadb.org/#/builders/572/builds/4473/steps/8/logs/stdio rpl.rpl_gtid_crash 'innodb,row' w14 [ fail ] Found warnings/errors in server log file! Test ended at 2023-11-22 14:23:59 line 2023-11-22 14:23:50 5 [Warning] Slave I/O: SET @master_heartbeat_period to master failed with error: Lost connection to server during query, Internal MariaDB error code: 2013 ^ Found warnings in /home/buildbot/amd64-debian-11-msan/build/mysql-test/var/14/log/mysqld.2.err According to the slave side error log it may happened.

            The root cause of this appears to be a problem in the linux kernel networking code:

            https://lore.kernel.org/netdev/87sf0ldk41.fsf@urd.knielsen-hq.org/T/#u

            When the race is hit, in the test case rpl.rpl_gtid_crash, the slave gets a successful socket reconnection to the (old, crashed) master, but then times out in poll() during the connection setup because of the bug/race in the kernel. Since the connect timeout happens to be longer than the 60 seconds timeout coded in the test case, this ends up as a test case timeout failure.

            Given that the root cause is a problem in the OS, a work-around could be implemented in the test to simply reduce the connection timeout so that the test case will complete even when it gets a stale reconnect.

            knielsen Kristian Nielsen added a comment - The root cause of this appears to be a problem in the linux kernel networking code: https://lore.kernel.org/netdev/87sf0ldk41.fsf@urd.knielsen-hq.org/T/#u When the race is hit, in the test case rpl.rpl_gtid_crash, the slave gets a successful socket reconnection to the (old, crashed) master, but then times out in poll() during the connection setup because of the bug/race in the kernel. Since the connect timeout happens to be longer than the 60 seconds timeout coded in the test case, this ends up as a test case timeout failure. Given that the root cause is a problem in the OS, a work-around could be implemented in the test to simply reduce the connection timeout so that the test case will complete even when it gets a stale reconnect.

            I think it is --slave-net-timeout that gets triggered by the Linux kernel bug and needs to be reduced (from the default 120 currently).

            The failure is quite tricky to reproduce, but does occur relatively frequently in buildbot.

            Need to decide if the value should be reduced only in this rpl.rpl_gtid_crash test case, or should be reduced generally for all tests, in case others are affected by this as well.

            knielsen Kristian Nielsen added a comment - I think it is --slave-net-timeout that gets triggered by the Linux kernel bug and needs to be reduced (from the default 120 currently). The failure is quite tricky to reproduce, but does occur relatively frequently in buildbot. Need to decide if the value should be reduced only in this rpl.rpl_gtid_crash test case, or should be reduced generally for all tests, in case others are affected by this as well.

            Fix for this in st-10.5-knielsen, but I'm prevented from pushing by broken branch protection.
            Feel free to pull if desired.

            knielsen Kristian Nielsen added a comment - Fix for this in st-10.5-knielsen, but I'm prevented from pushing by broken branch protection. Feel free to pull if desired.

            Pushed to 10.5.

            knielsen Kristian Nielsen added a comment - Pushed to 10.5.

            People

              knielsen Kristian Nielsen
              angelique.sklavounos Angelique Sklavounos (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.