Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-30214

rpl.rpl_manual_change_index_file occasionally fails in BB with Result length mismatch

Details

    Description

      https://buildbot.mariadb.org/#/builders/208/builds/9538

      10.11 12786f0e7

      rpl.rpl_manual_change_index_file 'row'   w30 [ fail ]
              Test ended at 2022-12-12 10:53:42
       
      CURRENT_TEST: rpl.rpl_manual_change_index_file
      --- /home/buildbot/amd64-ubuntu-1804-bigtest/build/mysql-test/suite/rpl/r/rpl_manual_change_index_file.result	2022-12-12 07:52:58.000000000 +0000
      +++ /home/buildbot/amd64-ubuntu-1804-bigtest/build/mysql-test/suite/rpl/r/rpl_manual_change_index_file.reject	2022-12-12 10:53:42.390734033 +0000
      @@ -13,7 +13,7 @@
       call mtr.add_suppression('Got fatal error 1236 from master when reading data from binary log: .*could not find next log');
       connection slave;
       include/wait_for_slave_io_error.inc [errno=1236]
      -Last_IO_Error = 'Got fatal error 1236 from master when reading data from binary log: 'could not find next log; the first event 'master-bin.000001' at XXX, the last event read from 'master-bin.000002' at XXX, the last byte read from 'master-bin.000002' at XXX.''
      +Last_IO_Error = 'Got fatal error 1236 from master when reading data from binary log: 'could not find next log; the first event '.' at XXX, the last event read from 'master-bin.000002' at XXX, the last byte read from 'master-bin.000002' at XXX.''
       connection master;
       CREATE TABLE t2(c1 INT);
       FLUSH LOGS;
       
      mysqltest: Result length mismatch
      

      Error logs and var dir attached (mysqld.x.err.30, var_30214.tar.gz).
      Error logs for passing equivalent test in mysqld.x.15. (https://buildbot.mariadb.org/#/builders/208/builds/12052/)

      Attachments

        1. mysqld.1.err.15
          445 kB
        2. mysqld.1.err.30
          458 kB
        3. mysqld.2.err.15
          558 kB
        4. mysqld.2.err.30
          650 kB
        5. var_30214.tar.gz
          1.66 MB

        Activity

          The test appears to always be retry-pass. The result file expects the error message to contain ‘master-bin.000001’, but instead the error log shows the IO thread with an empty string or 'FIRST' for the log file:

          2022-12-12 10:53:42 241 [Note] Slave I/O thread: Start asynchronous replication to master 'root@127.0.0.1:16540' in log '' at position 4
          2022-12-12 10:53:42 242 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 0, relay log './slave-relay-bin.000001' position: 4
          2022-12-12 10:53:42 241 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:16540',replication started in log 'FIRST' at position 4
          2022-12-12 10:53:42 241 [ERROR] Error reading packet from server: could not find next log; the first event '.' at 4, the last event read from 'master-bin.000002' at 514, the last byte read from 'master-bin.000002' at 562. (server_errno=1236)
          

          This failure was first seen with: https://buildbot.mariadb.org/#/builders/121/builds/6504 and is only seen on 10.10 and higher.

          The error log for the failure shows (attached mysqld.2.err.30) that the CHANGE MASTER statement at the start of the test does not contain the filename for the New state:

          2022-12-12 10:53:42 239 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='127.0.0.1', master_port='16540', master_log_file='master-bin.000001', master_log_pos='329'. New state master_host='127.0.0.1', master_port='16540', master_log_file='', master_log_pos='4'.
          2022-12-12 10:53:42 239 [Note] Previous Using_Gtid=Slave_Pos. New Using_Gtid=No
          

          before this, when the slave is stopped so CM can be changed to USING_GTID=NO, the IO thread knows of the master log file but not the SQL thread (let’s call this Case A):

          2022-12-12 10:53:42 235 [Note] Previous Using_Gtid=Slave_Pos. New Using_Gtid=Slave_Pos
          2022-12-12 10:53:42 237 [Note] Slave I/O thread: Start asynchronous replication to master 'root@127.0.0.1:16540' in log '' at position 4
          2022-12-12 10:53:42 237 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:16540',replication starts at GTID position ''
          2022-12-12 10:53:42 238 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './slave-relay-bin.000001' position: 4; GTID position ''
          2022-12-12 10:53:42 238 [Note] Slave SQL thread exiting, replication stopped in log 'FIRST' at position 4; GTID position '', master: 127.0.0.1:16540
          2022-12-12 10:53:42 237 [Note] Slave I/O thread exiting, read up to log 'master-bin.000001', position 329; GTID position , master 127.0.0.1:16540
          

          while for the passing test in another build (https://buildbot.mariadb.org/#/builders/208) the error log shows that the CHANGE MASTER statement contains the master_log_file filename:

          2023-04-14 16:13:00 239 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='127.0.0.1', master_port='16240', master_log_file='master-bin.000001', master_log_pos='329'. New state master_host='127.0.0.1', master_port='16240', master_log_file='master-bin.000001', master_log_pos='329'.
          2023-04-14 16:13:00 239 [Note] Previous Using_Gtid=Slave_Pos. New Using_Gtid=No
          

          and both IO and SQL threads know of the log file:

          2023-04-14 16:13:00 235 [Note] Previous Using_Gtid=Slave_Pos. New Using_Gtid=Slave_Pos
          2023-04-14 16:13:00 237 [Note] Slave I/O thread: Start asynchronous replication to master 'root@127.0.0.1:16240' in log '' at position 4
          2023-04-14 16:13:00 238 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './slave-relay-bin.000001' position: 4; GTID position ''
          2023-04-14 16:13:00 237 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:16240',replication starts at GTID position ''
          2023-04-14 16:13:00 238 [Note] Error reading relay log event: slave SQL thread was killed
          2023-04-14 16:13:00 238 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000001' at position 329; GTID position '', master: 127.0.0.1:16240
          2023-04-14 16:13:00 237 [Note] Slave I/O thread exiting, read up to log 'master-bin.000001', position 329; GTID position , master 127.0.0.1:16240
          

          Notably, the error log for the failure does NOT show the message “Error reading relay log event: slave SQL thread was killed”

          Elkin I am unsure if there is a timing bug with Case A to investigate. Otherwise, the expected error message can simply be made more general so that it would accept either ‘.’ or ‘master-bin.000001’, or the "USING_GTID=NO" could be removed as the 1236 error is still achievable when the index file is manually changed.

          angelique.sklavounos Angelique Sklavounos (Inactive) added a comment - The test appears to always be retry-pass. The result file expects the error message to contain ‘master-bin.000001’, but instead the error log shows the IO thread with an empty string or 'FIRST' for the log file: 2022-12-12 10:53:42 241 [Note] Slave I/O thread: Start asynchronous replication to master 'root@127.0.0.1:16540' in log '' at position 4 2022-12-12 10:53:42 242 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 0, relay log './slave-relay-bin.000001' position: 4 2022-12-12 10:53:42 241 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:16540',replication started in log 'FIRST' at position 4 2022-12-12 10:53:42 241 [ERROR] Error reading packet from server: could not find next log; the first event '.' at 4, the last event read from 'master-bin.000002' at 514, the last byte read from 'master-bin.000002' at 562. (server_errno=1236) This failure was first seen with: https://buildbot.mariadb.org/#/builders/121/builds/6504 and is only seen on 10.10 and higher. The error log for the failure shows (attached mysqld.2.err.30) that the CHANGE MASTER statement at the start of the test does not contain the filename for the New state: 2022-12-12 10:53:42 239 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='127.0.0.1', master_port='16540', master_log_file='master-bin.000001', master_log_pos='329'. New state master_host='127.0.0.1', master_port='16540', master_log_file='', master_log_pos='4'. 2022-12-12 10:53:42 239 [Note] Previous Using_Gtid=Slave_Pos. New Using_Gtid=No before this, when the slave is stopped so CM can be changed to USING_GTID=NO, the IO thread knows of the master log file but not the SQL thread (let’s call this Case A): 2022-12-12 10:53:42 235 [Note] Previous Using_Gtid=Slave_Pos. New Using_Gtid=Slave_Pos 2022-12-12 10:53:42 237 [Note] Slave I/O thread: Start asynchronous replication to master 'root@127.0.0.1:16540' in log '' at position 4 2022-12-12 10:53:42 237 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:16540',replication starts at GTID position '' 2022-12-12 10:53:42 238 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './slave-relay-bin.000001' position: 4; GTID position '' 2022-12-12 10:53:42 238 [Note] Slave SQL thread exiting, replication stopped in log 'FIRST' at position 4; GTID position '', master: 127.0.0.1:16540 2022-12-12 10:53:42 237 [Note] Slave I/O thread exiting, read up to log 'master-bin.000001', position 329; GTID position , master 127.0.0.1:16540 while for the passing test in another build ( https://buildbot.mariadb.org/#/builders/208 ) the error log shows that the CHANGE MASTER statement contains the master_log_file filename: 2023-04-14 16:13:00 239 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='127.0.0.1', master_port='16240', master_log_file='master-bin.000001', master_log_pos='329'. New state master_host='127.0.0.1', master_port='16240', master_log_file='master-bin.000001', master_log_pos='329'. 2023-04-14 16:13:00 239 [Note] Previous Using_Gtid=Slave_Pos. New Using_Gtid=No and both IO and SQL threads know of the log file: 2023-04-14 16:13:00 235 [Note] Previous Using_Gtid=Slave_Pos. New Using_Gtid=Slave_Pos 2023-04-14 16:13:00 237 [Note] Slave I/O thread: Start asynchronous replication to master 'root@127.0.0.1:16240' in log '' at position 4 2023-04-14 16:13:00 238 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './slave-relay-bin.000001' position: 4; GTID position '' 2023-04-14 16:13:00 237 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:16240',replication starts at GTID position '' 2023-04-14 16:13:00 238 [Note] Error reading relay log event: slave SQL thread was killed 2023-04-14 16:13:00 238 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000001' at position 329; GTID position '', master: 127.0.0.1:16240 2023-04-14 16:13:00 237 [Note] Slave I/O thread exiting, read up to log 'master-bin.000001', position 329; GTID position , master 127.0.0.1:16240 Notably, the error log for the failure does NOT show the message “Error reading relay log event: slave SQL thread was killed” Elkin I am unsure if there is a timing bug with Case A to investigate. Otherwise, the expected error message can simply be made more general so that it would accept either ‘.’ or ‘master-bin.000001’, or the "USING_GTID=NO" could be removed as the 1236 error is still achievable when the index file is manually changed.
          Elkin Andrei Elkin added a comment - - edited

          angelique.sklavounos, I agree with making 'the expected error message ... more general'. After all the test verifies what belongs to
          a class of unspecified behavior (thankfully a way was found to avoid that endless looping).

          If you have it reproducible with rr we could gain some better understanding. But no need to persevere.

          Elkin Andrei Elkin added a comment - - edited angelique.sklavounos , I agree with making 'the expected error message ... more general'. After all the test verifies what belongs to a class of unspecified behavior (thankfully a way was found to avoid that endless looping). If you have it reproducible with rr we could gain some better understanding. But no need to persevere.

          People

            angelique.sklavounos Angelique Sklavounos (Inactive)
            angelique.sklavounos Angelique Sklavounos (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.