Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-30929

Sporadic failure in spider.spider_fixes_part

Details

    Description

      While preparing MariaDB 10.11.2 for Debian I noticed in https://launchpadlibrarian.net/657664639/buildlog_ubuntu-lunar-ppc64el.mariadb_1%3A10.11.2-2~ubuntu23.04.1~1679807146.b55a2e269b8.dev.otto_BUILDING.txt.gz that test spider.spider_fixes_part failed with "mysqltest: At line 492: sync_with_master failed: 'select master_pos_wait('master-bin.000003', 692, 300, '')' returned NULL indicating slave SQL thread failure":

      spider.spider_fixes_part                 w2 [ fail ]
              Test ended at 2023-03-26 05:57:41
       
      CURRENT_TEST: spider.spider_fixes_part
      analyze: sync_with_master
      mysqltest: At line 492: sync_with_master failed: 'select master_pos_wait('master-bin.000003', 692, 300, '')' returned NULL indicating slave SQL thread failure
       
      The result from queries just before the failure was:
      < snip >
      KEY idx1(b),
      KEY idx2(c)
      ) MASTER_1_ENGINE MASTER_1_CHARSET MASTER_1_COMMENT3_P_2_1
      INSERT INTO ta_l_int (a, b, c) VALUES (1, 2, 3);
      INSERT INTO ta_l_int (a, b, c) SELECT a + 1, b + 1, c + 1 FROM ta_l_int;
      INSERT INTO ta_l_int (a, b, c) SELECT a + 2, b + 2, c + 2 FROM ta_l_int;
      INSERT INTO ta_l_int (a, b, c) SELECT a + 4, b + 4, c + 4 FROM ta_l_int;
      INSERT INTO ta_l_int (a, b, c) SELECT a + 8, b + 8, c + 8 FROM ta_l_int;
      connection master_1;
      SELECT a, b, c FROM ta_l_int force index(primary, idx1, idx2)
      WHERE a = 5 OR b = 5 OR c = 5 ORDER BY a;
      a	b	c
      3	4	5
      4	5	6
      5	6	7
       
      2.26
      auto_increment with partition
      connection master_1;
      connection slave1_1;
       
      More results from queries before failure can be found in /<<PKGBUILDDIR>>/builddir/mysql-test/var/2/log/spider_fixes_part.log
       
       
       == /<<PKGBUILDDIR>>/builddir/mysql-test/var/2/tmp/analyze-sync_with_master-mysqld.1.1.err ==
       
      ############################## default ##############################
       
      **** SHOW WARNINGS on default ****
      SHOW WARNINGS;
      Level	Code	Message
       
      **** SELECT replication-related variables on default ****
      SELECT NOW(), @@SERVER_ID;
      NOW()	@@SERVER_ID
      2023-03-26 05:57:41	1
       
      **** SHOW SLAVE STATUS on default ****
      SHOW SLAVE STATUS;
       
      **** SHOW MASTER STATUS on default ****
      SHOW MASTER STATUS;
      File	master-bin.000003
      Position	692
      Binlog_Do_DB	
      Binlog_Ignore_DB	
       
      **** SHOW SLAVE HOSTS on default ****
      SHOW SLAVE HOSTS;
      Server_id	8
      Host	localhost
      Port	16007
      Master_id	1
       
      **** SHOW PROCESSLIST on default ****
      SHOW PROCESSLIST;
      Id	User	Host	db	Command	Time	State	Info	Progress
      92	system user		NULL	Daemon	NULL	Spider table background cardinality action handler	NULL	0.000
      93	system user		NULL	Daemon	NULL	Spider table background cardinality action handler	NULL	0.000
      94	system user		NULL	Daemon	NULL	Spider table background cardinality action handler	NULL	0.000
      95	system user		NULL	Daemon	NULL	Spider table background cardinality action handler	NULL	0.000
      96	system user		NULL	Daemon	NULL	Spider table background cardinality action handler	NULL	0.000
      97	system user		NULL	Daemon	NULL	Spider table background statistics action handler	NULL	0.000
      98	system user		NULL	Daemon	NULL	Spider table background cardinality action handler	NULL	0.000
      99	system user		NULL	Daemon	NULL	Spider table background statistics action handler	NULL	0.000
      100	system user		NULL	Daemon	NULL	Spider table background cardinality action handler	NULL	0.000
      101	system user		NULL	Daemon	NULL	Spider table background cardinality action handler	NULL	0.000
      102	system user		NULL	Daemon	NULL	Spider table background cardinality action handler	NULL	0.000
      103	system user		NULL	Daemon	NULL	Spider table background statistics action handler	NULL	0.000
      104	system user		NULL	Daemon	NULL	Spider table background statistics action handler	NULL	0.000
      105	system user		NULL	Daemon	NULL	Spider table background statistics action handler	NULL	0.000
      106	system user		NULL	Daemon	NULL	Spider table background statistics action handler	NULL	0.000
      107	system user		NULL	Daemon	NULL	Spider table background statistics action handler	NULL	0.000
      108	system user		NULL	Sleep	1	Reset for next command	NULL	0.000
      109	system user		NULL	Daemon	NULL	Spider table background statistics action handler	NULL	0.000
      110	system user		NULL	Daemon	NULL	Spider table background statistics action handler	NULL	0.000
      111	system user		NULL	Daemon	NULL	Spider table background cardinality action handler	NULL	0.000
      112	root	localhost:57406	NULL	Binlog Dump	0	Master has sent all binlog to slave; waiting for more updates	NULL	0.000
      113	root	localhost	NULL	Query	0	starting	SHOW PROCESSLIST	0.000
       
      **** SHOW BINARY LOGS on default ****
      SHOW BINARY LOGS;
      Log_name	File_size
      master-bin.000001	8704
      master-bin.000002	419
      master-bin.000003	692
       
      **** SHOW BINLOG EVENTS on default ****
      binlog_name = 'master-bin.000003'
      SHOW BINLOG EVENTS IN 'master-bin.000003';
      Log_name	Pos	Event_type	Server_id	End_log_pos	Info
      master-bin.000003	4	Format_desc	1	256	Server ver: 10.11.2-MariaDB-2~ubuntu23.04.1~1679807146.b55a2e, Binlog ver: 4
      master-bin.000003	256	Gtid_list	1	299	[0-1-39]
      master-bin.000003	299	Binlog_checkpoint	1	343	master-bin.000002
      master-bin.000003	343	Binlog_checkpoint	1	387	master-bin.000003
      master-bin.000003	387	Gtid	1	429	BEGIN GTID 0-1-40
      master-bin.000003	429	Query	1	620	use `mtr`; INSERT INTO test_suppressions (pattern) VALUES ( NAME_CONST('pattern',_latin1'unknown variable' COLLATE 'latin1_swedish_ci'))
      master-bin.000003	620	Query	1	692	COMMIT
       
      **** SHOW RELAYLOG EVENTS on default ****
      relaylog_name = 'No such row'
      SHOW RELAYLOG EVENTS IN 'No such row';
      Log_name	Pos	Event_type	Server_id	End_log_pos	Info
      connection default;
      

      Attachments

        1. master-bin.000001
          24 kB
        2. master-bin.000002
          1 kB
        3. master-bin.000003
          3 kB
        4. mysqld-relay-bin.000001
          2 kB
        5. spider_fixes_part-1.out
          87 kB
        6. stdout
          2.24 MB

        Issue Links

          Activity

            ycp Yuchen Pei added a comment -

            Hi Elkin, following up on the discussion on slack, can you approve this patch before I push it? Thanks.

            upstream/bb-10.11-mdev-30929 a50417b7ac83940fb542e56f7db343b01750748f
            MDEV-30929 spider.spider_fixes_part: wait and restart slave
             
            In the absence of insight of the cause of spider.spider_fixes_part
            failure as described in MDEV-30929, This is a workaround, which could
            help narrow the possibility down to whether slave SQL thread attempts
            to read from file that maybe not yet on disk. It does not otherwise
            affect the coverage of the test.
             
            I have pushed this commit 4 times, but have yet to encounter the
            failure as described in MDEV-30929, so it could also fix the test and
            stop the CI pollution.
            

            ycp Yuchen Pei added a comment - Hi Elkin , following up on the discussion on slack, can you approve this patch before I push it? Thanks. upstream/bb-10.11-mdev-30929 a50417b7ac83940fb542e56f7db343b01750748f MDEV-30929 spider.spider_fixes_part: wait and restart slave   In the absence of insight of the cause of spider.spider_fixes_part failure as described in MDEV-30929, This is a workaround, which could help narrow the possibility down to whether slave SQL thread attempts to read from file that maybe not yet on disk. It does not otherwise affect the coverage of the test.   I have pushed this commit 4 times, but have yet to encounter the failure as described in MDEV-30929, so it could also fix the test and stop the CI pollution.
            Elkin Andrei Elkin added a comment - - edited

            ycp howdy. As we discussed on slack, there's a pending refinement there. It needs to be made into the main branches where the test fails regularly so we learned necessary details sooner.
            Also let me copy-paste a recommendation

            Considering the relay-log contained just 4 bytes, it could be that slave was not initialized very early, that is here https://github.com/MariaDB/server/blob/10.11/storage/spider/mysql-test/spider/t/slave_test_init.inc#L22
            In the replication suite we do --source include/start_slave.inc
            $SOURCE_DIR/mysql-test/include/start_slave.inc

            that makes sure the slave is really up before to proceed. The spider suite should follow that.

            Elkin Andrei Elkin added a comment - - edited ycp howdy. As we discussed on slack, there's a pending refinement there. It needs to be made into the main branches where the test fails regularly so we learned necessary details sooner. Also let me copy-paste a recommendation Considering the relay-log contained just 4 bytes, it could be that slave was not initialized very early, that is here https://github.com/MariaDB/server/blob/10.11/storage/spider/mysql-test/spider/t/slave_test_init.inc#L22 In the replication suite we do --source include/start_slave.inc $SOURCE_DIR/mysql-test/include/start_slave.inc that makes sure the slave is really up before to proceed. The spider suite should follow that.
            ycp Yuchen Pei added a comment - - edited

            Thanks Elkin, I've updated the patch (see below), please take a look and let me know if it is what you meant, and if so, then I'd like to move the --source include/start_slave.inc change to another MDEV with an earlier fixversion, assuming it is a good change in general (not just for this bug).

            bb-10.11-mdev-30929 2f3a7f0bb99440d0a3e9218128e65c6d7a04f71a
            MDEV-30929 spider.spider_fixes_part: wait and restart slave
             
            In the absence of insight of the cause of spider.spider_fixes_part
            failure as described in MDEV-30929, This is a workaround, which could
            help narrow the possibility down to whether slave SQL thread attempts
            to read from file that maybe not yet on disk. It does not otherwise
            affect the coverage of the test.
             
            I have pushed this commit 4 times, but have yet to encounter the
            failure as described in MDEV-30929, so it could also fix the test and
            stop the CI pollution.
             
            Also replaced START SLAVE; with --source include/start_slave.inc
            inside the slave_test_init.inc files.
            

            ycp Yuchen Pei added a comment - - edited Thanks Elkin , I've updated the patch (see below), please take a look and let me know if it is what you meant, and if so, then I'd like to move the --source include/start_slave.inc change to another MDEV with an earlier fixversion, assuming it is a good change in general (not just for this bug). bb-10.11-mdev-30929 2f3a7f0bb99440d0a3e9218128e65c6d7a04f71a MDEV-30929 spider.spider_fixes_part: wait and restart slave   In the absence of insight of the cause of spider.spider_fixes_part failure as described in MDEV-30929, This is a workaround, which could help narrow the possibility down to whether slave SQL thread attempts to read from file that maybe not yet on disk. It does not otherwise affect the coverage of the test.   I have pushed this commit 4 times, but have yet to encounter the failure as described in MDEV-30929, so it could also fix the test and stop the CI pollution.   Also replaced START SLAVE; with --source include/start_slave.inc inside the slave_test_init.inc files.
            Elkin Andrei Elkin added a comment -

            Looks good enough for me, ycp. Thanks for your patience and for steady progress !

            Elkin Andrei Elkin added a comment - Looks good enough for me, ycp . Thanks for your patience and for steady progress !
            ycp Yuchen Pei added a comment -

            Thanks for the review Elkin, pushing 64314d3094ef5eb52223c613632292be96bd7732 to 10.11.

            > move the --source include/start_slave.inc change to another MDEV with an earlier fixversion,

            Given that there's no way to block this ticket with the propagation of that change from 10.5 to 10.11, we should just push this change including the start slave part, and backport the start slave to 10.5 - opened MDEV-34098.

            ycp Yuchen Pei added a comment - Thanks for the review Elkin , pushing 64314d3094ef5eb52223c613632292be96bd7732 to 10.11. > move the --source include/start_slave.inc change to another MDEV with an earlier fixversion, Given that there's no way to block this ticket with the propagation of that change from 10.5 to 10.11, we should just push this change including the start slave part, and backport the start slave to 10.5 - opened MDEV-34098 .

            People

              ycp Yuchen Pei
              otto Otto Kekäläinen
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.