Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-22513

main.processlist_notembedded fails in buildbot with Timeout in wait_until_count_sessions

Details

    Description

      http://buildbot.askmonty.org/buildbot/builders/kvm-rpm-centos74-amd64-debug/builds/4005

      bb-10.2-release 985f63cce1057d7b3fd6e86e65a2a2ee 2020-05-08 19:07:46

      main.processlist_notembedded             w1 [ fail ]
              Test ended at 2020-05-08 17:33:56
       
      CURRENT_TEST: main.processlist_notembedded
      mysqltest: In included file "/usr/share/mysql-test/include/wait_until_count_sessions.inc": 
      included from /usr/share/mysql-test/t/processlist_notembedded.test at line 37:
      At line 125: Timeout in wait_until_count_sessions.inc
       
      The result from queries just before the failure was:
      #
      # MDEV-20466: SHOW PROCESSLIST truncates query text on \0 bytes
      #
      connect  con1,localhost,root,,;
      connection con1;
      SET DEBUG_SYNC= 'before_join_optimize SIGNAL in_sync WAIT_FOR go';
      connection default;
      SET DEBUG_SYNC= 'now WAIT_FOR in_sync';
      FOUND 1 /sleep/ in MDEV-20466.text
      SET DEBUG_SYNC= 'now SIGNAL go';
      disconnect con1;
      SET DEBUG_SYNC = 'RESET';
      # Timeout in wait_until_count_sessions.inc
      # Number of sessions   expected: <= 1   found: 2
      SHOW PROCESSLIST;
      Id	User	Host	db	Command	Time	State	Info	Progress
      11071	root	localhost	test	Query	0	init	SHOW PROCESSLIST	0.000
      11072	root	localhost	test	Query	10	debug sync point: before_join_optimize	SELECT user FROM mysql.user WHERE user ='some
      

      Attachments

        Issue Links

          Activity

            As reported in MDEV-21462, this test was ruining subsequent tests by failing to ensure that all connections terminated.

            My change was merged by the original test author sanja upward. The bb-10.2-release branch where the test failed is a merge from bb-10.1-release. There, that test did not fail anywhere, but notably, kvm-rpm-centos74-amd64 did not run at all on bb-10.1-release.

            I tried and failed to reproduce the failure locally, despite 1,200 repeats. The [ retry-fail ] in the buildbot log looks promising. I did not find anything useful in the server error log.

            I am reassigning this back to the original test author, who also did the merge to bb-10.2-release. I hope that this is well repeatable on the buildbot VM.

            I only fixed MDEV-21462 because the badly written test was causing a potentially large number of subsequent tests to fail. Luckily, this failure appears to be the failure of a single test for a single bug.

            marko Marko Mäkelä added a comment - As reported in MDEV-21462 , this test was ruining subsequent tests by failing to ensure that all connections terminated. My change was merged by the original test author sanja upward. The bb-10.2-release branch where the test failed is a merge from bb-10.1-release. There, that test did not fail anywhere, but notably, kvm-rpm-centos74-amd64 did not run at all on bb-10.1-release. I tried and failed to reproduce the failure locally, despite 1,200 repeats. The [ retry-fail ] in the buildbot log looks promising. I did not find anything useful in the server error log. I am reassigning this back to the original test author, who also did the merge to bb-10.2-release. I hope that this is well repeatable on the buildbot VM. I only fixed MDEV-21462 because the badly written test was causing a potentially large number of subsequent tests to fail. Luckily, this failure appears to be the failure of a single test for a single bug.

            I repeated the hang once locally:

            10.5 0186b0a07760b734af61a9e45ebdb5a077d1872d

            CURRENT_TEST: main.processlist_notembedded
            mysqltest: In included file "./include/wait_until_count_sessions.inc": 
            included from /mariadb/10.5-merge/mysql-test/main/processlist_notembedded.test at line 37:
            At line 125: Timeout in wait_until_count_sessions.inc
             
            The result from queries just before the failure was:
            #
            # MDEV-20466: SHOW PROCESSLIST truncates query text on \0 bytes
            #
            connect  con1,localhost,root,,;
            connection con1;
            SET DEBUG_SYNC= 'before_join_optimize SIGNAL in_sync WAIT_FOR go';
            connection default;
            SET DEBUG_SYNC= 'now WAIT_FOR in_sync';
            FOUND 1 /sleep/ in MDEV-20466.text
            SET DEBUG_SYNC= 'now SIGNAL go';
            disconnect con1;
            SET DEBUG_SYNC = 'RESET';
            # Timeout in wait_until_count_sessions.inc
            # Number of sessions   expected: <= 1   found: 2
            SHOW PROCESSLIST;
            Id	User	Host	db	Command	Time	State	Info	Progress
            122	root	localhost	test	Query	0	starting	SHOW PROCESSLIST	0.000
            123	root	localhost	test	Query	10	debug sync point: before_join_optimize	SELECT user FROM mysql.user WHERE user ='some\0sleep'	0.000
            

            I tried to repeat by running the same sequence of tests that the hung worker was running, since the latest server restart. But, the test would fail to time out:

            while ./mtr --verbose-restart --no-reorder main.opt_trace_security main.opt_trace_ucs2 main.order_by-mdev-10122 main.order_by_optimizer main.order_by_sortkey main.order_by_zerolength-4285 main.order_fill_sortbuf main.outfile main.outfile_loaddata main.overflow main.packet main.parser main.parser_not_embedded main.parser_precedence main.parser_stack main.password_expiration main.perror main.perror-win main.plugin_vars main.preload main.prepare main.processlist main.processlist_notembedded
            do :
            done
            

            Possibly related note: The tests main.partition_debug_sync and main.query_cache_debug occasionally spend 300 extra seconds (5 minutes) real time, apparently due to something DEBUG_SYNC related. The tests would not fail, but they would run for 300xxx milliseconds instead of some tens or hundreds of milliseconds. I have seen this starting with 10.2. It might also be that I have not run tests on 10.1 frequently enough to witness such ‘hang’.

            marko Marko Mäkelä added a comment - I repeated the hang once locally: 10.5 0186b0a07760b734af61a9e45ebdb5a077d1872d CURRENT_TEST: main.processlist_notembedded mysqltest: In included file "./include/wait_until_count_sessions.inc": included from /mariadb/10.5-merge/mysql-test/main/processlist_notembedded.test at line 37: At line 125: Timeout in wait_until_count_sessions.inc   The result from queries just before the failure was: # # MDEV-20466: SHOW PROCESSLIST truncates query text on \0 bytes # connect con1,localhost,root,,; connection con1; SET DEBUG_SYNC= 'before_join_optimize SIGNAL in_sync WAIT_FOR go'; connection default; SET DEBUG_SYNC= 'now WAIT_FOR in_sync'; FOUND 1 /sleep/ in MDEV-20466.text SET DEBUG_SYNC= 'now SIGNAL go'; disconnect con1; SET DEBUG_SYNC = 'RESET'; # Timeout in wait_until_count_sessions.inc # Number of sessions expected: <= 1 found: 2 SHOW PROCESSLIST; Id User Host db Command Time State Info Progress 122 root localhost test Query 0 starting SHOW PROCESSLIST 0.000 123 root localhost test Query 10 debug sync point: before_join_optimize SELECT user FROM mysql.user WHERE user ='some\0sleep' 0.000 I tried to repeat by running the same sequence of tests that the hung worker was running, since the latest server restart. But, the test would fail to time out: while ./mtr --verbose-restart --no-reorder main.opt_trace_security main.opt_trace_ucs2 main.order_by-mdev-10122 main.order_by_optimizer main.order_by_sortkey main.order_by_zerolength-4285 main.order_fill_sortbuf main.outfile main.outfile_loaddata main.overflow main.packet main.parser main.parser_not_embedded main.parser_precedence main.parser_stack main.password_expiration main.perror main.perror-win main.plugin_vars main.preload main.prepare main.processlist main.processlist_notembedded do : done Possibly related note: The tests main.partition_debug_sync and main.query_cache_debug occasionally spend 300 extra seconds (5 minutes) real time, apparently due to something DEBUG_SYNC related. The tests would not fail, but they would run for 300xxx milliseconds instead of some tens or hundreds of milliseconds. I have seen this starting with 10.2. It might also be that I have not run tests on 10.1 frequently enough to witness such ‘hang’.

            I think that the signal can be lost because we blindly disconnect the paused connection, and then discard the signals by SET DEBUG_SYNC = 'RESET'. A reap before the disconnect should fix this.

            marko Marko Mäkelä added a comment - I think that the signal can be lost because we blindly disconnect the paused connection, and then discard the signals by SET DEBUG_SYNC = 'RESET' . A reap before the disconnect should fix this.

            People

              marko Marko Mäkelä
              elenst Elena Stepanova
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.