Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-33489

The test atomic.alter_table occasionally times out

Details

    Description

      The test atomic.alter_table is seldomly hanging on our CI systems. Here is an example from the oldest affected major version:

      10.6 80ae69c8bc948382920b083ba203369e

      atomic.alter_table 'innodb'              w3 [ fail ]  timeout after 900 seconds
              Test ended at 2023-02-06 20:57:42
       
      Test case timeout after 900 seconds
       
      == /dev/shm/var/3/log/alter_table.log == 
        `c` int(11) DEFAULT NULL,
        KEY `a` (`a`)
      ) ENGINE=InnoDB DEFAULT CHARSET=latin1 COLLATE=latin1_swedish_ci COMMENT='new'
      count(*)
      2
      master-bin.000002	#	Query	#	#	use `test`; ALTER TABLE t1 ADD COLUMN c INT, COMMENT "new", rename test2.t2
      crash point: ddl_log_alter_after_rename_triggers
      t2.frm
      t2.ibd
      "Table is in test2"
      Table	Create Table
      t2	CREATE TABLE `t2` (
        `a` int(11) DEFAULT NULL,
        `b` int(11) DEFAULT NULL,
        `c` int(11) DEFAULT NULL,
        KEY `a` (`a`)
      ) ENGINE=InnoDB DEFAULT CHARSET=latin1 COLLATE=latin1_swedish_ci COMMENT='new'
      count(*)
      2
      master-bin.000002	#	Query	#	#	use `test`; ALTER TABLE t1 ADD COLUMN c INT, COMMENT "new", rename test2.t2
       
       == /dev/shm/var/3/tmp/analyze-timeout-mysqld.1.err ==
      mysqltest: Could not open connection 'default' after 500 attempts: 2002 Can't connect to local server through socket '/dev/shm/var/tmp/3/mysqld.1.sock' (111)
      

      Stack traces of the apparently hung server process are available. As far as I can tell, all threads are idle in all the stack traces that I have checked; at least InnoDB should not be to blame for the hangs. This could be a bug in mtr itself.

      Attachments

        Issue Links

          Activity

            For the record, this 10.11-based hang of a MemorySanitizer build from today does not show up in the cross-reference. The log ends like this:

            refs/pull/3045 3dd7b0a80cb674d0773fd0d25ceb8a5183798597

            Completed: Failed 2/6053 tests, 99.97% were successful.
            Failing test(s): perfschema.table_lock_aggregate_thread_4u_2t atomic.alter_table
            

            Again, all threads in the server appear to be idle, waiting for a connection to be established.

            marko Marko Mäkelä added a comment - For the record, this 10.11-based hang of a MemorySanitizer build from today does not show up in the cross-reference. The log ends like this: refs/pull/3045 3dd7b0a80cb674d0773fd0d25ceb8a5183798597 Completed: Failed 2/6053 tests, 99.97% were successful. Failing test(s): perfschema.table_lock_aggregate_thread_4u_2t atomic.alter_table Again, all threads in the server appear to be idle, waiting for a connection to be established.

            There appear to be more test failures findable via https://buildbot.mariadb.org/cr/ than through the interface that I was using.

            marko Marko Mäkelä added a comment - There appear to be more test failures findable via https://buildbot.mariadb.org/cr/ than through the interface that I was using.
            danblack Daniel Black added a comment -

            Looking at the logs on msan builder, this isn't hung, its just there is so much in the test case, that its taking over 900 seconds (especially with MSAN, and occasionally ASAN).

            CURRENT_TEST: atomic.alter_table
            $ /home/buildbot/amd64-debian-11-msan/build/sql/mariadbd --defaults-group-suffix=.1 --defaults-file=/home/buildbot/amd64-debian-11-msan/build/mysql-test/var/31/my.cnf --log-output=file --innodb --innodb-cmpmem --innodb-cmp-per-index --innodb-trx --innodb-locks --innodb-lock-waits --innodb-metrics --innodb-buffer-pool-stats --innodb-buffer-page --innodb-buffer-page-lru --innodb-sys-columns --innodb-sys-fields --innodb-sys-foreign --innodb-sys-foreign-cols --innodb-sys-indexes --innodb-sys-tables --innodb-sys-virtual --log-bin=master-bin --innodb-max-dirty-pages-pct=0 --core-file --loose-debug-sync-timeout=300
            2024-03-27 10:37:11 0 [Warning] Could not increase number of max_open_files to more than 1024 (request: 32246)
            ...
             
            2024-03-27 10:52:12 0 [Note] mariadbd: Event Scheduler: Loaded 0 events
            2024-03-27 10:52:12 0 [Note] /home/buildbot/amd64-debian-11-msan/build/sql/mariadbd: ready for connections.
            Version: '11.5.0-MariaDB-debug-log'  socket: '/home/buildbot/amd64-debian-11-msan/build/mysql-test/var/tmp/31/mysqld.1.sock'  port: 16580  Source distribution
            240327 10:52:13 [ERROR] mysqld got signal 6 ;
            Sorry, we probably made a mistake, and this is a bug.
            

            900 exact seconds.

            danblack Daniel Black added a comment - Looking at the logs on msan builder , this isn't hung, its just there is so much in the test case, that its taking over 900 seconds (especially with MSAN, and occasionally ASAN). CURRENT_TEST: atomic.alter_table $ /home/buildbot/amd64-debian-11-msan/build/sql/mariadbd --defaults-group-suffix=.1 --defaults-file=/home/buildbot/amd64-debian-11-msan/build/mysql-test/var/31/my.cnf --log-output=file --innodb --innodb-cmpmem --innodb-cmp-per-index --innodb-trx --innodb-locks --innodb-lock-waits --innodb-metrics --innodb-buffer-pool-stats --innodb-buffer-page --innodb-buffer-page-lru --innodb-sys-columns --innodb-sys-fields --innodb-sys-foreign --innodb-sys-foreign-cols --innodb-sys-indexes --innodb-sys-tables --innodb-sys-virtual --log-bin=master-bin --innodb-max-dirty-pages-pct=0 --core-file --loose-debug-sync-timeout=300 2024-03-27 10:37:11 0 [Warning] Could not increase number of max_open_files to more than 1024 (request: 32246) ...   2024-03-27 10:52:12 0 [Note] mariadbd: Event Scheduler: Loaded 0 events 2024-03-27 10:52:12 0 [Note] /home/buildbot/amd64-debian-11-msan/build/sql/mariadbd: ready for connections. Version: '11.5.0-MariaDB-debug-log' socket: '/home/buildbot/amd64-debian-11-msan/build/mysql-test/var/tmp/31/mysqld.1.sock' port: 16580 Source distribution 240327 10:52:13 [ERROR] mysqld got signal 6 ; Sorry, we probably made a mistake, and this is a bug. 900 exact seconds.

            The failures are apparently becoming more frequent, especially on some debug instrumented builders. It looks like the test is not actually hanging but only taking an excessive amount of time. The server error log for a rather recent failure shows that the server was not hung at the time the test was interrupted:

            11.4 9da7b41151bfeb1b4554f0280a3afe69532d39ed

            CURRENT_TEST: atomic.alter_table
            $ /home/buildbot/s390x-ubuntu-2004-debug/build/sql/mariadbd --defaults-group-suffix=.1 --defaults-file=/home/buildbot/s390x-ubuntu-2004-debug/build/mysql-test/var/2/my.cnf --log-output=file --innodb --innodb-cmpmem --innodb-cmp-per-index --innodb-trx --innodb-locks --innodb-lock-waits --innodb-metrics --innodb-buffer-pool-stats --innodb-buffer-page --innodb-buffer-page-lru --innodb-sys-columns --innodb-sys-fields --innodb-sys-foreign --innodb-sys-foreign-cols --innodb-sys-indexes --innodb-sys-tables --innodb-sys-virtual --log-bin=master-bin --innodb-max-dirty-pages-pct=0 --core-file --loose-debug-sync-timeout=300
            2024-11-18  4:43:58 0 [Warning] Could not increase number of max_open_files to more than 1024 (request: 32190)
            2024-11-18  4:43:58 0 [Warning] Changed limits: max_open_files: 1024  max_connections: 151 (was 151)  table_cache: 421 (was 2000)
            2024-11-18  4:43:58 0 [Note] Starting MariaDB 11.4.5-MariaDB-debug-log source revision 9da7b41151bfeb1b4554f0280a3afe69532d39ed server_uid aGXWEBIdjhTxAOY5ySL2hEM4Yyw= as process 174512
            …
            2024-11-18  4:59:00 0 [Note] DDL_LOG: Crash recovery executed 1 entries
            2024-11-18  4:59:00 0 [Note] Server socket created on IP: '127.0.0.1'.
            2024-11-18  4:59:00 0 [Note] mariadbd: Event Scheduler: Loaded 0 events
            2024-11-18  4:59:00 0 [Note] /home/buildbot/s390x-ubuntu-2004-debug/build/sql/mariadbd: ready for connections.
            Version: '11.4.5-MariaDB-debug-log'  socket: '/home/buildbot/s390x-ubuntu-2004-debug/build/mysql-test/var/tmp/2/mysqld.1.sock'  port: 19060  Source distribution
            241118  4:59:06 [ERROR] mysqld got signal 6 ;
            …
            sql/signal_handler.cc:247(handle_fatal_signal)[0x2aa1e7bbadc]
            [0x3ffcf7fb6f6]
            /lib/s390x-linux-gnu/libc.so.6(__poll+0x62)[0x3ffa9cf1c92]
            sql/mysqld.cc:6380(handle_connections_sockets())[0x2aa1e0fe118]
            sql/mysqld.cc:5664(run_main_loop())[0x2aa1e0fc734]
            sql/mysqld.cc:6087(mysqld_main(int, char**))[0x2aa1e0fd72a]
            sql/main.cc:34(main)[0x2aa1e0eee64]
            

            Between the start of the test at 4:43:58 and the externally sent SIGABRT at 4:59:06, 16 minutes and 8 seconds had elapsed. The test case timeout is 900 seconds (15 minutes). Later during this test run, the test did pass on retry:

            11.4 9da7b41151bfeb1b4554f0280a3afe69532d39ed

            atomic.alter_table                       w2 [ retry-pass ]  681758
            

            Other runs of this, especially starting with the 11.7 branch, are timing out also on retry.

            I think that this test needs to be split into one combination per storage engine, and if possible, the test needs to be refactored to use multiple connections, so that several ALTER TABLE operations can be interrupted with a single kill+restart cycle. That should improve test coverage as well, because currently it appears to test the recovery of at most one in-progress DDL operation at a time.

            marko Marko Mäkelä added a comment - The failures are apparently becoming more frequent, especially on some debug instrumented builders. It looks like the test is not actually hanging but only taking an excessive amount of time. The server error log for a rather recent failure shows that the server was not hung at the time the test was interrupted: 11.4 9da7b41151bfeb1b4554f0280a3afe69532d39ed CURRENT_TEST: atomic.alter_table $ /home/buildbot/s390x-ubuntu-2004-debug/build/sql/mariadbd --defaults-group-suffix=.1 --defaults-file=/home/buildbot/s390x-ubuntu-2004-debug/build/mysql-test/var/2/my.cnf --log-output=file --innodb --innodb-cmpmem --innodb-cmp-per-index --innodb-trx --innodb-locks --innodb-lock-waits --innodb-metrics --innodb-buffer-pool-stats --innodb-buffer-page --innodb-buffer-page-lru --innodb-sys-columns --innodb-sys-fields --innodb-sys-foreign --innodb-sys-foreign-cols --innodb-sys-indexes --innodb-sys-tables --innodb-sys-virtual --log-bin=master-bin --innodb-max-dirty-pages-pct=0 --core-file --loose-debug-sync-timeout=300 2024-11-18 4:43:58 0 [Warning] Could not increase number of max_open_files to more than 1024 (request: 32190) 2024-11-18 4:43:58 0 [Warning] Changed limits: max_open_files: 1024 max_connections: 151 (was 151) table_cache: 421 (was 2000) 2024-11-18 4:43:58 0 [Note] Starting MariaDB 11.4.5-MariaDB-debug-log source revision 9da7b41151bfeb1b4554f0280a3afe69532d39ed server_uid aGXWEBIdjhTxAOY5ySL2hEM4Yyw= as process 174512 … 2024-11-18 4:59:00 0 [Note] DDL_LOG: Crash recovery executed 1 entries 2024-11-18 4:59:00 0 [Note] Server socket created on IP: '127.0.0.1'. 2024-11-18 4:59:00 0 [Note] mariadbd: Event Scheduler: Loaded 0 events 2024-11-18 4:59:00 0 [Note] /home/buildbot/s390x-ubuntu-2004-debug/build/sql/mariadbd: ready for connections. Version: '11.4.5-MariaDB-debug-log' socket: '/home/buildbot/s390x-ubuntu-2004-debug/build/mysql-test/var/tmp/2/mysqld.1.sock' port: 19060 Source distribution 241118 4:59:06 [ERROR] mysqld got signal 6 ; … sql/signal_handler.cc:247(handle_fatal_signal)[0x2aa1e7bbadc] [0x3ffcf7fb6f6] /lib/s390x-linux-gnu/libc.so.6(__poll+0x62)[0x3ffa9cf1c92] sql/mysqld.cc:6380(handle_connections_sockets())[0x2aa1e0fe118] sql/mysqld.cc:5664(run_main_loop())[0x2aa1e0fc734] sql/mysqld.cc:6087(mysqld_main(int, char**))[0x2aa1e0fd72a] sql/main.cc:34(main)[0x2aa1e0eee64] Between the start of the test at 4:43:58 and the externally sent SIGABRT at 4:59:06, 16 minutes and 8 seconds had elapsed. The test case timeout is 900 seconds (15 minutes). Later during this test run, the test did pass on retry: 11.4 9da7b41151bfeb1b4554f0280a3afe69532d39ed atomic.alter_table w2 [ retry-pass ] 681758 Other runs of this, especially starting with the 11.7 branch, are timing out also on retry. I think that this test needs to be split into one combination per storage engine, and if possible, the test needs to be refactored to use multiple connections, so that several ALTER TABLE operations can be interrupted with a single kill+restart cycle. That should improve test coverage as well, because currently it appears to test the recovery of at most one in-progress DDL operation at a time.

            wlad suggested that the test had better be run with --skip-ssl. I will try that.

            marko Marko Mäkelä added a comment - wlad suggested that the test had better be run with --skip-ssl . I will try that.

            Between 11.4 and 11.8, the test atomic.alter_table was made slower by MDEV-19123 (changing the default collation from latin1) and by the addition of vector indexes (MDEV-34939). I’ll explicitly set the collation to latin1 in this test, because even with --skip-ssl the test would otherwise be very close to the border of timing out on MemorySanitizer.

            marko Marko Mäkelä added a comment - Between 11.4 and 11.8, the test atomic.alter_table was made slower by MDEV-19123 (changing the default collation from latin1 ) and by the addition of vector indexes ( MDEV-34939 ). I’ll explicitly set the collation to latin1 in this test, because even with --skip-ssl the test would otherwise be very close to the border of timing out on MemorySanitizer.

            Unfortunately, the test still occasionally times out on some IBM System Z (s390x) targets. I believe that the number of server restarts in the test needs to be minimized. Currently, it is using frequent kill+restart cycles for testing only one recovery step at a time.

            marko Marko Mäkelä added a comment - Unfortunately, the test still occasionally times out on some IBM System Z (s390x) targets. I believe that the number of server restarts in the test needs to be minimized. Currently, it is using frequent kill+restart cycles for testing only one recovery step at a time.

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.