[MDEV-33489] The test atomic.alter_table occasionally times out - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.6, 10.11, 11.0(EOL), 11.1(EOL), 11.2(EOL), 11.3(EOL), 11.4, 11.5(EOL), 11.6(EOL), 11.7(EOL), 11.8
Fix Version/s: 11.4.6, 11.8.2
Component/s: Data Definition - Alter Table, Server, Tests
Labels:
- affects-tests

Description

The test atomic.alter_table is seldomly hanging on our CI systems. Here is an example from the oldest affected major version:

10.6 80ae69c8bc948382920b083ba203369e
atomic.alter_table 'innodb' w3 [ fail ] timeout after 900 seconds
Test ended at 2023-02-06 20:57:42

Test case timeout after 900 seconds

== /dev/shm/var/3/log/alter_table.log ==
`c` int(11) DEFAULT NULL,
KEY `a` (`a`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COLLATE=latin1_swedish_ci COMMENT='new'
count(*)
2
master-bin.000002 # Query # # use `test`; ALTER TABLE t1 ADD COLUMN c INT, COMMENT "new", rename test2.t2
crash point: ddl_log_alter_after_rename_triggers
t2.frm
t2.ibd
"Table is in test2"
Table Create Table
t2 CREATE TABLE `t2` (
`a` int(11) DEFAULT NULL,
`b` int(11) DEFAULT NULL,
`c` int(11) DEFAULT NULL,
KEY `a` (`a`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COLLATE=latin1_swedish_ci COMMENT='new'
count(*)
2
master-bin.000002 # Query # # use `test`; ALTER TABLE t1 ADD COLUMN c INT, COMMENT "new", rename test2.t2

== /dev/shm/var/3/tmp/analyze-timeout-mysqld.1.err ==
mysqltest: Could not open connection 'default' after 500 attempts: 2002 Can't connect to local server through socket '/dev/shm/var/tmp/3/mysqld.1.sock' (111)

Stack traces of the apparently hung server process are available. As far as I can tell, all threads are idle in all the stack traces that I have checked; at least InnoDB should not be to blame for the hangs. This could be a bug in mtr itself.

Attachments

Issue Links

is part of

MDEV-33073 always green buildbot

Stalled

relates to

MDEV-25180 Atomic ALTER TABLE

Closed

MDEV-36666 atomic.alter_table still times out often

Closed

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2024-02-20 10:38

For the record, this 10.11-based hang of a MemorySanitizer build from today does not show up in the cross-reference. The log ends like this:

refs/pull/3045 3dd7b0a80cb674d0773fd0d25ceb8a5183798597
Completed: Failed 2/6053 tests, 99.97% were successful.
Failing test(s): perfschema.table_lock_aggregate_thread_4u_2t atomic.alter_table

Again, all threads in the server appear to be idle, waiting for a connection to be established.

Marko Mäkelä added a comment - 2024-02-20 10:38 For the record, this 10.11-based hang of a MemorySanitizer build from today does not show up in the cross-reference. The log ends like this: refs/pull/3045 3dd7b0a80cb674d0773fd0d25ceb8a5183798597 Completed: Failed 2/6053 tests, 99.97% were successful. Failing test(s): perfschema.table_lock_aggregate_thread_4u_2t atomic.alter_table Again, all threads in the server appear to be idle, waiting for a connection to be established.

Marko Mäkelä added a comment - 2024-02-20 10:57

There appear to be more test failures findable via https://buildbot.mariadb.org/cr/ than through the interface that I was using.

Marko Mäkelä added a comment - 2024-02-20 10:57 There appear to be more test failures findable via https://buildbot.mariadb.org/cr/ than through the interface that I was using.

Daniel Black added a comment - 2024-03-28 01:44

Looking at the logs on msan builder, this isn't hung, its just there is so much in the test case, that its taking over 900 seconds (especially with MSAN, and occasionally ASAN).

CURRENT_TEST: atomic.alter_table

$ /home/buildbot/amd64-debian-11-msan/build/sql/mariadbd --defaults-group-suffix=.1 --defaults-file=/home/buildbot/amd64-debian-11-msan/build/mysql-test/var/31/my.cnf --log-output=file --innodb --innodb-cmpmem --innodb-cmp-per-index --innodb-trx --innodb-locks --innodb-lock-waits --innodb-metrics --innodb-buffer-pool-stats --innodb-buffer-page --innodb-buffer-page-lru --innodb-sys-columns --innodb-sys-fields --innodb-sys-foreign --innodb-sys-foreign-cols --innodb-sys-indexes --innodb-sys-tables --innodb-sys-virtual --log-bin=master-bin --innodb-max-dirty-pages-pct=0 --core-file --loose-debug-sync-timeout=300

2024-03-27 10:37:11 0 [Warning] Could not increase number of max_open_files to more than 1024 (request: 32246)

...

2024-03-27 10:52:12 0 [Note] mariadbd: Event Scheduler: Loaded 0 events

2024-03-27 10:52:12 0 [Note] /home/buildbot/amd64-debian-11-msan/build/sql/mariadbd: ready for connections.

Version: '11.5.0-MariaDB-debug-log'  socket: '/home/buildbot/amd64-debian-11-msan/build/mysql-test/var/tmp/31/mysqld.1.sock'  port: 16580  Source distribution

240327 10:52:13 [ERROR] mysqld got signal 6 ;

Sorry, we probably made a mistake, and this is a bug.

900 exact seconds.

Daniel Black added a comment - 2024-03-28 01:44 Looking at the logs on msan builder , this isn't hung, its just there is so much in the test case, that its taking over 900 seconds (especially with MSAN, and occasionally ASAN). CURRENT_TEST: atomic.alter_table $ /home/buildbot/amd64-debian-11-msan/build/sql/mariadbd --defaults-group-suffix=.1 --defaults-file=/home/buildbot/amd64-debian-11-msan/build/mysql-test/var/31/my.cnf --log-output=file --innodb --innodb-cmpmem --innodb-cmp-per-index --innodb-trx --innodb-locks --innodb-lock-waits --innodb-metrics --innodb-buffer-pool-stats --innodb-buffer-page --innodb-buffer-page-lru --innodb-sys-columns --innodb-sys-fields --innodb-sys-foreign --innodb-sys-foreign-cols --innodb-sys-indexes --innodb-sys-tables --innodb-sys-virtual --log-bin=master-bin --innodb-max-dirty-pages-pct=0 --core-file --loose-debug-sync-timeout=300 2024-03-27 10:37:11 0 [Warning] Could not increase number of max_open_files to more than 1024 (request: 32246) ... 2024-03-27 10:52:12 0 [Note] mariadbd: Event Scheduler: Loaded 0 events 2024-03-27 10:52:12 0 [Note] /home/buildbot/amd64-debian-11-msan/build/sql/mariadbd: ready for connections. Version: '11.5.0-MariaDB-debug-log' socket: '/home/buildbot/amd64-debian-11-msan/build/mysql-test/var/tmp/31/mysqld.1.sock' port: 16580 Source distribution 240327 10:52:13 [ERROR] mysqld got signal 6 ; Sorry, we probably made a mistake, and this is a bug. 900 exact seconds.

Marko Mäkelä added a comment - 2024-11-20 05:35

The failures are apparently becoming more frequent, especially on some debug instrumented builders. It looks like the test is not actually hanging but only taking an excessive amount of time. The server error log for a rather recent failure shows that the server was not hung at the time the test was interrupted:

11.4 9da7b41151bfeb1b4554f0280a3afe69532d39ed
CURRENT_TEST: atomic.alter_table
$ /home/buildbot/s390x-ubuntu-2004-debug/build/sql/mariadbd --defaults-group-suffix=.1 --defaults-file=/home/buildbot/s390x-ubuntu-2004-debug/build/mysql-test/var/2/my.cnf --log-output=file --innodb --innodb-cmpmem --innodb-cmp-per-index --innodb-trx --innodb-locks --innodb-lock-waits --innodb-metrics --innodb-buffer-pool-stats --innodb-buffer-page --innodb-buffer-page-lru --innodb-sys-columns --innodb-sys-fields --innodb-sys-foreign --innodb-sys-foreign-cols --innodb-sys-indexes --innodb-sys-tables --innodb-sys-virtual --log-bin=master-bin --innodb-max-dirty-pages-pct=0 --core-file --loose-debug-sync-timeout=300
2024-11-18 4:43:58 0 [Warning] Could not increase number of max_open_files to more than 1024 (request: 32190)
2024-11-18 4:43:58 0 [Warning] Changed limits: max_open_files: 1024 max_connections: 151 (was 151) table_cache: 421 (was 2000)
2024-11-18 4:43:58 0 [Note] Starting MariaDB 11.4.5-MariaDB-debug-log source revision 9da7b41151bfeb1b4554f0280a3afe69532d39ed server_uid aGXWEBIdjhTxAOY5ySL2hEM4Yyw= as process 174512
…
2024-11-18 4:59:00 0 [Note] DDL_LOG: Crash recovery executed 1 entries
2024-11-18 4:59:00 0 [Note] Server socket created on IP: '127.0.0.1'.
2024-11-18 4:59:00 0 [Note] mariadbd: Event Scheduler: Loaded 0 events
2024-11-18 4:59:00 0 [Note] /home/buildbot/s390x-ubuntu-2004-debug/build/sql/mariadbd: ready for connections.
Version: '11.4.5-MariaDB-debug-log' socket: '/home/buildbot/s390x-ubuntu-2004-debug/build/mysql-test/var/tmp/2/mysqld.1.sock' port: 19060 Source distribution
241118 4:59:06 [ERROR] mysqld got signal 6 ;
…
sql/signal_handler.cc:247(handle_fatal_signal)[0x2aa1e7bbadc]
[0x3ffcf7fb6f6]
/lib/s390x-linux-gnu/libc.so.6(__poll+0x62)[0x3ffa9cf1c92]
sql/mysqld.cc:6380(handle_connections_sockets())[0x2aa1e0fe118]
sql/mysqld.cc:5664(run_main_loop())[0x2aa1e0fc734]
sql/mysqld.cc:6087(mysqld_main(int, char**))[0x2aa1e0fd72a]
sql/main.cc:34(main)[0x2aa1e0eee64]

Between the start of the test at 4:43:58 and the externally sent SIGABRT at 4:59:06, 16 minutes and 8 seconds had elapsed. The test case timeout is 900 seconds (15 minutes). Later during this test run, the test did pass on retry:

11.4 9da7b41151bfeb1b4554f0280a3afe69532d39ed
atomic.alter_table w2 [ retry-pass ] 681758

Other runs of this, especially starting with the 11.7 branch, are timing out also on retry.

I think that this test needs to be split into one combination per storage engine, and if possible, the test needs to be refactored to use multiple connections, so that several ALTER TABLE operations can be interrupted with a single kill+restart cycle. That should improve test coverage as well, because currently it appears to test the recovery of at most one in-progress DDL operation at a time.

Marko Mäkelä added a comment - 2024-11-20 05:35 The failures are apparently becoming more frequent, especially on some debug instrumented builders. It looks like the test is not actually hanging but only taking an excessive amount of time. The server error log for a rather recent failure shows that the server was not hung at the time the test was interrupted: 11.4 9da7b41151bfeb1b4554f0280a3afe69532d39ed CURRENT_TEST: atomic.alter_table $ /home/buildbot/s390x-ubuntu-2004-debug/build/sql/mariadbd --defaults-group-suffix=.1 --defaults-file=/home/buildbot/s390x-ubuntu-2004-debug/build/mysql-test/var/2/my.cnf --log-output=file --innodb --innodb-cmpmem --innodb-cmp-per-index --innodb-trx --innodb-locks --innodb-lock-waits --innodb-metrics --innodb-buffer-pool-stats --innodb-buffer-page --innodb-buffer-page-lru --innodb-sys-columns --innodb-sys-fields --innodb-sys-foreign --innodb-sys-foreign-cols --innodb-sys-indexes --innodb-sys-tables --innodb-sys-virtual --log-bin=master-bin --innodb-max-dirty-pages-pct=0 --core-file --loose-debug-sync-timeout=300 2024-11-18 4:43:58 0 [Warning] Could not increase number of max_open_files to more than 1024 (request: 32190) 2024-11-18 4:43:58 0 [Warning] Changed limits: max_open_files: 1024 max_connections: 151 (was 151) table_cache: 421 (was 2000) 2024-11-18 4:43:58 0 [Note] Starting MariaDB 11.4.5-MariaDB-debug-log source revision 9da7b41151bfeb1b4554f0280a3afe69532d39ed server_uid aGXWEBIdjhTxAOY5ySL2hEM4Yyw= as process 174512 … 2024-11-18 4:59:00 0 [Note] DDL_LOG: Crash recovery executed 1 entries 2024-11-18 4:59:00 0 [Note] Server socket created on IP: '127.0.0.1'. 2024-11-18 4:59:00 0 [Note] mariadbd: Event Scheduler: Loaded 0 events 2024-11-18 4:59:00 0 [Note] /home/buildbot/s390x-ubuntu-2004-debug/build/sql/mariadbd: ready for connections. Version: '11.4.5-MariaDB-debug-log' socket: '/home/buildbot/s390x-ubuntu-2004-debug/build/mysql-test/var/tmp/2/mysqld.1.sock' port: 19060 Source distribution 241118 4:59:06 [ERROR] mysqld got signal 6 ; … sql/signal_handler.cc:247(handle_fatal_signal)[0x2aa1e7bbadc] [0x3ffcf7fb6f6] /lib/s390x-linux-gnu/libc.so.6(__poll+0x62)[0x3ffa9cf1c92] sql/mysqld.cc:6380(handle_connections_sockets())[0x2aa1e0fe118] sql/mysqld.cc:5664(run_main_loop())[0x2aa1e0fc734] sql/mysqld.cc:6087(mysqld_main(int, char**))[0x2aa1e0fd72a] sql/main.cc:34(main)[0x2aa1e0eee64] Between the start of the test at 4:43:58 and the externally sent SIGABRT at 4:59:06, 16 minutes and 8 seconds had elapsed. The test case timeout is 900 seconds (15 minutes). Later during this test run, the test did pass on retry: 11.4 9da7b41151bfeb1b4554f0280a3afe69532d39ed atomic.alter_table w2 [ retry-pass ] 681758 Other runs of this, especially starting with the 11.7 branch, are timing out also on retry. I think that this test needs to be split into one combination per storage engine, and if possible, the test needs to be refactored to use multiple connections, so that several ALTER TABLE operations can be interrupted with a single kill+restart cycle. That should improve test coverage as well, because currently it appears to test the recovery of at most one in-progress DDL operation at a time.

Marko Mäkelä added a comment - 2025-03-05 09:44

wlad suggested that the test had better be run with --skip-ssl. I will try that.

Marko Mäkelä added a comment - 2025-03-05 09:44 wlad suggested that the test had better be run with --skip-ssl . I will try that.

Marko Mäkelä added a comment - 2025-03-05 14:08

Between 11.4 and 11.8, the test atomic.alter_table was made slower by ~~MDEV-19123~~ (changing the default collation from latin1) and by the addition of vector indexes (~~MDEV-34939~~). I’ll explicitly set the collation to latin1 in this test, because even with --skip-ssl the test would otherwise be very close to the border of timing out on MemorySanitizer.

Marko Mäkelä added a comment - 2025-03-05 14:08 Between 11.4 and 11.8, the test atomic.alter_table was made slower by MDEV-19123 (changing the default collation from latin1 ) and by the addition of vector indexes ( MDEV-34939 ). I’ll explicitly set the collation to latin1 in this test, because even with --skip-ssl the test would otherwise be very close to the border of timing out on MemorySanitizer.

Marko Mäkelä added a comment - 2025-03-07 12:30

Unfortunately, the test still occasionally times out on some IBM System Z (s390x) targets. I believe that the number of server restarts in the test needs to be minimized. Currently, it is using frequent kill+restart cycles for testing only one recovery step at a time.

Marko Mäkelä added a comment - 2025-03-07 12:30 Unfortunately, the test still occasionally times out on some IBM System Z (s390x) targets. I believe that the number of server restarts in the test needs to be minimized. Currently, it is using frequent kill+restart cycles for testing only one recovery step at a time.

People

Assignee:: Marko Mäkelä

Reporter:: Marko Mäkelä

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2024-02-19 13:25

Updated:: 2025-04-23 04:49

Resolved:: 2025-03-05 10:26

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server