[MDEV-36666] atomic.alter_table still times out often - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.6, 10.11, 11.4, 11.8, 12.0
Fix Version/s: 10.6.22, 10.11.12, 11.4.6, 11.8.2, 12.0.1
Component/s: Tests
Labels:
None

Description

Test is heavy on server restarts, it became even worse after tests for VECTOR were added. Either split the test or optimise it somehow so that restarts don't take that much time.

Attachments

Issue Links

is part of

MDEV-36647 No red leaves in the forest

Open

relates to

MDEV-33489 The test atomic.alter_table occasionally times out

Closed

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2025-04-23 04:49

From its beginning, this test has been constructed as some nested loops that will kill and restart the server at each test step. Furthermore, some of the test step combinations could be entirely redundant, that is, they are only applicable to some particular storage engines.

As far as I understand, currently the test atomic.alter_table is using 2×13×16=416 kill+restart cycles because it is covering both InnoDB and MyISAM.

For MyISAM, most kill+restart steps could be completely unnecessary. In any case, a very low-hanging fruit might be to split MyISAM into a separate atomic.alter_table_myisam test (similar to atomic.alter_table_aria and atomic.alter_table_rocksdb), to halve the number of kill+restart cycles in atomic.alter_table itself.

Furthermore, I would suggest a canonical (music) approach where instead of testing multiple kill+restart cycles across one table definition at time, we would create multiple tables at different stages, and let most kill+restart cycles cover multiple different ALTER TABLE steps.

In this way, instead of using 1 table at a time with 13×16 kill+restart cycles per storage engine, we might create 13 tables and use 16 restarts, or create 13×16=208 tables in equally many concurrent connections, plus one more connection that would kill+restart the server (only one time).

The kill mechanism would have to be rewritten too, replacing things like debug_crash_here("ddl_log_alter_after_create_frm") with DEBUG_SYNC points, so that each crash recovery step would be blocked by their own DEBUG_SYNC point in their own concurrent connection, and the server would be killed and restarted in one more connection.

An early example of this approach is the test innodb.autoinc_persistent, which ~~MDEV-6076~~ introduced in MariaDB Server 10.2. That test is restarting the server 6 times, while the original test in MySQL 8.0 would restart the server 10 times.

Marko Mäkelä added a comment - 2025-04-23 04:49 From its beginning, this test has been constructed as some nested loops that will kill and restart the server at each test step. Furthermore, some of the test step combinations could be entirely redundant, that is, they are only applicable to some particular storage engines. As far as I understand, currently the test atomic.alter_table is using 2×13×16=416 kill+restart cycles because it is covering both InnoDB and MyISAM. For MyISAM, most kill+restart steps could be completely unnecessary. In any case, a very low-hanging fruit might be to split MyISAM into a separate atomic.alter_table_myisam test (similar to atomic.alter_table_aria and atomic.alter_table_rocksdb ), to halve the number of kill+restart cycles in atomic.alter_table itself. Furthermore, I would suggest a canonical (music) approach where instead of testing multiple kill+restart cycles across one table definition at time, we would create multiple tables at different stages, and let most kill+restart cycles cover multiple different ALTER TABLE steps. In this way, instead of using 1 table at a time with 13×16 kill+restart cycles per storage engine, we might create 13 tables and use 16 restarts, or create 13×16=208 tables in equally many concurrent connections, plus one more connection that would kill+restart the server (only one time). The kill mechanism would have to be rewritten too, replacing things like debug_crash_here("ddl_log_alter_after_create_frm") with DEBUG_SYNC points, so that each crash recovery step would be blocked by their own DEBUG_SYNC point in their own concurrent connection, and the server would be killed and restarted in one more connection. An early example of this approach is the test innodb.autoinc_persistent , which MDEV-6076 introduced in MariaDB Server 10.2. That test is restarting the server 6 times, while the original test in MySQL 8.0 would restart the server 10 times.

Sergey Vojtovich added a comment - 6 days ago

I have created https://github.com/MariaDB/server/pull/4013, which, expectedly, takes simpler approach, that is splits MyISAM and InnoDB tests. Also disabled InnoDB where it wasn't required, it makes atomic.alter_table_myisam 3x faster than atomic.alter_table.

I expect these steps to reduce failure frequency 10x at the very least. We can look into those once again in another iteration of "No read leafs in the forest" effort.

I'm unsure how MyISAM can avoid kill+restart steps, but I like the idea with multiple connections.

Sergey Vojtovich added a comment - 6 days ago I have created https://github.com/MariaDB/server/pull/4013 , which, expectedly, takes simpler approach, that is splits MyISAM and InnoDB tests. Also disabled InnoDB where it wasn't required, it makes atomic.alter_table_myisam 3x faster than atomic.alter_table. I expect these steps to reduce failure frequency 10x at the very least. We can look into those once again in another iteration of "No read leafs in the forest" effort. I'm unsure how MyISAM can avoid kill+restart steps, but I like the idea with multiple connections.

Sergey Vojtovich added a comment - 6 days ago

Since you analysed it so deeply, it'd be nice if you could review PR#4013. Feel free to decline review request if you're busy.

Sergey Vojtovich added a comment - 6 days ago Since you analysed it so deeply, it'd be nice if you could review PR#4013 . Feel free to decline review request if you're busy.

Marko Mäkelä added a comment - 6 days ago

The simple approach looks OK. But, I would suggest to apply the change to the earliest applicable branch (10.6) already, so that the test names and their scope will be identical between branches.

Marko Mäkelä added a comment - 6 days ago The simple approach looks OK. But, I would suggest to apply the change to the earliest applicable branch (10.6) already, so that the test names and their scope will be identical between branches.

People

Assignee:: Sergey Vojtovich

Reporter:: Sergey Vojtovich

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2025-04-22 20:47

Updated:: 6 days ago 07:09

Resolved:: 6 days ago 07:09

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.