[MDEV-16136] Various ASAN failures when testing 10.2/10.3 - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.2.15, 10.3.6
Fix Version/s: 10.2.18, 10.3.10
Component/s: Storage Engine - InnoDB
Labels:
- affects-tests
Environment:
Ubuntu 17.04 but I assume this is not important.

Description

MariaDB 10.3 commit 8b087c63b56408edfae21f3234bae0b5391759b6 (2018-05-09)
compiled with ASAN.

I have some rather simple RQG test containing mostly DDL.
When executing this test via combinations.pl in parallel (leads to high loaded box) with many trials than some significant share of the test runs fail with ASAN failures like
SUMMARY: AddressSanitizer: use-after-poison .../storage/innobase/row/row0upd.cc:3422 in row_upd_step(que_thr_t*)
SUMMARY: AddressSanitizer: use-after-poison .../storage/innobase/trx/trx0purge.cc:224 in trx_purge_add_undo_to_history(trx_t const*, trx_undo_t*&, mtr_t*)
SUMMARY: AddressSanitizer: use-after-poison .../storage/innobase/trx/trx0purge.cc:226 in trx_purge_add_undo_to_history(trx_t const*, trx_undo_t*&, mtr_t*)

There were >= 52 different unique ASAN Summary lines.
(grep -h 'SUMMARY: AddressSanitizer: ' last_comb_workdir/trial*.log | sort -u)

I am aware that

a significant fraction of these ASAN failures are already reported
But these reports lack often some fast replay testcase.
some clear decision about which part in MariaDB is "guilty" (InnoDB or the server or both) cannot be made based on the current information available
there is some significant but not big likelihood that the failures reported during testing might be caused by
exceeding OS/testing box resources -> server/InnoDB meet conditions they cannot handle good enough in the moment -> ....
There are at least no signs that the OS starts to "attack" the mass of perl processes because of resource shortages or similar.
weaknesses in RQG mechanics
Basically RQG has also sometimes problems to handle slow reacting servers/processes.
Sorry in case that is valid.
The dilemma is that we need extreme CPU and memory IO load for getting a short bug replay time etc. On a system with low load the test passes nearly all time.

Attachments

Issue Links

is blocked by

MDEV-16063 [Draft] ASAN use-after-poison in row_sel / row_sel_step / que_thr_step

Closed

is caused by

MDEV-15030 Add ASAN instrumentation

Closed

relates to

MDEV-16781 InnoDB: AddressSanitizer: use-after-poison during DDL

Closed

Activity

Ascending order - Click to sort in descending order

View 5 older comments

Matthias Leich added a comment - 2018-05-15 12:38

No replay within ~ 100 RQG test runs after assigning innodb_stats_persistent=off to the server start.
Usually I get a lot replays in that amount of test runs.
So the innodb_stats_persistent seems to be guilty.

Matthias Leich added a comment - 2018-05-15 12:38 No replay within ~ 100 RQG test runs after assigning innodb_stats_persistent=off to the server start. Usually I get a lot replays in that amount of test runs. So the innodb_stats_persistent seems to be guilty.

Marko Mäkelä added a comment - 2018-07-26 06:12

In ~~MDEV-16781~~ there is a failure reported during a DDL operation (not during the update of persistent statistics, or dict_stats_exec_sql()). I hope we can repeat that while setting innodb_stats_persistent=off.

Marko Mäkelä added a comment - 2018-07-26 06:12 In MDEV-16781 there is a failure reported during a DDL operation (not during the update of persistent statistics, or dict_stats_exec_sql() ). I hope we can repeat that while setting innodb_stats_persistent=off .

Marko Mäkelä added a comment - 2018-08-15 15:58

The Pool poisoning added in ~~MDEV-15030~~ was not thread-safe. After I protected the poisoning and unpoisoning with a common mutex, I was unable to repeat failures with a test case of ~~MDEV-16781~~.

Marko Mäkelä added a comment - 2018-08-15 15:58 The Pool poisoning added in MDEV-15030 was not thread-safe. After I protected the poisoning and unpoisoning with a common mutex, I was unable to repeat failures with a test case of MDEV-16781 .

Marko Mäkelä added a comment - 2018-08-16 03:49

There was a race condition in the AddressSanitizer instrumentation that I introduced in ~~MDEV-15030~~.

Also, I believe that there was a race condition between trx_reference() and trx_free()/trx_create_low(), but I do not think that it would have triggered ASAN. Furthermore, related to this work, I simplified the memory management of trx->lock.rec_pool and trx->lock.table_pool.

Marko Mäkelä added a comment - 2018-08-16 03:49 There was a race condition in the AddressSanitizer instrumentation that I introduced in MDEV-15030 . Also, I believe that there was a race condition between trx_reference() and trx_free() / trx_create_low() , but I do not think that it would have triggered ASAN. Furthermore, related to this work, I simplified the memory management of trx->lock.rec_pool and trx->lock.table_pool .

Marko Mäkelä added a comment - 2019-08-27 13:54

I had added some unreachable code to trx_reference() in an early attempt of fixing this. That will be reverted in 10.2.27, 10.3.18, 10.4.8 onwards.

Marko Mäkelä added a comment - 2019-08-27 13:54 I had added some unreachable code to trx_reference() in an early attempt of fixing this. That will be reverted in 10.2.27, 10.3.18, 10.4.8 onwards.

People

Assignee:: Marko Mäkelä

Reporter:: Matthias Leich

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 2018-05-10 13:00

Updated:: 2019-08-27 13:54

Resolved:: 2018-08-16 03:49

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server