Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-16136

Various ASAN failures when testing 10.2/10.3

Details

    Description

      MariaDB 10.3 commit 8b087c63b56408edfae21f3234bae0b5391759b6 (2018-05-09)
      compiled with ASAN.

      I have some rather simple RQG test containing mostly DDL.
      When executing this test via combinations.pl in parallel (leads to high loaded box) with many trials than some significant share of the test runs fail with ASAN failures like
      SUMMARY: AddressSanitizer: use-after-poison .../storage/innobase/row/row0upd.cc:3422 in row_upd_step(que_thr_t*)
      SUMMARY: AddressSanitizer: use-after-poison .../storage/innobase/trx/trx0purge.cc:224 in trx_purge_add_undo_to_history(trx_t const*, trx_undo_t*&, mtr_t*)
      SUMMARY: AddressSanitizer: use-after-poison .../storage/innobase/trx/trx0purge.cc:226 in trx_purge_add_undo_to_history(trx_t const*, trx_undo_t*&, mtr_t*)

      There were >= 52 different unique ASAN Summary lines.
      (grep -h 'SUMMARY: AddressSanitizer: ' last_comb_workdir/trial*.log | sort -u)

      I am aware that

      • a significant fraction of these ASAN failures are already reported
        But these reports lack often some fast replay testcase.
      • some clear decision about which part in MariaDB is "guilty" (InnoDB or the server or both) cannot be made based on the current information available
      • there is some significant but not big likelihood that the failures reported during testing might be caused by
      • exceeding OS/testing box resources -> server/InnoDB meet conditions they cannot handle good enough in the moment -> ....
        There are at least no signs that the OS starts to "attack" the mass of perl processes because of resource shortages or similar.
      • weaknesses in RQG mechanics
        Basically RQG has also sometimes problems to handle slow reacting servers/processes.
        Sorry in case that is valid.
        The dilemma is that we need extreme CPU and memory IO load for getting a short bug replay time etc. On a system with low load the test passes nearly all time.

      Attachments

        Issue Links

          Activity

            No replay within ~ 100 RQG test runs after assigning innodb_stats_persistent=off to the server start.
            Usually I get a lot replays in that amount of test runs.
            So the innodb_stats_persistent seems to be guilty.

            mleich Matthias Leich added a comment - No replay within ~ 100 RQG test runs after assigning innodb_stats_persistent=off to the server start. Usually I get a lot replays in that amount of test runs. So the innodb_stats_persistent seems to be guilty.

            In MDEV-16781 there is a failure reported during a DDL operation (not during the update of persistent statistics, or dict_stats_exec_sql()). I hope we can repeat that while setting innodb_stats_persistent=off.

            marko Marko Mäkelä added a comment - In MDEV-16781 there is a failure reported during a DDL operation (not during the update of persistent statistics, or dict_stats_exec_sql() ). I hope we can repeat that while setting innodb_stats_persistent=off .

            The Pool poisoning added in MDEV-15030 was not thread-safe. After I protected the poisoning and unpoisoning with a common mutex, I was unable to repeat failures with a test case of MDEV-16781.

            marko Marko Mäkelä added a comment - The Pool poisoning added in MDEV-15030 was not thread-safe. After I protected the poisoning and unpoisoning with a common mutex, I was unable to repeat failures with a test case of MDEV-16781 .

            There was a race condition in the AddressSanitizer instrumentation that I introduced in MDEV-15030.

            Also, I believe that there was a race condition between trx_reference() and trx_free()/trx_create_low(), but I do not think that it would have triggered ASAN. Furthermore, related to this work, I simplified the memory management of trx->lock.rec_pool and trx->lock.table_pool.

            marko Marko Mäkelä added a comment - There was a race condition in the AddressSanitizer instrumentation that I introduced in MDEV-15030 . Also, I believe that there was a race condition between trx_reference() and trx_free() / trx_create_low() , but I do not think that it would have triggered ASAN. Furthermore, related to this work, I simplified the memory management of trx->lock.rec_pool and trx->lock.table_pool .

            I had added some unreachable code to trx_reference() in an early attempt of fixing this. That will be reverted in 10.2.27, 10.3.18, 10.4.8 onwards.

            marko Marko Mäkelä added a comment - I had added some unreachable code to trx_reference() in an early attempt of fixing this. That will be reverted in 10.2.27, 10.3.18, 10.4.8 onwards.

            People

              marko Marko Mäkelä
              mleich Matthias Leich
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.