Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-35225

Bogus debug assertion failures in innodb.innodb-32k-crash

    XMLWordPrintable

Details

    Description

      sanja noted that during the execution of the test innodb.innodb-32k-crash some debug assertions that had been added together with the fix of MDEV-31354 to the function log_sort_flush_list() are failing rather often, like this:

      10.6 753e7d6d7ce7770d3c98beb6fdcb97e0e8d1ec9f

      innodb.innodb-32k-crash                  w18 [ fail ]
              Test ended at 2024-10-01 10:31:25
      2024-10-01 10:31:25 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=1299959,3973825
      2024-10-01 10:31:25 0 [Note] InnoDB: 1 transaction(s) which must be rolled back or cleaned up in total 3 row operations to undo
      2024-10-01 10:31:25 0 [Note] InnoDB: Trx id counter is 225
      2024-10-01 10:31:25 0 [Note] InnoDB: To recover: 658 pages
      mariadbd: /home/buildbot/amd64-ubuntu-2204-debug-ps/build/storage/innobase/log/log0recv.cc:3658: log_sort_flush_list()::<lambda(const buf_page_t*, const buf_page_t*)>: Assertion `l > 2' failed.
      

      I was able to reproduce this. In the core dump that I analyzed, all 7 members of buf_pool.flush_list carried oldest_modification()==1, that is, the pages had been written back to the file system.

      As noted in MDEV-31354, starting with MDEV-25113 it is possible that the buf_page_t::oldest_modification() will be updated to 1 by a thread that is not holding buf_pool.flush_list_mutex. The debug assertions on LSN being above 2 must be revised accordingly. As a slight optimization, when we are copying the sorted list back to buf_pool.flush_list, we can omit such blocks.

      The test innodb.innodb-32k-crash also started to fail in another way in 10.6 but not later versions, due to a bogus debug assertion that was added to recv_recovery_from_checkpoint_start() in MDEV-34830:

      ut_ad(log_sys.get_lsn() >= recv_sys.scanned_lsn);
      

      This assertion may fail when the last mini-transaction in the log was not completely written. In that case, the recv_sys.scanned_lsn could be a few 512-byte blocks ahead of recv_sys.recovered_lsn, which is what matters. In MDEV-14425, these fields were replaced by recv_sys.lsn and there is no log block layer anymore; each mini-transaction is a logical log block on its own.

      Attachments

        Issue Links

          Activity

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.