Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-33213

History list is not shrunk unless there is a pause in the workload

Details

    Description

      While testing MDEV-32050 and its follow-up changes MDEV-33009 and MDEV-33112, axel pointed out that the InnoDB history list is not being shrunk during a workload.

      Some initial debugging indicates that the purge_truncation_task (which was added in MDEV-32050) is being invoked but it is not doing anything. Also, the purge_sys.view is advancing; there are no old read views that would block the purge of history. Throttling the Sysbench workload for a few seconds would allow the history list to shrink immediately.

      Attachments

        1. 10.6-MDEV-33213.pdf
          107 kB
        2. 12x5.pdf
          58 kB
        3. 24x5_high_threads_pausing.pdf
          59 kB
        4. 24x5_high_threads.pdf
          56 kB
        5. 24x5.pdf
          59 kB
        6. no_pause.png
          no_pause.png
          36 kB
        7. with_pause.png
          with_pause.png
          47 kB

        Issue Links

          Activity

            TheWitness Larry Adams added a comment -

            Marko,

            I've upgraded to 10.6.17-13 this morning. I'm not seeing any increased history flushing, but at least it's flushing now. I've opened a ticket with support to see if there is a way to speed it up though as the number keeps going up incrementally.

            Larry

            TheWitness Larry Adams added a comment - Marko, I've upgraded to 10.6.17-13 this morning. I'm not seeing any increased history flushing, but at least it's flushing now. I've opened a ticket with support to see if there is a way to speed it up though as the number keeps going up incrementally. Larry
            TheWitness Larry Adams added a comment -

            I tried

            SET GLOBAL innodb_max_purge_lag_wait=0;

            But that command just hung for almost 10 minutes before I killed it.

            TheWitness Larry Adams added a comment - I tried SET GLOBAL innodb_max_purge_lag_wait=0; But that command just hung for almost 10 minutes before I killed it.

            Hi Larry, let’s follow up on this in the support ticket. One thing that we need to be aware of is that when SHOW ENGINE INNODB STATUS shows any ACTIVE or PREPARED transactions with a transaction identifier or with a read view, then purge will be unable to proceed beyond that, and SET GLOBAL innodb_max_purge_lag_wait=0 will hang. In the worst case, someone starts a connection with START TRANSACTION WITH CONSISTENT SNAPSHOT and walks away.

            That said, it is possible that there are no active transactions or read views, and the purge is really taking extremely long to clean up the queue. Implementing MDEV-17598 should help in case the reason is secondary indexes.

            marko Marko Mäkelä added a comment - Hi Larry, let’s follow up on this in the support ticket. One thing that we need to be aware of is that when SHOW ENGINE INNODB STATUS shows any ACTIVE or PREPARED transactions with a transaction identifier or with a read view, then purge will be unable to proceed beyond that, and SET GLOBAL innodb_max_purge_lag_wait=0 will hang. In the worst case, someone starts a connection with START TRANSACTION WITH CONSISTENT SNAPSHOT and walks away. That said, it is possible that there are no active transactions or read views, and the purge is really taking extremely long to clean up the queue. Implementing MDEV-17598 should help in case the reason is secondary indexes.

            I have recently upgraded many servers that had been running on 10.3.12 to 10.6.15 at which time I noticed that the history list was not shrinking as it had always done on 10.3.12. I found this bug which seems to describe the situation pretty well. I upgraded to 10.6.17 and tried many variations of innodb_* parameters described above and on other tickets. None seemed to have any effect. I also tried versions 10.11.8 and 11.4.2. The innodb undo logs still do not seem to be getting purged in those versions. If I stop non-replication traffic, the history list length reduces very rapidly. Only the busiest hosts in my infrastructure can't keep up with the undo log purging. however, on 10.3.12 this was never an issue. It feels like the rows in the undo log must already be delete-marked or query performance would be really bad at such large history lengths. So the only impact I'm feeling right now is unbounded disk space growth (or the hassle of juggling hosts in and out of rotation, including the primary).

            marko are you sure this issue is fully resolved? thank you!!

            ch701 chad ambrosius added a comment - I have recently upgraded many servers that had been running on 10.3.12 to 10.6.15 at which time I noticed that the history list was not shrinking as it had always done on 10.3.12. I found this bug which seems to describe the situation pretty well. I upgraded to 10.6.17 and tried many variations of innodb_* parameters described above and on other tickets. None seemed to have any effect. I also tried versions 10.11.8 and 11.4.2. The innodb undo logs still do not seem to be getting purged in those versions. If I stop non-replication traffic, the history list length reduces very rapidly. Only the busiest hosts in my infrastructure can't keep up with the undo log purging. however, on 10.3.12 this was never an issue. It feels like the rows in the undo log must already be delete-marked or query performance would be really bad at such large history lengths. So the only impact I'm feeling right now is unbounded disk space growth (or the hassle of juggling hosts in and out of rotation, including the primary). marko are you sure this issue is fully resolved? thank you!!

            ch701, unfortunately MariaDB 10.6.16 and 10.6.17 are affected by MDEV-33819, which was introduced in MDEV-32050. However, that fix should be present in the 10.11.8 and 11.4.2 releases, which you also tested.

            MariaDB Server 10.3 is missing some correctness fixes such as MDEV-31355. It is possible that history is purged prematurely, while some old transactions could still access the history. I believe that the bug could lead to a situation where the undo log page has been reused and trx_undo_rec_copy() would attempt to copy data at an incorrect offset, and ultimately cause a crash. MDEV-11044 is an example of that, although this particular problematic undo page access was not in an MVCC read but the purge of history itself. Perhaps that bug was actually fixed by MDEV-22388.

            I am currently working on a performance regression in this area, MDEV-34515. The test case is artificial: a frequently updated single-row table with a secondary index. But there clearly is some room for improvement.

            marko Marko Mäkelä added a comment - ch701 , unfortunately MariaDB 10.6.16 and 10.6.17 are affected by MDEV-33819 , which was introduced in MDEV-32050 . However, that fix should be present in the 10.11.8 and 11.4.2 releases, which you also tested. MariaDB Server 10.3 is missing some correctness fixes such as MDEV-31355 . It is possible that history is purged prematurely, while some old transactions could still access the history. I believe that the bug could lead to a situation where the undo log page has been reused and trx_undo_rec_copy() would attempt to copy data at an incorrect offset, and ultimately cause a crash. MDEV-11044 is an example of that, although this particular problematic undo page access was not in an MVCC read but the purge of history itself. Perhaps that bug was actually fixed by MDEV-22388 . I am currently working on a performance regression in this area, MDEV-34515 . The test case is artificial: a frequently updated single-row table with a secondary index. But there clearly is some room for improvement.

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              1 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.