Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-29593

Purge misses a chance to free not-yet-reused undo pages

Details

    Description

      While reviewing some code to prepare for MDEV-24402, I came across the following.

      In the case where the purge coordinator task invokes trx_purge_remove_log_hdr() outside of trx_purge_free_segment(), it would be possible to free the undo log page, provided that nothing was appended to the page after the purged records.

      Undo log pages may contain records for many unrelated transactions. If no "reuse of cached undo pages" took place, we could simply free the undo page. This could avoid some page writes and improve the performance of backup and recovery. It could also avoid unnecessary growth of the system tablespace or the temporary tablespace when they store undo log pages.

      Attachments

        Issue Links

          Activity

            The logic for this is a little tricky. I think that we can do this in the case when we cannot invoke trx_purge_free_segment() because the state is not TRX_UNDO_TO_PURGE but TRX_UNDO_CACHED and all other conditions (which were recently revised in MDEV-30671) hold.

            In this case, we should find exactly one entry in rseg.undo_cached that points to the last record in the undo log page. That entry as well as the TRX_RSEG_UNDO_SLOTS pointer to the cached undo page will have to be removed.

            The rseg.undo_cached is also being used for temporary undo logs (see trx_undo_assign_low() and trx_undo_reuse_cached()), but there is no purge of history or MVCC for temporary tables. For temporary tables, we could modify trx_undo_commit_cleanup() so that it will never add anything to rseg.undo_cached and instead just free the undo page after commit.

            marko Marko Mäkelä added a comment - The logic for this is a little tricky. I think that we can do this in the case when we cannot invoke trx_purge_free_segment() because the state is not TRX_UNDO_TO_PURGE but TRX_UNDO_CACHED and all other conditions (which were recently revised in MDEV-30671 ) hold. In this case, we should find exactly one entry in rseg.undo_cached that points to the last record in the undo log page. That entry as well as the TRX_RSEG_UNDO_SLOTS pointer to the cached undo page will have to be removed. The rseg.undo_cached is also being used for temporary undo logs (see trx_undo_assign_low() and trx_undo_reuse_cached() ), but there is no purge of history or MVCC for temporary tables. For temporary tables, we could modify trx_undo_commit_cleanup() so that it will never add anything to rseg.undo_cached and instead just free the undo page after commit.

            I figured out that we can actually free the entire segment and not just the last page.

            My attempts at reproducing the impact of this change on the size of the undo log tablespaces have failed so far. For MDEV-26782 it was straightforward, because I tested speed and not the file size. I did measure some speed improvement for this change, but more testing will be needed.

            marko Marko Mäkelä added a comment - I figured out that we can actually free the entire segment and not just the last page. My attempts at reproducing the impact of this change on the size of the undo log tablespaces have failed so far. For MDEV-26782 it was straightforward, because I tested speed and not the file size. I did measure some speed improvement for this change, but more testing will be needed.

            According to one brave user who deployed on some replicas the version that I posted above, this seems to help prevent the undo tablespace growth and the need of innodb_undo_log_truncate=ON.

            In the early patch that I linked in my previous comment, there are two problems. It occasionally detects a missed opportunity to free undo pages:

                case TRX_UNDO_CACHED:
                  /* rseg.undo_cached must point to this page */
                  trx_undo_t *undo= UT_LIST_GET_FIRST(rseg.undo_cached);
                  for (; undo; undo= UT_LIST_GET_NEXT(undo_list, undo))
                    if (undo->top_page_no == hdr_addr.page &&
                        undo->hdr_page_no == hdr_addr.page &&
                        undo->hdr_offset == hdr_addr.boffset)
                      goto found_cached;
                  ut_ad("inconsistent undo logs" == 0);
                  break;
                found_cached:
                  UT_LIST_REMOVE(rseg.undo_cached, undo);
            …
                  goto free_segment;
            

            This is a debug assertion, and it is there on purpose so that I would get examples of that. mleich provided a simple one. At the time the assertion fails, everything else except undo->hdr_offset matches. In the cached undo log, we have an undo->hdr_offset of 2796. The hdr_addr.boffset as well as the undo page header contents shows 120 at this time. When the 2796 was stored in undo->hdr_offset, the undo log records in the page had not been purged yet. The transaction for which the tentative reservation was made, was rolled back later, maybe due to a deadlock or a duplicate key error.

            I think that we must remove the condition undo->hdr_offset == hdr_addr.boffset. I need to think about the conditions on the page number. Maybe one of them would suffice.

            The debug assertion failure itself is ‘harmless’ here, that is, if we do not goto free_segment, the undo log segment will not be freed, just like the logic was before this fix.

            Another problem with the experimental patch was that the rseg_hdr at the end of the function is being restored in an unsafe way. If the block had been evicted from the buffer pool between the mtr.commit() and the reacquisition of the rseg_hdr->page.lock, we would continue operating on a totally wrong page. I think that this problem only affects systems with an extremely small buffer pool.

            marko Marko Mäkelä added a comment - According to one brave user who deployed on some replicas the version that I posted above, this seems to help prevent the undo tablespace growth and the need of innodb_undo_log_truncate=ON . In the early patch that I linked in my previous comment, there are two problems. It occasionally detects a missed opportunity to free undo pages: case TRX_UNDO_CACHED: /* rseg.undo_cached must point to this page */ trx_undo_t *undo= UT_LIST_GET_FIRST(rseg.undo_cached); for (; undo; undo= UT_LIST_GET_NEXT(undo_list, undo)) if (undo->top_page_no == hdr_addr.page && undo->hdr_page_no == hdr_addr.page && undo->hdr_offset == hdr_addr.boffset) goto found_cached; ut_ad( "inconsistent undo logs" == 0); break ; found_cached: UT_LIST_REMOVE(rseg.undo_cached, undo); … goto free_segment; This is a debug assertion, and it is there on purpose so that I would get examples of that. mleich provided a simple one. At the time the assertion fails, everything else except undo->hdr_offset matches. In the cached undo log, we have an undo->hdr_offset of 2796. The hdr_addr.boffset as well as the undo page header contents shows 120 at this time. When the 2796 was stored in undo->hdr_offset , the undo log records in the page had not been purged yet. The transaction for which the tentative reservation was made, was rolled back later, maybe due to a deadlock or a duplicate key error. I think that we must remove the condition undo->hdr_offset == hdr_addr.boffset . I need to think about the conditions on the page number. Maybe one of them would suffice. The debug assertion failure itself is ‘harmless’ here, that is, if we do not goto free_segment , the undo log segment will not be freed, just like the logic was before this fix. Another problem with the experimental patch was that the rseg_hdr at the end of the function is being restored in an unsafe way. If the block had been evicted from the buffer pool between the mtr.commit() and the reacquisition of the rseg_hdr->page.lock , we would continue operating on a totally wrong page. I think that this problem only affects systems with an extremely small buffer pool.

            After analyzing a little more, I believe that the following check should be sufficient:

                  trx_undo_t *undo= UT_LIST_GET_FIRST(rseg.undo_cached);
                  for (; undo; undo= UT_LIST_GET_NEXT(undo_list, undo))
                    if (undo->hdr_page_no == hdr_addr.page)
                      goto found_cached;
                  ut_ad("inconsistent undo logs" == 0);
                  break;
                found_cached:
            

            I analyzed a trace where undo->top_page_no was different. The last access to that page was in purge_coordinator_callback() in the same thread, in trx_purge_get_next_rec(). Before this part of the code, the MDEV-30671 logic is already checking that the entire undo log segment is safe to remove. Basically the only difference in this fix is that we allow not only fully processed TRX_UNDO_TO_PURGE but also fully-processed-and-not-yet-reused TRX_UNDO_CACHED undo log segments to be freed.

            marko Marko Mäkelä added a comment - After analyzing a little more, I believe that the following check should be sufficient: trx_undo_t *undo= UT_LIST_GET_FIRST(rseg.undo_cached); for (; undo; undo= UT_LIST_GET_NEXT(undo_list, undo)) if (undo->hdr_page_no == hdr_addr.page) goto found_cached; ut_ad( "inconsistent undo logs" == 0); break ; found_cached: I analyzed a trace where undo->top_page_no was different. The last access to that page was in purge_coordinator_callback() in the same thread, in trx_purge_get_next_rec() . Before this part of the code, the MDEV-30671 logic is already checking that the entire undo log segment is safe to remove. Basically the only difference in this fix is that we allow not only fully processed TRX_UNDO_TO_PURGE but also fully-processed-and-not-yet-reused TRX_UNDO_CACHED undo log segments to be freed.

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              2 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.