Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-31487

Recovery or backup failure after innodb_undo_log_truncate=ON

Details

    Description

      origin/preview-11.2-preview 79f9e94195dc8a007debd27ec4e8d2c15f094a97 2023-06-13T17:23:05+10:00
      Scenario:
      1. Start source server, generate some initial data
      2. Several sessions run concurrent SQL
      3. During 2. is ongoing
          mariabackup --backup data from source server to some other location
          mariabackup --prepare on that other location
          This fails with
          # 2023-06-13T20:43:22 [73003] | [rr 136876 18773]2023-06-13 20:41:29 0 [Note] InnoDB: Last binlog file './mysql-bin.000001', position 11543118
         # 2023-06-13T20:43:22 [73003] | [rr 136876 18786]2023-06-13 20:41:29 0 [ERROR] InnoDB: Space id and page no stored in the page, read in are [page id: space=50528773, page number=1165950976], should be [page id: space=3, page number=0]
         # 2023-06-13T20:43:22 [73003] | [rr 136876 18810][00] 2023-06-13 20:41:29 Last binlog file ./mysql-bin.000001, position 11543118
         # 2023-06-13T20:43:22 [73003] | [rr 136876 18849]2023-06-13 20:41:29 0 [ERROR] InnoDB: Database page corruption on disk or a failed read of file './undo001' page [page id: space=1, page number=4]. You may have to recover from a backup.
       
      pluto:/data/results/1686674959/TBR-1929--TBR-1928$ _RR_TRACE_DIR=./1_clone/rr/ rr replay --mark-stdio
       
      RQG
      ====
      # git clone https://github.com/mleich1/rqg --branch <pick the right branch> RQG
      #
      # GIT_SHOW: HEAD -> master, origin/master, origin/HEAD a8c8cd1021480e64ef979d326a2543c577d1068a 2023-06-09T18:22:34+02:00
      # rqg.pl  : Version 4.2.1 (2022-12)
      #
      # $RQG_HOME/rqg.pl \
      # --grammar=conf/engines/many_indexes.yy \
      # --gendata=conf/engines/many_indexes.zz \
      # --max_gd_duration=900 \
      # --reporters=Mariabackup_linux \
      # --mysqld=--loose-innodb-log-file-size=200M \
      # --mysqld=--loose-innodb_lock_schedule_algorithm=fcfs \
      # --mysqld=--loose-idle_write_transaction_timeout=0 \
      # --mysqld=--loose-idle_transaction_timeout=0 \
      # --mysqld=--loose-idle_readonly_transaction_timeout=0 \
      # --mysqld=--connect_timeout=60 \
      # --mysqld=--interactive_timeout=28800 \
      # --mysqld=--slave_net_timeout=60 \
      # --mysqld=--net_read_timeout=30 \
      # --mysqld=--net_write_timeout=60 \
      # --mysqld=--loose-table_lock_wait_timeout=50 \
      # --mysqld=--wait_timeout=28800 \
      # --no_mask \
      # --queries=10000000 \
      # --seed=random \
      # --reporters=None \
      # --reporters=ErrorLog \
      # --reporters=Deadlock \
      # --validators=None \
      # --mysqld=--log_output=none \
      # --mysqld=--log_bin_trust_function_creators=1 \
      # --mysqld=--loose-debug_assert_on_not_freed_memory=0 \
      # --engine=InnoDB \
      # --restart_timeout=240 \
      # --mysqld=--plugin-load-add=file_key_management.so \
      # --mysqld=--loose-file-key-management-filename=$RQG_HOME/conf/mariadb/encryption_keys.txt \
      # --mysqld=--plugin-load-add=provider_lzo.so \
      # --mysqld=--plugin-load-add=provider_bzip2.so \
      # --mysqld=--plugin-load-add=provider_lzma.so \
      # --mysqld=--plugin-load-add=provider_snappy.so \
      # --mysqld=--plugin-load-add=provider_lz4.so \
      # --mysqld=--loose-innodb_compression_level=1 \
      # --duration=300 \
      # --mysqld=--loose-innodb_fatal_semaphore_wait_threshold=300 \
      # --mysqld=--lock-wait-timeout=86400 \
      # --mysqld=--innodb-lock-wait-timeout=50 \
      # --mysqld=--loose-innodb_read_only_compressed=OFF \
      # --mysqld=--innodb_stats_persistent=on \
      # --mysqld=--innodb_adaptive_hash_index=on \
      # --mysqld=--innodb_random_read_ahead=OFF \
      # --mysqld=--log-bin \
      # --mysqld=--sync-binlog=1 \
      # --mysqld=--loose-innodb_evict_tables_on_commit_debug=off \
      # --mysqld=--loose-max-statement-time=30 \
      # --threads=9 \
      # --mysqld=--innodb_use_native_aio=1 \
      # --mysqld=--innodb_undo_log_truncate=ON \
      # --mysqld=--innodb_rollback_on_timeout=OFF \
      # --mysqld=--innodb_page_size=8K \
      # --mysqld=--innodb-buffer-pool-size=256M \
      # <local settings>
      

      Attachments

        1. data.tar.xz
          3.03 MB
          Marko Mäkelä

        Issue Links

          Activity

            I can also reproduce this locally with the data.tar.xz that I copied from the test location. The following patch fixes this:

            diff --git a/storage/innobase/log/log0recv.cc b/storage/innobase/log/log0recv.cc
            index 82d4ba3afde..9643e029f74 100644
            --- a/storage/innobase/log/log0recv.cc
            +++ b/storage/innobase/log/log0recv.cc
            @@ -1841,7 +1841,7 @@ inline bool page_recv_t::trim(lsn_t start_lsn)
             {
               while (log.head)
               {
            -    if (log.head->lsn > start_lsn) return false;
            +    if (log.head->lsn >= start_lsn) return false;
                 last_offset= 1; /* the next record must not be same_page */
                 log_rec_t *next= log.head->next;
                 recv_sys.free(log.head);
            

            This would fully revert an earlier fix of MDEV-30479.

            I think that instead of applying the above change, we must ensure the correct recovery of the truncation LSN.

            marko Marko Mäkelä added a comment - I can also reproduce this locally with the data.tar.xz that I copied from the test location. The following patch fixes this: diff --git a/storage/innobase/log/log0recv.cc b/storage/innobase/log/log0recv.cc index 82d4ba3afde..9643e029f74 100644 --- a/storage/innobase/log/log0recv.cc +++ b/storage/innobase/log/log0recv.cc @@ -1841,7 +1841,7 @@ inline bool page_recv_t::trim(lsn_t start_lsn) { while (log.head) { - if (log.head->lsn > start_lsn) return false; + if (log.head->lsn >= start_lsn) return false; last_offset= 1; /* the next record must not be same_page */ log_rec_t *next= log.head->next; recv_sys.free(log.head); This would fully revert an earlier fix of MDEV-30479 . I think that instead of applying the above change, we must ensure the correct recovery of the truncation LSN.

            Fortunately, the rr replay traces of MDEV-30479 are still accessible. I will reanalyze that and think of a better fix.

            marko Marko Mäkelä added a comment - Fortunately, the rr replay traces of MDEV-30479 are still accessible. I will reanalyze that and think of a better fix.

            thiru’s observations from debugging a different case led me to the following patch:

            diff --git a/storage/innobase/log/log0recv.cc b/storage/innobase/log/log0recv.cc
            index 82d4ba3afde..b4a5957bb5e 100644
            --- a/storage/innobase/log/log0recv.cc
            +++ b/storage/innobase/log/log0recv.cc
            @@ -2673,9 +2673,9 @@ recv_sys_t::parse_mtr_result recv_sys_t::parse(source &l, bool if_exists)
                                     TRX_SYS_MAX_UNDO_SPACES, "compatibility");
                       /* The entire undo tablespace will be reinitialized by
                       innodb_undo_log_truncate=ON. Discard old log for all pages. */
            -          trim({space_id, 0}, lsn);
            +          trim({space_id, 0}, start_lsn);
                       truncated_undo_spaces[space_id - srv_undo_space_id_start]=
            -            { lsn, page_no };
            +            { start_lsn, page_no };
                       if (!store && undo_space_trunc)
                         undo_space_trunc(space_id);
             #endif
            

            This follows the idea of the MDEV-30479 fix. We must keep track of the start LSN of the mini-transaction that truncated the undo tablespace. The start LSN is also being used for handling mlog_init (MDEV-12699).

            marko Marko Mäkelä added a comment - thiru ’s observations from debugging a different case led me to the following patch: diff --git a/storage/innobase/log/log0recv.cc b/storage/innobase/log/log0recv.cc index 82d4ba3afde..b4a5957bb5e 100644 --- a/storage/innobase/log/log0recv.cc +++ b/storage/innobase/log/log0recv.cc @@ -2673,9 +2673,9 @@ recv_sys_t::parse_mtr_result recv_sys_t::parse(source &l, bool if_exists) TRX_SYS_MAX_UNDO_SPACES, "compatibility"); /* The entire undo tablespace will be reinitialized by innodb_undo_log_truncate=ON. Discard old log for all pages. */ - trim({space_id, 0}, lsn); + trim({space_id, 0}, start_lsn); truncated_undo_spaces[space_id - srv_undo_space_id_start]= - { lsn, page_no }; + { start_lsn, page_no }; if (!store && undo_space_trunc) undo_space_trunc(space_id); #endif This follows the idea of the MDEV-30479 fix. We must keep track of the start LSN of the mini-transaction that truncated the undo tablespace. The start LSN is also being used for handling mlog_init ( MDEV-12699 ).

            Hitting that problem was not rare. The corresponding error patterns
            'ERROR: Reporter Mariabackup : First prepare returned.+ Executing backup: .+ Executing first prepare: .+\[ERROR\] InnoDB: Database page corruption on disk or a failed read of file \'\.\/undo.+RESULT: The RQG run ended with status STATUS_BACKUP_FAILURE' ],
             'ERROR: Reporter Mariabackup : First prepare returned.+ Executing backup: .+ Executing first prepare: .+\[ERROR\] InnoDB: Space id and page no stored in the page,.+RESULT: The RQG run ended with status STATUS_BACKUP_FAILURE' ],
             showed usually up together.
            This happened last time on origin/11.2, 11.2 dbc342959233abd0ee70dace85409d042684162d 2023-06-26T16:22:27+02:00 and
            preview-11.2-preview.
            The excessive RQG test runs on
            - 11.2 2867894ac6ca23fc2f82cfad2dd510351f597325 2023-06-28T09:39:28+03:00
            - origin/bb-11.2-MDEV-14795 f4ae37836c3a6632a4284be853675bcbe3f565d5 2023-06-28T15:48:11+05:30
            did not replay these error patterns again.
            So we can assume that the reason for the problem is most probably fixed.
            

            mleich Matthias Leich added a comment - Hitting that problem was not rare. The corresponding error patterns 'ERROR: Reporter Mariabackup : First prepare returned.+ Executing backup: .+ Executing first prepare: .+\[ERROR\] InnoDB: Database page corruption on disk or a failed read of file \'\.\/undo.+RESULT: The RQG run ended with status STATUS_BACKUP_FAILURE' ], 'ERROR: Reporter Mariabackup : First prepare returned.+ Executing backup: .+ Executing first prepare: .+\[ERROR\] InnoDB: Space id and page no stored in the page,.+RESULT: The RQG run ended with status STATUS_BACKUP_FAILURE' ], showed usually up together. This happened last time on origin/11.2, 11.2 dbc342959233abd0ee70dace85409d042684162d 2023-06-26T16:22:27+02:00 and preview-11.2-preview. The excessive RQG test runs on - 11.2 2867894ac6ca23fc2f82cfad2dd510351f597325 2023-06-28T09:39:28+03:00 - origin/bb-11.2-MDEV-14795 f4ae37836c3a6632a4284be853675bcbe3f565d5 2023-06-28T15:48:11+05:30 did not replay these error patterns again. So we can assume that the reason for the problem is most probably fixed.

            Thank you, mleich. I marked the bug MDEV-31488 that thiru analyzed as a duplicate of this one.

            marko Marko Mäkelä added a comment - Thank you, mleich . I marked the bug MDEV-31488 that thiru analyzed as a duplicate of this one.

            People

              marko Marko Mäkelä
              mleich Matthias Leich
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.