Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-30069

InnoDB: Trying to write ... bytes at ... outside the bounds of the file ...

Details

    Description

      For the dataset in MDEV-30068, a failure occurred during crash recovery with 10.6-based branch:

      2022-11-21  8:59:32 0 [ERROR] [FATAL] InnoDB: Trying to write 65536 bytes at 2359296 outside the bounds of the file: ./test/t5.ibd
      

      Before the failure, there was one call to fil_system_t::extend_to_recv_size() from recv_sys.apply_log(false).
      The failure would occur during "background recovery":

      #5  0x00005640dd514e7d in fil_invalid_page_access_msg (name=<optimized out>, 
          offset=offset@entry=2359296, len=len@entry=65536, is_read=<optimized out>)
          at /data/Server/bb-10.6-MDEV-29835extra/storage/innobase/fil/fil0fil.cc:2639
      #6  0x00005640dd5287e6 in fil_space_t::io (this=this@entry=0x6120000034c0, 
                type=@0x7ffc0c342410: {bpage = 0x45cb7349a870, slot = 0x631000028820, node = 0x0, type = IORequest::WRITE_LRU}, offset=2359296, len=<optimized out>, 
          buf=buf@entry=0x6f6f7afb0000, bpage=bpage@entry=0x45cb7349a870)
          at /data/Server/bb-10.6-MDEV-29835extra/storage/innobase/fil/fil0fil.cc:2715
      #7  0x00005640dd41fdf2 in buf_page_t::flush (this=this@entry=0x45cb7349a870, 
          lru=lru@entry=true, space=space@entry=0x6120000034c0)
          at /data/Server/bb-10.6-MDEV-29835extra/storage/innobase/buf/buf0flu.cc:900
      #8  0x00005640dd4152a1 in buf_flush_LRU_list_batch (max=max@entry=32, 
          n=n@entry=0x7ffc0c342600)
          at /data/Server/bb-10.6-MDEV-29835extra/storage/innobase/buf/buf0flu.cc:1299
      #9  0x00005640dd416295 in buf_do_LRU_batch (max=max@entry=32)
          at /data/Server/bb-10.6-MDEV-29835extra/storage/innobase/buf/buf0flu.cc:1337
      #10 0x00005640dd416590 in buf_flush_LRU (max_n=32)
          at /data/Server/bb-10.6-MDEV-29835extra/storage/innobase/buf/buf0flu.cc:1678
      #11 0x00005640dd42f58d in buf_LRU_get_free_block (
          have_mutex=have_mutex@entry=false)
          at /data/Server/bb-10.6-MDEV-29835extra/storage/innobase/buf/buf0lru.cc:488
      #12 0x00005640dcf678d0 in recv_sys_t::recover_low (
          this=0x5640dec48760 <recv_sys>, page_id=<optimized out>)
          at /data/Server/bb-10.6-MDEV-29835extra/storage/innobase/log/log0recv.cc:3155
      #13 0x00005640dd3e4508 in recv_sys_t::recover (page_id=<optimized out>, 
          this=<optimized out>)
          at /data/Server/bb-10.6-MDEV-29835extra/storage/innobase/include/log0recv.h:415
      #14 buf_page_get_gen (page_id={m_id = 303}, zip_size=zip_size@entry=0, 
          rw_latch=rw_latch@entry=1, guess=guess@entry=0x45cb7349a870, 
      …
      #25 0x00005640dd2cc19a in trx_resurrect (undo=undo@entry=0x611000043fc0, rseg=rseg@entry=0x5640decf2340 <trx_sys+30080>, start_time=start_time@entry=1669049972, start_time_micro=start_time_micro@entry=284020700207, rows_to_undo=rows_to_undo@entry=0x7ffc0c346840) at /data/Server/bb-10.6-MDEV-29835extra/storage/innobase/trx/trx0trx.cc:701
      #26 0x00005640dd2cc843 in trx_lists_init_at_db_start () at /data/Server/bb-10.6-MDEV-29835extra/storage/innobase/trx/trx0trx.cc:752
      #27 0x00005640dd23d5f0 in srv_start (create_new_db=<optimized out>) at /data/Server/bb-10.6-MDEV-29835extra/storage/innobase/srv/srv0start.cc:1485
      #28 0x00005640dcdb1952 in innodb_init (p=<optimized out>) at /data/Server/bb-10.6-MDEV-29835extra/storage/innobase/handler/ha_innodb.cc:4293
      #29 0x00005640dc30fd21 in ha_initialize_handlerton (plugin=0x62100002e010) at /data/Server/bb-10.6-MDEV-29835extra/sql/handler.cc:659
      

      As far as I can tell, the problem should be fixed by extending the tablespace before invoking buf_LRU_get_free_block(), at least in recv_sys_t::recover_low() and possibly elsewhere (definitely when MDEV-14481 is implemented).

      I was not able to reproduce the error for a copy of the data directory, so some additional testing will be needed.

      Attachments

        Issue Links

          Activity

            RQG
            ===
            # git clone https://github.com/mleich1/rqg --branch experimental RQG
            #
            # GIT_SHOW: HEAD -> experimental, origin/experimental ec83f640bab8f5ea31f07b9bc118d508abae3b35 2022-11-17T15:46:44+01:00
            # rqg.pl  : Version 4.0.6 (2022-09)
            #
            # $RQG_HOME/rqg.pl \
            # --grammar=conf/mariadb/table_stress_innodb_nocopy1.yy \
            # --gendata=conf/mariadb/table_stress.zz \
            # --gendata_sql=conf/mariadb/table_stress.sql \
            # --reporters=CrashRecovery1 \
            # --mysqld=--loose-innodb_lock_schedule_algorithm=fcfs \
            # --mysqld=--loose-idle_write_transaction_timeout=0 \
            # --mysqld=--loose-idle_transaction_timeout=0 \
            # --mysqld=--loose-idle_readonly_transaction_timeout=0 \
            # --mysqld=--connect_timeout=60 \
            # --mysqld=--interactive_timeout=28800 \
            # --mysqld=--slave_net_timeout=60 \
            # --mysqld=--net_read_timeout=30 \
            # --mysqld=--net_write_timeout=60 \
            # --mysqld=--loose-table_lock_wait_timeout=50 \
            # --mysqld=--wait_timeout=28800 \
            # --mysqld=--lock-wait-timeout=86400 \
            # --mysqld=--innodb-lock-wait-timeout=50 \
            # --no-mask \
            # --queries=10000000 \
            # --seed=random \
            # --reporters=Backtrace \
            # --reporters=ErrorLog \
            # --reporters=Deadlock1 \
            # --validators=None \
            # --mysqld=--log_output=none \
            # --mysqld=--log_bin_trust_function_creators=1 \
            # --mysqld=--loose-debug_assert_on_not_freed_memory=0 \
            # --engine=InnoDB \
            # --restart_timeout=240 \
            # --mysqld=--plugin-load-add=file_key_management.so \
            # --mysqld=--loose-file-key-management-filename=$RQG_HOME/conf/mariadb/encryption_keys.txt \
            # --mysqld=--plugin-load-add=provider_lzo.so \
            # --mysqld=--plugin-load-add=provider_bzip2.so \
            # --mysqld=--plugin-load-add=provider_lzma.so \
            # --mysqld=--plugin-load-add=provider_snappy.so \
            # --mysqld=--plugin-load-add=provider_lz4.so \
            # --duration=300 \
            # --mysqld=--loose-innodb_fatal_semaphore_wait_threshold=300 \
            # --mysqld=--innodb_file_per_table=1 \
            # --mysqld=--loose-innodb_read_only_compressed=OFF \
            # --mysqld=--innodb_stats_persistent=off \
            # --mysqld=--innodb_adaptive_hash_index=off \
            # --mysqld=--log-bin \
            # --mysqld=--sync-binlog=1 \
            # --mysqld=--loose-innodb_evict_tables_on_commit_debug=on \
            # --mysqld=--loose-max-statement-time=30 \
            # --threads=33 \
            # --mysqld=--innodb-use-native-aio=0 \
            # --mysqld=--loose-gdb \
            # --mysqld=--loose-debug-gdb \
            # --rr=Extended \
            # --rr_options=--chaos --wait \
            # --mysqld=--innodb_rollback_on_timeout=ON \
            # --mysqld=--innodb_page_size=64K \
            # --mysqld=--innodb-buffer-pool-size=24M \
            # <local settings>
             
            sdp:/data/results/1669039949/TBR-1663
            Error pattern
            [ 'TBR-1663', 'INFO: Reporter \'CrashRecovery.\': Attempting database recovery using the server.+\[Note\] InnoDB: Starting crash recovery from checkpoint LSN.+\[ERROR\] \[FATAL\] InnoDB: Trying to write .{1,20} bytes at .{1,20} outside the bounds of the file: .{1,250}\.ibd.+RESULT: The RQG run ended with status STATUS_RECOVERY_FAILURE' ],
            

            mleich Matthias Leich added a comment - RQG === # git clone https://github.com/mleich1/rqg --branch experimental RQG # # GIT_SHOW: HEAD -> experimental, origin/experimental ec83f640bab8f5ea31f07b9bc118d508abae3b35 2022-11-17T15:46:44+01:00 # rqg.pl : Version 4.0.6 (2022-09) # # $RQG_HOME/rqg.pl \ # --grammar=conf/mariadb/table_stress_innodb_nocopy1.yy \ # --gendata=conf/mariadb/table_stress.zz \ # --gendata_sql=conf/mariadb/table_stress.sql \ # --reporters=CrashRecovery1 \ # --mysqld=--loose-innodb_lock_schedule_algorithm=fcfs \ # --mysqld=--loose-idle_write_transaction_timeout=0 \ # --mysqld=--loose-idle_transaction_timeout=0 \ # --mysqld=--loose-idle_readonly_transaction_timeout=0 \ # --mysqld=--connect_timeout=60 \ # --mysqld=--interactive_timeout=28800 \ # --mysqld=--slave_net_timeout=60 \ # --mysqld=--net_read_timeout=30 \ # --mysqld=--net_write_timeout=60 \ # --mysqld=--loose-table_lock_wait_timeout=50 \ # --mysqld=--wait_timeout=28800 \ # --mysqld=--lock-wait-timeout=86400 \ # --mysqld=--innodb-lock-wait-timeout=50 \ # --no-mask \ # --queries=10000000 \ # --seed=random \ # --reporters=Backtrace \ # --reporters=ErrorLog \ # --reporters=Deadlock1 \ # --validators=None \ # --mysqld=--log_output=none \ # --mysqld=--log_bin_trust_function_creators=1 \ # --mysqld=--loose-debug_assert_on_not_freed_memory=0 \ # --engine=InnoDB \ # --restart_timeout=240 \ # --mysqld=--plugin-load-add=file_key_management.so \ # --mysqld=--loose-file-key-management-filename=$RQG_HOME/conf/mariadb/encryption_keys.txt \ # --mysqld=--plugin-load-add=provider_lzo.so \ # --mysqld=--plugin-load-add=provider_bzip2.so \ # --mysqld=--plugin-load-add=provider_lzma.so \ # --mysqld=--plugin-load-add=provider_snappy.so \ # --mysqld=--plugin-load-add=provider_lz4.so \ # --duration=300 \ # --mysqld=--loose-innodb_fatal_semaphore_wait_threshold=300 \ # --mysqld=--innodb_file_per_table=1 \ # --mysqld=--loose-innodb_read_only_compressed=OFF \ # --mysqld=--innodb_stats_persistent=off \ # --mysqld=--innodb_adaptive_hash_index=off \ # --mysqld=--log-bin \ # --mysqld=--sync-binlog=1 \ # --mysqld=--loose-innodb_evict_tables_on_commit_debug=on \ # --mysqld=--loose-max-statement-time=30 \ # --threads=33 \ # --mysqld=--innodb-use-native-aio=0 \ # --mysqld=--loose-gdb \ # --mysqld=--loose-debug-gdb \ # --rr=Extended \ # --rr_options=--chaos --wait \ # --mysqld=--innodb_rollback_on_timeout=ON \ # --mysqld=--innodb_page_size=64K \ # --mysqld=--innodb-buffer-pool-size=24M \ # <local settings>   sdp:/data/results/1669039949/TBR-1663 Error pattern [ 'TBR-1663', 'INFO: Reporter \'CrashRecovery.\': Attempting database recovery using the server.+\[Note\] InnoDB: Starting crash recovery from checkpoint LSN.+\[ERROR\] \[FATAL\] InnoDB: Trying to write .{1,20} bytes at .{1,20} outside the bounds of the file: .{1,250}\.ibd.+RESULT: The RQG run ended with status STATUS_RECOVERY_FAILURE' ],

            The fix that I initially had in mind is too complex, could cause a performance regression, and actually cannot fix the failure.

            In the rr replay trace, fil_ibd_load() is being invoked multiple times on the file test/t5.ibd, and each time it returns FIL_LOAD_DEFER, which the caller does not make note of (MDEV-25909). Each of those calls also invokes os_file_get_size() but throws away the result.

            The data set that I uploaded to MDEV-30068 actually does recover, because the first page of the file test/t5.ibd had already been written by recovery.

            I should have copied the data directory before the restart attempt. In that data directory, the file test/t5.ibd consists of 448*65536 NUL bytes.

            This problem was caused by MDEV-24626 in MariaDB Server 10.6.

            As far as I can tell, a second attempt to recover the data directory should succeed.

            The FSP_SIZE in page 0 of the file will be recovered to 0x1c0 (448) pages on my system, while in the data directory that I attached to MDEV-30068, it was only recovered to 10 pages. That is what recv_sys_t::recover_deferred() recovered as the file size in the rr replay traces.

            Finally, I was able to reproduce this with data.tar.xz (before the recovery attempt) and encryption_keys.txt and a little cheat:

            diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc
            index 9b81aa9307c..73d29692907 100644
            --- a/storage/innobase/handler/ha_innodb.cc
            +++ b/storage/innobase/handler/ha_innodb.cc
            @@ -3821,6 +3821,7 @@ static int innodb_init_params()
             	pages, even for larger pages */
             	MYSQL_SYSVAR_NAME(buffer_pool_size).min_val= min_buffer_pool_size();
             
            +#if 0
             	if (innobase_buffer_pool_size < MYSQL_SYSVAR_NAME(buffer_pool_size).min_val) {
             		ib::error() << "innodb_page_size="
             			<< srv_page_size << " requires "
            @@ -3830,6 +3831,7 @@ static int innodb_init_params()
             			<< "MiB";
             		DBUG_RETURN(HA_ERR_INITIALIZATION);
             	}
            +#endif
             
             #ifndef HAVE_LZ4
             	if (innodb_compression_algorithm == PAGE_LZ4_ALGORITHM) {
            

            mariadbd --plugin-dir=$BUILD_DIR/plugin/file_key_management --plugin-load-add=file_key_management.so --file-key-management-filename=$(pwd)/encryption_keys.txt --innodb_page_size=64K --innodb-log-file-size=96M --innodb-buffer-pool-size=12M --datadir $(pwd)/data
            

            10.6 377ec1b5c906ea55fa2af9ae048a14eb9e1cf201 (patched)

            2022-11-22 12:54:38 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=48561117,51775692
            2022-11-22 12:54:38 0 [Note] InnoDB: Starting a batch to recover 503 pages from redo log.
            2022-11-22 12:54:38 0 [ERROR] [FATAL] InnoDB: Trying to write 65536 bytes at 2359296 outside the bounds of the file: ./test/t5.ibd
            

            Like in the rr replay trace, the tablespace size is 10 and the recv_size is 448 pages at this point.

            I think that the problem occurs when a created data file is recovered during an earlier batch of recovery. At that point, we fail to read the file size metadata from the file system.

            marko Marko Mäkelä added a comment - The fix that I initially had in mind is too complex, could cause a performance regression, and actually cannot fix the failure. In the rr replay trace, fil_ibd_load() is being invoked multiple times on the file test/t5.ibd , and each time it returns FIL_LOAD_DEFER , which the caller does not make note of ( MDEV-25909 ). Each of those calls also invokes os_file_get_size() but throws away the result. The data set that I uploaded to MDEV-30068 actually does recover, because the first page of the file test/t5.ibd had already been written by recovery. I should have copied the data directory before the restart attempt. In that data directory, the file test/t5.ibd consists of 448*65536 NUL bytes. This problem was caused by MDEV-24626 in MariaDB Server 10.6. As far as I can tell, a second attempt to recover the data directory should succeed. The FSP_SIZE in page 0 of the file will be recovered to 0x1c0 (448) pages on my system, while in the data directory that I attached to MDEV-30068 , it was only recovered to 10 pages. That is what recv_sys_t::recover_deferred() recovered as the file size in the rr replay traces. Finally, I was able to reproduce this with data.tar.xz (before the recovery attempt) and encryption_keys.txt and a little cheat: diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc index 9b81aa9307c..73d29692907 100644 --- a/storage/innobase/handler/ha_innodb.cc +++ b/storage/innobase/handler/ha_innodb.cc @@ -3821,6 +3821,7 @@ static int innodb_init_params() pages, even for larger pages */ MYSQL_SYSVAR_NAME(buffer_pool_size).min_val= min_buffer_pool_size(); +#if 0 if (innobase_buffer_pool_size < MYSQL_SYSVAR_NAME(buffer_pool_size).min_val) { ib::error() << "innodb_page_size=" << srv_page_size << " requires " @@ -3830,6 +3831,7 @@ static int innodb_init_params() << "MiB"; DBUG_RETURN(HA_ERR_INITIALIZATION); } +#endif #ifndef HAVE_LZ4 if (innodb_compression_algorithm == PAGE_LZ4_ALGORITHM) { mariadbd --plugin-dir=$BUILD_DIR/plugin/file_key_management --plugin-load-add=file_key_management.so --file-key-management-filename=$(pwd)/encryption_keys.txt --innodb_page_size=64K --innodb-log-file-size=96M --innodb-buffer-pool-size=12M --datadir $(pwd)/data 10.6 377ec1b5c906ea55fa2af9ae048a14eb9e1cf201 (patched) 2022-11-22 12:54:38 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=48561117,51775692 2022-11-22 12:54:38 0 [Note] InnoDB: Starting a batch to recover 503 pages from redo log. 2022-11-22 12:54:38 0 [ERROR] [FATAL] InnoDB: Trying to write 65536 bytes at 2359296 outside the bounds of the file: ./test/t5.ibd Like in the rr replay trace, the tablespace size is 10 and the recv_size is 448 pages at this point. I think that the problem occurs when a created data file is recovered during an earlier batch of recovery. At that point, we fail to read the file size metadata from the file system.

            I saw this on s390x builder on Launchpad for MariaDB 10.11.7:

            innodb.undo_truncate '4k,innodb'         w2 [ fail ]  Found warnings/errors in server log file!
                    Test ended at 2024-05-08 03:54:42
            line
            2024-05-08  3:54:18 0 [ERROR] [FATAL] InnoDB: Trying to write 4096 bytes at 2621440 outside the bounds of the file: .//undo001
            Attempting backtrace. You can use the following information to find out
            ^ Found warnings in /<<PKGBUILDDIR>>/builddir/mysql-test/var/2/log/mysqld.1.err
            ok
            

            Full log: https://launchpadlibrarian.net/728605875/buildlog_ubuntu-noble-s390x.mariadb_1%3A10.11.7-5~bpo24.04.1~1715057361.1833c7608a3+debian.latest_BUILDING.txt.gz

            otto Otto Kekäläinen added a comment - I saw this on s390x builder on Launchpad for MariaDB 10.11.7: innodb.undo_truncate '4k,innodb' w2 [ fail ] Found warnings/errors in server log file! Test ended at 2024-05-08 03:54:42 line 2024-05-08 3:54:18 0 [ERROR] [FATAL] InnoDB: Trying to write 4096 bytes at 2621440 outside the bounds of the file: .//undo001 Attempting backtrace. You can use the following information to find out ^ Found warnings in /<<PKGBUILDDIR>>/builddir/mysql-test/var/2/log/mysqld.1.err ok Full log: https://launchpadlibrarian.net/728605875/buildlog_ubuntu-noble-s390x.mariadb_1%3A10.11.7-5~bpo24.04.1~1715057361.1833c7608a3+debian.latest_BUILDING.txt.gz

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.