Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-30069

InnoDB: Trying to write ... bytes at ... outside the bounds of the file ...

Details

    Description

      For the dataset in MDEV-30068, a failure occurred during crash recovery with 10.6-based branch:

      2022-11-21  8:59:32 0 [ERROR] [FATAL] InnoDB: Trying to write 65536 bytes at 2359296 outside the bounds of the file: ./test/t5.ibd
      

      Before the failure, there was one call to fil_system_t::extend_to_recv_size() from recv_sys.apply_log(false).
      The failure would occur during "background recovery":

      #5  0x00005640dd514e7d in fil_invalid_page_access_msg (name=<optimized out>, 
          offset=offset@entry=2359296, len=len@entry=65536, is_read=<optimized out>)
          at /data/Server/bb-10.6-MDEV-29835extra/storage/innobase/fil/fil0fil.cc:2639
      #6  0x00005640dd5287e6 in fil_space_t::io (this=this@entry=0x6120000034c0, 
                type=@0x7ffc0c342410: {bpage = 0x45cb7349a870, slot = 0x631000028820, node = 0x0, type = IORequest::WRITE_LRU}, offset=2359296, len=<optimized out>, 
          buf=buf@entry=0x6f6f7afb0000, bpage=bpage@entry=0x45cb7349a870)
          at /data/Server/bb-10.6-MDEV-29835extra/storage/innobase/fil/fil0fil.cc:2715
      #7  0x00005640dd41fdf2 in buf_page_t::flush (this=this@entry=0x45cb7349a870, 
          lru=lru@entry=true, space=space@entry=0x6120000034c0)
          at /data/Server/bb-10.6-MDEV-29835extra/storage/innobase/buf/buf0flu.cc:900
      #8  0x00005640dd4152a1 in buf_flush_LRU_list_batch (max=max@entry=32, 
          n=n@entry=0x7ffc0c342600)
          at /data/Server/bb-10.6-MDEV-29835extra/storage/innobase/buf/buf0flu.cc:1299
      #9  0x00005640dd416295 in buf_do_LRU_batch (max=max@entry=32)
          at /data/Server/bb-10.6-MDEV-29835extra/storage/innobase/buf/buf0flu.cc:1337
      #10 0x00005640dd416590 in buf_flush_LRU (max_n=32)
          at /data/Server/bb-10.6-MDEV-29835extra/storage/innobase/buf/buf0flu.cc:1678
      #11 0x00005640dd42f58d in buf_LRU_get_free_block (
          have_mutex=have_mutex@entry=false)
          at /data/Server/bb-10.6-MDEV-29835extra/storage/innobase/buf/buf0lru.cc:488
      #12 0x00005640dcf678d0 in recv_sys_t::recover_low (
          this=0x5640dec48760 <recv_sys>, page_id=<optimized out>)
          at /data/Server/bb-10.6-MDEV-29835extra/storage/innobase/log/log0recv.cc:3155
      #13 0x00005640dd3e4508 in recv_sys_t::recover (page_id=<optimized out>, 
          this=<optimized out>)
          at /data/Server/bb-10.6-MDEV-29835extra/storage/innobase/include/log0recv.h:415
      #14 buf_page_get_gen (page_id={m_id = 303}, zip_size=zip_size@entry=0, 
          rw_latch=rw_latch@entry=1, guess=guess@entry=0x45cb7349a870, 
      …
      #25 0x00005640dd2cc19a in trx_resurrect (undo=undo@entry=0x611000043fc0, rseg=rseg@entry=0x5640decf2340 <trx_sys+30080>, start_time=start_time@entry=1669049972, start_time_micro=start_time_micro@entry=284020700207, rows_to_undo=rows_to_undo@entry=0x7ffc0c346840) at /data/Server/bb-10.6-MDEV-29835extra/storage/innobase/trx/trx0trx.cc:701
      #26 0x00005640dd2cc843 in trx_lists_init_at_db_start () at /data/Server/bb-10.6-MDEV-29835extra/storage/innobase/trx/trx0trx.cc:752
      #27 0x00005640dd23d5f0 in srv_start (create_new_db=<optimized out>) at /data/Server/bb-10.6-MDEV-29835extra/storage/innobase/srv/srv0start.cc:1485
      #28 0x00005640dcdb1952 in innodb_init (p=<optimized out>) at /data/Server/bb-10.6-MDEV-29835extra/storage/innobase/handler/ha_innodb.cc:4293
      #29 0x00005640dc30fd21 in ha_initialize_handlerton (plugin=0x62100002e010) at /data/Server/bb-10.6-MDEV-29835extra/sql/handler.cc:659
      

      As far as I can tell, the problem should be fixed by extending the tablespace before invoking buf_LRU_get_free_block(), at least in recv_sys_t::recover_low() and possibly elsewhere (definitely when MDEV-14481 is implemented).

      I was not able to reproduce the error for a copy of the data directory, so some additional testing will be needed.

      Attachments

        Issue Links

          Activity

            marko Marko Mäkelä created issue -
            marko Marko Mäkelä made changes -
            Field Original Value New Value
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -

            RQG
            ===
            # git clone https://github.com/mleich1/rqg --branch experimental RQG
            #
            # GIT_SHOW: HEAD -> experimental, origin/experimental ec83f640bab8f5ea31f07b9bc118d508abae3b35 2022-11-17T15:46:44+01:00
            # rqg.pl  : Version 4.0.6 (2022-09)
            #
            # $RQG_HOME/rqg.pl \
            # --grammar=conf/mariadb/table_stress_innodb_nocopy1.yy \
            # --gendata=conf/mariadb/table_stress.zz \
            # --gendata_sql=conf/mariadb/table_stress.sql \
            # --reporters=CrashRecovery1 \
            # --mysqld=--loose-innodb_lock_schedule_algorithm=fcfs \
            # --mysqld=--loose-idle_write_transaction_timeout=0 \
            # --mysqld=--loose-idle_transaction_timeout=0 \
            # --mysqld=--loose-idle_readonly_transaction_timeout=0 \
            # --mysqld=--connect_timeout=60 \
            # --mysqld=--interactive_timeout=28800 \
            # --mysqld=--slave_net_timeout=60 \
            # --mysqld=--net_read_timeout=30 \
            # --mysqld=--net_write_timeout=60 \
            # --mysqld=--loose-table_lock_wait_timeout=50 \
            # --mysqld=--wait_timeout=28800 \
            # --mysqld=--lock-wait-timeout=86400 \
            # --mysqld=--innodb-lock-wait-timeout=50 \
            # --no-mask \
            # --queries=10000000 \
            # --seed=random \
            # --reporters=Backtrace \
            # --reporters=ErrorLog \
            # --reporters=Deadlock1 \
            # --validators=None \
            # --mysqld=--log_output=none \
            # --mysqld=--log_bin_trust_function_creators=1 \
            # --mysqld=--loose-debug_assert_on_not_freed_memory=0 \
            # --engine=InnoDB \
            # --restart_timeout=240 \
            # --mysqld=--plugin-load-add=file_key_management.so \
            # --mysqld=--loose-file-key-management-filename=$RQG_HOME/conf/mariadb/encryption_keys.txt \
            # --mysqld=--plugin-load-add=provider_lzo.so \
            # --mysqld=--plugin-load-add=provider_bzip2.so \
            # --mysqld=--plugin-load-add=provider_lzma.so \
            # --mysqld=--plugin-load-add=provider_snappy.so \
            # --mysqld=--plugin-load-add=provider_lz4.so \
            # --duration=300 \
            # --mysqld=--loose-innodb_fatal_semaphore_wait_threshold=300 \
            # --mysqld=--innodb_file_per_table=1 \
            # --mysqld=--loose-innodb_read_only_compressed=OFF \
            # --mysqld=--innodb_stats_persistent=off \
            # --mysqld=--innodb_adaptive_hash_index=off \
            # --mysqld=--log-bin \
            # --mysqld=--sync-binlog=1 \
            # --mysqld=--loose-innodb_evict_tables_on_commit_debug=on \
            # --mysqld=--loose-max-statement-time=30 \
            # --threads=33 \
            # --mysqld=--innodb-use-native-aio=0 \
            # --mysqld=--loose-gdb \
            # --mysqld=--loose-debug-gdb \
            # --rr=Extended \
            # --rr_options=--chaos --wait \
            # --mysqld=--innodb_rollback_on_timeout=ON \
            # --mysqld=--innodb_page_size=64K \
            # --mysqld=--innodb-buffer-pool-size=24M \
            # <local settings>
             
            sdp:/data/results/1669039949/TBR-1663
            Error pattern
            [ 'TBR-1663', 'INFO: Reporter \'CrashRecovery.\': Attempting database recovery using the server.+\[Note\] InnoDB: Starting crash recovery from checkpoint LSN.+\[ERROR\] \[FATAL\] InnoDB: Trying to write .{1,20} bytes at .{1,20} outside the bounds of the file: .{1,250}\.ibd.+RESULT: The RQG run ended with status STATUS_RECOVERY_FAILURE' ],
            

            mleich Matthias Leich added a comment - RQG === # git clone https://github.com/mleich1/rqg --branch experimental RQG # # GIT_SHOW: HEAD -> experimental, origin/experimental ec83f640bab8f5ea31f07b9bc118d508abae3b35 2022-11-17T15:46:44+01:00 # rqg.pl : Version 4.0.6 (2022-09) # # $RQG_HOME/rqg.pl \ # --grammar=conf/mariadb/table_stress_innodb_nocopy1.yy \ # --gendata=conf/mariadb/table_stress.zz \ # --gendata_sql=conf/mariadb/table_stress.sql \ # --reporters=CrashRecovery1 \ # --mysqld=--loose-innodb_lock_schedule_algorithm=fcfs \ # --mysqld=--loose-idle_write_transaction_timeout=0 \ # --mysqld=--loose-idle_transaction_timeout=0 \ # --mysqld=--loose-idle_readonly_transaction_timeout=0 \ # --mysqld=--connect_timeout=60 \ # --mysqld=--interactive_timeout=28800 \ # --mysqld=--slave_net_timeout=60 \ # --mysqld=--net_read_timeout=30 \ # --mysqld=--net_write_timeout=60 \ # --mysqld=--loose-table_lock_wait_timeout=50 \ # --mysqld=--wait_timeout=28800 \ # --mysqld=--lock-wait-timeout=86400 \ # --mysqld=--innodb-lock-wait-timeout=50 \ # --no-mask \ # --queries=10000000 \ # --seed=random \ # --reporters=Backtrace \ # --reporters=ErrorLog \ # --reporters=Deadlock1 \ # --validators=None \ # --mysqld=--log_output=none \ # --mysqld=--log_bin_trust_function_creators=1 \ # --mysqld=--loose-debug_assert_on_not_freed_memory=0 \ # --engine=InnoDB \ # --restart_timeout=240 \ # --mysqld=--plugin-load-add=file_key_management.so \ # --mysqld=--loose-file-key-management-filename=$RQG_HOME/conf/mariadb/encryption_keys.txt \ # --mysqld=--plugin-load-add=provider_lzo.so \ # --mysqld=--plugin-load-add=provider_bzip2.so \ # --mysqld=--plugin-load-add=provider_lzma.so \ # --mysqld=--plugin-load-add=provider_snappy.so \ # --mysqld=--plugin-load-add=provider_lz4.so \ # --duration=300 \ # --mysqld=--loose-innodb_fatal_semaphore_wait_threshold=300 \ # --mysqld=--innodb_file_per_table=1 \ # --mysqld=--loose-innodb_read_only_compressed=OFF \ # --mysqld=--innodb_stats_persistent=off \ # --mysqld=--innodb_adaptive_hash_index=off \ # --mysqld=--log-bin \ # --mysqld=--sync-binlog=1 \ # --mysqld=--loose-innodb_evict_tables_on_commit_debug=on \ # --mysqld=--loose-max-statement-time=30 \ # --threads=33 \ # --mysqld=--innodb-use-native-aio=0 \ # --mysqld=--loose-gdb \ # --mysqld=--loose-debug-gdb \ # --rr=Extended \ # --rr_options=--chaos --wait \ # --mysqld=--innodb_rollback_on_timeout=ON \ # --mysqld=--innodb_page_size=64K \ # --mysqld=--innodb-buffer-pool-size=24M \ # <local settings>   sdp:/data/results/1669039949/TBR-1663 Error pattern [ 'TBR-1663', 'INFO: Reporter \'CrashRecovery.\': Attempting database recovery using the server.+\[Note\] InnoDB: Starting crash recovery from checkpoint LSN.+\[ERROR\] \[FATAL\] InnoDB: Trying to write .{1,20} bytes at .{1,20} outside the bounds of the file: .{1,250}\.ibd.+RESULT: The RQG run ended with status STATUS_RECOVERY_FAILURE' ],

            The fix that I initially had in mind is too complex, could cause a performance regression, and actually cannot fix the failure.

            In the rr replay trace, fil_ibd_load() is being invoked multiple times on the file test/t5.ibd, and each time it returns FIL_LOAD_DEFER, which the caller does not make note of (MDEV-25909). Each of those calls also invokes os_file_get_size() but throws away the result.

            The data set that I uploaded to MDEV-30068 actually does recover, because the first page of the file test/t5.ibd had already been written by recovery.

            I should have copied the data directory before the restart attempt. In that data directory, the file test/t5.ibd consists of 448*65536 NUL bytes.

            This problem was caused by MDEV-24626 in MariaDB Server 10.6.

            As far as I can tell, a second attempt to recover the data directory should succeed.

            The FSP_SIZE in page 0 of the file will be recovered to 0x1c0 (448) pages on my system, while in the data directory that I attached to MDEV-30068, it was only recovered to 10 pages. That is what recv_sys_t::recover_deferred() recovered as the file size in the rr replay traces.

            Finally, I was able to reproduce this with data.tar.xz (before the recovery attempt) and encryption_keys.txt and a little cheat:

            diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc
            index 9b81aa9307c..73d29692907 100644
            --- a/storage/innobase/handler/ha_innodb.cc
            +++ b/storage/innobase/handler/ha_innodb.cc
            @@ -3821,6 +3821,7 @@ static int innodb_init_params()
             	pages, even for larger pages */
             	MYSQL_SYSVAR_NAME(buffer_pool_size).min_val= min_buffer_pool_size();
             
            +#if 0
             	if (innobase_buffer_pool_size < MYSQL_SYSVAR_NAME(buffer_pool_size).min_val) {
             		ib::error() << "innodb_page_size="
             			<< srv_page_size << " requires "
            @@ -3830,6 +3831,7 @@ static int innodb_init_params()
             			<< "MiB";
             		DBUG_RETURN(HA_ERR_INITIALIZATION);
             	}
            +#endif
             
             #ifndef HAVE_LZ4
             	if (innodb_compression_algorithm == PAGE_LZ4_ALGORITHM) {
            

            mariadbd --plugin-dir=$BUILD_DIR/plugin/file_key_management --plugin-load-add=file_key_management.so --file-key-management-filename=$(pwd)/encryption_keys.txt --innodb_page_size=64K --innodb-log-file-size=96M --innodb-buffer-pool-size=12M --datadir $(pwd)/data
            

            10.6 377ec1b5c906ea55fa2af9ae048a14eb9e1cf201 (patched)

            2022-11-22 12:54:38 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=48561117,51775692
            2022-11-22 12:54:38 0 [Note] InnoDB: Starting a batch to recover 503 pages from redo log.
            2022-11-22 12:54:38 0 [ERROR] [FATAL] InnoDB: Trying to write 65536 bytes at 2359296 outside the bounds of the file: ./test/t5.ibd
            

            Like in the rr replay trace, the tablespace size is 10 and the recv_size is 448 pages at this point.

            I think that the problem occurs when a created data file is recovered during an earlier batch of recovery. At that point, we fail to read the file size metadata from the file system.

            marko Marko Mäkelä added a comment - The fix that I initially had in mind is too complex, could cause a performance regression, and actually cannot fix the failure. In the rr replay trace, fil_ibd_load() is being invoked multiple times on the file test/t5.ibd , and each time it returns FIL_LOAD_DEFER , which the caller does not make note of ( MDEV-25909 ). Each of those calls also invokes os_file_get_size() but throws away the result. The data set that I uploaded to MDEV-30068 actually does recover, because the first page of the file test/t5.ibd had already been written by recovery. I should have copied the data directory before the restart attempt. In that data directory, the file test/t5.ibd consists of 448*65536 NUL bytes. This problem was caused by MDEV-24626 in MariaDB Server 10.6. As far as I can tell, a second attempt to recover the data directory should succeed. The FSP_SIZE in page 0 of the file will be recovered to 0x1c0 (448) pages on my system, while in the data directory that I attached to MDEV-30068 , it was only recovered to 10 pages. That is what recv_sys_t::recover_deferred() recovered as the file size in the rr replay traces. Finally, I was able to reproduce this with data.tar.xz (before the recovery attempt) and encryption_keys.txt and a little cheat: diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc index 9b81aa9307c..73d29692907 100644 --- a/storage/innobase/handler/ha_innodb.cc +++ b/storage/innobase/handler/ha_innodb.cc @@ -3821,6 +3821,7 @@ static int innodb_init_params() pages, even for larger pages */ MYSQL_SYSVAR_NAME(buffer_pool_size).min_val= min_buffer_pool_size(); +#if 0 if (innobase_buffer_pool_size < MYSQL_SYSVAR_NAME(buffer_pool_size).min_val) { ib::error() << "innodb_page_size=" << srv_page_size << " requires " @@ -3830,6 +3831,7 @@ static int innodb_init_params() << "MiB"; DBUG_RETURN(HA_ERR_INITIALIZATION); } +#endif #ifndef HAVE_LZ4 if (innodb_compression_algorithm == PAGE_LZ4_ALGORITHM) { mariadbd --plugin-dir=$BUILD_DIR/plugin/file_key_management --plugin-load-add=file_key_management.so --file-key-management-filename=$(pwd)/encryption_keys.txt --innodb_page_size=64K --innodb-log-file-size=96M --innodb-buffer-pool-size=12M --datadir $(pwd)/data 10.6 377ec1b5c906ea55fa2af9ae048a14eb9e1cf201 (patched) 2022-11-22 12:54:38 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=48561117,51775692 2022-11-22 12:54:38 0 [Note] InnoDB: Starting a batch to recover 503 pages from redo log. 2022-11-22 12:54:38 0 [ERROR] [FATAL] InnoDB: Trying to write 65536 bytes at 2359296 outside the bounds of the file: ./test/t5.ibd Like in the rr replay trace, the tablespace size is 10 and the recv_size is 448 pages at this point. I think that the problem occurs when a created data file is recovered during an earlier batch of recovery. At that point, we fail to read the file size metadata from the file system.
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            Attachment data.tar.xz [ 67012 ]
            Attachment encryption_keys.txt [ 67013 ]
            Fix Version/s 10.5 [ 23123 ]
            Affects Version/s 10.5 [ 23123 ]
            Labels rr-profile-analyzed regression-10.6 rr-profile-analyzed
            marko Marko Mäkelä made changes -
            issue.field.resolutiondate 2022-11-22 13:19:30.0 2022-11-22 13:19:30.642
            marko Marko Mäkelä made changes -
            Fix Version/s 10.6.12 [ 28513 ]
            Fix Version/s 10.7.8 [ 28515 ]
            Fix Version/s 10.8.7 [ 28517 ]
            Fix Version/s 10.9.5 [ 28519 ]
            Fix Version/s 10.10.3 [ 28521 ]
            Fix Version/s 10.11.2 [ 28523 ]
            Fix Version/s 10.6 [ 24028 ]
            Fix Version/s 10.7 [ 24805 ]
            Fix Version/s 10.8 [ 26121 ]
            Fix Version/s 10.9 [ 26905 ]
            Fix Version/s 10.10 [ 27530 ]
            Fix Version/s 10.11 [ 27614 ]
            Resolution Fixed [ 1 ]
            Status Open [ 1 ] Closed [ 6 ]
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -

            I saw this on s390x builder on Launchpad for MariaDB 10.11.7:

            innodb.undo_truncate '4k,innodb'         w2 [ fail ]  Found warnings/errors in server log file!
                    Test ended at 2024-05-08 03:54:42
            line
            2024-05-08  3:54:18 0 [ERROR] [FATAL] InnoDB: Trying to write 4096 bytes at 2621440 outside the bounds of the file: .//undo001
            Attempting backtrace. You can use the following information to find out
            ^ Found warnings in /<<PKGBUILDDIR>>/builddir/mysql-test/var/2/log/mysqld.1.err
            ok
            

            Full log: https://launchpadlibrarian.net/728605875/buildlog_ubuntu-noble-s390x.mariadb_1%3A10.11.7-5~bpo24.04.1~1715057361.1833c7608a3+debian.latest_BUILDING.txt.gz

            otto Otto Kekäläinen added a comment - I saw this on s390x builder on Launchpad for MariaDB 10.11.7: innodb.undo_truncate '4k,innodb' w2 [ fail ] Found warnings/errors in server log file! Test ended at 2024-05-08 03:54:42 line 2024-05-08 3:54:18 0 [ERROR] [FATAL] InnoDB: Trying to write 4096 bytes at 2621440 outside the bounds of the file: .//undo001 Attempting backtrace. You can use the following information to find out ^ Found warnings in /<<PKGBUILDDIR>>/builddir/mysql-test/var/2/log/mysqld.1.err ok Full log: https://launchpadlibrarian.net/728605875/buildlog_ubuntu-noble-s390x.mariadb_1%3A10.11.7-5~bpo24.04.1~1715057361.1833c7608a3+debian.latest_BUILDING.txt.gz

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.