[MDEV-30069] InnoDB: Trying to write ... bytes at ... outside the bounds of the file ... - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.6, 10.7(EOL), 10.8(EOL), 10.9(EOL), 10.10(EOL), 10.11
Fix Version/s: 10.11.2, 10.6.12, 10.7.8, 10.8.7, 10.9.5, 10.10.3
Component/s: Storage Engine - InnoDB
Labels:
- regression-10.6
- rr-profile-analyzed

Description

For the dataset in ~~MDEV-30068~~, a failure occurred during crash recovery with 10.6-based branch:

2022-11-21  8:59:32 0 [ERROR] [FATAL] InnoDB: Trying to write 65536 bytes at 2359296 outside the bounds of the file: ./test/t5.ibd

Before the failure, there was one call to fil_system_t::extend_to_recv_size() from recv_sys.apply_log(false).
The failure would occur during "background recovery":

#5  0x00005640dd514e7d in fil_invalid_page_access_msg (name=<optimized out>,

    offset=offset@entry=2359296, len=len@entry=65536, is_read=<optimized out>)

    at /data/Server/bb-10.6-MDEV-29835extra/storage/innobase/fil/fil0fil.cc:2639

#6  0x00005640dd5287e6 in fil_space_t::io (this=this@entry=0x6120000034c0,

          type=@0x7ffc0c342410: {bpage = 0x45cb7349a870, slot = 0x631000028820, node = 0x0, type = IORequest::WRITE_LRU}, offset=2359296, len=<optimized out>,

    buf=buf@entry=0x6f6f7afb0000, bpage=bpage@entry=0x45cb7349a870)

    at /data/Server/bb-10.6-MDEV-29835extra/storage/innobase/fil/fil0fil.cc:2715

#7  0x00005640dd41fdf2 in buf_page_t::flush (this=this@entry=0x45cb7349a870,

    lru=lru@entry=true, space=space@entry=0x6120000034c0)

    at /data/Server/bb-10.6-MDEV-29835extra/storage/innobase/buf/buf0flu.cc:900

#8  0x00005640dd4152a1 in buf_flush_LRU_list_batch (max=max@entry=32,

    n=n@entry=0x7ffc0c342600)

    at /data/Server/bb-10.6-MDEV-29835extra/storage/innobase/buf/buf0flu.cc:1299

#9  0x00005640dd416295 in buf_do_LRU_batch (max=max@entry=32)

    at /data/Server/bb-10.6-MDEV-29835extra/storage/innobase/buf/buf0flu.cc:1337

#10 0x00005640dd416590 in buf_flush_LRU (max_n=32)

    at /data/Server/bb-10.6-MDEV-29835extra/storage/innobase/buf/buf0flu.cc:1678

#11 0x00005640dd42f58d in buf_LRU_get_free_block (

    have_mutex=have_mutex@entry=false)

    at /data/Server/bb-10.6-MDEV-29835extra/storage/innobase/buf/buf0lru.cc:488

#12 0x00005640dcf678d0 in recv_sys_t::recover_low (

    this=0x5640dec48760 <recv_sys>, page_id=<optimized out>)

    at /data/Server/bb-10.6-MDEV-29835extra/storage/innobase/log/log0recv.cc:3155

#13 0x00005640dd3e4508 in recv_sys_t::recover (page_id=<optimized out>,

    this=<optimized out>)

    at /data/Server/bb-10.6-MDEV-29835extra/storage/innobase/include/log0recv.h:415

#14 buf_page_get_gen (page_id={m_id = 303}, zip_size=zip_size@entry=0,

    rw_latch=rw_latch@entry=1, guess=guess@entry=0x45cb7349a870,

…

#25 0x00005640dd2cc19a in trx_resurrect (undo=undo@entry=0x611000043fc0, rseg=rseg@entry=0x5640decf2340 <trx_sys+30080>, start_time=start_time@entry=1669049972, start_time_micro=start_time_micro@entry=284020700207, rows_to_undo=rows_to_undo@entry=0x7ffc0c346840) at /data/Server/bb-10.6-MDEV-29835extra/storage/innobase/trx/trx0trx.cc:701

#26 0x00005640dd2cc843 in trx_lists_init_at_db_start () at /data/Server/bb-10.6-MDEV-29835extra/storage/innobase/trx/trx0trx.cc:752

#27 0x00005640dd23d5f0 in srv_start (create_new_db=<optimized out>) at /data/Server/bb-10.6-MDEV-29835extra/storage/innobase/srv/srv0start.cc:1485

#28 0x00005640dcdb1952 in innodb_init (p=<optimized out>) at /data/Server/bb-10.6-MDEV-29835extra/storage/innobase/handler/ha_innodb.cc:4293

#29 0x00005640dc30fd21 in ha_initialize_handlerton (plugin=0x62100002e010) at /data/Server/bb-10.6-MDEV-29835extra/sql/handler.cc:659

As far as I can tell, the problem should be fixed by extending the tablespace before invoking buf_LRU_get_free_block(), at least in recv_sys_t::recover_low() and possibly elsewhere (definitely when ~~MDEV-14481~~ is implemented).

I was not able to reproduce the error for a copy of the data directory, so some additional testing will be needed.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

encryption_keys.txt
0.4 kB
2022-11-22 10:54
data.tar.xz
4.54 MB
2022-11-22 10:54

Issue Links

is caused by

MDEV-24626 Remove synchronous write of page0 and flushing file during file creation

Closed

relates to

MDEV-30132 Crash after recovery, with InnoDB: Tried to read ... bytes at offset

Closed

MDEV-11556 InnoDB redo log apply fails to adjust data file sizes

Closed

MDEV-14481 Execute InnoDB crash recovery in the background

Closed

MDEV-30068 Confusing error message when encryption is not available on recovery

Closed

Activity

Ascending order - Click to sort in descending order

Matthias Leich added a comment - 2022-11-22 09:32

RQG

===

# git clone https://github.com/mleich1/rqg --branch experimental RQG

# GIT_SHOW: HEAD -> experimental, origin/experimental ec83f640bab8f5ea31f07b9bc118d508abae3b35 2022-11-17T15:46:44+01:00

# rqg.pl  : Version 4.0.6 (2022-09)

# $RQG_HOME/rqg.pl \

# --grammar=conf/mariadb/table_stress_innodb_nocopy1.yy \

# --gendata=conf/mariadb/table_stress.zz \

# --gendata_sql=conf/mariadb/table_stress.sql \

# --reporters=CrashRecovery1 \

# --mysqld=--loose-innodb_lock_schedule_algorithm=fcfs \

# --mysqld=--loose-idle_write_transaction_timeout=0 \

# --mysqld=--loose-idle_transaction_timeout=0 \

# --mysqld=--loose-idle_readonly_transaction_timeout=0 \

# --mysqld=--connect_timeout=60 \

# --mysqld=--interactive_timeout=28800 \

# --mysqld=--slave_net_timeout=60 \

# --mysqld=--net_read_timeout=30 \

# --mysqld=--net_write_timeout=60 \

# --mysqld=--loose-table_lock_wait_timeout=50 \

# --mysqld=--wait_timeout=28800 \

# --mysqld=--lock-wait-timeout=86400 \

# --mysqld=--innodb-lock-wait-timeout=50 \

# --no-mask \

# --queries=10000000 \

# --seed=random \

# --reporters=Backtrace \

# --reporters=ErrorLog \

# --reporters=Deadlock1 \

# --validators=None \

# --mysqld=--log_output=none \

# --mysqld=--log_bin_trust_function_creators=1 \

# --mysqld=--loose-debug_assert_on_not_freed_memory=0 \

# --engine=InnoDB \

# --restart_timeout=240 \

# --mysqld=--plugin-load-add=file_key_management.so \

# --mysqld=--loose-file-key-management-filename=$RQG_HOME/conf/mariadb/encryption_keys.txt \

# --mysqld=--plugin-load-add=provider_lzo.so \

# --mysqld=--plugin-load-add=provider_bzip2.so \

# --mysqld=--plugin-load-add=provider_lzma.so \

# --mysqld=--plugin-load-add=provider_snappy.so \

# --mysqld=--plugin-load-add=provider_lz4.so \

# --duration=300 \

# --mysqld=--loose-innodb_fatal_semaphore_wait_threshold=300 \

# --mysqld=--innodb_file_per_table=1 \

# --mysqld=--loose-innodb_read_only_compressed=OFF \

# --mysqld=--innodb_stats_persistent=off \

# --mysqld=--innodb_adaptive_hash_index=off \

# --mysqld=--log-bin \

# --mysqld=--sync-binlog=1 \

# --mysqld=--loose-innodb_evict_tables_on_commit_debug=on \

# --mysqld=--loose-max-statement-time=30 \

# --threads=33 \

# --mysqld=--innodb-use-native-aio=0 \

# --mysqld=--loose-gdb \

# --mysqld=--loose-debug-gdb \

# --rr=Extended \

# --rr_options=--chaos --wait \

# --mysqld=--innodb_rollback_on_timeout=ON \

# --mysqld=--innodb_page_size=64K \

# --mysqld=--innodb-buffer-pool-size=24M \

# <local settings>

sdp:/data/results/1669039949/TBR-1663

Error pattern

[ 'TBR-1663', 'INFO: Reporter \'CrashRecovery.\': Attempting database recovery using the server.+\[Note\] InnoDB: Starting crash recovery from checkpoint LSN.+\[ERROR\] \[FATAL\] InnoDB: Trying to write .{1,20} bytes at .{1,20} outside the bounds of the file: .{1,250}\.ibd.+RESULT: The RQG run ended with status STATUS_RECOVERY_FAILURE' ],

Matthias Leich added a comment - 2022-11-22 09:32 RQG === # git clone https://github.com/mleich1/rqg --branch experimental RQG # # GIT_SHOW: HEAD -> experimental, origin/experimental ec83f640bab8f5ea31f07b9bc118d508abae3b35 2022-11-17T15:46:44+01:00 # rqg.pl : Version 4.0.6 (2022-09) # # $RQG_HOME/rqg.pl \ # --grammar=conf/mariadb/table_stress_innodb_nocopy1.yy \ # --gendata=conf/mariadb/table_stress.zz \ # --gendata_sql=conf/mariadb/table_stress.sql \ # --reporters=CrashRecovery1 \ # --mysqld=--loose-innodb_lock_schedule_algorithm=fcfs \ # --mysqld=--loose-idle_write_transaction_timeout=0 \ # --mysqld=--loose-idle_transaction_timeout=0 \ # --mysqld=--loose-idle_readonly_transaction_timeout=0 \ # --mysqld=--connect_timeout=60 \ # --mysqld=--interactive_timeout=28800 \ # --mysqld=--slave_net_timeout=60 \ # --mysqld=--net_read_timeout=30 \ # --mysqld=--net_write_timeout=60 \ # --mysqld=--loose-table_lock_wait_timeout=50 \ # --mysqld=--wait_timeout=28800 \ # --mysqld=--lock-wait-timeout=86400 \ # --mysqld=--innodb-lock-wait-timeout=50 \ # --no-mask \ # --queries=10000000 \ # --seed=random \ # --reporters=Backtrace \ # --reporters=ErrorLog \ # --reporters=Deadlock1 \ # --validators=None \ # --mysqld=--log_output=none \ # --mysqld=--log_bin_trust_function_creators=1 \ # --mysqld=--loose-debug_assert_on_not_freed_memory=0 \ # --engine=InnoDB \ # --restart_timeout=240 \ # --mysqld=--plugin-load-add=file_key_management.so \ # --mysqld=--loose-file-key-management-filename=$RQG_HOME/conf/mariadb/encryption_keys.txt \ # --mysqld=--plugin-load-add=provider_lzo.so \ # --mysqld=--plugin-load-add=provider_bzip2.so \ # --mysqld=--plugin-load-add=provider_lzma.so \ # --mysqld=--plugin-load-add=provider_snappy.so \ # --mysqld=--plugin-load-add=provider_lz4.so \ # --duration=300 \ # --mysqld=--loose-innodb_fatal_semaphore_wait_threshold=300 \ # --mysqld=--innodb_file_per_table=1 \ # --mysqld=--loose-innodb_read_only_compressed=OFF \ # --mysqld=--innodb_stats_persistent=off \ # --mysqld=--innodb_adaptive_hash_index=off \ # --mysqld=--log-bin \ # --mysqld=--sync-binlog=1 \ # --mysqld=--loose-innodb_evict_tables_on_commit_debug=on \ # --mysqld=--loose-max-statement-time=30 \ # --threads=33 \ # --mysqld=--innodb-use-native-aio=0 \ # --mysqld=--loose-gdb \ # --mysqld=--loose-debug-gdb \ # --rr=Extended \ # --rr_options=--chaos --wait \ # --mysqld=--innodb_rollback_on_timeout=ON \ # --mysqld=--innodb_page_size=64K \ # --mysqld=--innodb-buffer-pool-size=24M \ # <local settings> sdp:/data/results/1669039949/TBR-1663 Error pattern [ 'TBR-1663', 'INFO: Reporter \'CrashRecovery.\': Attempting database recovery using the server.+\[Note\] InnoDB: Starting crash recovery from checkpoint LSN.+\[ERROR\] \[FATAL\] InnoDB: Trying to write .{1,20} bytes at .{1,20} outside the bounds of the file: .{1,250}\.ibd.+RESULT: The RQG run ended with status STATUS_RECOVERY_FAILURE' ],

Marko Mäkelä added a comment - 2022-11-22 11:00

The fix that I initially had in mind is too complex, could cause a performance regression, and actually cannot fix the failure.

In the rr replay trace, fil_ibd_load() is being invoked multiple times on the file test/t5.ibd, and each time it returns FIL_LOAD_DEFER, which the caller does not make note of (MDEV-25909). Each of those calls also invokes os_file_get_size() but throws away the result.

The data set that I uploaded to ~~MDEV-30068~~ actually does recover, because the first page of the file test/t5.ibd had already been written by recovery.

I should have copied the data directory before the restart attempt. In that data directory, the file test/t5.ibd consists of 448*65536 NUL bytes.

This problem was caused by ~~MDEV-24626~~ in MariaDB Server 10.6.

As far as I can tell, a second attempt to recover the data directory should succeed.

The FSP_SIZE in page 0 of the file will be recovered to 0x1c0 (448) pages on my system, while in the data directory that I attached to ~~MDEV-30068~~, it was only recovered to 10 pages. That is what recv_sys_t::recover_deferred() recovered as the file size in the rr replay traces.

Finally, I was able to reproduce this with data.tar.xz (before the recovery attempt) and encryption_keys.txt and a little cheat:

diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc

index 9b81aa9307c..73d29692907 100644

--- a/storage/innobase/handler/ha_innodb.cc

+++ b/storage/innobase/handler/ha_innodb.cc

@@ -3821,6 +3821,7 @@ static int innodb_init_params()

 	pages, even for larger pages */

 	MYSQL_SYSVAR_NAME(buffer_pool_size).min_val= min_buffer_pool_size();

+#if 0

 	if (innobase_buffer_pool_size < MYSQL_SYSVAR_NAME(buffer_pool_size).min_val) {

 		ib::error() << "innodb_page_size="

 			<< srv_page_size << " requires "

@@ -3830,6 +3831,7 @@ static int innodb_init_params()

 			<< "MiB";

 		DBUG_RETURN(HA_ERR_INITIALIZATION);

+#endif

 #ifndef HAVE_LZ4

 	if (innodb_compression_algorithm == PAGE_LZ4_ALGORITHM) {

mariadbd --plugin-dir=$BUILD_DIR/plugin/file_key_management --plugin-load-add=file_key_management.so --file-key-management-filename=$(pwd)/encryption_keys.txt --innodb_page_size=64K --innodb-log-file-size=96M --innodb-buffer-pool-size=12M --datadir $(pwd)/data

10.6 377ec1b5c906ea55fa2af9ae048a14eb9e1cf201 (patched)
2022-11-22 12:54:38 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=48561117,51775692
2022-11-22 12:54:38 0 [Note] InnoDB: Starting a batch to recover 503 pages from redo log.
2022-11-22 12:54:38 0 [ERROR] [FATAL] InnoDB: Trying to write 65536 bytes at 2359296 outside the bounds of the file: ./test/t5.ibd

Like in the rr replay trace, the tablespace size is 10 and the recv_size is 448 pages at this point.

I think that the problem occurs when a created data file is recovered during an earlier batch of recovery. At that point, we fail to read the file size metadata from the file system.

Marko Mäkelä added a comment - 2022-11-22 11:00 The fix that I initially had in mind is too complex, could cause a performance regression, and actually cannot fix the failure. In the rr replay trace, fil_ibd_load() is being invoked multiple times on the file test/t5.ibd , and each time it returns FIL_LOAD_DEFER , which the caller does not make note of ( MDEV-25909 ). Each of those calls also invokes os_file_get_size() but throws away the result. The data set that I uploaded to MDEV-30068 actually does recover, because the first page of the file test/t5.ibd had already been written by recovery. I should have copied the data directory before the restart attempt. In that data directory, the file test/t5.ibd consists of 448*65536 NUL bytes. This problem was caused by MDEV-24626 in MariaDB Server 10.6. As far as I can tell, a second attempt to recover the data directory should succeed. The FSP_SIZE in page 0 of the file will be recovered to 0x1c0 (448) pages on my system, while in the data directory that I attached to MDEV-30068 , it was only recovered to 10 pages. That is what recv_sys_t::recover_deferred() recovered as the file size in the rr replay traces. Finally, I was able to reproduce this with data.tar.xz (before the recovery attempt) and encryption_keys.txt and a little cheat: diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc index 9b81aa9307c..73d29692907 100644 --- a/storage/innobase/handler/ha_innodb.cc +++ b/storage/innobase/handler/ha_innodb.cc @@ -3821,6 +3821,7 @@ static int innodb_init_params() pages, even for larger pages */ MYSQL_SYSVAR_NAME(buffer_pool_size).min_val= min_buffer_pool_size(); +#if 0 if (innobase_buffer_pool_size < MYSQL_SYSVAR_NAME(buffer_pool_size).min_val) { ib::error() << "innodb_page_size=" << srv_page_size << " requires " @@ -3830,6 +3831,7 @@ static int innodb_init_params() << "MiB"; DBUG_RETURN(HA_ERR_INITIALIZATION); } +#endif #ifndef HAVE_LZ4 if (innodb_compression_algorithm == PAGE_LZ4_ALGORITHM) { mariadbd --plugin-dir=$BUILD_DIR/plugin/file_key_management --plugin-load-add=file_key_management.so --file-key-management-filename=$(pwd)/encryption_keys.txt --innodb_page_size=64K --innodb-log-file-size=96M --innodb-buffer-pool-size=12M --datadir $(pwd)/data 10.6 377ec1b5c906ea55fa2af9ae048a14eb9e1cf201 (patched) 2022-11-22 12:54:38 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=48561117,51775692 2022-11-22 12:54:38 0 [Note] InnoDB: Starting a batch to recover 503 pages from redo log. 2022-11-22 12:54:38 0 [ERROR] [FATAL] InnoDB: Trying to write 65536 bytes at 2359296 outside the bounds of the file: ./test/t5.ibd Like in the rr replay trace, the tablespace size is 10 and the recv_size is 448 pages at this point. I think that the problem occurs when a created data file is recovered during an earlier batch of recovery. At that point, we fail to read the file size metadata from the file system.

Otto Kekäläinen added a comment - 2024-05-09 04:40

I saw this on s390x builder on Launchpad for MariaDB 10.11.7:

innodb.undo_truncate '4k,innodb'         w2 [ fail ]  Found warnings/errors in server log file!

        Test ended at 2024-05-08 03:54:42

line

2024-05-08  3:54:18 0 [ERROR] [FATAL] InnoDB: Trying to write 4096 bytes at 2621440 outside the bounds of the file: .//undo001

Attempting backtrace. You can use the following information to find out

^ Found warnings in /<<PKGBUILDDIR>>/builddir/mysql-test/var/2/log/mysqld.1.err

ok

Full log: https://launchpadlibrarian.net/728605875/buildlog_ubuntu-noble-s390x.mariadb_1%3A10.11.7-5~bpo24.04.1~1715057361.1833c7608a3+debian.latest_BUILDING.txt.gz

Otto Kekäläinen added a comment - 2024-05-09 04:40 I saw this on s390x builder on Launchpad for MariaDB 10.11.7: innodb.undo_truncate '4k,innodb' w2 [ fail ] Found warnings/errors in server log file! Test ended at 2024-05-08 03:54:42 line 2024-05-08 3:54:18 0 [ERROR] [FATAL] InnoDB: Trying to write 4096 bytes at 2621440 outside the bounds of the file: .//undo001 Attempting backtrace. You can use the following information to find out ^ Found warnings in /<<PKGBUILDDIR>>/builddir/mysql-test/var/2/log/mysqld.1.err ok Full log: https://launchpadlibrarian.net/728605875/buildlog_ubuntu-noble-s390x.mariadb_1%3A10.11.7-5~bpo24.04.1~1715057361.1833c7608a3+debian.latest_BUILDING.txt.gz

MariaDB Server

InnoDB: Trying to write ... bytes at ... outside the bounds of the file ...

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration