[MDEV-28845] InnoDB: Failing assertion: bpage->can_relocate() in buf0lru.cc - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: N/A
Fix Version/s: 10.7.5, 10.6.9, 10.8.4, 10.9.2
Component/s: Storage Engine - InnoDB
Labels:
- race
- rr-profile

Description

origin/10.6 6c82ab4f726b83f09646c22afddd4c102e60607b 2022-06-14T15:33:11+03:00

pluto:/data/results/1655224930/TBR-1523

_RR_TRACE_DIR="./1/rr" rr replay --mark-stdio

rr 1613722 294754]2022-06-14 18:13:34 0x7f16761f4700[rr 1613722 294757]  InnoDB: Assertion failure in file /data/Server/10.6R/storage/innobase/buf/buf0lru.cc line 1060

[rr 1613722 294759]InnoDB: Failing assertion: bpage->can_relocate()

(rr) bt

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50

#1  0x00007f1687466859 in __GI_abort () at abort.c:79

#2  0x000055b3ec40533c in ut_dbg_assertion_failed (expr=expr@entry=0x55b3ed333320 "bpage->can_relocate()", file=file@entry=0x55b3ed32fb80 "/data/Server/10.6R/storage/innobase/buf/buf0lru.cc", line=line@entry=1060)

    at /data/Server/10.6R/storage/innobase/ut/ut0dbg.cc:60

#3  0x000055b3ec563c99 in buf_LRU_block_remove_hashed (bpage=bpage@entry=0x7f1679f9ed60, id=..., chain=..., zip=zip@entry=true) at /data/Server/10.6R/storage/innobase/buf/buf0lru.cc:1060

#4  0x000055b3ec56d32f in buf_pool_t::corrupted_evict (this=this@entry=0x55b3edeb84c0 <buf_pool>, bpage=bpage@entry=0x7f1679f9ed60, state=state@entry=2147483648) at /data/Server/10.6R/storage/innobase/buf/buf0lru.cc:1236

#5  0x000055b3ec4ff50b in buf_page_t::read_complete (this=0x7f1679f9ed60, node=...) at /data/Server/10.6R/storage/innobase/buf/buf0buf.cc:3693

#6  0x000055b3ec686276 in fil_aio_callback (request=...) at /data/Server/10.6R/storage/innobase/fil/fil0fil.cc:2945

#7  0x000055b3ec0a4669 in io_callback (cb=0x7f1684151508) at /data/Server/10.6R/storage/innobase/os/os0file.cc:3533

#8  0x000055b3ec800143 in tpool::task_group::execute (this=0x612000022ed0, t=t@entry=0x7f1684151560) at /data/Server/10.6R/tpool/task_group.cc:55

#9  0x000055b3ec8003b7 in tpool::task::execute (this=0x7f1684151560) at /data/Server/10.6R/tpool/task.cc:32

#10 0x000055b3ec7f9094 in tpool::thread_pool_generic::worker_main (this=0x618000000880, thread_var=0x63000001fd80) at /data/Server/10.6R/tpool/tpool_generic.cc:584

#11 0x000055b3ec7f9dae in std::__invoke_impl<void, void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> (__t=<optimized out>, __f=<optimized out>)

    at /usr/include/c++/9/bits/invoke.h:89

#12 std::__invoke<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> (__fn=<optimized out>) at /usr/include/c++/9/bits/invoke.h:95

#13 std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> >::_M_invoke<0ul, 1ul, 2ul> (this=<optimized out>) at /usr/include/c++/9/thread:244

#14 std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> >::operator() (this=<optimized out>) at /usr/include/c++/9/thread:251

#15 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::_M_run (this=<optimized out>)

    at /usr/include/c++/9/thread:195

#16 0x00007f1687873de4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6

#17 0x00007f168798e609 in start_thread (arg=<optimized out>) at pthread_create.c:477

#18 0x00007f1687563293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

(rr)

RQG

===

# git clone https://github.com/mleich1/rqg --branch experimental RQG

# GIT_SHOW: HEAD -> experimental, origin/experimental a32979c6d789ddf0982e6a56af687565be00590c 2022-06-14T17:21:56+02:00

# rqg.pl  : Version 4.0.6 (2022-05)

# $RQG_HOME/rqg.pl \

# --grammar=conf/mariadb/table_stress_innodb_nocopy2.yy \

# --gendata=conf/mariadb/table_stress.zz \

# --gendata_sql=conf/mariadb/table_stress.sql \

# --reporters=RestartConsistency \

# --mysqld=--loose-innodb_lock_schedule_algorithm=fcfs \

# --mysqld=--loose-idle_write_transaction_timeout=0 \

# --mysqld=--loose-idle_transaction_timeout=0 \

# --mysqld=--loose-idle_readonly_transaction_timeout=0 \

# --mysqld=--connect_timeout=60 \

# --mysqld=--interactive_timeout=28800 \

# --mysqld=--slave_net_timeout=60 \

# --mysqld=--net_read_timeout=30 \

# --mysqld=--net_write_timeout=60 \

# --mysqld=--loose-table_lock_wait_timeout=50 \

# --mysqld=--wait_timeout=28800 \

# --mysqld=--lock-wait-timeout=86400 \

# --mysqld=--innodb-lock-wait-timeout=50 \

# --no-mask \

# --queries=10000000 \

# --seed=random \

# --reporters=Backtrace \

# --reporters=ErrorLog \

# --reporters=Deadlock1 \

# --validators=None \

# --mysqld=--log_output=none \

# --mysqld=--log_bin_trust_function_creators=1 \

# --mysqld=--loose-debug_assert_on_not_freed_memory=0 \

# --engine=InnoDB \

# --restart_timeout=240 \

# --mysqld=--plugin-load-add=file_key_management.so \

# --mysqld=--loose-file-key-management-filename=$RQG_HOME/conf/mariadb/encryption_keys.txt \

# --mysqld=--plugin-load-add=provider_lzo.so \

# --mysqld=--plugin-load-add=provider_bzip2.so \

# --mysqld=--plugin-load-add=provider_lzma.so \

# --mysqld=--plugin-load-add=provider_snappy.so \

# --mysqld=--plugin-load-add=provider_lz4.so \

# --duration=300 \

# --mysqld=--loose-innodb_fatal_semaphore_wait_threshold=300 \

# --mysqld=--innodb_file_per_table=1 \

# --mysqld=--loose-innodb_read_only_compressed=OFF \

# --mysqld=--innodb_stats_persistent=on \

# --mysqld=--innodb_adaptive_hash_index=on \

# --mysqld=--loose-innodb_evict_tables_on_commit_debug=off \

# --mysqld=--loose-max-statement-time=30 \

# --threads=9 \

# --mysqld=--innodb-use-native-aio=0 \

# --mysqld=--loose-gdb \

# --mysqld=--loose-debug-gdb \

# --rr=Extended \

# --rr_options=--wait \

# --mysqld=--loose_innodb_change_buffering=deletes \

# --mysqld=--innodb_rollback_on_timeout=OFF \

# --vardir_type=fast \

# --mysqld=--innodb_page_size=64K \

# --mysqld=--innodb-buffer-pool-size=24M \

# --no_mask \

# --batch \

# <local settings>

Error pattern for RQG

[ 'TBR-1523', 'InnoDB: Assertion failure in file .{1,250}buf0lru.cc.+InnoDB: Failing assertion: bpage->can_relocate\(\).+RESULT: The RQG run ended with status STATUS_SERVER_CRASHED' ],

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

table_stress_innodb_nocopy2.yy
37 kB
2022-06-14 18:50

Issue Links

is caused by

MDEV-13542 Crashing on a corrupted page is unhelpful

Closed

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2022-06-15 07:30

The assertion fails because there is a waiting request for the buffer pool block:

10.6 6c82ab4f726b83f09646c22afddd4c102e60607b
Thread 18 hit Hardware watchpoint 2: -location lock.lock.writer.lock._M_i
…
#4 sux_lock<ssux_lock_impl<true> >::x_lock (this=this@entry=0x7f1679f9ed78,
for_io=for_io@entry=false)
at /data/Server/10.6R/storage/innobase/include/sux_lock.h:395
#5 0x000055b3ec507d41 in buf_page_create_low (page_id=...,
zip_size=zip_size@entry=0, mtr=mtr@entry=0x61a000449a90,
free_block=free_block@entry=0x7f1679fa7c60)
at /data/Server/10.6R/storage/innobase/buf/buf0buf.cc:3230
#6 0x000055b3ec5092a5 in buf_page_create (space=space@entry=0x6120000817c0,
offset=offset@entry=9, zip_size=<optimized out>,
mtr=mtr@entry=0x61a000449a90, free_block=free_block@entry=0x7f1679fa7c60)
at /data/Server/10.6R/storage/innobase/include/buf0types.h:95
#7 0x000055b3ec6c8a64 in fsp_page_create (space=space@entry=0x6120000817c0,
offset=offset@entry=9, mtr=mtr@entry=0x61a000449a90)
at /data/Server/10.6R/storage/innobase/fsp/fsp0fsp.cc:1047

A little later, a read of the same block (which is not actually needed, because the page would be overwritten by the above thread) would complete:

10.6 6c82ab4f726b83f09646c22afddd4c102e60607b
(rr)
Continuing.
[Switching to Thread 1613722.1614200]

Thread 3 hit Hardware watchpoint 3: -location bpage.id_

Old value = {m_id = 47244640265}
New value = {m_id = 18446744073709551615}
0x000055b3ec56d6e4 in buf_page_t::set_corrupt_id (
this=this@entry=0x7f1679f9ed60)
at /data/Server/10.6R/storage/innobase/include/buf0types.h:98
98 constexpr page_id_t(uint64_t id) : m_id(id) {}
(rr) bt
#0 0x000055b3ec56d6e4 in buf_page_t::set_corrupt_id (
this=this@entry=0x7f1679f9ed60)
at /data/Server/10.6R/storage/innobase/include/buf0types.h:98
#1 0x000055b3ec56d196 in buf_pool_t::corrupted_evict (
this=this@entry=0x55b3edeb84c0 <buf_pool>,
bpage=bpage@entry=0x7f1679f9ed60, state=state@entry=2147483648)
at /data/Server/10.6R/storage/innobase/buf/buf0lru.cc:1220
#2 0x000055b3ec4ff50b in buf_page_t::read_complete (this=0x7f1679f9ed60,
node=...) at /data/Server/10.6R/storage/innobase/buf/buf0buf.cc:3693
#3 0x000055b3ec686276 in fil_aio_callback (request=...)
at /data/Server/10.6R/storage/innobase/fil/fil0fil.cc:2945
#4 0x000055b3ec0a4669 in io_callback (cb=0x7f1684151508)
at /data/Server/10.6R/storage/innobase/os/os0file.cc:3533

The reason why the block was marked as corrupted is that its contents is all-zero. Before ~~MDEV-13542~~, we would allow such blocks to enter the buffer pool. All-zero pages are not really corrupted; unused pages in a tablespace will be initially zero-initialized. But, such pages must not be accessed, neither by recovery (applying log records), nor by any caller of buf_page_get_low().
The assertion will fail a few steps later in:

10.6 6c82ab4f726b83f09646c22afddd4c102e60607b
#3 0x000055b3ec563c99 in buf_LRU_block_remove_hashed (bpage=bpage@entry=0x7f1679f9ed60, id=..., chain=..., zip=zip@entry=true) at /data/Server/10.6R/storage/innobase/buf/buf0lru.cc:1060
#4 0x000055b3ec56d32f in buf_pool_t::corrupted_evict (this=this@entry=0x55b3edeb84c0 <buf_pool>, bpage=bpage@entry=0x7f1679f9ed60, state=state@entry=2147483648)
at /data/Server/10.6R/storage/innobase/buf/buf0lru.cc:1236

At this point, the buf_page_create() thread is still waiting.

Perhaps the simplest fix will be to not flag all-zero pages as corrupted at this level, but instead simply set the state to buf_page_t::FREED so that access via buf_page_get_low() would still fail.

By the way, as expected, the page was read due to read-ahead heuristics; the contents of the page was not really interesting:

10.6 6c82ab4f726b83f09646c22afddd4c102e60607b
#7 0x000055b3ec57a182 in buf_read_ahead_linear (page_id=..., zip_size=zip_size@entry=0, ibuf=<optimized out>) at /data/Server/10.6R/storage/innobase/buf/buf0rea.cc:639
#8 0x000055b3ec516173 in buf_page_get_low (page_id=<error reading variable: Cannot access memory at address 0xffffffffffffffa0>, zip_size=zip_size@entry=0, rw_latch=rw_latch@entry=1, guess=<optimized out>,
guess@entry=0x0, mode=mode@entry=10, mtr=mtr@entry=0x7f16643c2e00, err=<optimized out>, allow_ibuf_merge=<optimized out>) at /data/Server/10.6R/storage/innobase/include/buf0buf.h:1035
#9 0x000055b3ec516539 in buf_page_get_gen (page_id=..., zip_size=zip_size@entry=0, rw_latch=rw_latch@entry=1, guess=guess@entry=0x0, mode=mode@entry=10, mtr=mtr@entry=0x7f16643c2e00, err=0x0,
allow_ibuf_merge=false) at /data/Server/10.6R/storage/innobase/buf/buf0buf.cc:3057
#10 0x000055b3ec46f971 in btr_copy_blob_prefix (buf=<optimized out>, len=<optimized out>, id=..., offset=38) at /data/Server/10.6R/storage/innobase/btr/btr0cur.cc:7518
…
#26 0x000055b3eab8334d in dispatch_command (command=command@entry=COM_QUERY, thd=thd@entry=0x62b00015e218,
packet=packet@entry=0x629000c8f219 " ALTER IGNORE TABLE t7 ADD INDEX IF NOT EXISTS `Marvão_idx1` ( col_int /!100800 ASC /, col_text(9) /!100800 DESC / ), ALGORITHM = NOCOPY, LOCK = DEFAULT /* E_R Thread7 QNO 1069 CON_ID 78 */ ", packet_length=packet_length@entry=196, blocking=blocking@entry=true) at /data/Server/10.6R/sql/sql_class.h:1355

Marko Mäkelä added a comment - 2022-06-15 07:30 The assertion fails because there is a waiting request for the buffer pool block: 10.6 6c82ab4f726b83f09646c22afddd4c102e60607b Thread 18 hit Hardware watchpoint 2: -location lock.lock.writer.lock._M_i … #4 sux_lock<ssux_lock_impl<true> >::x_lock (this=this@entry=0x7f1679f9ed78, for_io=for_io@entry=false) at /data/Server/10.6R/storage/innobase/include/sux_lock.h:395 #5 0x000055b3ec507d41 in buf_page_create_low (page_id=..., zip_size=zip_size@entry=0, mtr=mtr@entry=0x61a000449a90, free_block=free_block@entry=0x7f1679fa7c60) at /data/Server/10.6R/storage/innobase/buf/buf0buf.cc:3230 #6 0x000055b3ec5092a5 in buf_page_create (space=space@entry=0x6120000817c0, offset=offset@entry=9, zip_size=<optimized out>, mtr=mtr@entry=0x61a000449a90, free_block=free_block@entry=0x7f1679fa7c60) at /data/Server/10.6R/storage/innobase/include/buf0types.h:95 #7 0x000055b3ec6c8a64 in fsp_page_create (space=space@entry=0x6120000817c0, offset=offset@entry=9, mtr=mtr@entry=0x61a000449a90) at /data/Server/10.6R/storage/innobase/fsp/fsp0fsp.cc:1047 A little later, a read of the same block (which is not actually needed, because the page would be overwritten by the above thread) would complete: 10.6 6c82ab4f726b83f09646c22afddd4c102e60607b (rr) Continuing. [Switching to Thread 1613722.1614200] Thread 3 hit Hardware watchpoint 3: -location bpage.id_ Old value = {m_id = 47244640265} New value = {m_id = 18446744073709551615} 0x000055b3ec56d6e4 in buf_page_t::set_corrupt_id ( this=this@entry=0x7f1679f9ed60) at /data/Server/10.6R/storage/innobase/include/buf0types.h:98 98 constexpr page_id_t(uint64_t id) : m_id(id) {} (rr) bt #0 0x000055b3ec56d6e4 in buf_page_t::set_corrupt_id ( this=this@entry=0x7f1679f9ed60) at /data/Server/10.6R/storage/innobase/include/buf0types.h:98 #1 0x000055b3ec56d196 in buf_pool_t::corrupted_evict ( this=this@entry=0x55b3edeb84c0 <buf_pool>, bpage=bpage@entry=0x7f1679f9ed60, state=state@entry=2147483648) at /data/Server/10.6R/storage/innobase/buf/buf0lru.cc:1220 #2 0x000055b3ec4ff50b in buf_page_t::read_complete (this=0x7f1679f9ed60, node=...) at /data/Server/10.6R/storage/innobase/buf/buf0buf.cc:3693 #3 0x000055b3ec686276 in fil_aio_callback (request=...) at /data/Server/10.6R/storage/innobase/fil/fil0fil.cc:2945 #4 0x000055b3ec0a4669 in io_callback (cb=0x7f1684151508) at /data/Server/10.6R/storage/innobase/os/os0file.cc:3533 The reason why the block was marked as corrupted is that its contents is all-zero. Before MDEV-13542 , we would allow such blocks to enter the buffer pool. All-zero pages are not really corrupted; unused pages in a tablespace will be initially zero-initialized. But, such pages must not be accessed, neither by recovery (applying log records), nor by any caller of buf_page_get_low() . The assertion will fail a few steps later in: 10.6 6c82ab4f726b83f09646c22afddd4c102e60607b #3 0x000055b3ec563c99 in buf_LRU_block_remove_hashed (bpage=bpage@entry=0x7f1679f9ed60, id=..., chain=..., zip=zip@entry=true) at /data/Server/10.6R/storage/innobase/buf/buf0lru.cc:1060 #4 0x000055b3ec56d32f in buf_pool_t::corrupted_evict (this=this@entry=0x55b3edeb84c0 <buf_pool>, bpage=bpage@entry=0x7f1679f9ed60, state=state@entry=2147483648) at /data/Server/10.6R/storage/innobase/buf/buf0lru.cc:1236 At this point, the buf_page_create() thread is still waiting. Perhaps the simplest fix will be to not flag all-zero pages as corrupted at this level, but instead simply set the state to buf_page_t::FREED so that access via buf_page_get_low() would still fail. By the way, as expected, the page was read due to read-ahead heuristics; the contents of the page was not really interesting: 10.6 6c82ab4f726b83f09646c22afddd4c102e60607b #7 0x000055b3ec57a182 in buf_read_ahead_linear (page_id=..., zip_size=zip_size@entry=0, ibuf=<optimized out>) at /data/Server/10.6R/storage/innobase/buf/buf0rea.cc:639 #8 0x000055b3ec516173 in buf_page_get_low (page_id=<error reading variable: Cannot access memory at address 0xffffffffffffffa0>, zip_size=zip_size@entry=0, rw_latch=rw_latch@entry=1, guess=<optimized out>, guess@entry=0x0, mode=mode@entry=10, mtr=mtr@entry=0x7f16643c2e00, err=<optimized out>, allow_ibuf_merge=<optimized out>) at /data/Server/10.6R/storage/innobase/include/buf0buf.h:1035 #9 0x000055b3ec516539 in buf_page_get_gen (page_id=..., zip_size=zip_size@entry=0, rw_latch=rw_latch@entry=1, guess=guess@entry=0x0, mode=mode@entry=10, mtr=mtr@entry=0x7f16643c2e00, err=0x0, allow_ibuf_merge=false) at /data/Server/10.6R/storage/innobase/buf/buf0buf.cc:3057 #10 0x000055b3ec46f971 in btr_copy_blob_prefix (buf=<optimized out>, len=<optimized out>, id=..., offset=38) at /data/Server/10.6R/storage/innobase/btr/btr0cur.cc:7518 … #26 0x000055b3eab8334d in dispatch_command (command=command@entry=COM_QUERY, thd=thd@entry=0x62b00015e218, packet=packet@entry=0x629000c8f219 " ALTER IGNORE TABLE t7 ADD INDEX IF NOT EXISTS `Marvão_idx1` ( col_int /*!100800 ASC */, col_text(9) /*!100800 DESC */ ), ALGORITHM = NOCOPY, LOCK = DEFAULT /* E_R Thread7 QNO 1069 CON_ID 78 */ ", packet_length=packet_length@entry=196, blocking=blocking@entry=true) at /data/Server/10.6R/sql/sql_class.h:1355

Marko Mäkelä added a comment - 2022-06-15 07:52

Come to think of it, we could have a scenario where multiple threads are concurrently requesting access to a page that needs to be read into the buffer pool. If the page read failed, we would crash while attempting to evict the page, because waiting lock requests exist. If we handled the all-zero page as a special case to fix this regression, it would not help in such a corrupted-read scenario. ~~MDEV-13542~~ aimed to remove all crashes related to reading a corrupted page.

Marko Mäkelä added a comment - 2022-06-15 07:52 Come to think of it, we could have a scenario where multiple threads are concurrently requesting access to a page that needs to be read into the buffer pool. If the page read failed, we would crash while attempting to evict the page, because waiting lock requests exist. If we handled the all-zero page as a special case to fix this regression, it would not help in such a corrupted-read scenario. MDEV-13542 aimed to remove all crashes related to reading a corrupted page.

Matthias Leich added a comment - 2022-06-15 15:53

origin/bb-10.6-~~MDEV-28845~~ 253806dffc6e80a3bebb1d7fa2fa4b1899252306 2022-06-15T17:00:05+03:00
performed well during RQG testing.

Matthias Leich added a comment - 2022-06-15 15:53 origin/bb-10.6- MDEV-28845 253806dffc6e80a3bebb1d7fa2fa4b1899252306 2022-06-15T17:00:05+03:00 performed well during RQG testing.

MariaDB Server

InnoDB: Failing assertion: bpage->can_relocate() in buf0lru.cc

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration