[MDEV-27891] Delayed SIGSEGV in buf_pool_t::resize on InnoDB buffer pool resize after or during DROP TABLE - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Blocker
Resolution: Fixed
Affects Version/s: 10.6.7, 10.8(EOL), 10.9(EOL)
Fix Version/s: 10.6.8, 10.7.4, 10.8.3
Component/s: Storage Engine - InnoDB
Labels:
- not-10.2
- not-10.3
- not-10.4
- not-10.5

Description

SET sql_mode='';

CREATE TABLE t1 (a INT NOT NULL, b INT, PRIMARY KEY(a)) ENGINE=InnoDB;

SET GLOBAL innodb_buffer_pool_size=21474836480;

INSERT INTO t1 VALUES (0,0,0);

DROP TABLE t1;

SET GLOBAL innodb_buffer_pool_size=@@innodb_buffer_pool_size + 1048576;

SELECT SLEEP (3);

Leads to:

10.9.0 b5852ffbeebc3000982988383daeefb0549e058a (Debug)
Core was generated by `/test/MD140222-mariadb-10.9.0-linux-x86_64-dbg/bin/mysqld --no-defaults --core-'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 buf_pool_t::resize (this=this@entry=0x55764535bc40 <buf_pool>)
at /test/10.9_dbg/storage/innobase/buf/buf0buf.cc:1683
[Current thread is 1 (Thread 0x150c29af5700 (LWP 1052367))]
(gdb) bt
#0 buf_pool_t::resize (this=this@entry=0x55764535bc40 <buf_pool>) at /test/10.9_dbg/storage/innobase/buf/buf0buf.cc:1683
#1 0x000055764471d763 in buf_resize_callback () at /test/10.9_dbg/storage/innobase/buf/buf0buf.cc:1992
#2 0x0000557644849100 in tpool::task_group::execute (this=0x557645cb6080 <single_threaded_group>, t=t@entry=0x557645cb5fe0 <buf_resize_task>) at /test/10.9_dbg/tpool/task_group.cc:55
#3 0x0000557644849189 in tpool::task::execute (this=0x557645cb5fe0 <buf_resize_task>) at /test/10.9_dbg/tpool/task.cc:32
#4 0x0000557644847e1f in tpool::thread_pool_generic::worker_main (this=0x5576465a91d0, thread_var=0x5576465b8470) at /test/10.9_dbg/tpool/tpool_generic.cc:549
#5 0x0000557644848156 in std::__invoke_impl<void, void (tpool::thread_pool_generic::)(tpool::worker_data), tpool::thread_pool_generic, tpool::worker_data> (__t=<optimized out>, __f=<optimized out>) at /usr/include/c++/9/bits/invoke.h:89
#6 std::__invoke<void (tpool::thread_pool_generic::)(tpool::worker_data), tpool::thread_pool_generic, tpool::worker_data> (__fn=<optimized out>) at /usr/include/c++/9/bits/invoke.h:95
#7 std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::)(tpool::worker_data), tpool::thread_pool_generic, tpool::worker_data> >::_M_invoke<0ul, 1ul, 2ul> (this=<optimized out>) at /usr/include/c++/9/thread:244
#8 std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::)(tpool::worker_data), tpool::thread_pool_generic, tpool::worker_data> >::operator() (this=<optimized out>) at /usr/include/c++/9/thread:251
#9 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::)(tpool::worker_data), tpool::thread_pool_generic, tpool::worker_data> > >::_M_run (this=<optimized out>) at /usr/include/c++/9/thread:195
#10 0x0000150c5515fde4 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#11 0x0000150c5527a609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#12 0x0000150c54e68293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Bug confirmed present in:
MariaDB: 10.8.1 (dbg), 10.9.0 (dbg)

Bug (or feature/syntax) confirmed not present in:
MariaDB: 10.2.42 (dbg), 10.2.42 (opt), 10.3.33 (dbg), 10.3.33 (opt), 10.4.23 (dbg), 10.4.23 (opt), 10.5.14 (dbg), 10.5.14 (opt), 10.6.6 (dbg), 10.6.6 (opt), 10.7.2 (dbg), 10.7.2 (opt), 10.8.1 (opt), 10.9.0 (opt)
MySQL: 5.5.62 (dbg), 5.5.62 (opt), 5.6.51 (dbg), 5.6.51 (opt), 5.7.36 (dbg), 5.7.36 (opt), 8.0.27 (dbg), 8.0.27 (opt)

Attachments

Issue Links

causes

MDEV-34167 We fail to terminate transaction early with ER_LOCK_TABLE_FULL when lock memory is growing

Closed

is caused by

MDEV-25342 autosize innodb_buffer_pool_chunk_size

Closed

relates to

MDEV-24659 Assertion `!fsp_is_system_temporary(bpage->id().space())' failed in buf_flush_relocate_on_flush_list

Closed

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2022-02-21 09:19

Here is a simpler test case:

--source include/have_innodb.inc

CREATE TABLE t1 (a INT PRIMARY KEY) ENGINE=InnoDB;

SET GLOBAL innodb_buffer_pool_size=21474836480;

DROP TABLE t1;

# The previous buffer pool resize may still be in progress.

--error 0,ER_WRONG_ARGUMENTS

SET GLOBAL innodb_buffer_pool_size=@@innodb_buffer_pool_size + 1048576;

SELECT SLEEP (3);

If I essentially revert ~~MDEV-25342~~ with the following, the test case will not crash for me:

diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc

index 1ca4ce9cbf8..cb66f5d8c22 100644

--- a/storage/innobase/handler/ha_innodb.cc

+++ b/storage/innobase/handler/ha_innodb.cc

@@ -3732,11 +3732,12 @@ static void innodb_buffer_pool_size_init()

   else if (srv_buf_pool_chunk_unit == 0)

     srv_buf_pool_chunk_unit = srv_buf_pool_size / 64;

-    my_large_page_truncate(&srv_buf_pool_chunk_unit);

+    // my_large_page_truncate(&srv_buf_pool_chunk_unit);

+#if 0

   if (srv_buf_pool_chunk_unit < buf_pool_chunk_min_size)

     srv_buf_pool_chunk_unit = buf_pool_chunk_min_size;

+#endif

   srv_buf_pool_size = buf_pool_size_align(srv_buf_pool_size);

   innobase_buffer_pool_size = srv_buf_pool_size;

For me, the code crashed like this:

#3  <signal handler called>

#4  buf_pool_t::resize (this=this@entry=0x558839a38bc0 <buf_pool>) at /mariadb/10.8/storage/innobase/buf/buf0buf.cc:1683

#5  0x0000558838eefe59 in buf_resize_callback () at /mariadb/10.8/storage/innobase/buf/buf0buf.cc:1992

The crashing statement is in the loop body (dereferencing an invalid chunk):

		for (const chunk_t* chunk = chunks + n_chunks_new,

		     * const echunk = chunks + n_chunks;

		     chunk != echunk; chunk++)

			w += chunk->size;

Marko Mäkelä added a comment - 2022-02-21 09:19 Here is a simpler test case: --source include/have_innodb.inc CREATE TABLE t1 (a INT PRIMARY KEY ) ENGINE=InnoDB; SET GLOBAL innodb_buffer_pool_size=21474836480; DROP TABLE t1; # The previous buffer pool resize may still be in progress. --error 0,ER_WRONG_ARGUMENTS SET GLOBAL innodb_buffer_pool_size=@@innodb_buffer_pool_size + 1048576; SELECT SLEEP (3); If I essentially revert MDEV-25342 with the following, the test case will not crash for me: diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc index 1ca4ce9cbf8..cb66f5d8c22 100644 --- a/storage/innobase/handler/ha_innodb.cc +++ b/storage/innobase/handler/ha_innodb.cc @@ -3732,11 +3732,12 @@ static void innodb_buffer_pool_size_init() else if (srv_buf_pool_chunk_unit == 0) { srv_buf_pool_chunk_unit = srv_buf_pool_size / 64; - my_large_page_truncate(&srv_buf_pool_chunk_unit); + // my_large_page_truncate(&srv_buf_pool_chunk_unit); } - +#if 0 if (srv_buf_pool_chunk_unit < buf_pool_chunk_min_size) srv_buf_pool_chunk_unit = buf_pool_chunk_min_size; +#endif srv_buf_pool_size = buf_pool_size_align(srv_buf_pool_size); innobase_buffer_pool_size = srv_buf_pool_size; For me, the code crashed like this: #3 <signal handler called> #4 buf_pool_t::resize (this=this@entry=0x558839a38bc0 <buf_pool>) at /mariadb/10.8/storage/innobase/buf/buf0buf.cc:1683 #5 0x0000558838eefe59 in buf_resize_callback () at /mariadb/10.8/storage/innobase/buf/buf0buf.cc:1992 The crashing statement is in the loop body (dereferencing an invalid chunk ): for ( const chunk_t* chunk = chunks + n_chunks_new, * const echunk = chunks + n_chunks; chunk != echunk; chunk++) w += chunk->size;

Daniel Black added a comment - 2022-02-22 06:06

The above crashing loop is in a withdrawing blocks loop. Given we've increased the buffer pool size (twice), this shouldn't be happening.

At the crash location:

(gdb) info locals

chunk = 0x7f2058282018

echunk = 0x7f20580f1f68

w = 2383958886547318777

warning = false

new_instance_size = 1310784

str_old_size = {<std::basic_ostream<char, std::char_traits<char> >> = {<std::basic_ios<char, std::char_traits<char> >> = {<No data fields>}, _vptr$basic_ostream = 0x7f207ca035f0 <vtable for std::__cxx11::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >+24>}, _M_stringbuf = <incomplete type>}

str_new_size = {<std::basic_ostream<char, std::char_traits<char> >> = {<std::basic_ios<char, std::char_traits<char> >> = {<No data fields>}, _vptr$basic_ostream = 0x7f207ca035f0 <vtable for std::__cxx11::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >+24>}, _M_stringbuf = <incomplete type>}

str_chunk_size = {<std::basic_ostream<char, std::char_traits<char> >> = {<std::basic_ios<char, std::char_traits<char> >> = {<No data fields>}, _vptr$basic_ostream = 0x7f207ca035f0 <vtable for std::__cxx11::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >+24>}, _M_stringbuf = <incomplete type>}

btr_search_disabled = false

withdraw_started = 99758171203

message_interval = 60000000

retry_interval = 1

should_retry_withdraw = false

current_time = 99758171203

new_size = 1290202

chunk_map_old = 0x42e98e8

s = 40318

new_size_too_diff = true

(gdb) p curr_size

$6 = {m_counter = {<std::__atomic_base<unsigned long>> = {_M_i = 1269822}, <No data fields>}}

(gdb) p old_size

$7 = {m_counter = {<std::__atomic_base<unsigned long>> = {_M_i = 1290202}, <No data fields>}}

(gdb) p withdraw_target

$8 = 0

(gdb) p n_chunks

$9 = 20480

(gdb) p n_chunks_new

$10 = 20481

(gdb) p chunks->size

$11 = 62

(gdb) p echunk

$12 = (const buf_pool_t::chunk_t * const) 0x7f20580f1f68

(gdb) p chunk

$13 = (const buf_pool_t::chunk_t *) 0x7f2058282018

(gdb) p chunks

$14 = (buf_pool_t::chunk_t *) 0x7f2058001f68

(gdb) p srv_buf_pool_old_size

$15 = 21474836480

(gdb) p srv_buf_pool_size

$16 = 21475885056

cur_size (the new instance size) is < old_size hence entering this for loop. With n_chunks_new > n_chunks this loop was only going to terminate on segfault.

The below resolves the test case, but not why old_size is out of sync.

patch
diff --git a/storage/innobase/buf/buf0buf.cc b/storage/innobase/buf/buf0buf.cc
index dbe7f9cbd2d..c9e2ee48314 100644
--- a/storage/innobase/buf/buf0buf.cc
+++ b/storage/innobase/buf/buf0buf.cc
@@ -1673,7 +1673,7 @@ inline void buf_pool_t::resize()
curr_size = n_chunks_new * chunks->size;
mysql_mutex_unlock(&mutex);

- if (curr_size < old_size) {
+ if (n_chunks_new < n_chunks) {
/* set withdraw target */
size_t w = 0;

@@ -1694,7 +1694,7 @@ inline void buf_pool_t::resize()

withdraw_retry:
/* wait for the number of blocks fit to the new size (if needed)*/
- bool should_retry_withdraw = curr_size < old_size
+ bool should_retry_withdraw = n_chunks_new < n_chunks
&& withdraw_blocks();

if (srv_shutdown_state != SRV_SHUTDOWN_NONE) {

As curr_size vs old_size comparisons and changes are always under the pool mutex, can we replace the curr_size -> n_chunks_new and old_size -> n_chunks for relative operations, and for running_out (only absolute reference to both) multiply by the chunk_size (and hold the mutex during calculation).

n_chunks and n_chunks_new should loose their volatile compiler optimization too.

Daniel Black added a comment - 2022-02-22 06:06 The above crashing loop is in a withdrawing blocks loop. Given we've increased the buffer pool size (twice), this shouldn't be happening. At the crash location: (gdb) info locals chunk = 0x7f2058282018 echunk = 0x7f20580f1f68 w = 2383958886547318777 warning = false new_instance_size = 1310784 str_old_size = {<std::basic_ostream<char, std::char_traits<char> >> = {<std::basic_ios<char, std::char_traits<char> >> = {<No data fields>}, _vptr$basic_ostream = 0x7f207ca035f0 <vtable for std::__cxx11::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >+24>}, _M_stringbuf = <incomplete type>} str_new_size = {<std::basic_ostream<char, std::char_traits<char> >> = {<std::basic_ios<char, std::char_traits<char> >> = {<No data fields>}, _vptr$basic_ostream = 0x7f207ca035f0 <vtable for std::__cxx11::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >+24>}, _M_stringbuf = <incomplete type>} str_chunk_size = {<std::basic_ostream<char, std::char_traits<char> >> = {<std::basic_ios<char, std::char_traits<char> >> = {<No data fields>}, _vptr$basic_ostream = 0x7f207ca035f0 <vtable for std::__cxx11::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >+24>}, _M_stringbuf = <incomplete type>} btr_search_disabled = false withdraw_started = 99758171203 message_interval = 60000000 retry_interval = 1 should_retry_withdraw = false current_time = 99758171203 new_size = 1290202 chunk_map_old = 0x42e98e8 s = 40318 new_size_too_diff = true (gdb) p curr_size $6 = {m_counter = {<std::__atomic_base<unsigned long>> = {_M_i = 1269822}, <No data fields>}} (gdb) p old_size $7 = {m_counter = {<std::__atomic_base<unsigned long>> = {_M_i = 1290202}, <No data fields>}} (gdb) p withdraw_target $8 = 0 (gdb) p n_chunks $9 = 20480 (gdb) p n_chunks_new $10 = 20481 (gdb) p chunks->size $11 = 62 (gdb) p echunk $12 = (const buf_pool_t::chunk_t * const) 0x7f20580f1f68 (gdb) p chunk $13 = (const buf_pool_t::chunk_t *) 0x7f2058282018 (gdb) p chunks $14 = (buf_pool_t::chunk_t *) 0x7f2058001f68 (gdb) p srv_buf_pool_old_size $15 = 21474836480 (gdb) p srv_buf_pool_size $16 = 21475885056 cur_size (the new instance size) is < old_size hence entering this for loop. With n_chunks_new > n_chunks this loop was only going to terminate on segfault. The below resolves the test case, but not why old_size is out of sync. patch diff --git a/storage/innobase/buf/buf0buf.cc b/storage/innobase/buf/buf0buf.cc index dbe7f9cbd2d..c9e2ee48314 100644 --- a/storage/innobase/buf/buf0buf.cc +++ b/storage/innobase/buf/buf0buf.cc @@ -1673,7 +1673,7 @@ inline void buf_pool_t::resize() curr_size = n_chunks_new * chunks->size; mysql_mutex_unlock(&mutex); - if (curr_size < old_size) { + if (n_chunks_new < n_chunks) { /* set withdraw target */ size_t w = 0; @@ -1694,7 +1694,7 @@ inline void buf_pool_t::resize() withdraw_retry: /* wait for the number of blocks fit to the new size (if needed)*/ - bool should_retry_withdraw = curr_size < old_size + bool should_retry_withdraw = n_chunks_new < n_chunks && withdraw_blocks(); if (srv_shutdown_state != SRV_SHUTDOWN_NONE) { As curr_size vs old_size comparisons and changes are always under the pool mutex, can we replace the curr_size -> n_chunks_new and old_size -> n_chunks for relative operations, and for running_out (only absolute reference to both) multiply by the chunk_size (and hold the mutex during calculation). n_chunks and n_chunks_new should loose their volatile compiler optimization too.

Daniel Black added a comment - 2022-02-22 10:23

With an explict innodb-buffer-pool-chunk-size=1M, like ~~MDEV-27467~~, is producible on older versions. I haven't triggered a 10.5 case yet.

10.6-4030a9fb2eb699361c58d71878e97b282647319a
2022-02-22 21:15:30 0 [Note] InnoDB: Resized hash tables at lock_sys, adaptive hash index, dictionary.
2022-02-22 21:15:30 0 [Note] InnoDB: Completed to resize buffer pool from 8388608 to 21474836480.
2022-02-22 21:15:30 0 [Note] InnoDB: Completed resizing buffer pool at 220222 21:15:30.
2022-02-22 21:15:30 4 [Note] InnoDB: Requested to resize buffer pool. (new size: 21475885056 bytes)
2022-02-22 21:15:30 0 [Note] InnoDB: Resizing buffer pool from 21474836480 to 21475885056 (unit=1048576).
2022-02-22 21:15:30 0 [Note] InnoDB: Disabling adaptive hash index.
220222 21:15:30 [ERROR] mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 10.6.8-MariaDB-log
key_buffer_size=1048576
read_buffer_size=131072
max_used_connections=1
max_threads=153
thread_count=1
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 63733 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x49000
mysys/stacktrace.c:212(my_print_stacktrace)[0xed4a5e]
sql/signal_handler.cc:226(handle_fatal_signal)[0x91aea4]
??:0(__restore_rt)[0x7f9d87cd0750]
buf/buf0buf.cc:1688(buf_pool_t::resize())[0xddcdf0]
buf0buf.cc:0(buf_resize_callback(void*))[0xdd5b84]
tpool/task_group.cc:56(tpool::task_group::execute(tpool::task*))[0xe73158]
tpool/tpool_generic.cc:547(tpool::thread_pool_generic::worker_main(tpool::worker_data*))[0xe70fb6]
??:0(std::error_code::default_error_condition() const)[0x7f9d87f8c5c4]
??:0(start_thread)[0x7f9d87d1bb1a]

Daniel Black added a comment - 2022-02-22 10:23 With an explict innodb-buffer-pool-chunk-size=1M, like MDEV-27467 , is producible on older versions. I haven't triggered a 10.5 case yet. 10.6-4030a9fb2eb699361c58d71878e97b282647319a 2022-02-22 21:15:30 0 [Note] InnoDB: Resized hash tables at lock_sys, adaptive hash index, dictionary. 2022-02-22 21:15:30 0 [Note] InnoDB: Completed to resize buffer pool from 8388608 to 21474836480. 2022-02-22 21:15:30 0 [Note] InnoDB: Completed resizing buffer pool at 220222 21:15:30. 2022-02-22 21:15:30 4 [Note] InnoDB: Requested to resize buffer pool. (new size: 21475885056 bytes) 2022-02-22 21:15:30 0 [Note] InnoDB: Resizing buffer pool from 21474836480 to 21475885056 (unit=1048576). 2022-02-22 21:15:30 0 [Note] InnoDB: Disabling adaptive hash index. 220222 21:15:30 [ERROR] mysqld got signal 11 ; This could be because you hit a bug. It is also possible that this binary or one of the libraries it was linked against is corrupt, improperly built, or misconfigured. This error can also be caused by malfunctioning hardware. To report this bug, see https://mariadb.com/kb/en/reporting-bugs We will try our best to scrape up some info that will hopefully help diagnose the problem, but since we have already crashed, something is definitely wrong and this may fail. Server version: 10.6.8-MariaDB-log key_buffer_size=1048576 read_buffer_size=131072 max_used_connections=1 max_threads=153 thread_count=1 It is possible that mysqld could use up to key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 63733 K bytes of memory Hope that's ok; if not, decrease some variables in the equation. Thread pointer: 0x0 Attempting backtrace. You can use the following information to find out where mysqld died. If you see no messages after this, something went terribly wrong... stack_bottom = 0x0 thread_stack 0x49000 mysys/stacktrace.c:212(my_print_stacktrace)[0xed4a5e] sql/signal_handler.cc:226(handle_fatal_signal)[0x91aea4] ??:0(__restore_rt)[0x7f9d87cd0750] buf/buf0buf.cc:1688(buf_pool_t::resize())[0xddcdf0] buf0buf.cc:0(buf_resize_callback(void*))[0xdd5b84] tpool/task_group.cc:56(tpool::task_group::execute(tpool::task*))[0xe73158] tpool/tpool_generic.cc:547(tpool::thread_pool_generic::worker_main(tpool::worker_data*))[0xe70fb6] ??:0(std::error_code::default_error_condition() const)[0x7f9d87f8c5c4] ??:0(start_thread)[0x7f9d87d1bb1a]

Daniel Black added a comment - 2022-02-28 07:05

update https://github.com/MariaDB/server/pull/2025

biggest question is a simple running_out implementation based on resized size acceptable.

Daniel Black added a comment - 2022-02-28 07:05 update https://github.com/MariaDB/server/pull/2025 biggest question is a simple running_out implementation based on resized size acceptable.

Marko Mäkelä added a comment - 2022-03-04 16:48

Thank you, the revised fix looks acceptable to me.

Marko Mäkelä added a comment - 2022-03-04 16:48 Thank you, the revised fix looks acceptable to me.

Daniel Black added a comment - 2022-03-07 02:41

Thanks marko. Merged.

Existing n_chunks_new/n_chunks usage outside of mutex doesn't seem as safe as I originally suspected. Functions noted in commit message.

Daniel Black added a comment - 2022-03-07 02:41 Thanks marko . Merged. Existing n_chunks_new/n_chunks usage outside of mutex doesn't seem as safe as I originally suspected. Functions noted in commit message.

People

Assignee:: Daniel Black

Reporter:: Roel Van de Paar

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 2022-02-19 00:27

Updated:: 2024-05-21 03:03

Resolved:: 2022-03-07 02:41

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server