[Current thread is 1 (Thread 0x14ac32811700 (LWP 3928898))]
(gdb) bt
#0 buf_flush_LRU_list_batch (n=<synthetic pointer>, n=<synthetic pointer>, max=1536) at /test/preview-10.10-MDEV-16329-online-alter_opt/storage/innobase/buf/buf0flu.cc:1280
#1 buf_do_LRU_batch (max=1536) at /test/preview-10.10-MDEV-16329-online-alter_opt/storage/innobase/buf/buf0flu.cc:1374
#2 buf_flush_LRU (max_n=1536) at /test/preview-10.10-MDEV-16329-online-alter_opt/storage/innobase/buf/buf0flu.cc:1715
#3 0x00005563ba5a8f23 in buf_pool_t::withdraw_blocks (this=this@entry=0x5563bb708440 <buf_pool>) at /usr/include/c++/9/bits/stl_algobase.h:222
#4 0x00005563bad0512d in buf_pool_t::resize (this=this@entry=0x5563bb708440 <buf_pool>) at /test/preview-10.10-MDEV-16329-online-alter_opt/storage/innobase/include/buf0buf.h:1546
#5 0x00005563bad0321c in buf_resize_callback () at /test/preview-10.10-MDEV-16329-online-alter_opt/storage/innobase/buf/buf0buf.cc:1972
#6 0x00005563bad9d019 in tpool::task_group::execute (this=0x5563bc006c20 <single_threaded_group>, t=0x5563bc006b80 <buf_resize_task>) at /test/preview-10.10-MDEV-16329-online-alter_opt/tpool/task_group.cc:55
#7 0x00005563bad9bda7 in tpool::thread_pool_generic::worker_main (this=0x5563bcb26740, thread_var=0x5563bcb36260) at /test/preview-10.10-MDEV-16329-online-alter_opt/tpool/tpool_generic.cc:588
#8 0x000014ac55999de4 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#9 0x000014ac55ab3609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#10 0x000014ac5569f133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
However, all attempts at reproduction have so far failed. This ticket in the hope that there is sufficient information to find the issue.
Added all threads bt (gdb_threads.txt) and all threads full bt (gdb_full.txt).
Also added full original SQL but it is 18k lines (and does not reproduce as mentioned)
And linking with MDEV-26662 given the stack.
Roel Van de Paar
added a comment - - edited A C-based client (pquery) was in use when the issue happened, if it matters.
Options used for the trial were:
--no-defaults --max_allowed_packet=33554432 --maximum-bulk_insert_buffer_size=1M --maximum-join_buffer_size=1M --maximum-max_heap_table_size=1M --maximum-max_join_size=1M --maximum-myisam_max_sort_file_size=1M --maximum-myisam_mmap_size=1M --maximum-myisam_sort_buffer_size=1M --maximum-optimizer_trace_max_mem_size=1M --maximum-preload_buffer_size=1M --maximum-query_alloc_block_size=1M --maximum-query_prealloc_size=1M --maximum-range_alloc_block_size=1M --maximum-read_buffer_size=1M --maximum-read_rnd_buffer_size=1M --maximum-sort_buffer_size=1M --maximum-tmp_table_size=1M --maximum-transaction_alloc_block_size=1M --maximum-transaction_prealloc_size=1M --log-output=none --sql_mode= --old_alter_table=COPY --sql_mode= --deadlock-timeout-short=10 --deadlock-timeout-long=10 --deadlock-search-depth-short=10 --deadlock-search-depth-long=33 --innodb_change_buffering=inserts --innodb_fast_shutdown=1 --innodb_file_per_table=1 --log-bin --binlog_format=ROW --log_bin_trust_function_creators=1 --innodb-buffer-pool-size=300M --loose-debug_assert_on_not_freed_memory=1 --innodb_use_native_aio=1
SET @@global.relay_log_purge = '';#ERROR: 2013 - Lost connectionto MySQL server during query
Roel Van de Paar
added a comment - The failure seems to have been on this command:
SET @@ global .relay_log_purge = '' ;#ERROR: 2013 - Lost connection to MySQL server during query
0x00005563bad0be63 <+515>: je 0x5563bad0be6c <_Z13buf_flush_LRUm+524>
0x00005563bad0be65 <+517>: add %r15,0x2208(%rbx)
0x00005563bad0be6c <+524>: mov 0x4430(%rbx),%rdi
0x00005563bad0be73 <+531>: mov %r13d,%r15d
=> 0x00005563bad0be76 <+534>: mov 0x58(%rdi),%rax
…
The indicated source code line is the following:
retry:
buf_page_t *prev= UT_LIST_GET_PREV(LRU, bpage);
The value of bpage is not directly available. rbx contains &buf_pool. The offset 0x4430 refers to buf_pool.lru_hp.m_hp, which is nullptr in the core dump. But this does not match the above source code line. The generated code perfectly matches the following code (the goto was optimized away):
auto p= buf_flush_space(space_id);
space= p.first;
last_space_id= space_id;
mysql_mutex_lock(&buf_pool.mutex);
if (p.second)
buf_pool.stat.n_pages_written+= p.second;
bpage= buf_pool.lru_hp.get();
goto retry;
Side note: This branch preview-10.10-MDEV-16329-online-alter is several commits behind the latest 10.10 branch. This code has been refactored a little since then, but that should be unrelated to this crash.
It looks like the code was compiled with PLUGIN_PERFSCHEMA=NO. The code is indeed blindly dereferencing the null pointer that was returned by buf_pool.lru_hp.get(). Before MDEV-27985, the loop was correctly terminated on bpage=nullptr.
Marko Mäkelä
added a comment - The crash occurs due to dereferencing a null pointer:
(gdb) p server_version_source_revision
$1 = 0x5563bafc3f80 "d7a7d16713070c7c2902c3df4b4f6d024cd0320f"
(gdb) display/i $pc
1: x/i $pc
=> 0x5563bad0be76 <_Z13buf_flush_LRUm+534>: mov 0x58(%rdi),%rax
(gdb) disassemble
…
0x00005563bad0be50 <+496>: call 0x5563bad0b590 <buf_flush_freed_pages(fil_space_t*, bool)>
0x00005563bad0be55 <+501>: mov %rbx,%rdi
0x00005563bad0be58 <+504>: mov %eax,%r15d
0x00005563bad0be5b <+507>: call 0x5563ba5664c0 <pthread_mutex_lock@plt>
0x00005563bad0be60 <+512>: test %r15d,%r15d
0x00005563bad0be63 <+515>: je 0x5563bad0be6c <_Z13buf_flush_LRUm+524>
0x00005563bad0be65 <+517>: add %r15,0x2208(%rbx)
0x00005563bad0be6c <+524>: mov 0x4430(%rbx),%rdi
0x00005563bad0be73 <+531>: mov %r13d,%r15d
=> 0x00005563bad0be76 <+534>: mov 0x58(%rdi),%rax
…
The indicated source code line is the following:
retry:
buf_page_t *prev= UT_LIST_GET_PREV(LRU, bpage);
The value of bpage is not directly available. rbx contains &buf_pool . The offset 0x4430 refers to buf_pool.lru_hp.m_hp , which is nullptr in the core dump. But this does not match the above source code line. The generated code perfectly matches the following code (the goto was optimized away):
auto p= buf_flush_space(space_id);
space= p.first;
last_space_id= space_id;
mysql_mutex_lock(&buf_pool.mutex);
if (p.second)
buf_pool.stat.n_pages_written+= p.second;
bpage= buf_pool.lru_hp.get();
goto retry;
Side note: This branch preview-10.10- MDEV-16329 -online-alter is several commits behind the latest 10.10 branch. This code has been refactored a little since then, but that should be unrelated to this crash.
It looks like the code was compiled with PLUGIN_PERFSCHEMA=NO . The code is indeed blindly dereferencing the null pointer that was returned by buf_pool.lru_hp.get() . Before MDEV-27985 , the loop was correctly terminated on bpage=nullptr .
A C-based client (pquery) was in use when the issue happened, if it matters.
Options used for the trial were:
--no-defaults --max_allowed_packet=33554432 --maximum-bulk_insert_buffer_size=1M --maximum-join_buffer_size=1M --maximum-max_heap_table_size=1M --maximum-max_join_size=1M --maximum-myisam_max_sort_file_size=1M --maximum-myisam_mmap_size=1M --maximum-myisam_sort_buffer_size=1M --maximum-optimizer_trace_max_mem_size=1M --maximum-preload_buffer_size=1M --maximum-query_alloc_block_size=1M --maximum-query_prealloc_size=1M --maximum-range_alloc_block_size=1M --maximum-read_buffer_size=1M --maximum-read_rnd_buffer_size=1M --maximum-sort_buffer_size=1M --maximum-tmp_table_size=1M --maximum-transaction_alloc_block_size=1M --maximum-transaction_prealloc_size=1M --log-output=none --sql_mode= --old_alter_table=COPY --sql_mode= --deadlock-timeout-short=10 --deadlock-timeout-long=10 --deadlock-search-depth-short=10 --deadlock-search-depth-long=33 --innodb_change_buffering=inserts --innodb_fast_shutdown=1 --innodb_file_per_table=1 --log-bin --binlog_format=ROW --log_bin_trust_function_creators=1 --innodb-buffer-pool-size=300M --loose-debug_assert_on_not_freed_memory=1 --innodb_use_native_aio=1