[MDEV-28828] SIGSEGV in buf_flush_LRU_list_batch Created: 2022-06-14  Updated: 2022-06-16  Resolved: 2022-06-14

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.5.16, 10.6.8, 10.7.4, 10.8.3, 10.9.1, 10.10
Fix Version/s: 10.5.17, 10.6.9, 10.7.5, 10.8.4, 10.9.2

Type: Bug Priority: Blocker
Reporter: Roel Van de Paar Assignee: Marko Mäkelä
Resolution: Fixed Votes: 1
Labels: crash, regression-10.5

Attachments: File MDEV-28828.sql     Text File gdb_full.txt     Text File gdb_threads.txt    
Issue Links:
Problem/Incident
is caused by MDEV-27985 buf_flush_freed_pages() causes InnoDB... Closed
Relates
relates to MDEV-26662 buf_read_page_background() is blockin... Closed

 Description   

This following bug was observed:

preview-10.10-MDEV-16329-online-alter 10.10.0 d7a7d16713070c7c2902c3df4b4f6d024cd0320f (Optimized)

Core was generated by `/test/online-alter_MD080622-mariadb-10.10.0-linux-x86_64-opt/bin/mysqld --no-de'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  buf_flush_LRU_list_batch (n=<synthetic pointer>, n=<synthetic pointer>, 
    max=1536)
    at /test/preview-10.10-MDEV-16329-online-alter_opt/storage/innobase/buf/buf0flu.cc:1280
1280        buf_page_t *prev= UT_LIST_GET_PREV(LRU, bpage);
[Current thread is 1 (Thread 0x14ac32811700 (LWP 3928898))]
(gdb) bt
#0  buf_flush_LRU_list_batch (n=<synthetic pointer>, n=<synthetic pointer>, max=1536) at /test/preview-10.10-MDEV-16329-online-alter_opt/storage/innobase/buf/buf0flu.cc:1280
#1  buf_do_LRU_batch (max=1536) at /test/preview-10.10-MDEV-16329-online-alter_opt/storage/innobase/buf/buf0flu.cc:1374
#2  buf_flush_LRU (max_n=1536) at /test/preview-10.10-MDEV-16329-online-alter_opt/storage/innobase/buf/buf0flu.cc:1715
#3  0x00005563ba5a8f23 in buf_pool_t::withdraw_blocks (this=this@entry=0x5563bb708440 <buf_pool>) at /usr/include/c++/9/bits/stl_algobase.h:222
#4  0x00005563bad0512d in buf_pool_t::resize (this=this@entry=0x5563bb708440 <buf_pool>) at /test/preview-10.10-MDEV-16329-online-alter_opt/storage/innobase/include/buf0buf.h:1546
#5  0x00005563bad0321c in buf_resize_callback () at /test/preview-10.10-MDEV-16329-online-alter_opt/storage/innobase/buf/buf0buf.cc:1972
#6  0x00005563bad9d019 in tpool::task_group::execute (this=0x5563bc006c20 <single_threaded_group>, t=0x5563bc006b80 <buf_resize_task>) at /test/preview-10.10-MDEV-16329-online-alter_opt/tpool/task_group.cc:55
#7  0x00005563bad9bda7 in tpool::thread_pool_generic::worker_main (this=0x5563bcb26740, thread_var=0x5563bcb36260) at /test/preview-10.10-MDEV-16329-online-alter_opt/tpool/tpool_generic.cc:588
#8  0x000014ac55999de4 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x000014ac55ab3609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#10 0x000014ac5569f133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

However, all attempts at reproduction have so far failed. This ticket in the hope that there is sufficient information to find the issue.
Added all threads bt (gdb_threads.txt) and all threads full bt (gdb_full.txt).
Also added full original SQL but it is 18k lines (and does not reproduce as mentioned)
And linking with MDEV-26662 given the stack.



 Comments   
Comment by Roel Van de Paar [ 2022-06-14 ]

A C-based client (pquery) was in use when the issue happened, if it matters.
Options used for the trial were:

--no-defaults --max_allowed_packet=33554432 --maximum-bulk_insert_buffer_size=1M --maximum-join_buffer_size=1M --maximum-max_heap_table_size=1M --maximum-max_join_size=1M --maximum-myisam_max_sort_file_size=1M --maximum-myisam_mmap_size=1M --maximum-myisam_sort_buffer_size=1M --maximum-optimizer_trace_max_mem_size=1M --maximum-preload_buffer_size=1M --maximum-query_alloc_block_size=1M --maximum-query_prealloc_size=1M --maximum-range_alloc_block_size=1M --maximum-read_buffer_size=1M --maximum-read_rnd_buffer_size=1M --maximum-sort_buffer_size=1M --maximum-tmp_table_size=1M --maximum-transaction_alloc_block_size=1M --maximum-transaction_prealloc_size=1M --log-output=none --sql_mode= --old_alter_table=COPY --sql_mode= --deadlock-timeout-short=10 --deadlock-timeout-long=10 --deadlock-search-depth-short=10 --deadlock-search-depth-long=33 --innodb_change_buffering=inserts --innodb_fast_shutdown=1 --innodb_file_per_table=1 --log-bin --binlog_format=ROW --log_bin_trust_function_creators=1 --innodb-buffer-pool-size=300M --loose-debug_assert_on_not_freed_memory=1 --innodb_use_native_aio=1

Comment by Roel Van de Paar [ 2022-06-14 ]

The failure seems to have been on this command:

SET @@global.relay_log_purge = '';#ERROR: 2013 - Lost connection to MySQL server during query

Comment by Marko Mäkelä [ 2022-06-14 ]

The crash occurs due to dereferencing a null pointer:

(gdb) p server_version_source_revision 
$1 = 0x5563bafc3f80 "d7a7d16713070c7c2902c3df4b4f6d024cd0320f"
(gdb) display/i $pc
1: x/i $pc
=> 0x5563bad0be76 <_Z13buf_flush_LRUm+534>:	mov    0x58(%rdi),%rax
(gdb) disassemble
   0x00005563bad0be50 <+496>:	call   0x5563bad0b590 <buf_flush_freed_pages(fil_space_t*, bool)>
   0x00005563bad0be55 <+501>:	mov    %rbx,%rdi
   0x00005563bad0be58 <+504>:	mov    %eax,%r15d
   0x00005563bad0be5b <+507>:	call   0x5563ba5664c0 <pthread_mutex_lock@plt>
   0x00005563bad0be60 <+512>:	test   %r15d,%r15d
   0x00005563bad0be63 <+515>:	je     0x5563bad0be6c <_Z13buf_flush_LRUm+524>
   0x00005563bad0be65 <+517>:	add    %r15,0x2208(%rbx)
   0x00005563bad0be6c <+524>:	mov    0x4430(%rbx),%rdi
   0x00005563bad0be73 <+531>:	mov    %r13d,%r15d
=> 0x00005563bad0be76 <+534>:	mov    0x58(%rdi),%rax

The indicated source code line is the following:

  retry:
    buf_page_t *prev= UT_LIST_GET_PREV(LRU, bpage);

The value of bpage is not directly available. rbx contains &buf_pool. The offset 0x4430 refers to buf_pool.lru_hp.m_hp, which is nullptr in the core dump. But this does not match the above source code line. The generated code perfectly matches the following code (the goto was optimized away):

          auto p= buf_flush_space(space_id);
          space= p.first;
          last_space_id= space_id;
          mysql_mutex_lock(&buf_pool.mutex);
          if (p.second)
            buf_pool.stat.n_pages_written+= p.second;
          bpage= buf_pool.lru_hp.get();
          goto retry;

Side note: This branch preview-10.10-MDEV-16329-online-alter is several commits behind the latest 10.10 branch. This code has been refactored a little since then, but that should be unrelated to this crash.

It looks like the code was compiled with PLUGIN_PERFSCHEMA=NO. The code is indeed blindly dereferencing the null pointer that was returned by buf_pool.lru_hp.get(). Before MDEV-27985, the loop was correctly terminated on bpage=nullptr.

Comment by Roel Van de Paar [ 2022-06-14 ]

Confirming CMAKE command used -DPLUGIN_PERFSCHEMA=NO

cmake . -DWITH_SSL=bundled -DBUILD_CONFIG=mysql_release -DWITH_JEMALLOC=no -DWITH_TOKUDB=0 -DFEATURE_SET=community -DDEBUG_EXTNAME=OFF -DWITH_EMBEDDED_SERVER=0 -DENABLE_DOWNLOADS=1 -DDOWNLOAD_BOOST=1 -DWITH_BOOST=/tmp/boost_886596 -DENABLED_LOCAL_INFILE=1 -DENABLE_DTRACE=0 -DPLUGIN_PERFSCHEMA=NO -DWITH_DBUG_TRACE=OFF -DWITH_ZLIB=bundled -DWITH_ROCKSDB=1 -DWITH_PAM=ON -DFORCE_INSOURCE_BUILD=1 -DCMAKE_C_FLAGS='-Og -march=native -mtune=native' -DCMAKE_CXX_FLAGS='-Og -march=native -mtune=native'

Generated at Thu Feb 08 10:03:47 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.