Details

    Description

      This following bug was observed:

      preview-10.10-MDEV-16329-online-alter 10.10.0 d7a7d16713070c7c2902c3df4b4f6d024cd0320f (Optimized)

      Core was generated by `/test/online-alter_MD080622-mariadb-10.10.0-linux-x86_64-opt/bin/mysqld --no-de'.
      Program terminated with signal SIGSEGV, Segmentation fault.
      #0  buf_flush_LRU_list_batch (n=<synthetic pointer>, n=<synthetic pointer>, 
          max=1536)
          at /test/preview-10.10-MDEV-16329-online-alter_opt/storage/innobase/buf/buf0flu.cc:1280
      1280        buf_page_t *prev= UT_LIST_GET_PREV(LRU, bpage);
      [Current thread is 1 (Thread 0x14ac32811700 (LWP 3928898))]
      (gdb) bt
      #0  buf_flush_LRU_list_batch (n=<synthetic pointer>, n=<synthetic pointer>, max=1536) at /test/preview-10.10-MDEV-16329-online-alter_opt/storage/innobase/buf/buf0flu.cc:1280
      #1  buf_do_LRU_batch (max=1536) at /test/preview-10.10-MDEV-16329-online-alter_opt/storage/innobase/buf/buf0flu.cc:1374
      #2  buf_flush_LRU (max_n=1536) at /test/preview-10.10-MDEV-16329-online-alter_opt/storage/innobase/buf/buf0flu.cc:1715
      #3  0x00005563ba5a8f23 in buf_pool_t::withdraw_blocks (this=this@entry=0x5563bb708440 <buf_pool>) at /usr/include/c++/9/bits/stl_algobase.h:222
      #4  0x00005563bad0512d in buf_pool_t::resize (this=this@entry=0x5563bb708440 <buf_pool>) at /test/preview-10.10-MDEV-16329-online-alter_opt/storage/innobase/include/buf0buf.h:1546
      #5  0x00005563bad0321c in buf_resize_callback () at /test/preview-10.10-MDEV-16329-online-alter_opt/storage/innobase/buf/buf0buf.cc:1972
      #6  0x00005563bad9d019 in tpool::task_group::execute (this=0x5563bc006c20 <single_threaded_group>, t=0x5563bc006b80 <buf_resize_task>) at /test/preview-10.10-MDEV-16329-online-alter_opt/tpool/task_group.cc:55
      #7  0x00005563bad9bda7 in tpool::thread_pool_generic::worker_main (this=0x5563bcb26740, thread_var=0x5563bcb36260) at /test/preview-10.10-MDEV-16329-online-alter_opt/tpool/tpool_generic.cc:588
      #8  0x000014ac55999de4 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
      #9  0x000014ac55ab3609 in start_thread (arg=<optimized out>) at pthread_create.c:477
      #10 0x000014ac5569f133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
      

      However, all attempts at reproduction have so far failed. This ticket in the hope that there is sufficient information to find the issue.
      Added all threads bt (gdb_threads.txt) and all threads full bt (gdb_full.txt).
      Also added full original SQL but it is 18k lines (and does not reproduce as mentioned)
      And linking with MDEV-26662 given the stack.

      Attachments

        1. gdb_full.txt
          348 kB
          Roel Van de Paar
        2. gdb_threads.txt
          250 kB
          Roel Van de Paar
        3. MDEV-28828.sql
          3.22 MB
          Roel Van de Paar

        Issue Links

          Activity

            Roel Roel Van de Paar added a comment - - edited

            A C-based client (pquery) was in use when the issue happened, if it matters.
            Options used for the trial were:

            --no-defaults --max_allowed_packet=33554432 --maximum-bulk_insert_buffer_size=1M --maximum-join_buffer_size=1M --maximum-max_heap_table_size=1M --maximum-max_join_size=1M --maximum-myisam_max_sort_file_size=1M --maximum-myisam_mmap_size=1M --maximum-myisam_sort_buffer_size=1M --maximum-optimizer_trace_max_mem_size=1M --maximum-preload_buffer_size=1M --maximum-query_alloc_block_size=1M --maximum-query_prealloc_size=1M --maximum-range_alloc_block_size=1M --maximum-read_buffer_size=1M --maximum-read_rnd_buffer_size=1M --maximum-sort_buffer_size=1M --maximum-tmp_table_size=1M --maximum-transaction_alloc_block_size=1M --maximum-transaction_prealloc_size=1M --log-output=none --sql_mode= --old_alter_table=COPY --sql_mode= --deadlock-timeout-short=10 --deadlock-timeout-long=10 --deadlock-search-depth-short=10 --deadlock-search-depth-long=33 --innodb_change_buffering=inserts --innodb_fast_shutdown=1 --innodb_file_per_table=1 --log-bin --binlog_format=ROW --log_bin_trust_function_creators=1 --innodb-buffer-pool-size=300M --loose-debug_assert_on_not_freed_memory=1 --innodb_use_native_aio=1
            

            Roel Roel Van de Paar added a comment - - edited A C-based client (pquery) was in use when the issue happened, if it matters. Options used for the trial were: --no-defaults --max_allowed_packet=33554432 --maximum-bulk_insert_buffer_size=1M --maximum-join_buffer_size=1M --maximum-max_heap_table_size=1M --maximum-max_join_size=1M --maximum-myisam_max_sort_file_size=1M --maximum-myisam_mmap_size=1M --maximum-myisam_sort_buffer_size=1M --maximum-optimizer_trace_max_mem_size=1M --maximum-preload_buffer_size=1M --maximum-query_alloc_block_size=1M --maximum-query_prealloc_size=1M --maximum-range_alloc_block_size=1M --maximum-read_buffer_size=1M --maximum-read_rnd_buffer_size=1M --maximum-sort_buffer_size=1M --maximum-tmp_table_size=1M --maximum-transaction_alloc_block_size=1M --maximum-transaction_prealloc_size=1M --log-output=none --sql_mode= --old_alter_table=COPY --sql_mode= --deadlock-timeout-short=10 --deadlock-timeout-long=10 --deadlock-search-depth-short=10 --deadlock-search-depth-long=33 --innodb_change_buffering=inserts --innodb_fast_shutdown=1 --innodb_file_per_table=1 --log-bin --binlog_format=ROW --log_bin_trust_function_creators=1 --innodb-buffer-pool-size=300M --loose-debug_assert_on_not_freed_memory=1 --innodb_use_native_aio=1

            The failure seems to have been on this command:

            SET @@global.relay_log_purge = '';#ERROR: 2013 - Lost connection to MySQL server during query
            

            Roel Roel Van de Paar added a comment - The failure seems to have been on this command: SET @@ global .relay_log_purge = '' ;#ERROR: 2013 - Lost connection to MySQL server during query

            The crash occurs due to dereferencing a null pointer:

            (gdb) p server_version_source_revision 
            $1 = 0x5563bafc3f80 "d7a7d16713070c7c2902c3df4b4f6d024cd0320f"
            (gdb) display/i $pc
            1: x/i $pc
            => 0x5563bad0be76 <_Z13buf_flush_LRUm+534>:	mov    0x58(%rdi),%rax
            (gdb) disassemble
            …
               0x00005563bad0be50 <+496>:	call   0x5563bad0b590 <buf_flush_freed_pages(fil_space_t*, bool)>
               0x00005563bad0be55 <+501>:	mov    %rbx,%rdi
               0x00005563bad0be58 <+504>:	mov    %eax,%r15d
               0x00005563bad0be5b <+507>:	call   0x5563ba5664c0 <pthread_mutex_lock@plt>
               0x00005563bad0be60 <+512>:	test   %r15d,%r15d
               0x00005563bad0be63 <+515>:	je     0x5563bad0be6c <_Z13buf_flush_LRUm+524>
               0x00005563bad0be65 <+517>:	add    %r15,0x2208(%rbx)
               0x00005563bad0be6c <+524>:	mov    0x4430(%rbx),%rdi
               0x00005563bad0be73 <+531>:	mov    %r13d,%r15d
            => 0x00005563bad0be76 <+534>:	mov    0x58(%rdi),%rax
            …
            

            The indicated source code line is the following:

              retry:
                buf_page_t *prev= UT_LIST_GET_PREV(LRU, bpage);
            

            The value of bpage is not directly available. rbx contains &buf_pool. The offset 0x4430 refers to buf_pool.lru_hp.m_hp, which is nullptr in the core dump. But this does not match the above source code line. The generated code perfectly matches the following code (the goto was optimized away):

                      auto p= buf_flush_space(space_id);
                      space= p.first;
                      last_space_id= space_id;
                      mysql_mutex_lock(&buf_pool.mutex);
                      if (p.second)
                        buf_pool.stat.n_pages_written+= p.second;
                      bpage= buf_pool.lru_hp.get();
                      goto retry;
            

            Side note: This branch preview-10.10-MDEV-16329-online-alter is several commits behind the latest 10.10 branch. This code has been refactored a little since then, but that should be unrelated to this crash.

            It looks like the code was compiled with PLUGIN_PERFSCHEMA=NO. The code is indeed blindly dereferencing the null pointer that was returned by buf_pool.lru_hp.get(). Before MDEV-27985, the loop was correctly terminated on bpage=nullptr.

            marko Marko Mäkelä added a comment - The crash occurs due to dereferencing a null pointer: (gdb) p server_version_source_revision $1 = 0x5563bafc3f80 "d7a7d16713070c7c2902c3df4b4f6d024cd0320f" (gdb) display/i $pc 1: x/i $pc => 0x5563bad0be76 <_Z13buf_flush_LRUm+534>: mov 0x58(%rdi),%rax (gdb) disassemble … 0x00005563bad0be50 <+496>: call 0x5563bad0b590 <buf_flush_freed_pages(fil_space_t*, bool)> 0x00005563bad0be55 <+501>: mov %rbx,%rdi 0x00005563bad0be58 <+504>: mov %eax,%r15d 0x00005563bad0be5b <+507>: call 0x5563ba5664c0 <pthread_mutex_lock@plt> 0x00005563bad0be60 <+512>: test %r15d,%r15d 0x00005563bad0be63 <+515>: je 0x5563bad0be6c <_Z13buf_flush_LRUm+524> 0x00005563bad0be65 <+517>: add %r15,0x2208(%rbx) 0x00005563bad0be6c <+524>: mov 0x4430(%rbx),%rdi 0x00005563bad0be73 <+531>: mov %r13d,%r15d => 0x00005563bad0be76 <+534>: mov 0x58(%rdi),%rax … The indicated source code line is the following: retry: buf_page_t *prev= UT_LIST_GET_PREV(LRU, bpage); The value of bpage is not directly available. rbx contains &buf_pool . The offset 0x4430 refers to buf_pool.lru_hp.m_hp , which is nullptr in the core dump. But this does not match the above source code line. The generated code perfectly matches the following code (the goto was optimized away): auto p= buf_flush_space(space_id); space= p.first; last_space_id= space_id; mysql_mutex_lock(&buf_pool.mutex); if (p.second) buf_pool.stat.n_pages_written+= p.second; bpage= buf_pool.lru_hp.get(); goto retry; Side note: This branch preview-10.10- MDEV-16329 -online-alter is several commits behind the latest 10.10 branch. This code has been refactored a little since then, but that should be unrelated to this crash. It looks like the code was compiled with PLUGIN_PERFSCHEMA=NO . The code is indeed blindly dereferencing the null pointer that was returned by buf_pool.lru_hp.get() . Before MDEV-27985 , the loop was correctly terminated on bpage=nullptr .

            Confirming CMAKE command used -DPLUGIN_PERFSCHEMA=NO

            cmake . -DWITH_SSL=bundled -DBUILD_CONFIG=mysql_release -DWITH_JEMALLOC=no -DWITH_TOKUDB=0 -DFEATURE_SET=community -DDEBUG_EXTNAME=OFF -DWITH_EMBEDDED_SERVER=0 -DENABLE_DOWNLOADS=1 -DDOWNLOAD_BOOST=1 -DWITH_BOOST=/tmp/boost_886596 -DENABLED_LOCAL_INFILE=1 -DENABLE_DTRACE=0 -DPLUGIN_PERFSCHEMA=NO -DWITH_DBUG_TRACE=OFF -DWITH_ZLIB=bundled -DWITH_ROCKSDB=1 -DWITH_PAM=ON -DFORCE_INSOURCE_BUILD=1 -DCMAKE_C_FLAGS='-Og -march=native -mtune=native' -DCMAKE_CXX_FLAGS='-Og -march=native -mtune=native'
            

            Roel Roel Van de Paar added a comment - Confirming CMAKE command used -DPLUGIN_PERFSCHEMA=NO cmake . -DWITH_SSL=bundled -DBUILD_CONFIG=mysql_release -DWITH_JEMALLOC=no -DWITH_TOKUDB=0 -DFEATURE_SET=community -DDEBUG_EXTNAME=OFF -DWITH_EMBEDDED_SERVER=0 -DENABLE_DOWNLOADS=1 -DDOWNLOAD_BOOST=1 -DWITH_BOOST=/tmp/boost_886596 -DENABLED_LOCAL_INFILE=1 -DENABLE_DTRACE=0 -DPLUGIN_PERFSCHEMA=NO -DWITH_DBUG_TRACE=OFF -DWITH_ZLIB=bundled -DWITH_ROCKSDB=1 -DWITH_PAM=ON -DFORCE_INSOURCE_BUILD=1 -DCMAKE_C_FLAGS='-Og -march=native -mtune=native' -DCMAKE_CXX_FLAGS='-Og -march=native -mtune=native'

            People

              marko Marko Mäkelä
              Roel Roel Van de Paar
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.