Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-35154

dict_sys_t::load_table() is holding exclusive dict_sys.latch for unnecessarily long time

Details

    Description

      Server crashes when using more than 8K partitions in a table. Startup options used to reproduce the issue --sql_mode= --deadlock-timeout-short=10 --deadlock-timeout-long=10 --deadlock-search-depth-short=10 --deadlock-search-depth-long=33 --innodb-fatal-semaphore-wait-threshold=2 --innodb-read-io-threads=1

      Could not reproduce the issue in `rr`. The issue seems to be system specific and reproduced only on busy server

      Attached test case in.sql and full back trace full_back_trace.log .

      CS 11.6.2 ba7088d462c326c9df7de97a46fe69419cd7e116 (Debug)

      Core was generated by `/test/MD031024-mariadb-11.6.2-linux-x86_64-dbg/bin/mariadbd --no-defaults --max'.
      Program terminated with signal SIGABRT, Aborted.
      #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
      [Current thread is 1 (Thread 0x14bf685f2700 (LWP 1603475))]
      (gdb) bt
      #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
      #1  0x000014bf92473859 in __GI_abort () at abort.c:79
      #2  0x000055a9eebdf701 in ib::fatal::~fatal (this=<optimized out>, __in_chrg=<optimized out>) at /test/11.6_dbg/storage/innobase/ut/ut0ut.cc:488
      #3  0x000055a9eec95a58 in dict_sys_t::lock_wait (this=this@entry=0x55a9efa07d40 <dict_sys>, file=file@entry=0x55a9ef2c8d40 "/test/11.6_dbg/storage/innobase/trx/trx0purge.cc", line=line@entry=1135) at /test/11.6_dbg/storage/innobase/include/ut0ut.h:323
      #4  0x000055a9eeb9cabb in dict_sys_t::lock (line=1135, file=0x55a9ef2c8d40 "/test/11.6_dbg/storage/innobase/trx/trx0purge.cc", this=<optimized out>) at /test/11.6_dbg/storage/innobase/include/dict0dict.h:1494
      #5  trx_purge_table_open (table_id=18, mdl_context=mdl_context@entry=0x55a9f1f49ee0, mdl=mdl@entry=0x14bf685f1aa8) at /test/11.6_dbg/storage/innobase/trx/trx0purge.cc:1135
      #6  0x000055a9eeba2ea2 in trx_purge_attach_undo_recs (n_work_items=<synthetic pointer>, thd=<optimized out>) at /test/11.6_dbg/storage/innobase/trx/trx0purge.cc:1250
      #7  trx_purge (n_tasks=n_tasks@entry=4, history_size=<optimized out>) at /test/11.6_dbg/storage/innobase/trx/trx0purge.cc:1368
      #8  0x000055a9eeb8c1b1 in purge_coordinator_state::do_purge (this=0x55a9f036cc20 <purge_state>) at /test/11.6_dbg/storage/innobase/srv/srv0srv.cc:1426
      #9  purge_coordinator_callback () at /test/11.6_dbg/storage/innobase/srv/srv0srv.cc:1510
      #10 0x000055a9eed65e6d in tpool::task_group::execute (this=0x55a9f036ca80 <purge_coordinator_task_group>, t=t@entry=0x55a9f036c9e0 <purge_coordinator_task>) at /test/11.6_dbg/tpool/task_group.cc:73
      #11 0x000055a9eed65ef5 in tpool::task::execute (this=0x55a9f036c9e0 <purge_coordinator_task>) at /test/11.6_dbg/tpool/task.cc:32
      #12 0x000055a9eed6433d in tpool::thread_pool_generic::worker_main (this=0x55a9f1772930, thread_var=0x55a9f1ad2e20) at /test/11.6_dbg/tpool/tpool_generic.cc:583
      #13 0x000055a9eed64f78 in std::__invoke_impl<void, void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> (__t=<optimized out>, __f=<optimized out>) at /usr/include/c++/9/bits/invoke.h:89
      #14 std::__invoke<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> (__fn=<optimized out>) at /usr/include/c++/9/bits/invoke.h:95
      #15 std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> >::_M_invoke<0ul, 1ul, 2ul> (this=<optimized out>) at /usr/include/c++/9/thread:244
      #16 std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> >::operator() (this=<optimized out>) at /usr/include/c++/9/thread:251
      #17 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::_M_run (this=<optimized out>) at /usr/include/c++/9/thread:195
      #18 0x000014bf9286ade4 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
      #19 0x000014bf92984609 in start_thread (arg=<optimized out>) at pthread_create.c:477
      #20 0x000014bf92570133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
      

      Attachments

        Issue Links

          Activity

            As far as I can tell, this is a performance bug, which was recently pointed out by maxk. During the CREATE OR REPLACE TABLE of a partitioned table, each partition is being treated as a separate InnoDB table, in its own separate transactions. This should be roughly equivalent to executing the operation on a non-partitioned table for a number of times that corresponds to the number of partitions.

            In full_back_trace.log, the following thread is holding an exclusive dict_sys.latch:

            Thread 6 (Thread 0x14bf900f6700 (LWP 1595731)):
            #0  unlink_chunk (p=p@entry=0x14bf47480790, av=0x14bf50000020) at malloc.c:1462
            #1  0x000014bf924e6a7c in malloc_consolidate (av=av@entry=0x14bf50000020) at malloc.c:4494
            #2  0x000014bf924e8c83 in _int_malloc (av=av@entry=0x14bf50000020, bytes=bytes@entry=1816) at malloc.c:3699
            #3  0x000014bf924eb299 in __GI___libc_malloc (bytes=bytes@entry=1816) at malloc.c:3066
            #4  0x000055a9eea53245 in ut_allocator<unsigned char, true>::allocate (throw_on_error=false, set_to_zero=false, autoevent_idx=35, n_elements=<optimized out>, this=<optimized out>) at /test/11.6_dbg/storage/innobase/include/ut0new.h:336
            #5  mem_heap_create_block_func (heap=heap@entry=0x14bf47c393c8, n=<optimized out>, file_name=file_name@entry=0x14bf47c393c8 "0mem.cc", line=145, type=0) at /test/11.6_dbg/storage/innobase/mem/mem0mem.cc:277
            #6  0x000055a9eea53959 in mem_heap_add_block (heap=heap@entry=0x14bf47c393c8, n=n@entry=128) at /test/11.6_dbg/storage/innobase/mem/mem0mem.cc:378
            #7  0x000055a9eecb62a4 in mem_heap_alloc (n=128, heap=0x14bf47c393c8) at /test/11.6_dbg/storage/innobase/include/mem0mem.inl:193
            #8  dict_table_t::create (name=..., space=space@entry=0x0, n_cols=n_cols@entry=1, n_v_cols=n_v_cols@entry=0, flags=flags@entry=33, flags2=flags2@entry=80) at /test/11.6_dbg/storage/innobase/dict/dict0mem.cc:173
            #9  0x000055a9eeca7cb2 in dict_load_table_low (mtr=mtr@entry=0x14bf900ef7d0, uncommitted=uncommitted@entry=false, rec=rec@entry=0x14bf74ed82bc "test/t#P#p0", table=table@entry=0x14bf900ef3c8) at /test/11.6_dbg/include/span.h:66
            #10 0x000055a9eecac5a3 in dict_load_table_one (name=..., ignore_err=ignore_err@entry=DICT_ERR_IGNORE_DROP, fk_tables=std::deque with 0 elements) at /test/11.6_dbg/storage/innobase/dict/dict0load.cc:2382
            #11 0x000055a9eecaf05f in dict_sys_t::load_table (this=0x55a9efa07d40 <dict_sys>, name=..., ignore=ignore@entry=DICT_ERR_IGNORE_DROP) at /test/11.6_dbg/storage/innobase/dict/dict0load.cc:2563
            #12 0x000055a9ee9a7895 in ha_innobase::delete_table (this=<optimized out>, name=0x14bf900f0550 "./test/t#P#p0") at /test/11.6_dbg/storage/innobase/handler/ha_innodb.cc:13426
            #24 0x000055a9ee2bab2b in dispatch_command (command=command@entry=COM_QUERY, thd=thd@entry=0x14bf500008e8, packet=packet@entry=0x14bf50026249 "CREATE OR REPLACE TABLE t (i INT)", packet_length=packet_length@entry=33, blocking=blocking@entry=true) at /test/11.6_dbg/sql/sql_class.h:1640
            

            11.6 ba7088d462c326c9df7de97a46fe69419cd7e116 ha_innodb.cc

            13425
                dict_sys.lock(SRW_LOCK_CALL);
            13426
                table= dict_sys.load_table(n, DICT_ERR_IGNORE_DROP);
            

            Because MDEV-17805 has not been fixed, ha_innobase::delete_table() is forced to look up the table by name, instead of operating on an already opened table handle. In this case, the first partition of the table does not exist in the dict_sys cache and therefore needs to be loaded.

            Loading a table definition into the cache must involve acquiring an exclusive dict_sys.latch at some point. Currently, the exclusive latch is acquired before some I/O may be executed, to load multiple index pages of data dictionary tables into the buffer pool so that the table definition can be loaded. In the stack trace above, the thread happens to be allocating some memory, which should be fast, but this should have been preceded with some buffer pool access.

            We can reduce the exclusive latch hold time by first inserting a table ‘stub’ into the dict_sys cache and marking it as incomplete, which in turn would force any thread that wishes to access the table, wait for the definition to become fully available. Special care would be needed in the scenarios that the table definition turns out to be corrupted. In this way, only those threads that need to access this particular table or partition definition would have to wait for the I/O.

            InnoDB always used something similar for buffer pool pages. We’d first allocate and read-fix a buffer page descriptor and acquire an exclusive latch on it, then release the buf_pool.mutex. While the page is being read or created, other threads may concurrently access any other page in the buffer pool, because we are not hogging the buf_pool.mutex nor a latch on the buf_pool.page_hash shard that maps a page identifier to the block descriptor.

            marko Marko Mäkelä added a comment - As far as I can tell, this is a performance bug, which was recently pointed out by maxk . During the CREATE OR REPLACE TABLE of a partitioned table, each partition is being treated as a separate InnoDB table, in its own separate transactions. This should be roughly equivalent to executing the operation on a non-partitioned table for a number of times that corresponds to the number of partitions. In full_back_trace.log , the following thread is holding an exclusive dict_sys.latch : Thread 6 (Thread 0x14bf900f6700 (LWP 1595731)): #0 unlink_chunk (p=p@entry=0x14bf47480790, av=0x14bf50000020) at malloc.c:1462 #1 0x000014bf924e6a7c in malloc_consolidate (av=av@entry=0x14bf50000020) at malloc.c:4494 #2 0x000014bf924e8c83 in _int_malloc (av=av@entry=0x14bf50000020, bytes=bytes@entry=1816) at malloc.c:3699 #3 0x000014bf924eb299 in __GI___libc_malloc (bytes=bytes@entry=1816) at malloc.c:3066 #4 0x000055a9eea53245 in ut_allocator<unsigned char, true>::allocate (throw_on_error=false, set_to_zero=false, autoevent_idx=35, n_elements=<optimized out>, this=<optimized out>) at /test/11.6_dbg/storage/innobase/include/ut0new.h:336 #5 mem_heap_create_block_func (heap=heap@entry=0x14bf47c393c8, n=<optimized out>, file_name=file_name@entry=0x14bf47c393c8 "0mem.cc", line=145, type=0) at /test/11.6_dbg/storage/innobase/mem/mem0mem.cc:277 #6 0x000055a9eea53959 in mem_heap_add_block (heap=heap@entry=0x14bf47c393c8, n=n@entry=128) at /test/11.6_dbg/storage/innobase/mem/mem0mem.cc:378 #7 0x000055a9eecb62a4 in mem_heap_alloc (n=128, heap=0x14bf47c393c8) at /test/11.6_dbg/storage/innobase/include/mem0mem.inl:193 #8 dict_table_t::create (name=..., space=space@entry=0x0, n_cols=n_cols@entry=1, n_v_cols=n_v_cols@entry=0, flags=flags@entry=33, flags2=flags2@entry=80) at /test/11.6_dbg/storage/innobase/dict/dict0mem.cc:173 #9 0x000055a9eeca7cb2 in dict_load_table_low (mtr=mtr@entry=0x14bf900ef7d0, uncommitted=uncommitted@entry=false, rec=rec@entry=0x14bf74ed82bc "test/t#P#p0", table=table@entry=0x14bf900ef3c8) at /test/11.6_dbg/include/span.h:66 #10 0x000055a9eecac5a3 in dict_load_table_one (name=..., ignore_err=ignore_err@entry=DICT_ERR_IGNORE_DROP, fk_tables=std::deque with 0 elements) at /test/11.6_dbg/storage/innobase/dict/dict0load.cc:2382 #11 0x000055a9eecaf05f in dict_sys_t::load_table (this=0x55a9efa07d40 <dict_sys>, name=..., ignore=ignore@entry=DICT_ERR_IGNORE_DROP) at /test/11.6_dbg/storage/innobase/dict/dict0load.cc:2563 #12 0x000055a9ee9a7895 in ha_innobase::delete_table (this=<optimized out>, name=0x14bf900f0550 "./test/t#P#p0") at /test/11.6_dbg/storage/innobase/handler/ha_innodb.cc:13426 … #24 0x000055a9ee2bab2b in dispatch_command (command=command@entry=COM_QUERY, thd=thd@entry=0x14bf500008e8, packet=packet@entry=0x14bf50026249 "CREATE OR REPLACE TABLE t (i INT)", packet_length=packet_length@entry=33, blocking=blocking@entry=true) at /test/11.6_dbg/sql/sql_class.h:1640 11.6 ba7088d462c326c9df7de97a46fe69419cd7e116 ha_innodb.cc 13425 dict_sys.lock(SRW_LOCK_CALL); 13426 table= dict_sys.load_table(n, DICT_ERR_IGNORE_DROP); Because MDEV-17805 has not been fixed, ha_innobase::delete_table() is forced to look up the table by name, instead of operating on an already opened table handle. In this case, the first partition of the table does not exist in the dict_sys cache and therefore needs to be loaded. Loading a table definition into the cache must involve acquiring an exclusive dict_sys.latch at some point. Currently, the exclusive latch is acquired before some I/O may be executed, to load multiple index pages of data dictionary tables into the buffer pool so that the table definition can be loaded. In the stack trace above, the thread happens to be allocating some memory, which should be fast, but this should have been preceded with some buffer pool access. We can reduce the exclusive latch hold time by first inserting a table ‘stub’ into the dict_sys cache and marking it as incomplete, which in turn would force any thread that wishes to access the table, wait for the definition to become fully available. Special care would be needed in the scenarios that the table definition turns out to be corrupted. In this way, only those threads that need to access this particular table or partition definition would have to wait for the I/O. InnoDB always used something similar for buffer pool pages. We’d first allocate and read-fix a buffer page descriptor and acquire an exclusive latch on it, then release the buf_pool.mutex . While the page is being read or created, other threads may concurrently access any other page in the buffer pool, because we are not hogging the buf_pool.mutex nor a latch on the buf_pool.page_hash shard that maps a page identifier to the block descriptor.

            In MDEV-35424 there is an example where dict_load_foreigns() is blocked on a pread() deep inside btr_pcur_open_on_user_rec(). We really must not load anything into the buffer pool while holding an exclusive dict_sys.latch. We should also avoid doing that when holding a shared dict_sys.latch.

            marko Marko Mäkelä added a comment - In MDEV-35424 there is an example where dict_load_foreigns() is blocked on a pread() deep inside btr_pcur_open_on_user_rec() . We really must not load anything into the buffer pool while holding an exclusive dict_sys.latch . We should also avoid doing that when holding a shared dict_sys.latch .

            People

              marko Marko Mäkelä
              ramesh Ramesh Sivaraman
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.