Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-38800

dict_sys.latch contention causes fatal semaphore wait timeout under high-concurrency multi-tenant workloads

    XMLWordPrintable

Details

    • Bug
    • Status: Needs Feedback (View Workflow)
    • Critical
    • Resolution: Unresolved
    • 10.6.24, 10.11.14, 10.11.15, 11.8.5
    • None
    • None

    Description

      Under heavy load connections, the server crashes due to dict_sys.latch contention exceeding innodb_fatal_semaphore_wait_threshold. The core dump was captured on MariaDB 11.8.5. The same workload was stable on 10.6.18 but crashes on all later versions tested (10.6.24, 10.11.14, 10.11.15, 11.8.5), suggesting a regression.

      Error Log (MariaDB 11.8.5)

      [Warning] InnoDB: A long wait (xxx seconds) was observed for dict_sys.latch[Warning] InnoDB: A long wait (xxx seconds) was observed for dict_sys.latch
      [Warning] InnoDB: A long wait (xxx seconds) was observed for dict_sys.latch
      ...
      (repeated many times)
      ...
      [ERROR] [FATAL] InnoDB: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch. 
      Please refer to https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/
      [ERROR] mysqld got signal 6 ;
      Sorry, we probably made a mistake, and this is a bug.
      

      Core Dump Analysis

      The core dump was captured on MariaDB 11.8.5 with 8445 threads. 190 threads were trying to open the same table via ha_innobase::open_dict_table(). The backtrace shows multiple threads are trying to lock dict_sys.latch, and some are trying to lock the table.
      Two code paths were observed:

      Read lock (dict0dict.cc:1027 in 11.8.5), 156 threads waiting (1 raised the signal):
      These threads call dict_sys.freeze() which acquires a read lock on the latch:

      dict/dict0dict.cc#1027:
          dict_sys.freeze(SRW_LOCK_CALL);
       
      dict_sys_t::freeze:
          ATTRIBUTE_NOINLINE void dict_sys_t::freeze(const char *file, unsigned line) noexcept
          {
            latch.rd_lock(file, line);
          }
      

      Representative backtrace (MariaDB 11.8.5):

      #0  syscall () from /lib64/libc.so.6
      #1  srw_mutex_impl<true>::wait (this=<dict_sys>)
              at storage/innobase/sync/srw_lock.cc:252
      #2  ssux_lock_impl<true>::rd_lock_nospin (this=<dict_sys>)
              at storage/innobase/sync/srw_lock.cc:410
      #3  ssux_lock_impl<false>::rd_lock (this=<optimized out>)
              at storage/innobase/include/srw_lock.h:362
      #4  srw_lock_impl<false>::psi_rd_lock (this=<dict_sys>)
              at storage/innobase/sync/srw_lock.cc:489
      #5  dict_table_open_on_name (table_name="<schema>/<table>",
              dict_locked=<optimized out>, ignore_err=DICT_ERR_IGNORE_FK_NOKEY)
              at storage/innobase/dict/dict0dict.cc:1027
      #6  ha_innobase::open_dict_table (ignore_err=DICT_ERR_IGNORE_FK_NOKEY,
              is_partition=<optimized out>, norm_name="<schema>/<table>")
              at storage/innobase/handler/ha_innodb.cc:6109   
      

      Write lock (dict0dict.cc:1057 in 11.8.5), 33 threads:
      These threads call dict_sys.lock() which acquires a write lock on the latch:

      #0  syscall () from /lib64/libc.so.6
      #1  srw_mutex_impl<false>::wait (this=<dict_sys>)
              at storage/innobase/sync/srw_lock.cc:252
      #2  srw_mutex_impl<false>::wait_and_lock (this=<dict_sys>)
              at storage/innobase/sync/srw_lock.cc:313
      #3  srw_mutex_impl<false>::wr_lock (this=<dict_sys>)
              at storage/innobase/include/srw_lock.h:162
      #4  ssux_lock_impl<false>::wr_lock (this=<dict_sys>)
              at storage/innobase/include/srw_lock.h:284
      #5  srw_lock_impl<false>::psi_wr_lock (this=<dict_sys>)
              at storage/innobase/sync/srw_lock.cc:519
      #6  dict_table_open_on_name (table_name=<optimized out>,
              dict_locked=<optimized out>, ignore_err=DICT_ERR_IGNORE_FK_NOKEY)
              at storage/innobase/dict/dict0dict.cc:1057
      #7  ha_innobase::open_dict_table (ignore_err=DICT_ERR_IGNORE_FK_NOKEY,
              is_partition=<optimized out>, norm_name="<schema>/<table>")
              at storage/innobase/handler/ha_innodb.cc:6109  
      

      Attachments

        Issue Links

          Activity

            People

              marko Marko Mäkelä
              billjin Long Jin
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.