Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-34909

DDL during SET GLOBAL innodb_log_file_size may hang when using PMEM

Details

    Description

      While testing MDEV-33853, mleich produced a core dump of a server hang, which resulted in an intentional crash with the infamous message that innodb_fatal_semaphore_wait for dict_sys.latch was exceeded.

      In the stack traces that I analyzed, there were several threads trying to execute SET GLOBAL innodb_log_file_size as well as one DDL operation that was waiting on an exclusive log_sys.latch and therefore in the end blocking all threads, while holding exclusive dict_sys.latch.

      I think that the root cause is flawed logic in log_resize_acquire(). It could also make sense to protect that function with LOCK_global_system_variables so that multiple instances of log_resize_acquire() can be executed concurrently. A fix might look as follows. This will need some additional stress testing to validate this.

      diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc
      index ce2d3958f9c..27adf670b11 100644
      --- a/storage/innobase/handler/ha_innodb.cc
      +++ b/storage/innobase/handler/ha_innodb.cc
      @@ -18518,9 +18518,7 @@ buffer_pool_load_abort(
       static void innodb_log_file_buffering_update(THD *thd, st_mysql_sys_var*,
                                                    void *, const void *save)
       {
      -  mysql_mutex_unlock(&LOCK_global_system_variables);
         log_sys.set_buffered(*static_cast<const my_bool*>(save));
      -  mysql_mutex_lock(&LOCK_global_system_variables);
       }
       #endif
       
      @@ -18528,7 +18526,6 @@ static void innodb_log_file_size_update(THD *thd, st_mysql_sys_var*,
                                               void *var, const void *save)
       {
         ut_ad(var == &srv_log_file_size);
      -  mysql_mutex_unlock(&LOCK_global_system_variables);
       
         if (high_level_read_only)
           ib_senderrf(thd, IB_LOG_LEVEL_ERROR, ER_READ_ONLY_MODE);
      @@ -18551,13 +18548,15 @@ static void innodb_log_file_size_update(THD *thd, st_mysql_sys_var*,
             ib_senderrf(thd, IB_LOG_LEVEL_ERROR, ER_CANT_CREATE_HANDLER_FILE);
             break;
           case log_t::RESIZE_STARTED:
      +      mysql_mutex_unlock(&LOCK_global_system_variables);
             const lsn_t start{log_sys.resize_in_progress()};
             for (timespec abstime;;)
             {
               if (thd_kill_level(thd))
               {
      +          mysql_mutex_lock(&LOCK_global_system_variables);
                 log_sys.resize_abort();
      -          break;
      +          return;
               }
       
               set_timespec(abstime, 5);
      @@ -18588,9 +18587,9 @@ static void innodb_log_file_size_update(THD *thd, st_mysql_sys_var*,
               if (!resizing || resizing > start /* only wait for our resize */)
                 break;
             }
      +      mysql_mutex_lock(&LOCK_global_system_variables);
           }
         }
      -  mysql_mutex_lock(&LOCK_global_system_variables);
       }
       
       static void innodb_log_spin_wait_delay_update(THD *, st_mysql_sys_var*,
      diff --git a/storage/innobase/log/log0log.cc b/storage/innobase/log/log0log.cc
      index d7aae556ce0..7c94876996c 100644
      --- a/storage/innobase/log/log0log.cc
      +++ b/storage/innobase/log/log0log.cc
      @@ -384,6 +384,8 @@ void log_t::close_file()
       /** Acquire all latches that protect the log. */
       static void log_resize_acquire()
       {
      +  mysql_mutex_assert_owner(&LOCK_global_system_variables);
      +
         if (!log_sys.is_pmem())
         {
           while (flush_lock.acquire(log_sys.get_lsn() + 1, nullptr) !=
      

      I am also wondering if the logic regarding flush_lock.acquire() and write_lock.acquire(), which are partly visible above, is correct.

      wlad, I wonder if there is any chance that another thread in log_write_up_to() could get "in between" and start to wait for log_sys.latch before log_resize_acquire() completes log_sys.latch.wr_lock()? If yes, is there any way to acquire the write_lock and flush_lock with an even larger LSN (such as LSN_MAX) and rewind to the actual LSN once done? I have a concern that even with the above patch, a hang could be possible.

      Attachments

        Issue Links

          Activity

            A cleaner option would seem to be to introduce a log_sys.checkpoint_latch that would be acquired together with an exclusive lock_sys.latch, to cover starting and aborting log resizing, as well as checkpoints, which is where log resizing can be completed. In that way we would not have to extend the hold time of LOCK_global_system_variables, like the above tentative patch does. (It is not a good idea to hold a mutex while creating a potentially huge ib_logfile101 for the resizing.)

            I will have to check if that could work for log_t::set_buffered() as well, to change the O_DIRECT‌ness of the ib_logfile0. Possibly, set_buffered() could actually work based on exclusive log_sys.latch only.

            marko Marko Mäkelä added a comment - A cleaner option would seem to be to introduce a log_sys.checkpoint_latch that would be acquired together with an exclusive lock_sys.latch , to cover starting and aborting log resizing, as well as checkpoints, which is where log resizing can be completed. In that way we would not have to extend the hold time of LOCK_global_system_variables , like the above tentative patch does. (It is not a good idea to hold a mutex while creating a potentially huge ib_logfile101 for the resizing.) I will have to check if that could work for log_t::set_buffered() as well, to change the O_DIRECT ‌ness of the ib_logfile0 . Possibly, set_buffered() could actually work based on exclusive log_sys.latch only.

            marko, about flush_lock and write_lock - if lock.acquire() does not return group_commit_lock::ACQUIRED , then the code should not execute the operation. This logic is in log_write_up_to.

            So, if flush lock could not be acquired, flush is not executed in this thread.
            If write lock could not be acquired, write is not executed in this thread.

            If you acquire both, no code for writing and flushing should be executed by anyone else at the same time.

            wlad Vladislav Vaintroub added a comment - marko , about flush_lock and write_lock - if lock.acquire() does not return group_commit_lock::ACQUIRED , then the code should not execute the operation. This logic is in log_write_up_to. So, if flush lock could not be acquired, flush is not executed in this thread. If write lock could not be acquired, write is not executed in this thread. If you acquire both, no code for writing and flushing should be executed by anyone else at the same time.

            wlad, thank you for the clarification. The current logic (not changed by the patch in the Description, other than by the fact that the caller would always hold LOCK_global_system_variables) is as follows:

            static void log_resize_acquire()
            {
              if (!log_sys.is_pmem())
              {
                while (flush_lock.acquire(log_sys.get_lsn() + 1, nullptr) !=
                       group_commit_lock::ACQUIRED);
                while (write_lock.acquire(log_sys.get_lsn() + 1, nullptr) !=
                       group_commit_lock::ACQUIRED);
              }
             
              log_sys.latch.wr_lock(SRW_LOCK_CALL);
            }
            

            As far as I understand, as soon as each while loop terminates, other threads will pile up waiting until log_resize_release() releases these locks. It is clear that the log_sys.lsn may advance during all this. The +1 in the expression is carefully chosen: we want an "impossible" LSN that is between the current LSN and the end LSN of any completed mini-transaction. Each mini-transaction commit would grow the LSN by the length of the mini-transaction log, which always is more than 1 byte. The minimum should be 3 bytes for the shortest possible record, plus 1 sequence byte (0 or 1) and 4 bytes CRC-32C.

            I thought that the above patch would help by preventing several threads from executing log_resize_acquire() concurrently, and causing an excessive amount of context switching in the while loop. I checked the core dump once more. The problem is not excessive looping. We have 3 log_resize_acquire() that somehow got through both while loops. So, it looks like contrary to my assumption, ACQUIRED can be returned to multiple threads at the same time.

            (gdb) p flush_lock
            $1 = {m_owner_id = {_M_thread = 0}, m_mtx = {<std::__mutex_base> = {_M_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}}, <No data fields>}, m_value = std::atomic<unsigned long> = { 0 }, 
              m_pending_value = std::atomic<unsigned long> = { 0 }, m_lock = false, m_waiters_list = 0x0, m_pending_callbacks = std::__debug::vector of length 0, capacity 0}
            (gdb) p write_lock
            $2 = {m_owner_id = {_M_thread = 0}, m_mtx = {<std::__mutex_base> = {_M_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}}, <No data fields>}, m_value = std::atomic<unsigned long> = { 0 }, 
              m_pending_value = std::atomic<unsigned long> = { 0 }, m_lock = false, m_waiters_list = 0x0, m_pending_callbacks = std::__debug::vector of length 0, capacity 0}
            (gdb) p log_sys.lsn._M_i
            $3 = 31926967
            (gdb) thread apply all backtrace
            …
            Thread 16 (Thread 0x7847103b46c0 (LWP 212495)):
            …
            #4  ssux_lock_impl<false>::wr_lock (this=this@entry=0x5f6074b049c0 <log_sys+64>) at /data/Server/bb-11.6-MDEV-33853/storage/innobase/include/srw_lock.h:268
            #5  0x00005f60732e2124 in log_resize_acquire () at /data/Server/bb-11.6-MDEV-33853/storage/innobase/log/log0log.cc:394
            …
            Thread 11 (Thread 0x7847103ff6c0 (LWP 212485)):
            … (identical stack frames)
            Thread 10 (Thread 0x7847102886c0 (LWP 212518)):
            … (identical stack frames)
            

            Above, the contents of flush_lock and write_lock looks suspicious to me. Maybe these objects were somehow not included in the core dump. In any case, it looks like it is simplest to implement a separate log_sys.checkpoint_latch and replace the logic with a combination of that and an exclusive log_sys.latch.

            marko Marko Mäkelä added a comment - wlad , thank you for the clarification. The current logic (not changed by the patch in the Description, other than by the fact that the caller would always hold LOCK_global_system_variables ) is as follows: static void log_resize_acquire() { if (!log_sys.is_pmem()) { while (flush_lock.acquire(log_sys.get_lsn() + 1, nullptr) != group_commit_lock::ACQUIRED); while (write_lock.acquire(log_sys.get_lsn() + 1, nullptr) != group_commit_lock::ACQUIRED); }   log_sys.latch.wr_lock(SRW_LOCK_CALL); } As far as I understand, as soon as each while loop terminates, other threads will pile up waiting until log_resize_release() releases these locks. It is clear that the log_sys.lsn may advance during all this. The +1 in the expression is carefully chosen: we want an "impossible" LSN that is between the current LSN and the end LSN of any completed mini-transaction. Each mini-transaction commit would grow the LSN by the length of the mini-transaction log, which always is more than 1 byte. The minimum should be 3 bytes for the shortest possible record, plus 1 sequence byte (0 or 1) and 4 bytes CRC-32C. I thought that the above patch would help by preventing several threads from executing log_resize_acquire() concurrently, and causing an excessive amount of context switching in the while loop. I checked the core dump once more. The problem is not excessive looping. We have 3 log_resize_acquire() that somehow got through both while loops. So, it looks like contrary to my assumption, ACQUIRED can be returned to multiple threads at the same time. (gdb) p flush_lock $1 = {m_owner_id = {_M_thread = 0}, m_mtx = {<std::__mutex_base> = {_M_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}}, <No data fields>}, m_value = std::atomic<unsigned long> = { 0 }, m_pending_value = std::atomic<unsigned long> = { 0 }, m_lock = false, m_waiters_list = 0x0, m_pending_callbacks = std::__debug::vector of length 0, capacity 0} (gdb) p write_lock $2 = {m_owner_id = {_M_thread = 0}, m_mtx = {<std::__mutex_base> = {_M_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}}, <No data fields>}, m_value = std::atomic<unsigned long> = { 0 }, m_pending_value = std::atomic<unsigned long> = { 0 }, m_lock = false, m_waiters_list = 0x0, m_pending_callbacks = std::__debug::vector of length 0, capacity 0} (gdb) p log_sys.lsn._M_i $3 = 31926967 (gdb) thread apply all backtrace … Thread 16 (Thread 0x7847103b46c0 (LWP 212495)): … #4 ssux_lock_impl<false>::wr_lock (this=this@entry=0x5f6074b049c0 <log_sys+64>) at /data/Server/bb-11.6-MDEV-33853/storage/innobase/include/srw_lock.h:268 #5 0x00005f60732e2124 in log_resize_acquire () at /data/Server/bb-11.6-MDEV-33853/storage/innobase/log/log0log.cc:394 … Thread 11 (Thread 0x7847103ff6c0 (LWP 212485)): … (identical stack frames) Thread 10 (Thread 0x7847102886c0 (LWP 212518)): … (identical stack frames) Above, the contents of flush_lock and write_lock looks suspicious to me. Maybe these objects were somehow not included in the core dump. In any case, it looks like it is simplest to implement a separate log_sys.checkpoint_latch and replace the logic with a combination of that and an exclusive log_sys.latch .

            I realized that log_sys.is_pmem() holds in the core dump. That is, neither flush_lock nor write_lock will be used, and the logic around them could very well be solid. The problem must be solely around log_sys.latch:

            (gdb) p/x log_sys.latch
            $3 = {writer = {lock = std::atomic<unsigned int> = { 0x80000007 }}, readers = std::atomic<unsigned int> = { 0x80000000 }}
            

            There are 6 threads waiting for log_sys.latch. The 3 exclusive latch waiters in log_resize_acquire() were already mentioned. There are also 3 waits for a shared latch: Thread 4 in log_print(), as part of executing SHOW ENGINE INNODB STATUS, and Thread 6 and Thread 14 (log_t::persist(), log_write_up_to()). Thread 14 is the one that is holding an exclusive dict_sys.latch, because it is trying to make a RENAME TABLE operation persistent.

            Why is the log_sys.latch.writer.lock 0x80000007 if there are only 6 threads waiting for the exclusive latch to be released? It would seem that somewhere, a thread that had been holding the exclusive latch forgot to invoke log_sys.latch.wr_unlock(). That would have decremented the lock word by 0x80000001 and woken up any waiters. That would seem to be the root cause of the hang.

            marko Marko Mäkelä added a comment - I realized that log_sys.is_pmem() holds in the core dump. That is, neither flush_lock nor write_lock will be used, and the logic around them could very well be solid. The problem must be solely around log_sys.latch : (gdb) p/x log_sys.latch $3 = {writer = {lock = std::atomic<unsigned int> = { 0x80000007 }}, readers = std::atomic<unsigned int> = { 0x80000000 }} There are 6 threads waiting for log_sys.latch . The 3 exclusive latch waiters in log_resize_acquire() were already mentioned. There are also 3 waits for a shared latch: Thread 4 in log_print() , as part of executing SHOW ENGINE INNODB STATUS , and Thread 6 and Thread 14 ( log_t::persist() , log_write_up_to() ). Thread 14 is the one that is holding an exclusive dict_sys.latch , because it is trying to make a RENAME TABLE operation persistent. Why is the log_sys.latch.writer.lock 0x80000007 if there are only 6 threads waiting for the exclusive latch to be released? It would seem that somewhere, a thread that had been holding the exclusive latch forgot to invoke log_sys.latch.wr_unlock() . That would have decremented the lock word by 0x80000001 and woken up any waiters. That would seem to be the root cause of the hang.

            Let me post one piece of data from the original core dump:

            Thread 14 (Thread 0x78471031e6c0 (LWP 212509)):
            #0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
            #1  0x00005f60733eb1fd in srw_mutex_impl<false>::wait (this=this@entry=0x5f6074b049c0 <log_sys+64>, lk=lk@entry=2147483650) at /data/Server/bb-11.6-MDEV-33853/storage/innobase/sync/srw_lock.cc:253
            #2  0x00005f60733eba22 in ssux_lock_impl<false>::rd_wait (this=0x5f6074b049c0 <log_sys+64>) at /data/Server/bb-11.6-MDEV-33853/storage/innobase/sync/srw_lock.cc:424
            #3  0x00005f607327b997 in ssux_lock_impl<false>::rd_lock (this=this@entry=0x5f6074b049c0 <log_sys+64>) at /data/Server/bb-11.6-MDEV-33853/storage/innobase/include/srw_lock.h:261
            #4  0x00005f60732e52fc in log_t::persist (this=this@entry=0x5f6074b04980 <log_sys>, lsn=31926967) at /data/Server/bb-11.6-MDEV-33853/storage/innobase/log/log0log.cc:772
            #5  0x00005f60732e39f5 in log_write_and_flush () at /data/Server/bb-11.6-MDEV-33853/storage/innobase/log/log0log.cc:1062
            #6  0x00005f6073313991 in mtr_t::commit_file (this=this@entry=0x784710319f60, space=..., name=name@entry=0x7846680af3e0 "./cool_down/t1.ibd") at /data/Server/bb-11.6-MDEV-33853/storage/innobase/mtr/mtr0mtr.cc:715
            #7  0x00005f60734bc2ed in fil_space_t::rename (this=0x78469802a870, path=path@entry=0x7846680af3e0 "./cool_down/t1.ibd", log=log@entry=true, replace=replace@entry=false) at /data/Server/bb-11.6-MDEV-33853/storage/innobase/fil/fil0fil.cc:2007
            #8  0x00005f607348c47f in dict_table_t::rename_tablespace (this=this@entry=0x784678065e90, new_name=..., replace=replace@entry=false) at /data/Server/bb-11.6-MDEV-33853/storage/innobase/dict/dict0dict.cc:1473
            

            The function log_t::persist() is specific to the memory-mapped log file (PMEM or /dev/shm). Its assumption that the current thread is not holding log_sys.latch is wrong here. We would in fact be holding exclusive log_sys.latch during DDL.

            The simplest fix is to add a function parameter that specifies whether a latch is being held, and to add some assertions that can be tested with #define LOG_LATCH_DEBUG.

            marko Marko Mäkelä added a comment - Let me post one piece of data from the original core dump: Thread 14 (Thread 0x78471031e6c0 (LWP 212509)): #0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38 #1 0x00005f60733eb1fd in srw_mutex_impl<false>::wait (this=this@entry=0x5f6074b049c0 <log_sys+64>, lk=lk@entry=2147483650) at /data/Server/bb-11.6-MDEV-33853/storage/innobase/sync/srw_lock.cc:253 #2 0x00005f60733eba22 in ssux_lock_impl<false>::rd_wait (this=0x5f6074b049c0 <log_sys+64>) at /data/Server/bb-11.6-MDEV-33853/storage/innobase/sync/srw_lock.cc:424 #3 0x00005f607327b997 in ssux_lock_impl<false>::rd_lock (this=this@entry=0x5f6074b049c0 <log_sys+64>) at /data/Server/bb-11.6-MDEV-33853/storage/innobase/include/srw_lock.h:261 #4 0x00005f60732e52fc in log_t::persist (this=this@entry=0x5f6074b04980 <log_sys>, lsn=31926967) at /data/Server/bb-11.6-MDEV-33853/storage/innobase/log/log0log.cc:772 #5 0x00005f60732e39f5 in log_write_and_flush () at /data/Server/bb-11.6-MDEV-33853/storage/innobase/log/log0log.cc:1062 #6 0x00005f6073313991 in mtr_t::commit_file (this=this@entry=0x784710319f60, space=..., name=name@entry=0x7846680af3e0 "./cool_down/t1.ibd") at /data/Server/bb-11.6-MDEV-33853/storage/innobase/mtr/mtr0mtr.cc:715 #7 0x00005f60734bc2ed in fil_space_t::rename (this=0x78469802a870, path=path@entry=0x7846680af3e0 "./cool_down/t1.ibd", log=log@entry=true, replace=replace@entry=false) at /data/Server/bb-11.6-MDEV-33853/storage/innobase/fil/fil0fil.cc:2007 #8 0x00005f607348c47f in dict_table_t::rename_tablespace (this=this@entry=0x784678065e90, new_name=..., replace=replace@entry=false) at /data/Server/bb-11.6-MDEV-33853/storage/innobase/dict/dict0dict.cc:1473 The function log_t::persist() is specific to the memory-mapped log file (PMEM or /dev/shm ). Its assumption that the current thread is not holding log_sys.latch is wrong here. We would in fact be holding exclusive log_sys.latch during DDL. The simplest fix is to add a function parameter that specifies whether a latch is being held, and to add some assertions that can be tested with #define LOG_LATCH_DEBUG .

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.