Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-22340

Server hangs on ibuf_merge_or_delete_for_page during shutdown with innodb_fast_shutdown=0

Details

    Description

      Initially I thought I was hitting MDEV-20934 I got requested to file a new bug.

      I seem to have a case that matches this behavior on one of the versions that in theory fixes this

      I've attached the stack.

      Here there is some extra info:

      Thread 4 and 5 seem to be both on ibuf_merge_or_delete_for_page.

      Thread 4 seems to be a buffer dumping the buffer pool.
      Thread 5 seems to be in the state you describe on this JIRA.

      They seem to be working on a cluster index from change buffer.

      (gdb) p cursor->index->table->space->name
      $25 = 0x2b29f7036230 "innodb_system"
       
      (gdb) p index->table->name
      $26 =
      {m_name = 0x2b2a367ff340 "innodb_change_buffer", static part_suffix = "#P#"}
       
      (gdb) p index->name
      $27 =
      {m_name = 0x2b2a087db270 "CLUST_IND"}
      

      I've printed dictionary operation locks and sys mutexes:

      (gdb) print dict_operation_lock
      $36 = {lock_word = 536870912, waiters = 0, sx_recursive = 0, writer_is_wait_ex = false, writer_thread = 0, event = 0x2b2c5deec8d0, wait_ex_event = 0x2b2c5deec940,
      cfile_name = 0x55f9dbc4d608 "/local/p4clients/pkgbuild-9zvR5/workspace/src/RDSMariaDB/storage/innobase/dict/dict0dict.cc",
      last_x_file_name = 0x55f9dbc41fe0 "/local/p4clients/pkgbuild-9zvR5/workspace/src/RDSMariaDB/storage/innobase/srv/srv0srv.cc", cline = 1097, is_block_lock = 0,
      last_x_line = 2025, count_os_wait = 0, list =
      {prev = 0x2b29f7027a78, next = 0x2b2a2c3f4120}
       
      , pfs_psi = 0x2b2a0800d680}
       
      (gdb) print dict_sys.mutex
      $37 = {m_impl = {m_lock_word = 0, m_event = 0x2b2c5deec860, m_policy = {m_count =
      {m_spins = 0, m_waits = 0, m_calls = 0, m_enabled = false}
       
      , m_id = LATCH_ID_DICT_SYS}},
      m_ptr = 0x2b29f70bca80}
      

      Locks of the two threads:

      thread 4

      (gdb) p lock->waiters
      $59 = 1
       
      (gdb) p lock->event
      $56 = (os_event_t) 0x2b2ac905a9b0
       
      (gdb) p lock->wait_ex_event
      $57 = (os_event_t) 0x2b2ac905aa20
      

      thread 5

      (gdb) p lock->waiters
      $59 = 1
       
      (gdb) p lock->event
      $54 = (os_event_t) 0x2b2ac905a9b0
       
      (gdb) p lock->wait_ex_event
      $55 = (os_event_t) 0x2b2ac905aa20
      

      Attachments

        1. extra_info.txt
          7 kB
          Bernardo Perez
        2. stack.txt
          37 kB
          Bernardo Perez
        3. stack1.txt
          51 kB
          Bernardo Perez
        4. stack2.txt
          74 kB
          Bernardo Perez
        5. stack3.txt
          38 kB
          Bernardo Perez

        Issue Links

          Activity

            Hello Marko Mäkelä no problem at all. We had to terminate those instances. If/when we encounter the same issue we will follow your request to extract the data as you suggested.

            I will come back once we get the information.

            Regards,

            Bernardo Perez Bernardo Perez added a comment - Hello Marko Mäkelä no problem at all. We had to terminate those instances. If/when we encounter the same issue we will follow your request to extract the data as you suggested. I will come back once we get the information. Regards,

            It is really difficult to tell, but it might be the case that the system tablespace (or the change buffer structures) had been corrupted due to MDEV-24449. Without more data, it is hard to draw any conclusions.

            marko Marko Mäkelä added a comment - It is really difficult to tell, but it might be the case that the system tablespace (or the change buffer structures) had been corrupted due to MDEV-24449 . Without more data, it is hard to draw any conclusions.

            MDEV-27734 disabled the change buffer by default because of corruption like MDEV-27765.

            For the root cause of the hang MDEV-20934, we now have a better hypothesis.

            marko Marko Mäkelä added a comment - MDEV-27734 disabled the change buffer by default because of corruption like MDEV-27765 . For the root cause of the hang MDEV-20934 , we now have a better hypothesis.

            Bernardo Perez, we were finally able to reproduce a hang scenario similar to MDEV-20934 in our internal testing, in MDEV-30009. I can’t be sure if this is exactly what you experienced, but I think that it is plausible.

            Can you experience this with MariaDB Server 10.5.19 or later?

            marko Marko Mäkelä added a comment - Bernardo Perez , we were finally able to reproduce a hang scenario similar to MDEV-20934 in our internal testing, in MDEV-30009 . I can’t be sure if this is exactly what you experienced, but I think that it is plausible. Can you experience this with MariaDB Server 10.5.19 or later?

            I'm going to close it now. If feedback comes, we'll reopen

            serg Sergei Golubchik added a comment - I'm going to close it now. If feedback comes, we'll reopen

            People

              marko Marko Mäkelä
              Bernardo Perez Bernardo Perez
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.