Details

    Description

      The MariaDB server hangs intermittently for us on a production system, and unfortunately the issue is not reliably reproducible. The last time this happened, two subprocesses of the server hanged, with an increase in context switches for them as observed via htop.

      Here are the stack traces captured for these two processes, which seemed to have only one thread:

      (gdb) attach 2301451
      Attaching to program: target:/usr/sbin/mariadbd, process 2301451
      `target:/usr/sbin/mariadbd' has disappeared; keeping its symbols.
      warning: .dynamic section for "target:/lib/x86_64-linux-gnu/libuuid.so.1" is not at the expected address (wrong library or version mismatch?)
      Reading symbols from target:/lib/x86_64-linux-gnu/libpcre2-8.so.0...
      (No debugging symbols found in target:/lib/x86_64-linux-gnu/libpcre2-8.so.0)
      Reading symbols from target:/lib/x86_64-linux-gnu/libcrypt.so.1...
      (No debugging symbols found in target:/lib/x86_64-linux-gnu/libcrypt.so.1)
      .
      .
      .
      [Thread debugging using libthread_db enabled]
      Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
      0x00007fd73b7f47f8 in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=0x7fd6841145c0, rem=0x7fd6841145c0) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
      78      ../sysdeps/unix/sysv/linux/clock_nanosleep.c: No such file or directory.
      (gdb) bt
      #0  0x00007fd73b7f47f8 in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=0x7fd6841145c0, rem=0x7fd6841145c0) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
      #1  0x00007fd73b7f9677 in __GI___nanosleep (req=<optimized out>, rem=<optimized out>) at ../sysdeps/unix/sysv/linux/nanosleep.c:25
      #2  0x000056402673f831 in ?? ()
      #3  0x0000564026ed7e8a in ?? ()
      #4  0x0000564026ee29b5 in ?? ()
      #5  0x0000564026e5309b in ?? ()
      #6  0x0000564026e56ac6 in ?? ()
      #7  0x0000564026e56e22 in ?? ()
      #8  0x0000564026e68dbe in ?? ()
      #9  0x0000564026db18fa in ?? ()
      #10 0x0000564026ad01e0 in handler::ha_write_row(unsigned char const*) ()
      #11 0x000056402681d6cd in write_record(THD*, TABLE*, st_copy_info*, select_result*) ()
      #12 0x000056402682745f in mysql_insert(THD*, TABLE_LIST*, List<Item>&, List<List<Item> >&, List<Item>&, List<Item>&, enum_duplicates, bool, select_result*) ()
      #13 0x000056402685c148 in mysql_execute_command(THD*, bool) ()
      #14 0x0000564026860b97 in mysql_parse(THD*, char*, unsigned int, Parser_state*) ()
      #15 0x000056402686338d in dispatch_command(enum_server_command, THD*, char*, unsigned int, bool) ()
      #16 0x0000564026865288 in do_command(THD*, bool) ()
      #17 0x000056402698caef in do_handle_one_connection(CONNECT*, bool) ()
      #18 0x000056402698ce3d in handle_one_connection ()
      #19 0x0000564026cf1d56 in ?? ()
      #20 0x00007fd73b7a3ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
      #21 0x00007fd73b835850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
      

      and

      (gdb) bt
      #0  0x00007fd73b7f47f8 in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0,
          req=0x7fd6841f5c60, rem=0x7fd6841f5c60) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
      #1  0x00007fd73b7f9677 in __GI___nanosleep (req=<optimized out>, rem=<optimized out>)
          at ../sysdeps/unix/sysv/linux/nanosleep.c:25
      #2  0x000056402673f831 in ?? ()
      #3  0x0000564026ed7e8a in ?? ()
      #4  0x00005640266dbe5b in ?? ()
      #5  0x0000564026ee264a in ?? ()
      #6  0x0000564026e78a5c in ?? ()
      #7  0x0000564026f7d59d in ?? ()
      #8  0x0000564026f7de84 in ?? ()
      #9  0x0000564026e86128 in ?? ()
      #10 0x0000564026e3ca52 in ?? ()
      #11 0x0000564026ea6824 in ?? ()
      #12 0x0000564026ea5655 in ?? ()
      #13 0x0000564026d9d572 in ?? ()
      #14 0x0000564026ac4c40 in ha_rollback_trans(THD*, bool) ()
      #15 0x000056402699db07 in trans_rollback(THD*) ()
      #16 0x000056402685e5ff in mysql_execute_command(THD*, bool) ()
      #17 0x0000564026860b97 in mysql_parse(THD*, char*, unsigned int, Parser_state*) ()
      #18 0x000056402686338d in dispatch_command(enum_server_command, THD*, char*, unsigned int, bool) ()
      #19 0x0000564026865288 in do_command(THD*, bool) ()
      #20 0x000056402698caef in do_handle_one_connection(CONNECT*, bool) ()
      #21 0x000056402698ce3d in handle_one_connection ()
      #22 0x0000564026cf1d56 in ?? ()
      #23 0x00007fd73b7a3ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
      #24 0x00007fd73b835850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
      

      Attachments

        Issue Links

          Activity

            dhruv9vats Dhruv Vats added a comment -

            We didn't find any entry about running out of buffer pool. However almost always saw messages similar to below:

            [Warning] Aborted connection 820363 to db: 'event_store' user: 'event_store' host: 'localhost' (Got an error reading communication packets)
            

            Some additional context in this direction:
            Most of the hangs had a correlation with one of the below events:
            During or after a scheduled Backup of all databases
            slightly higher than normal traffic/load
            Pls note that the hangs had these events around them but these events do not trigger a hang always (backups happen daily but stalls happen much less frequently).
            On one of the servers, the frequency of hangs temporarily appeared to reduce by increasing `innodb_buffer_pool_size` but didn't go away completely which led us into believing that issue could be elsewhere and just gets aggravated when buffer pool would fall short (the default was 128MB which is awfully low).

            dhruv9vats Dhruv Vats added a comment - We didn't find any entry about running out of buffer pool. However almost always saw messages similar to below: [Warning] Aborted connection 820363 to db: 'event_store' user: 'event_store' host: 'localhost' (Got an error reading communication packets) Some additional context in this direction: Most of the hangs had a correlation with one of the below events: During or after a scheduled Backup of all databases slightly higher than normal traffic/load Pls note that the hangs had these events around them but these events do not trigger a hang always (backups happen daily but stalls happen much less frequently). On one of the servers, the frequency of hangs temporarily appeared to reduce by increasing `innodb_buffer_pool_size` but didn't go away completely which led us into believing that issue could be elsewhere and just gets aggravated when buffer pool would fall short (the default was 128MB which is awfully low).
            dhruv9vats Dhruv Vats added a comment -

            Adding a datapoint for another hang that was observed:
            mariadbd_full_bt_all_threads_20240605_1949.txt

            The number of context switches increase substantially too:

            dhruv9vats Dhruv Vats added a comment - Adding a datapoint for another hang that was observed: mariadbd_full_bt_all_threads_20240605_1949.txt The number of context switches increase substantially too:

            dhruv9vats, thank you. I see that buf_page_get_low() in Thread 8 is waiting for a ROW_FORMAT=COMPRESSED page to become available. Maybe it should rather wait by invoking os_aio_wait_until_no_pending_writes() or similar. I am not happy with this design, but I can’t come up with a better one. When the uncompressed copy of a ROW_FORMAT=COMPRESSED is being allocated or freed, the block descriptor will move and we can’t simply wait for the page latch, like we do in other cases of accessing buffer pool pages.

            I can also observe that a purge_worker_task is being waited for but not being executed. I wonder if MDEV-33669 could have fixed this. Can you reproduce this hang with MariaDB Server 10.11.8?

            Next time the hang occurs (or if you saved a core dump of the previous hang), please post the output of the following as well:

            print write_slots.m_cache.m_pos
            print read_slots.m_cache.m_pos
            

            marko Marko Mäkelä added a comment - dhruv9vats , thank you. I see that buf_page_get_low() in Thread 8 is waiting for a ROW_FORMAT=COMPRESSED page to become available. Maybe it should rather wait by invoking os_aio_wait_until_no_pending_writes() or similar. I am not happy with this design, but I can’t come up with a better one. When the uncompressed copy of a ROW_FORMAT=COMPRESSED is being allocated or freed, the block descriptor will move and we can’t simply wait for the page latch, like we do in other cases of accessing buffer pool pages. I can also observe that a purge_worker_task is being waited for but not being executed. I wonder if MDEV-33669 could have fixed this. Can you reproduce this hang with MariaDB Server 10.11.8? Next time the hang occurs (or if you saved a core dump of the previous hang), please post the output of the following as well: print write_slots.m_cache.m_pos print read_slots.m_cache.m_pos
            dhruv9vats Dhruv Vats added a comment -

            Thanks for the quick response @Marko Mäkelä,

            > Can you reproduce this hang with MariaDB Server 10.11.8?
            We'll look into using the fix version 10.11.8 and see if we encounter this again. But we only encounter this hang intermittently and haven't been able to get this to reproduce reliably.

            > Next time the hang occurs (or if you saved a core dump of the previous hang), please post the output of the following as well
            Will do.

            dhruv9vats Dhruv Vats added a comment - Thanks for the quick response @Marko Mäkelä, > Can you reproduce this hang with MariaDB Server 10.11.8? We'll look into using the fix version 10.11.8 and see if we encounter this again. But we only encounter this hang intermittently and haven't been able to get this to reproduce reliably. > Next time the hang occurs (or if you saved a core dump of the previous hang), please post the output of the following as well Will do.

            wlad agreed with me that MDEV-33669 could explain this. It was a very elusive bug.

            So, dhruv9vats, please upgrade to 10.11.8 and report back in a few weeks if the hang still occurs.

            marko Marko Mäkelä added a comment - wlad agreed with me that MDEV-33669 could explain this. It was a very elusive bug. So, dhruv9vats , please upgrade to 10.11.8 and report back in a few weeks if the hang still occurs.

            People

              debarun Debarun Banerjee
              dhruv9vats Dhruv Vats
              Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.